# Performance Optimization Standards for DuckDB
This document outlines the performance optimization standards for DuckDB, providing guidelines for developers to write efficient and performant code. These standards are tailored for DuckDB's architecture and are designed to improve application speed, responsiveness, and resource usage.
## 1. Query Optimization
### 1.1. Understanding Query Plans
**Standard:** Analyze query plans to identify bottlenecks and optimize query execution.
* **Do This:** Use "EXPLAIN" to examine the query plan and identify areas for improvement.
* **Don't Do This:** Blindly execute queries without understanding their underlying execution strategy.
**Why:** Understanding the query plan allows developers to make informed decisions about indexing, data types, and query structure.
**Example:**
"""sql
EXPLAIN SELECT * FROM lineitem WHERE l_orderkey = 12345;
"""
This will output the query plan, showing the steps DuckDB will take to execute the query. Areas of concern include full table scans, inefficient joins, or suboptimal sorting.
### 1.2. Indexing Strategies
**Standard:** Employ appropriate indexing strategies to accelerate data retrieval.
* **Do This:** Create indexes on frequently queried columns, especially those used in "WHERE" clauses and join conditions. Consider using multi-column indexes for composite queries.
* **Don't Do This:** Over-index tables, as this can slow down write operations and increase storage overhead. Avoid indexing columns with low cardinality or those rarely used in queries.
**Why:** Indexes significantly reduce the amount of data that needs to be scanned, resulting in faster query execution.
**Example:**
"""sql
-- Single-column index
CREATE INDEX idx_orderkey ON lineitem (l_orderkey);
-- Multi-column index
CREATE INDEX idx_order_ship ON lineitem (l_orderkey, l_shipdate);
"""
Carefully consider the order of columns in multi-column indexes. The most frequently queried column should come first. DuckDB (as of recent versions) also supports expression indexes, though these should be used judiciously as they can complicate maintenance.
### 1.3. Data Type Considerations
**Standard:** Use the most appropriate data types for your data to minimize storage and improve performance.
* **Do This:** Use smaller integer types (e.g., "SMALLINT", "INTEGER") if the range of values allows. Use "VARCHAR" with length limits when appropriate instead of "TEXT" for string data. Use the "DATE" and "TIMESTAMP" types for date and time data, respectively.
* **Don't Do This:** Use unnecessarily large data types, such as "BIGINT" when "INTEGER" suffices. Use "TEXT" for columns that contain short, fixed-length strings.
**Why:** Smaller data types reduce storage space and memory usage, leading to faster data processing.
**Example:**
"""sql
-- Good: Using SMALLINT when appropriate
CREATE TABLE orders (
order_id SMALLINT, -- Assuming order IDs won't exceed the range of SMALLINT
order_date DATE
);
-- Bad: Using BIGINT unnecessarily
CREATE TABLE products (
product_id BIGINT, -- INTEGER might be sufficient
product_name VARCHAR -- Length limit missing
);
-- Better: Using VARCHAR with length limit and explicit timestamp
CREATE TABLE products (
product_id INTEGER,
product_name VARCHAR(255),
created_at TIMESTAMP
);
"""
### 1.4. Join Optimization
**Standard:** Optimize join operations to minimize the amount of data processed.
* **Do This:** Use appropriate join algorithms (DuckDB generally auto-selects based on table sizes and statistics). Ensure join columns are indexed. If applicable, use "HASH JOIN" for equality joins on larger tables. Use "BROADCAST JOIN" when joining a large table to a considerably small table (DuckDB often optimizes automatically but understanding the strategy is important). Leverage pre-calculated aggregates if appropriate.
* **Don't Do This:** Perform joins without indexes on join columns. Join on complex expressions rather than simple column lookups. Perform cartesian products by omitting join conditions.
**Why:** Efficient join operations are critical for query performance, especially in data warehousing scenarios.
**Example:**
"""sql
-- Good: Indexed join columns
CREATE INDEX idx_customer_id ON orders (customer_id);
CREATE INDEX idx_customer_id ON customers (customer_id);
SELECT *
FROM orders
JOIN customers ON orders.customer_id = customers.customer_id;
-- Consider broadcasting small tables: (DuckDB might do this automatically though)
SELECT /*+ BROADCAST(customers) */ *
FROM orders
JOIN customers ON orders.customer_id = customers.customer_id;
-- Bad: No indexes, forcing a full table scan
SELECT *
FROM orders
JOIN customers ON orders.customer_id = customers.customer_id; -- Assuming no index on customer_id
"""
### 1.5. Subquery Optimization
**Standard:** Rewrite subqueries where possible to improve performance.
* **Do This:** Use "JOIN" operations instead of correlated subqueries when possible. Use Common Table Expressions (CTEs) to break down complex queries into smaller, manageable parts.
* **Don't Do This:** Use correlated subqueries excessively, as they can significantly slow down query execution.
**Why:** Correlated subqueries can be inefficient because they are executed for each row in the outer query.
**Example:**
"""sql
-- Bad: Correlated subquery
SELECT o.order_id
FROM orders o
WHERE EXISTS (
SELECT 1
FROM lineitem l
WHERE l.order_id = o.order_id
);
-- Good: Using a JOIN instead
SELECT DISTINCT o.order_id
FROM orders o
JOIN lineitem l ON o.order_id = l.order_id;
-- Good: Using a CTE for readability and potential optimization
WITH OrderItems AS (
SELECT order_id FROM lineitem
)
SELECT o.order_id
FROM orders o
WHERE o.order_id IN (SELECT order_id FROM OrderItems);
"""
### 1.6. Filtering Early
**Standard:** Apply filters as early as possible in the query execution pipeline.
* **Do This:** Place "WHERE" clauses that significantly reduce the number of rows processed at the beginning of the query.
* **Don't Do This:** Filter data late in the query execution pipeline, after expensive operations like joins or aggregations.
**Why:** Filtering early reduces the amount of data that subsequent operations need to process.
**Example:**
"""sql
-- Good: Filtering early significantly reduces rows
SELECT *
FROM orders
WHERE order_date > '2023-01-01'
AND customer_id IN (SELECT customer_id FROM active_customers);
-- Bad: Filtering late after a join (less efficient if only a fraction of orders are recent)
SELECT *
FROM orders
JOIN customers ON orders.customer_id = customers.customer_id
WHERE order_date > '2023-01-01';
"""
## 2. Data Loading and Storage
### 2.1. Bulk Loading
**Standard:** Use bulk loading techniques for large datasets.
* **Do This:** Use "COPY" command or DuckDB's API to load data in bulk. Use vectorized reads when possible.
* **Don't Do This:** Load data row-by-row using individual "INSERT" statements.
**Why:** Bulk loading is significantly faster than individual "INSERT" statements.
**Example:**
"""sql
-- CSV import
COPY lineitem FROM 'lineitem.tbl' (DELIMITER '|');
-- Parquet import (highly recommended due to DuckDB's columnar nature)
COPY lineitem FROM 'lineitem.parquet' (FORMAT 'PARQUET');
"""
Ensure the data is pre-sorted by clustering key for even greater performance, especially when creating clustered indexes.
### 2.2. Data Clustering and Sorting
**Standard:** Cluster and sort data based on common query patterns.
* **Do This:** Use "ALTER TABLE ... CLUSTER BY" to physically sort the data on disk based on specific columns. This is extremely beneficial for range queries.
* **Don't Do This:** Neglect to cluster data, especially for large tables. Cluster by columns that are rarely used in queries.
**Why:** Clustering data improves query performance by reducing the amount of data that needs to be scanned for range queries or queries involving a specific order.
**Example:**
"""sql
ALTER TABLE lineitem CLUSTER BY l_orderkey, l_shipdate; --Cluster by orderkey, then by shipdate within each orderkey
"""
### 2.3. Compression
**Standard:** Enable compression for large datasets to reduce storage space and improve I/O performance.
* **Do This:** Use compression algorithms like Zstd or Snappy, especially when storing data in Parquet format. DuckDB automatically handles compression for its internal storage.
* **Don't Do This:** Store uncompressed data unnecessarily.
**Why:** Compression reduces the amount of data that needs to be read from disk, leading to faster query execution.
**Example:**
"""sql
-- Parquet with Zstd compression (best generally for both compression ratio and speed)
COPY lineitem TO 'lineitem_compressed.parquet' (FORMAT 'PARQUET', COMPRESSION 'ZSTD'); -- Explicit compression
-- DuckDB auto-compression (will use a reasonable default)
CREATE TABLE compressed_table AS SELECT * FROM lineitem;
"""
### 2.4. Partitioning (using Parquet Files)
**Standard:** Partition data into separate files based on logical criteria (e.g., date ranges, geographic regions).
* **Do This:** Store data in Parquet files, partitioned by relevant columns. Use DuckDB's globbing capabilities to efficiently query specific partitions.
* **Don't Do This:** Store all data in a single large file, as this can slow down query execution.
**Why:** Partitioning allows DuckDB to only read the relevant files for a given query, improving performance.
**Example:**
Assume you have Parquet files partitioned by year and month: "/data/orders/year=2023/month=01/orders.parquet", "/data/orders/year=2023/month=02/orders.parquet", etc.
"""sql
-- Query data for a specific month
SELECT * FROM read_parquet('/data/orders/year=2023/month=01/*.parquet');
-- Query data for a specific year
SELECT * FROM read_parquet('/data/orders/year=2023/*.parquet');
-- Query for all data
SELECT * FROM read_parquet('/data/orders/*/*.parquet'); --Use cautiously. Is this REALLY what you meant?
"""
### 2.5. Vectorized Reads
**Standard:** Utilize DuckDB's vectorized reads for efficient data processing from disk or other external sources.
* **Do This:** When reading from Parquet or other file formats, ensure DuckDB is configured to utilize vectorized reads. This is enabled by default; however, verify configurations in case of custom setups.
* **Don't Do This:** Implement custom, row-by-row processing when reading data into DuckDB, especially when standard file formats are used.
**Why:** Vectorized reads allow DuckDB to process data in batches, significantly improving the throughput of data ingestion and query execution.
**Example:**
DuckDB automatically utilizes vectorized reads for Parquet files. You generally will not need to configure this directly. However, for custom data-loading implementations, ensure that you are reading data in batches and passing it to DuckDB's vectorized execution engine.
## 3. Concurrency and Parallelism
### 3.1. Connection Management
**Standard:** Manage database connections efficiently.
* **Do This:** Use connection pooling to reuse connections and avoid the overhead of creating new connections for each query. Close connections when they are no longer needed.
* **Don't Do This:** Create a new connection for each query. Leave connections open indefinitely.
**Why:** Establishing database connections can be expensive. Connection pooling improves performance by reusing existing connections.
**Example (Python):**
"""python
import duckdb
import threading
# Use a thread-local connection
local = threading.local()
def get_connection():
if not hasattr(local, "con"):
local.con = duckdb.connect('my_database.duckdb')
return local.con
def run_query(query):
con = get_connection()
result = con.execute(query).fetchall()
return result
"""
### 3.2. Parallel Query Execution
**Standard:** Leverage DuckDB's parallel query execution capabilities.
* **Do This:** Configure the number of threads used for query execution using "PRAGMA threads". Ensure that queries are designed to benefit from parallelism (e.g., large scans, aggregations).
* **Don't Do This:** Set the number of threads too high, as this can lead to excessive context switching and reduced performance.
**Why:** Parallel query execution can significantly improve performance for CPU-bound operations.
**Example:**
"""sql
PRAGMA threads=8; -- Use 8 threads
SELECT l_returnflag, l_linestatus, SUM(l_quantity) AS sum_qty, SUM(l_extendedprice) AS sum_base_price, SUM(l_discount) AS sum_disc_price
FROM lineitem
GROUP BY l_returnflag, l_linestatus;
PRAGMA threads=-1; --Use all available cores
"""
Carefully assess the optimal number of threads for your workload. For I/O bound workloads, increasing number of threads excessively can introduce contention overhead.
## 4. Runtime Configuration
### 4.1. Memory Management
**Standard:** Configure the amount of memory available to DuckDB.
* **Do This:** Use "PRAGMA memory_limit" to set the memory available to DuckDB. Monitor memory usage to ensure that the limit is appropriate.
* **Don't Do This:** Allow DuckDB to use excessive amounts of memory, potentially starving other processes. Set the memory limit too low, which can lead to disk spilling and reduced performance.
**Why:** Proper memory management prevents out-of-memory errors and ensures efficient query execution.
**Example:**
"""sql
PRAGMA memory_limit='16GB'; -- Set memory limit to 16GB
"""
### 4.2. Temporary Storage
**Standard:** Ensure that temporary storage is configured correctly.
* **Do This:** Use the "temp_directory" configuration option to specify a location for temporary files. Ensure that the specified location has sufficient storage space and high I/O performance.
* **Don't Do This:** Allow temporary files to be written to the default location, which may be on a slower storage device.
**Why:** DuckDB uses temporary storage for intermediate results. Configuring temporary storage correctly can improve query performance, especially when dealing with large datasets.
**Example:**
"""sql
PRAGMA temp_directory='/mnt/fast_ssd/duckdb_tmp';
"""
### 4.3. Detailed Monitoring
**Standard:** Using tools to actively monitor performance of queries and IO operations
* **Do This:** Use DuckDB's built-in performance monitoring features along with external system monitoring tools
* **Don't Do This:** Neglect monitoring the impact of configuration changes and code optimization. Changes should be tested thoroughly, and can sometimes negatively impact performance for some workloads.
**Why:** Consistent monitoring helps ensure that changes are having the impact you expect, and catches unexpected degradation.
**Example:**
While DuckDB contains some minimal internal monitoring, focus should be on wrapping the application in well-known monitoring frameworks used in the deployment envirnoment such as Prometheus, Grafana, or similar tools.
## 5. Code Maintainability and Readability
### 5.1. Code Formatting and Style
**Standard:** Follow a consistent code formatting style.
* **Do This:** Use a consistent indentation style (e.g., 4 spaces). Use meaningful variable and function names. Add comments to explain complex logic.
* **Don't Do This:** Use inconsistent indentation. Use cryptic variable names. Write code without comments.
**Why:** Consistent code formatting improves readability and maintainability.
**Example:**
"""sql
-- Good: Well-formatted SQL
SELECT
c.customer_id,
c.customer_name,
COUNT(o.order_id) AS order_count
FROM
customers c
LEFT JOIN
orders o ON c.customer_id = o.customer_id
WHERE
c.region = 'North America'
GROUP BY
c.customer_id, c.customer_name
ORDER BY
order_count DESC
LIMIT 10;
-- Bad: Poorly formatted SQL
select c.customer_id,c.customer_name,count(o.order_id) from customers c left join orders o on c.customer_id=o.customer_id where c.region='North America' group by c.customer_id,c.customer_name order by count(o.order_id) desc limit 10;
"""
### 5.2. Modular Design
**Standard:** Break down complex queries and logic into smaller, reusable modules.
* **Do This:** Use Common Table Expressions (CTEs) to break down complex queries into smaller parts. Create reusable functions for common operations.
* **Don't Do This:** Write monolithic queries that are difficult to understand and maintain.
**Why:** Modular design improves code organization and reduces code duplication.
**Example:**
"""sql
-- Good: Using CTEs to break down a complex query
WITH CustomerOrders AS (
SELECT
customer_id,
COUNT(order_id) AS order_count
FROM
orders
GROUP BY
customer_id
),
TopCustomers AS (
SELECT
customer_id
FROM
CustomerOrders
ORDER BY
order_count DESC
LIMIT 10
)
SELECT
c.customer_id,
c.customer_name,
co.order_count
FROM
customers c
JOIN
TopCustomers tc ON c.customer_id = tc.customer_id
JOIN
CustomerOrders co ON c.customer_id = co.customer_id;
-- Bad: Monolithic query
SELECT c.customer_id, c.customer_name, COUNT(o.order_id) FROM customers c JOIN orders o ON c.customer_id = o.customer_id GROUP BY c.customer_id, c.customer_name ORDER BY COUNT(o.order_id) DESC LIMIT 10;
"""
By adhering to these coding standards, DuckDB developers can write efficient, maintainable, and performant code, ensuring that applications utilizing DuckDB run smoothly and effectively. The consistent application of these rules, aided by AI tools, should lead to a higher quality codebase and improved overall system performance. Remember to stay current with DuckDB's release notes, especially those regarding optimization, as the engine is rapidly evolving.
danielsogl
Created Mar 6, 2025
This guide explains how to effectively use .clinerules
with Cline, the AI-powered coding assistant.
The .clinerules
file is a powerful configuration file that helps Cline understand your project's requirements, coding standards, and constraints. When placed in your project's root directory, it automatically guides Cline's behavior and ensures consistency across your codebase.
Place the .clinerules
file in your project's root directory. Cline automatically detects and follows these rules for all files within the project.
# Project Overview project: name: 'Your Project Name' description: 'Brief project description' stack: - technology: 'Framework/Language' version: 'X.Y.Z' - technology: 'Database' version: 'X.Y.Z'
# Code Standards standards: style: - 'Use consistent indentation (2 spaces)' - 'Follow language-specific naming conventions' documentation: - 'Include JSDoc comments for all functions' - 'Maintain up-to-date README files' testing: - 'Write unit tests for all new features' - 'Maintain minimum 80% code coverage'
# Security Guidelines security: authentication: - 'Implement proper token validation' - 'Use environment variables for secrets' dataProtection: - 'Sanitize all user inputs' - 'Implement proper error handling'
Be Specific
Maintain Organization
Regular Updates
# Common Patterns Example patterns: components: - pattern: 'Use functional components by default' - pattern: 'Implement error boundaries for component trees' stateManagement: - pattern: 'Use React Query for server state' - pattern: 'Implement proper loading states'
Commit the Rules
.clinerules
in version controlTeam Collaboration
Rules Not Being Applied
Conflicting Rules
Performance Considerations
# Basic .clinerules Example project: name: 'Web Application' type: 'Next.js Frontend' standards: - 'Use TypeScript for all new code' - 'Follow React best practices' - 'Implement proper error handling' testing: unit: - 'Jest for unit tests' - 'React Testing Library for components' e2e: - 'Cypress for end-to-end testing' documentation: required: - 'README.md in each major directory' - 'JSDoc comments for public APIs' - 'Changelog updates for all changes'
# Advanced .clinerules Example project: name: 'Enterprise Application' compliance: - 'GDPR requirements' - 'WCAG 2.1 AA accessibility' architecture: patterns: - 'Clean Architecture principles' - 'Domain-Driven Design concepts' security: requirements: - 'OAuth 2.0 authentication' - 'Rate limiting on all APIs' - 'Input validation with Zod'
# API Integration Standards for DuckDB This document outlines the coding standards for integrating DuckDB with external APIs and backend services. It focuses on best practices to ensure maintainability, performance, and security when leveraging DuckDB in conjunction with external data sources and services. ## 1. General Principles of API Integration ### 1.1. Clear Separation of Concerns **Do This:** * Isolate API interaction logic from core database operations. * Create dedicated modules or functions responsible for communicating with external APIs. **Don't Do This:** * Embed API calls directly within SQL queries or stored procedures. This makes debugging incredibly difficult and tightly couples your SQL logic to an external service. **Why:** Separate concerns promote modularity and testability. API interactions are often subject to change (e.g., API version updates, schema changes), so isolating them reduces the impact of these changes on core database logic. **Example:** """python # Correct: Separate API interaction logic import requests import duckdb def fetch_data_from_api(api_url): """Fetches data from an external API.""" try: response = requests.get(api_url) response.raise_for_status() #
# Testing Methodologies Standards for DuckDB This document outlines the testing methodologies standards for DuckDB, aiming to provide a clear and actionable guide for developers. The focus is on ensuring high-quality code through effective unit, integration, and end-to-end testing strategies, tailored specifically for DuckDB's unique features and architecture. ## 1. Overview of Testing Strategies DuckDB's testing strategy encompasses several layers to ensure reliability and correctness. These layers include unit tests, integration tests, and end-to-end tests. Each layer serves a distinct purpose and contributes to overall code quality. * **Unit Tests:** Focus on individual components or functions in isolation. They verify that each unit performs as expected under various conditions. * **Integration Tests:** Verify the interaction between different components or modules. They ensure that components work together correctly. * **End-to-End Tests:** Simulate real-world scenarios and validate the entire system from start to finish. They ensure that the system meets the desired requirements and performs accurately. ## 2. Unit Testing Standards Unit tests are the foundation of a robust testing suite. They isolate individual components, making it easier to identify and fix bugs. ### 2.1. Writing Effective Unit Tests * **Do This:** Write unit tests for every non-trivial function or method. * **Why:** Ensures that each part of the codebase works as intended, reducing the risk of regressions. * **Do This:** Use descriptive names for test functions to clearly indicate what is being tested. * **Why:** Improves readability and maintainability of the test suite. * **Do This:** Ensure each unit test covers a specific functionality or edge case. Keep unit tests small and focused. * **Why:** Easier to pinpoint the source of a failure and reduces debugging time. * **Don't Do This:** Write unit tests that are too broad or cover multiple functionalities in a single test. * **Why:** Makes it harder to understand the cause of failure and increases maintenance overhead. * **Don't Do This:** Write unit tests that depend on external resources or services. * **Why:** Makes tests brittle and unreliable. Unit tests should be isolated and repeatable. ### 2.2. Example: Unit Testing a DuckDB Function Assume we have a simple C++ function that calculates the square of a number: """c++ // src/math_utils.cpp #include <cmath> namespace duckdb { double square(double x) { return x * x; } } // namespace duckdb """ Here’s how you can write a unit test for it: """c++ // test/unittest/math_utils_test.cpp #include "catch.hpp" #include "duckdb.hpp" #include "src/math_utils.cpp" // Include the source file directly for unit testing using namespace duckdb; TEST_CASE("Square function test", "[math_utils]") { REQUIRE(square(2.0) == Approx(4.0)); REQUIRE(square(0.0) == Approx(0.0)); REQUIRE(square(-2.0) == Approx(4.0)); REQUIRE(square(3.14) == Approx(9.8596)); } """ * **Explanation:** * We use "catch.hpp" as the testing framework, which is common in DuckDB projects. * "TEST_CASE" defines a new test case with a descriptive name and tag. * "REQUIRE" asserts that the condition is true. "Approx" handles floating-point comparisons to account for potential precision issues. * Including the ".cpp" file directly is often necessary for unit testing due to DuckDB's internal build structure. ### 2.3. Common Anti-Patterns in Unit Testing * **Ignoring Edge Cases:** Failing to test boundary conditions or edge cases can lead to unexpected behavior in production. * **Example:** Not testing the "square" function with very large or very small numbers. * **Over-Mocking:** Using mocks excessively can make tests less reliable and harder to maintain. * **Why:** Over-mocking can hide underlying issues and makes it harder to refactor code. * **Not Cleaning Up:** Failing to clean up resources after a test can lead to resource leaks or interference with subsequent tests. * **Example:** Forgetting to close database connections or delete temporary files. ### 2.4. Modern Approaches * **Property-Based Testing:** Generate a large number of test cases automatically based on defined properties. * **Benefit:** Can uncover unexpected edge cases and improve test coverage. * **Example:** Testing that "square(x)" always returns a non-negative number for any real number "x". * **Mutation Testing:** Introduce small changes (mutations) to the code and verify that the tests fail. * **Benefit:** Helps identify weak tests that do not adequately cover the code. ## 3. Integration Testing Standards Integration tests verify the interaction between different components or modules. They ensure that these components work together correctly. ### 3.1. Writing Effective Integration Tests * **Do This:** Focus on testing the interfaces between components. * **Why:** Ensures that data is passed correctly and that components communicate effectively. * **Do This:** Use realistic test data that simulates real-world scenarios. * **Why:** Increases the likelihood of detecting integration issues. * **Do This:** Write tests that verify the expected behavior of the system as a whole. * **Why:** Ensures that the system meets the desired requirements. * **Don't Do This:** Write integration tests that are too granular or focused on individual component behavior. * **Why:** Overlaps with unit tests and increases maintenance overhead. * **Don't Do This:** Depend on unstable or unreliable external resources. * **Why:** Makes tests flaky and hard to reproduce. ### 3.2. Example: Integration Testing with DuckDB Suppose we have a module that imports data from a CSV file into a DuckDB database and another module that performs analytics on the data. Here’s how you can write an integration test: """c++ // test/integration/csv_import_test.cpp #include "catch.hpp" #include "duckdb.hpp" #include <fstream> using namespace duckdb; TEST_CASE("CSV import and analytics integration test", "[integration]") { // Create a temporary CSV file std::ofstream csv_file("test.csv"); csv_file << "id,name,value\n"; csv_file << "1,Alice,10\n"; csv_file << "2,Bob,20\n"; csv_file.close(); // Initialize DuckDB database DuckDB db(":memory:"); Connection con(db); // Import data from CSV file con.Query("CREATE TABLE my_table AS SELECT * FROM read_csv_auto('test.csv')"); // Perform analytics auto result = con.Query("SELECT SUM(value) FROM my_table"); // Verify the result REQUIRE(result->GetValue(0, 0).GetValue<int64_t>() == 30); // Clean up the temporary file std::remove("test.csv"); } """ * **Explanation:** * We create a temporary CSV file with sample data. * We initialize a DuckDB database in memory. * We import the data from the CSV file into a table using "read_csv_auto". * We perform a simple aggregation query to calculate the sum of values. * We verify that the result matches the expected value. * Finally, we clean up the temporary CSV file. ### 3.3. Common Anti-Patterns in Integration Testing * **Ignoring Error Handling:** Failing to test how the system handles errors during integration can lead to unexpected behavior in production. * **Example:** Not testing what happens when the CSV file is malformed or missing. * **Overlapping with Unit Tests:** Writing integration tests that duplicate the functionality of unit tests. * **Why:** Increases maintenance overhead without providing additional value. * **Using Hardcoded Values:** Using hardcoded values in tests can make them brittle and hard to maintain. * **Example:** Hardcoding the expected sum of values instead of calculating it dynamically. ### 3.4. Modern Approaches * **Containerization:** Use Docker containers to create isolated and reproducible test environments. * **Benefit:** Ensures that tests run consistently across different environments. * **Test Data Management:** Use dedicated tools to manage test data and ensure data consistency. * **Benefit:** Reduces the risk of data-related issues and improves test reliability. ## 4. End-to-End Testing Standards End-to-end tests simulate real-world scenarios and validate the entire system from start to finish. They ensure that the system meets the desired requirements and performs accurately. ### 4.1. Writing Effective End-to-End Tests * **Do This:** Simulate realistic user interactions and workflows. * **Why:** Increases the likelihood of detecting issues that users might encounter. * **Do This:** Verify that the system meets the business requirements. * **Why:** Ensures that the system provides the expected value. * **Do This:** Write tests that cover the most critical user journeys. * **Why:** Focuses testing efforts on the most important functionalities. * **Don't Do This:** Write end-to-end tests that are too detailed or cover every possible scenario. * **Why:** Increases maintenance overhead and can make tests brittle. * **Don't Do This:** Rely on manual setup or configuration. * **Why:** Makes tests hard to automate and reproduce. ### 4.2. Example: End-to-End Testing with DuckDB Suppose we have a system that allows users to upload CSV files, perform SQL queries on the data, and download the results. Here’s how you can write an end-to-end test: """python # test/e2e/upload_query_download_test.py import duckdb import os def test_upload_query_download(): # Create a temporary CSV file with open("test.csv", "w") as f: f.write("id,name,value\n") f.write("1,Alice,10\n") f.write("2,Bob,20\n") # Initialize DuckDB database con = duckdb.connect(database=':memory:', read_only=False) # Upload data from CSV file con.execute("CREATE TABLE my_table AS SELECT * FROM read_csv_auto('test.csv')") # Perform SQL query result = con.execute("SELECT SUM(value) FROM my_table").fetchone()[0] # Verify the result assert result == 30 # Download the result (simulated) - write to file with open("result.txt", "w") as f: f.write(str(result)) # Clean up the temporary file os.remove("test.csv") os.remove("result.txt") """ * **Explanation:** * Uses pytest as a framework * We create a temporary CSV file with sample data. * We initialize a DuckDB database in memory. * We upload the data from the CSV file into a table using "read_csv_auto". * We perform a SQL query to calculate the sum of values. * We verify that the result matches the expected value. * Finally, we clean up the temporary file. We also simulate "downloading" the result by writing it into a separate file and clearing that up too. ### 4.3. Common Anti-Patterns in End-to-End Testing * **Ignoring Performance:** Failing to test the performance of the system during end-to-end tests can lead to performance bottlenecks in production. * **Example:** Not measuring the time it takes to upload, query, and download data. * **Testing Implementation Details:** Writing tests that are tightly coupled to the implementation details of the system. * **Why:** Makes tests brittle and hard to maintain. * **Using Manual Assertions:** Relying on manual inspection of the system to verify the results. * **Why:** Makes tests hard to automate and reproduce. ### 4.4. Modern Approaches * **Behavior-Driven Development (BDD):** Use BDD frameworks like Cucumber to define tests in a human-readable format. * **Benefit:** Improves collaboration between developers, testers, and business stakeholders. * **Continuous Integration/Continuous Deployment (CI/CD):** Automate the execution of end-to-end tests as part of the CI/CD pipeline. * **Benefit:** Provides rapid feedback and ensures that the system is always in a deployable state. ## 5. DuckDB-Specific Testing Considerations Due to DuckDB's specific architecture (in-process, embedded), certain aspects of testing require special attention. * **Concurrency:** When testing concurrent operations, ensure proper isolation to prevent race conditions and data corruption. Use transactions and locking mechanisms as needed. * **Memory Management:** Pay close attention to memory usage during tests, as DuckDB operates within the application's memory space. Use tools to monitor memory consumption and detect leaks. * **Storage:** When testing persistent storage features, ensure that data is properly persisted and recovered after restarts. Verify the integrity of the stored data. * **Extensions:** When testing DuckDB extensions, ensure that the extensions are properly loaded and initialized. Verify that the extension functions work as expected. ## 6. Test Data Generation Creating realistic and diverse test data is crucial for effective testing. * **Use data generation tools:** Libraries such as Faker (Python) or other data synthesis tools can generate data that mimics real-world datasets. * **Utilize existing datasets:** If possible, leverage anonymized or sample datasets that represent the type of data your DuckDB instance will handle. * **Coverage:** Ensure test data covers a wide range of values, including edge cases, nulls, and invalid data, to ensure comprehensive testing. ## 7. Performance and Regression Testing * **Implement performance benchmarks:** Regularly run performance benchmarks for key queries and operations to detect performance regressions. * **Automate regression testing:** Create regression tests that capture known bugs or performance issues. Ensure these tests run with every code change to prevent the reintroduction of previously resolved problems. ## 8. Collaboration with AI Coding Assistants When using AI coding assistants like GitHub Copilot, provide clear and specific instructions related to testing. * **Contextualize code generation:** When generating test code, provide specific details about the functionality being tested, expected inputs, and expected outputs. * **Review suggestions carefully:** Always review AI-generated test code to ensure it accurately reflects the testing requirements and standards. Do not blindly accept AI-generated suggestions. * **Leverage AI for test case generation**: Where appropriate, prompt the AI to generate additional test cases, especially boundary conditions and edge cases. ## 9. Conclusion Adhering to these testing methodologies standards will ensure that DuckDB projects are robust, reliable, and maintainable. By incorporating thorough unit, integration, and end-to-end testing strategies, developers can build high-quality data solutions with confidence. Ongoing refinement and adaptation of these standards based on evolving best practices and project-specific needs are crucial for continued success.
# Tooling and Ecosystem Standards for DuckDB This document outlines the coding standards and best practices related to **Tooling and Ecosystem** for DuckDB development. It provides specific guidance on leveraging recommended libraries, tools, and extensions to enhance development workflows, maintainability, performance, and security of DuckDB-based projects. These standards are designed to work in concert with AI coding assistants like GitHub Copilot and Cursor, providing them with the necessary context to generate high-quality, DuckDB-idiomatic code. ## 1. Development Environment and Tooling ### 1.1. Recommended IDEs and Editors **Do This:** * Use IDEs and editors with strong DuckDB support (e.g., VS Code with the DuckDB extension, JetBrains DataGrip, DBeaver). These provide syntax highlighting, code completion, and integration with DuckDB's CLI. * Configure your editor with a DuckDB language server if available. * Utilize linters and formatters specific for the host language (e.g., "flake8" and "black" for Python) to maintain code consistency. Apply formatting before committing code. **Don't Do This:** * Rely solely on basic text editors lacking DuckDB-specific features. * Skip configuring linters and formatters, leading to inconsistent code style. **Why This Matters:** Enhanced tooling improves developer productivity, reduces errors, and ensures code maintainability. **Example:** """python # VS Code settings.json for Python and DuckDB { "python.linting.flake8Enabled": true, "python.formatting.provider": "black", "[python]": { "editor.formatOnSave": true, "editor.codeActionsOnSave": { "source.organizeImports": true } }, "files.autoSave": "afterDelay", "files.autoSaveDelay": 500 } """ ### 1.2. Version Control (Git) **Do This:** * Use Git for version control. Commit frequently with descriptive commit messages. * Create branches for new features or bug fixes. Follow Gitflow or a similar branching strategy. * Utilize pull requests for code review. * Store DuckDB-related scripts, configurations, and any required data (if suitable for the repository) in the Git repository. Exclude generated DuckDB database files ( ".duckdb" files) from version control using ".gitignore". **Don't Do This:** * Commit directly to the main branch without code review. * Store sensitive data (e.g., API keys, passwords) directly in the repository. Utilize environment variables and secure configuration management instead. * Track large data files or binary artifacts in Git. Consider using a separate data storage solution and version control for code only. **Why This Matters:** Version control is essential for collaboration, code management, and rollback capabilities. **Example:** """.gitignore # DuckDB database files *.duckdb *.wal # Temporary files tmp/ *.tmp # venv venv/ """ ### 1.3. Build Systems (CMake) **Do This:** * For C/C++ projects which interface with DuckDB, use CMake to manage the build process. CMake simplifies cross-platform compilation and dependency management. * Properly link against the DuckDB library in your "CMakeLists.txt" file. * Use CMake's "find_package" command to locate DuckDB. **Don't Do This:** * Manually manage compilation flags and linker options. * Hardcode paths to the DuckDB library. **Why This Matters:** CMake ensures portability and reproducible builds. **Example:** """cmake # CMakeLists.txt cmake_minimum_required(VERSION 3.15) project(MyDuckDBProject) find_package(DuckDB REQUIRED) add_executable(my_duckdb_app main.cpp) target_link_libraries(my_duckdb_app DuckDB::duckdb) # Link against DuckDB """ ## 2. DuckDB Extensions and Libraries ### 2.1. Utilizing DuckDB Extensions **Do This:** * Enable and use relevant DuckDB extensions to enhance functionality. Popular extensions include "httpfs" (for accessing data over HTTP/S), "parquet" (for reading Parquet files), "json" (for working with JSON data), and "excel" (for reading Excel files). * Install extensions using the "INSTALL" statement: "INSTALL httpfs;". * Load extensions using the "LOAD" statement: "LOAD httpfs;". Load extensions within your DuckDB scripts or application code. Check if an extension is already loaded before attempting to load it again. **Don't Do This:** * Forget to install and load extensions before using their functionality. * Load extensions unnecessarily, as this can increase startup time. **Why This Matters:** Extensions extend DuckDB's capabilities and improve data integration. **Example:** """sql -- Install and load the httpfs extension for accessing data over HTTP INSTALL httpfs; LOAD httpfs; -- Query a CSV file directly from a URL SELECT * FROM read_csv_auto('https://raw.githubusercontent.com/nytimes/covid-19-data/master/us.csv'); """ ### 2.2. Python Integration (DuckDB's "duckdb" Library) **Do This:** * Use the "duckdb" Python library for seamless integration with Python applications. * Utilize parameterized queries to prevent SQL injection vulnerabilities: "con.execute("SELECT * FROM my_table WHERE id = ?", (user_input,))". * Use the "df" method to convert between DuckDB result sets and Pandas DataFrames for data analysis: "df = con.execute("SELECT * FROM my_table").df()". * Use DuckDB's ability to directly query Pandas DataFrames: "con.execute("SELECT * FROM df").df()". **Don't Do This:** * Use string formatting to construct SQL queries, as this is prone to SQL injection. * Ignore error handling when interacting with the DuckDB library. * Fail to close connections and cursors properly. Use context managers ("with") to ensure resources are released. **Why This Matters:** The Python library simplifies data interaction and enables complex data workflows. **Example:** """python import duckdb import pandas as pd # Connect to an in-memory DuckDB database con = duckdb.connect(':memory:') # Create a Pandas DataFrame data = {'id': [1, 2, 3], 'value': ['a', 'b', 'c']} df = pd.DataFrame(data) # Register the DataFrame with DuckDB con.register('my_dataframe', df) # Query the DataFrame using SQL result = con.execute('SELECT * FROM my_dataframe WHERE id > 1').fetchdf() print(result) # Create a table from the DataFrame con.execute("CREATE TABLE my_table AS SELECT * FROM my_dataframe") # Execute a parameterized query user_input = 2 result = con.execute("SELECT * FROM my_table WHERE id = ?", (user_input,)).fetchdf() print(result) # Close the connection con.close() """ ### 2.3. R Integration (DuckDB's "duckdb" Package) **Do This:** * Use the "duckdb" R package for interacting with DuckDB from R. * Leverage "dbConnect()", "dbExecute()", "dbFetch()" and other functions provided by the package. * Use "dplyr" verbs within DuckDB using "dplyr::tbl()" to transparently push down operations. **Don't Do This:** * Manually construct SQL queries when R functions can achieve the same result. * Forget to disconnect from the database after use. **Why This Matters:** The R package enables efficient data analysis workflows within the R ecosystem. **Example:** """R library(duckdb) library(dplyr) # Connect to an in-memory DuckDB database con <- dbConnect(duckdb::duckdb(), dbdir = ":memory:", read_only = FALSE) # Create a data frame df <- data.frame(id = 1:3, value = c("a", "b", "c")) # Write the data frame to a DuckDB table dbWriteTable(con, "my_table", df) # Query the table using dplyr result <- tbl(con, "my_table") %>% filter(id > 1) %>% collect() print(result) # Execute a SQL query result <- dbGetQuery(con, "SELECT * FROM my_table WHERE id = 1") print(result) # Disconnect from the database dbDisconnect(con, shutdown = TRUE) """ ### 2.4 Arrow Integration **Do This:** * Use Arrow for efficient data exchange between DuckDB and other systems. DuckDB has native support for Arrow. * Convert DuckDB result sets to Arrow tables using ".arrow()" in Python and ".arrow()" in R. * Pass Arrow tables directly to DuckDB for querying. * Install and utilize the "arrow" extension for advanced Arrow support: "INSTALL arrow; LOAD arrow;". **Don't Do This:** * Rely on inefficient data serialization methods when Arrow provides a faster alternative. * Ignore potential data type incompatibilities between DuckDB and other systems. **Why This Matters:** Arrow enables zero-copy data transfer, significantly boosting performance. **Example (Python):** """python import duckdb import pyarrow.parquet as pq import pyarrow as pa # Connect to DuckDB con = duckdb.connect(':memory:') # Create a sample Arrow table table = pa.Table.from_pydict({'id': [1, 2, 3], 'value': ['a', 'b', 'c']}) # Register the Arrow table with DuckDB con.register('my_arrow_table', table) # Query the Arrow table using SQL result = con.execute('SELECT * FROM my_arrow_table WHERE id > 1').df() print(result) # write arrow to parquet pq.write_table(table, 'test.parquet') # Load parquet to duckdb con.execute("CREATE TABLE parquet_table AS SELECT * FROM 'test.parquet'") result = con.execute('SELECT * FROM parquet_table WHERE id > 1').df() print(result) con.close() """ ### 2.5. Data Visualization Tools **Do This:** * Connect data visualization tools (e.g., Tableau, Power BI, Metabase) to DuckDB as a data source. * Prefer direct connections to DuckDB using the appropriate drivers. * Leverage the ability to rapidly prototype dashboards locally using DuckDB and subsequently deploy them using larger data warehouses. **Don't Do This:** * Rely on exporting data to files for visualization when direct connections are possible. * Overload DuckDB with very complex queries for visualization purposes. Consider pre-aggregating data if necessary. **Why This Matters:** Data visualization helps understand trends and insights in your data. ### 2.6. Testing Frameworks **Do This:** * Use testing frameworks (e.g., pytest in Python, testthat in R) to write unit and integration tests for your DuckDB-based code. * Write tests to verify the correctness of SQL queries, data transformations, and data loading processes. Use assertion libraries for expressive testing. * Utilize DuckDB's in-memory capabilities for isolated testing environments. **Don't Do This:** * Skip writing tests, leading to undetected bugs and regressions. * Use production databases for testing, which can lead to data corruption. **Why This Matters:** Testing ensures code quality and reliability. **Example (Python with pytest):** """python import duckdb import pytest @pytest.fixture(scope="function") def duckdb_conn(): conn = duckdb.connect(':memory:') conn.execute("CREATE TABLE test_table (id INTEGER, value VARCHAR)") conn.execute("INSERT INTO test_table VALUES (1, 'a'), (2, 'b'), (3, 'c')") yield conn conn.close() def test_select_all(duckdb_conn): result = duckdb_conn.execute("SELECT * FROM test_table").fetchall() assert len(result) == 3 assert result[0] == (1, 'a') def test_where_clause(duckdb_conn): result = duckdb_conn.execute("SELECT * FROM test_table WHERE id > 1").fetchall() assert len(result) == 2 assert result[0] == (2, 'b') """ ## 3. Performance Optimization Tools ### 3.1. Query Profiling **Do This:** * Use DuckDB's built-in query profiler to identify performance bottlenecks in SQL queries. * Use "PRAGMA show_profile" to analyze query execution plans and identify slow operations. * Utilize the "EXPLAIN" statement to understand the query execution plan before running the query. **Don't Do This:** * Guess at performance bottlenecks without profiling. * Ignore the query execution plan when optimizing queries. **Why This Matters:** Profiling and analyzing query plans provides insights into performance issues. **Example:** """sql -- Enable profiling PRAGMA enable_profiling; -- Execute a query SELECT COUNT(*) FROM lineitem; -- Show the profiling information PRAGMA show_profile; --To clear the profiling information PRAGMA disable_profiling -- Analyze query execution plan EXPLAIN SELECT * FROM lineitem WHERE l_shipdate > '1998-12-01'; """ ### 3.2. Indexing **Do This:** * Create indexes on frequently queried columns to speed up data retrieval. Ensure that indexes are actually used and not slowing down write operations. * Consider using different index types (e.g., B-Tree, Hash) based on query patterns. **Don't Do This:** * Create indexes on every column, as this can slow down write operations. * Forget to analyze the performance impact of indexes. **Why This Matters:** Indexes improve query performance by allowing DuckDB to quickly locate data. **Example:** """sql -- Create an index on the l_shipdate column CREATE INDEX idx_shipdate ON lineitem (l_shipdate); """ ### 3.3. Data Partitioning (future) **Do This (when available):** * Explore data partitioning strategies to improve query performance on large datasets (future feature). **Don't Do This (currently):** * Rely on data partitioning until it is a fully supported feature in DuckDB. ## 4. Security Tools and Practices ### 4.1. Data Encryption **Do This:** * Utilize DuckDB's encryption features to protect sensitive data at rest and in transit (if supported and required). * Consult with security experts on choosing appropriate encryption algorithms and key management strategies. **Don't Do This:** * Store encryption keys directly in code or configuration files. Use secure key management systems. **Why This Matters:** Encryption protects data from unauthorized access. ### 4.2. Least Privilege Principle **Do This:** * Grant users only the necessary privileges to access and modify data. * Use roles to manage permissions and simplify administration. **Don't Do This:** * Grant users excessive privileges, which can lead to security vulnerabilities. **Why This Matters:** Limiting privileges reduces the impact of security breaches. At the time of writing DuckDB doesn't have significant user permissioning controls. ### 4.3. Input Validation and Sanitization **Do This:** * Validate and sanitize all user inputs to prevent SQL injection attacks. Use parameterized queries and prepared statements to escape user-provided data. **Don't Do This:** * Trust user inputs without validation. * Use string concatenation to build SQL queries with user inputs. **Why This Matters:** Input validation prevents malicious code from being executed. Always use parameterized queries. ## 5. Community and Support ### 5.1. Engaging with the DuckDB Community **Do This:** * Participate in the DuckDB community forums, mailing lists, and GitHub discussions. * Contribute to the DuckDB project by submitting bug reports, feature requests, and pull requests. * Share your knowledge and experience with other DuckDB users. **Don't Do This:** * Be afraid to ask questions or seek help from the community. * Ignore community guidelines and best practices. **Why This Matters:** Community involvement fosters collaboration and improves the DuckDB ecosystem. ### 5.2. Staying Up-to-Date **Do This:** * Follow the DuckDB release notes and documentation to stay informed about new features, bug fixes, and security updates. * Subscribe to the DuckDB newsletter or RSS feed. **Don't Do This:** * Use outdated versions of DuckDB, which may contain known bugs and security vulnerabilities. **Why This Matters:** Staying up-to-date ensures you are using the latest and greatest features and security patches. These standards serve as a comprehensive guide to leveraging tooling and ecosystem components when developing with DuckDB. By following these guidelines, developers can build robust, efficient, and secure DuckDB-based applications. Adherence will also ensure AI coding assistants provide more targeted and appropriate suggestions during development.
# State Management Standards for DuckDB This document outlines the coding standards for managing state within applications using DuckDB, focusing on data flow, reactivity, and persistence. These standards aim to ensure maintainability, performance, and security for DuckDB-driven applications. ## 1. Principles of State Management Effective state management is crucial for building robust and scalable DuckDB applications. A well-defined approach simplifies debugging, enhances testability, and improves overall code quality. ### 1.1. Explicit vs. Implicit State * **Do This:** Favor explicit state management. Clearly define and declare all state variables, data structures, and their relationships. Use appropriate data types. * **Don't Do This:** Rely on hidden or implicit state, such as global variables or mutable shared objects without clear boundaries. **Why:** Explicit state improves traceability and reduces the risk of unexpected side effects. **Example:** """python # Explicit State import duckdb def execute_query(db_connection, query): """Executes a SQL query against a DuckDB database.""" try: result = db_connection.execute(query).fetchall() return result except duckdb.Error as e: print(f"Error executing query: {e}") return None # Example Usage (Explicit Connection Object) conn = duckdb.connect(':memory:') conn.execute("CREATE TABLE mytable (id INTEGER, value VARCHAR)") conn.execute("INSERT INTO mytable VALUES (1, 'hello'), (2, 'world')") result = execute_query(conn, "SELECT * FROM mytable") print(result) conn.close() # Implicit State (Avoid) # (Using global database connections) """ ### 1.2. Immutable Data Structures * **Do This:** Use immutable data structures whenever possible to represent state. Prefer creating new copies of data upon modification rather than mutating existing objects. * **Don't Do This:** Modify data structures in place without considering the potential side effects on other parts of the application. **Why:** Immutability simplifies debugging and reasoning about data flow, particularly in concurrent environments. **Example:** """python # Immutable Data Structures & DuckDB import duckdb def update_records(db_path, table_name, updates): """ Simulates updating records by creating a new table with the modifications This is an example of immutable approach since DuckDB doesn't allow direct update in embedded mode """ conn = duckdb.connect(db_path) try: # 1. Read the existing records using DuckDB existing_records = conn.execute(f"SELECT * FROM {table_name}").fetchall() # Convert the result into a manageable format, like a dict records_dict = {record[0]: list(record[1:]) for record in existing_records} # Assuming id is record[0], and the rest are fields. # 2. Apply updates (generating new records) - Immutability approach: create new dict new_records_dict = records_dict.copy() # Create a copy for row_number, record_data in updates.items(): if row_number in new_records_dict: # We need to know the row number new_records_dict[row_number] = record_data # Update the dictionary (copy). #3 Delete old table and then add new table using dictionary conn.execute(f"DROP TABLE IF EXISTS {table_name}") # Convert each values(lists) in dictionary to tuple before adding a new table new_record_lists = {row_number: tuple(value) for row_number, value in new_records_dict.items()} table_data = list(new_record_lists.values()) # Define the column names for the new table column_names = ['id', 'name', 'age', 'city'] #Example of column Names # Create the new table using DuckDB conn.execute(f"CREATE TABLE {table_name} AS SELECT * FROM (VALUES {', '.join(map(str, table_data))}) AS t ({', '.join(column_names)})") # Verify result by reading sample data from updated table result = conn.execute(f"SELECT * FROM {table_name}").fetchall() print(f"Updated table records: {result}") except duckdb.Error as e: print(f"Error during update: {e}") finally: conn.close() # Example Usage - Important Note DuckDB requires to pass data as tuples instead of list to avoid type conversion issues db_path = 'my_example.duckdb' original_data = [(1, 'Alice', 30, 'New York'),(2, 'Bob', 25, 'Los Angeles'),(3, 'Charlie', 35, 'Chicago')] conn = duckdb.connect(db_path) conn.execute('CREATE TABLE IF NOT EXISTS users (id INTEGER, name VARCHAR, age INTEGER, city VARCHAR)') conn.executemany('INSERT INTO users VALUES (?, ?, ?, ?)', original_data) conn.close() updates = { 1: ['Alice Updated', 31, 'New Jersey'], # Key represents the row number 2: ['Bob Updated',26,'San Francisco'] } update_records(db_path, 'users', updates) """ ### 1.3. Single Source of Truth * **Do This:** Ensure that each piece of data has a single, authoritative source. Avoid redundant copies or derived data that can become inconsistent. Use DuckDB as the single source of truth for analytical data where possible. * **Don't Do This:** Cache data aggressively without proper invalidation mechanisms. **Why:** A single source of truth minimizes discrepancies and simplifies data synchronization. **Example:** """python # Single Source of Truth - DuckDB import duckdb def get_user_data(db_path, user_id): """Retrieves user data from DuckDB as the single source of truth.""" conn = duckdb.connect(db_path) try: result = conn.execute(f"SELECT * FROM users WHERE id = {user_id}").fetchone() if result: return { 'id': result[0], 'name': result[1], 'age': result[2], 'city': result[3] } else: return None except duckdb.Error as e: print(f"Error retrieving user data: {e}") return None finally: conn.close() # Usage db_path = 'my_example.duckdb' user_id = 1 user_data = get_user_data(db_path, user_id) print(user_data) """ ## 2. State Management Approaches in DuckDB Applications Different applications have different state management needs. Here's how to approach this for applications leveraging DuckDB: ### 2.1. Embedded DuckDB State * **Do This:** For small to medium-sized datasets, use DuckDB's embedded mode for direct data manipulation within the application's process. * **Don't Do This:** Attempt complex concurrent write operations in embedded mode without proper locking and transaction handling. * **Consider:** The limits of in-process memory and CPU usage for large datasets when using embedded DuckDB. **Why:** Embedded DuckDB offers simplicity and low latency for local analytics. **Example:** """python # Embedded DuckDB Example import duckdb db_conn = duckdb.connect(':memory:') # In-memory database for embedded use db_conn.execute("CREATE TABLE items (id INTEGER, name VARCHAR)") db_conn.execute("INSERT INTO items VALUES (1, 'Laptop')") db_conn.execute("INSERT INTO items VALUES (2, 'Keyboard')") results = db_conn.execute("SELECT * FROM items").fetchall() print(results) db_conn.close() """ ### 2.2. Persistent DuckDB State * **Do This:** Store the DuckDB database on disk for persisting data across application sessions. * **Don't Do This:** Neglect backup and recovery mechanisms for persistent DuckDB databases. * **Consider:** Using relative paths for the database file location to improve portability. **Why:** Persistent storage ensures data continuity even across application restarts. **Example:** """python import duckdb import os db_path = 'my_persistent_db.duckdb' # Database file path #Connect, create and close the connection db_conn = duckdb.connect(db_path) db_conn.execute("CREATE TABLE IF NOT EXISTS user_profiles (id INTEGER, username VARCHAR, email VARCHAR)") db_conn.close() # Function to insert data def insert_user_profile(db_path, id, username, email): conn = duckdb.connect(db_path) try: conn.execute("INSERT INTO user_profiles VALUES (?, ?, ?)", (id, username, email)) conn.commit() print(f"Inserted user: {username}") except duckdb.Error as e: print(f"Error inserting user: {e}") conn.rollback() finally: conn.close() #Insert sample date to persistent database insert_user_profile(db_path, 1, 'john_doe', 'john.doe@example.com') insert_user_profile(db_path, 2, 'jane_smith', 'jane.smith@example.com') # Read Function for retrieving user profile def get_user_profile(db_path, user_id): conn = duckdb.connect(db_path) try: result = conn.execute(f"SELECT * FROM user_profiles WHERE id={user_id}").fetchone() if result: return { 'id': result[0], 'username': result[1], 'email': result[2] } else: return None except duckdb.Error as e: print(f"Error getting user profile: {e}") return None finally: conn.close() # Get the data from Database and print user_profile = get_user_profile(db_path, 1) print(user_profile) """ ### 2.3. Connecting to External Data Sources * **Do This:** Utilize DuckDB's ability to directly query data from Parquet, CSV, JSON, and other file formats without importing. * **Don't Do This:** Assume that external data sources always conform to the expected schema. Implement robust error handling and schema validation. * **Consider:** Optimizing access to external data sources by filtering and aggregating data within DuckDB rather than transferring large amounts of data to the application. **Why:** External data access enables real-time analytics without data duplication. **Example:** """python # External Data Source - JSON (Important! Use the format duckdb.read_json_auto!) import duckdb def analyze_json_data(json_file_path, query): """Analyzes JSON data using DuckDB.""" try: full_query = f"SELECT * FROM read_json_auto('{json_file_path}')" # Use AUTO to let DuckDB infer schema full_query = query conn = duckdb.connect(':memory:') result = conn.execute(full_query).fetchall() conn.close() return result except duckdb.Error as e: print(f"Error querying JSON data: {e}") return None # Prepare a sample JSON file json_data = '[{"id": 1, "name": "Laptop", "price": 1200}, {"id": 2, "name": "Keyboard", "price": 75}]' with open('products.json', 'w') as f: f.write(json_data) json_file_path = 'products.json' query = f"SELECT name, price FROM read_json_auto('{json_file_path}') WHERE price > 100 " results = analyze_json_data(json_file_path, query) print(results) os.remove('products.json') # clean up the file """ ### 2.4. Managing Large Datasets * **Do This:** Use DuckDB's efficient query engine to perform aggregations, filtering, and joins on large datasets directly within the database. * **Don't Do This:** Load entire large datasets into application memory. * **Consider:** Partitioning and indexing techniques to optimize query performance on large datasets. **Why:** Optimized query execution minimizes memory usage and processing time. **Example:** """python # Large Dataset Handling import duckdb import pandas as pd def analyze_large_dataset(csv_file_path, query): """Analyzes a large CSV dataset using DuckDB.""" try: # Establish a connection to DuckDB (in-memory for example) conn = duckdb.connect(':memory:') # Register the CSV file as a virtual table conn.execute(f"CREATE TABLE my_data AS SELECT * FROM read_csv_auto('{csv_file_path}')") # Execute the query result = conn.execute(query).fetchdf() # Retrieve result as a Pandas DataFrame conn.close() return result except duckdb.Error as e: print(f"Error querying large dataset: {e}") return None #Example #Create a test file data = {'col1': [1, 2, 3, 4, 5], 'col2': ['A', 'B', 'C', 'D', 'E'], 'col3': [1.1, 2.2, 3.3, 4.4, 5.5]} df = pd.DataFrame(data) csv_file_path = "test.csv" df.to_csv(csv_file_path, index=False) query = "SELECT col2, AVG(col3) FROM my_data GROUP BY col2" results = analyze_large_dataset(csv_file_path, query) print(results) os.remove('test.csv') #Clean up test file """ ### 2.5. Transactions * **Do This:** Use transactions to ensure atomicity, consistency, isolation, and durability (ACID) when performing multiple write operations on DuckDB. * **Don't Do This:** Perform write operations without transactions, which can lead to data corruption or inconsistencies in case of errors. * **Consider:** Choosing the appropriate isolation level for transactions based on the application's concurrency requirements. **Why:** Transactions guarantee data integrity during complex operations. **Example:** """python import duckdb def transfer_funds(db_path, account_from, account_to, amount): """Transfers funds between two accounts using a transaction.""" conn = duckdb.connect(db_path) try: conn.execute("BEGIN TRANSACTION") # Start transaction # 1. Check if the sender account has sufficient balance. sender_balance = conn.execute(f"SELECT balance FROM accounts WHERE id = {account_from}").fetchone()[0] if sender_balance < amount: raise ValueError("Insufficient funds.") # 2. Withdraw from the sender account. conn.execute(f"UPDATE accounts SET balance = balance - {amount} WHERE id = {account_from}") # 3. Deposit to the receiver account. conn.execute(f"UPDATE accounts SET balance = balance + {amount} WHERE id = {account_to}") conn.commit() print("Funds transferred successfully.") except ValueError as e: conn.rollback() print(f"Transaction rolled back due to {e}") except duckdb.Error as e: conn.rollback() print(f"Error during transfer: {e}") finally: conn.close() #Setup initial state def setup_accounts(db_path): conn = duckdb.connect(db_path) try: conn.execute('CREATE TABLE IF NOT EXISTS accounts (id INTEGER, balance REAL)') conn.execute('INSERT INTO accounts VALUES (1, 1000.0)') conn.execute('INSERT INTO accounts VALUES (2, 500.0)') conn.commit() print ("Set up user account") except duckdb.Error as e: print(f"Error setting up accounts: {e}") conn.rollback() #Roll back in case of an error finally: conn.close() db_path = 'bank_db.duckdb' setup_accounts(db_path) transfer_funds(db_path, 1, 2, 200.0) #Transfer 200 from user 1 to user 2 #Verify results conn = duckdb.connect(db_path) print (conn.execute("SELECT * from accounts").fetchall()) conn.close() """ ## 3. Modern Approaches and Patterns ### 3.1. Reactive Programming * **Do This:** Use reactive programming techniques (e.g., RxPY) to automatically update application state in response to changes in the underlying DuckDB data. * **Don't Do This:** Poll the database repeatedly to detect changes. * **Consider:** Using change data capture (CDC) mechanisms if available within your DuckDB environment (though DuckDB itself has limited direct CDC). **Why:** Reactive programming enables efficient and real-time state updates. **Example (Conceptual, requires external libraries):** """python # Conceptual Reactive Example (Requires e.g., RxPY) # Note: This is a simplified conceptual example. Integration would depend on # specific libraries providing reactive capabilities around database changes. # This demonstrates the idea, not a fully working example. import duckdb import reactivex from reactivex import operators as ops def create_database_observable(db_path, query, interval): """Creates an observable that emits data from a DuckDB query at a given interval.""" def subscribe(observer, scheduler=None): def run(): try: conn = duckdb.connect(db_path) result = conn.execute(query).fetchall() observer.on_next(result) conn.close() except Exception as e: observer.on_error(e) # Propagate any errors to the observable #Recursive function to keep schedule until disposed if not observer.is_stopped: scheduler.schedule(run, interval) #Initial Schedule with recusive function scheduler.schedule(run, interval) return reactivex.disposable.Disposable(run, interval) return reactivex.create(subscribe) #Example DB Setup db_path = 'reactive_db.duckdb' conn = duckdb.connect(db_path) conn.execute("CREATE TABLE IF NOT EXISTS sensor_data (timestamp TIMESTAMP, temperature REAL)") conn.execute("INSERT INTO sensor_data VALUES ('2024-11-07 10:00:00', 25.5)") conn.close() # Create an observable that queries the DuckDB database every 5 seconds. db_observable = create_database_observable(db_path, "SELECT * FROM sensor_data", 5) # Subscribe to the observable and print the data. def on_next(data): print(f"Data emitted: {data}") def on_error(error): print(f"Error: {error}") def on_completed(): print("Completed") disposable = db_observable.subscribe( on_next=on_next, # Function to call when data is emitted on_error=on_error, # Function to call if there's an error on_completed=on_completed # Function when observable is stopped ) # Wait for 15 seconds to receive three emissions. import time time.sleep(15) # Dispose and close the database connection. disposable.dispose() conn = duckdb.connect(db_path) #To avoid errors conn.close() """ ### 3.2. Using DuckDB with Arrow for Data Transfer * **Do This:** Leverage Apache Arrow as a data transfer format between DuckDB and other systems (e.g., Pandas, Spark). Use the "arrow()" method from DuckDB connection objects to fetch data as Arrow tables. * **Don't Do This:** Rely on inefficient data serialization formats when transferring data between DuckDB and other systems. **Why:** Arrow provides zero-copy data sharing, minimizing overhead. **Example:** """python # Arrow Example import duckdb import pyarrow as pa db_conn = duckdb.connect(':memory:') db_conn.execute("CREATE TABLE my_data (id INTEGER, value VARCHAR)") db_conn.execute("INSERT INTO my_data VALUES (1, 'hello'), (2, 'world')") arrow_table = db_conn.execute("SELECT * FROM my_data").arrow() print(arrow_table) print(type(arrow_table)) # Print the type of the arrow_table db_conn.close() """ ### 3.3. Parameterized Queries * **Do This:** Use parameterized queries to prevent SQL injection attacks and improve query performance. * **Don't Do This:** Concatenate user input directly into SQL queries. **Why:** Parameterized queries sanitize user input and allow DuckDB to optimize query execution. **Example:** """python # Parameterized Query import duckdb def get_user(db_path, user_id): """Retrieves a user from the database using a parameterized query.""" conn = duckdb.connect(db_path) try: result = conn.execute("SELECT * FROM users WHERE id = ?", (user_id,)).fetchone() if result: return { 'id': result[0], 'username': result[1], 'email': result[2] } else: return None except duckdb.Error as e: print(f"Error retrieving user: {e}") return None finally: conn.close() db_path = 'user_db.duckdb' conn = duckdb.connect(db_path) conn.execute("CREATE TABLE IF NOT EXISTS users (id INTEGER, username VARCHAR, email VARCHAR)") conn.execute("INSERT INTO users VALUES (1, 'john_doe', 'john.doe@example.com')") conn.close() user = get_user(db_path, 1) print (user) """ ## 4. Error Handling and Logging ### 4.1. Specific Exception Handling * **Do This:** Catch specific "duckdb.Error" exceptions to handle different error conditions (e.g., "duckdb.CatalogException", "duckdb.InvalidInputException"). * **Don't Do This:** Use generic "except Exception:" blocks that can mask underlying issues. **Why:** Specific exception handling allows for targeted error recovery logic. **Example:** """python import duckdb def execute_query(db_path, query): """Executes a SQL query and handles potential DuckDB errors.""" conn = duckdb.connect(db_path) try: result = conn.execute(query).fetchall() return result except duckdb.CatalogException as e: print(f"Table not found: {e}") return None except duckdb.InvalidInputException as e: print(f"Invalid input: {e}") return None except duckdb.Error as e: print(f"General DuckDB error: {e}") return None finally: conn.close() db_path = 'test_db.duckdb' results = execute_query(db_path, "SELECT * FROM non_existent_table") #Raises duckdb.CatalogException print(results) results = execute_query(db_path, "SELECT * FROM 123") #Invalid query, raises duckdb.InvalidInputException """ ### 4.2. Logging * **Do This:** Use a logging framework (e.g., "logging" in Python) to record significant events, errors, and warnings related to DuckDB operations. * **Don't Do This:** Rely solely on "print()" statements for debugging in production code. Include log levels (INFO, WARNING, ERROR) appropriately. * **Consider:** Implementing structured logging to facilitate analysis of log data. **Why:** Logging provides valuable insights into application behavior and simplifies troubleshooting. **Example:** """python import duckdb import logging # Configure the logger logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s') def execute_query(db_path, query): """Executes a SQL query with logging.""" conn = duckdb.connect(db_path) try: logging.info(f"Executing query: {query}") result = conn.execute(query).fetchall() logging.info(f"Query executed successfully.") return result except duckdb.Error as e: logging.error(f"Error executing query: {e}", exc_info=True) # Log the exception details return None finally: conn.close() # Create a dummy database db_path = 'test_logging.duckdb' execute_query(db_path, "SELECT * FROM t") # Intentionally cause an error (table does not exist) """ This document provides a foundational set of standards for effective state management in DuckDB applications. By adhering to these guidelines, developers can create robust, maintainable, and performant solutions. Remember to continually review and adapt these standards as DuckDB evolves and new best practices emerge.
# Code Style and Conventions Standards for DuckDB This document outlines the coding style and conventions for the DuckDB project. Adhering to these guidelines is crucial for maintaining code readability, consistency, and ultimately, the long-term maintainability and performance of DuckDB. This applies equally to human contributors and AI coding assistants. ## 1. General Principles ### 1.1. Consistency * **Do This:** Maintain consistency in style within a single file, module, and throughout the entire codebase. If a file uses a particular naming convention or formatting style, new code introduced to that file should follow that convention. * **Don't Do This:** Introduce inconsistent styling within the same file or module. Avoid abrupt changes of style unless there's a compelling reason and broad agreement within the development team. **Why:** Consistency reduces cognitive load, making code easier to read, understand, and debug. It also simplifies the process of integrating contributions from multiple developers. ### 1.2. Readability * **Do This:** Write code that is easy to understand at a glance. Use meaningful names, clear comments, and appropriate indentation to enhance readability. * **Don't Do This:** Write overly condensed or terse code that sacrifices clarity for brevity. Avoid magic numbers or deeply nested conditional statements without clear explanations. **Why:** Readability is paramount. The code is read far more often than it is written. Optimizing for readability simplifies debugging, maintenance, and future modifications. ### 1.3. Simplicity * **Do This:** Favor simple, straightforward solutions over complex, convoluted ones. Break down large functions into smaller, more manageable pieces. * **Don't Do This:** Prematurely optimize code or introduce unnecessary complexity. Avoid over-engineering solutions before understanding the actual problem. **Why:** Simpler code is less prone to errors, easier to test, and easier to understand. It reduces the surface area for bugs and security vulnerabilities. ### 1.4. Following Existing Conventions * **Do This:** Familiarize yourself with the existing code and style. Emulate the conventions already in use. * **Don't Do This:** Impose your personal coding style if it significantly deviates from the established conventions. **Why:** Respecting the existing code conventions keeps the overall codebase uniform and easy to navigate for all developers. ## 2. Formatting ### 2.1. Indentation * **Do This:** Use 4 spaces for indentation. Do not use tabs. Configure your editor to automatically convert tabs to spaces. """c++ // Correct if (condition) { statements; } // Incorrect if (condition) { statements; // Using tab for indentation } """ **Why:** Spaces ensure consistent indentation across different editors and platforms, while 4 spaces provide sufficient visual separation without excessive horizontal space. ### 2.2. Line Length * **Do This:** Limit lines to a maximum of 120 characters (including indentation). Break long lines into multiple shorter lines. """c++ // Correct auto result = very_long_function_name(argument1, argument2, argument3, argument4, argument5, argument6); // Incorrect auto result = very_long_function_name(argument1, argument2, argument3, argument4, argument5, argument6); // Line exceeds 120 characters """ **Why:** Limiting line length improves readability, especially on smaller screens or when viewing code side-by-side in diff tools. Also, it may prevent issues with automated code review tools. ### 2.3. Whitespace * **Do This:** * Use a single space after commas, colons, and semicolons. * Use spaces around operators (e.g., "=", "+", "-", "*", "/", "<", ">"). * Add a blank line between logical blocks of code within a function. * Put a blank line between functions and classes. * No space after "(" or before ")" in function calls. """c++ // Correct int x = (a + b) * c; for (int i = 0; i < n; i++) { process(data[i], param1, param2); } // Incorrect int x= (a+b)*c; // Missing spaces around operators and after comma. for(int i=0;i<n;i++){ // Missing spaces. process(data[i],param1, param2); // Missing spaces after comma. } """ **Why:** Whitespace improves readability by visually separating different elements of the code. ### 2.4. Braces * **Do This:** * Place opening braces "{" on the same line as the statement (e.g., "if", "for", "while", function definitions). * Closing braces "}" should be on their own line. """c++ // Correct if (condition) { // ... } else { // ... } // Incorrect if (condition) { // ... } else { // ... } """ * For one-line statements inside if/else/for/while, braces are optional, *but consistency is preferred*. If one branch of an "if" statement uses braces, the other branch *should* use braces as well. """c++ // Correct (Consistent) if (condition) { do_something(); } else { do_something_else(); } // Also Correct (Both one-liners, but using braces) if (condition) { do_something(); } else { do_something_else(); } // Correct (Both one-liners, no braces) if (condition) do_something(); else do_something_else(); // Incorrect (Inconsistent) if (condition) { do_something(); } else do_something_else(); """ **Why:** Consistent brace placement improves readability and reduces the risk of errors related to scope. ### 2.5. Vertical Alignment (Use Sparingly) * **Do This:** Consider using vertical alignment to enhance readability, especially for complex initializations or assignments, but avoid overusing it. """c++ // Correct enum class DataType { INTEGER = 1, FLOAT = 2, VARCHAR = 3, }; // Less Readable enum class DataType { INTEGER = 1, FLOAT = 2, VARCHAR = 3, }; """ **Why:** Vertical alignment can visually group related elements, making the code easier to scan and understand. However, excessive or inconsistent vertical alignment can make the code harder to maintain. ## 3. Naming Conventions ### 3.1. General Naming * **Do This:** Use descriptive and meaningful names that clearly indicate the purpose of the variable, function, or class. * **Don't Do This:** Use single-letter variable names or acronyms unless they are widely understood within the context (e.g., "i" for loop counters). Also, avoid extremely long names; instead, strive to be succinct. **Why:** Meaningful names significantly improve code readability and reduce the need for comments. ### 3.2. Variables * **Do This:** Use "snake_case" for variable names (e.g., "row_count", "data_buffer"). Variables should be named following what they represent; often a noun. """c++ // Correct int row_count = 100; std::string file_path = "/path/to/file"; // Incorrect int r = 100; // Unclear meaning std::string FilePath = "/path/to/file"; // Incorrect case """ **Why:** "snake_case" is widely used in C++ and improves readability. ### 3.3. Constants * **Do This:** Use "UPPER_SNAKE_CASE" for constant names (e.g., "MAX_ROWS", "DEFAULT_BUFFER_SIZE"). """c++ // Correct const int MAX_ROWS = 1000; const std::string DEFAULT_FILE_PATH = "/default/path"; // Incorrect const int maxRows = 1000; // Incorrect case """ **Why:** Using a distinct naming convention for constants makes it easy to identify them in the code. ### 3.4. Functions * **Do This:** Use "snake_case" for function names (e.g., "calculate_average", "process_data"). Function names should be verbs with clear meaning. """c++ // Correct int calculate_average(const std::vector<int>& data); void process_data(const std::string& file_path); // Incorrect int CalculateAverage(const std::vector<int>& data); // Incorrect case void DataProcesser(const std::string& file_path); // Confusing name """ **Why:** "snake_case" promotes readability. Verb-based names clearly convey the action performed by the function. ### 3.5. Classes and Structs * **Do This:** Use "PascalCase" for class and struct names (e.g., "QueryResult", "DataBlock"). """c++ // Correct class QueryResult { // ... }; struct DataBlock { // ... }; // Incorrect class query_result { // Incorrect case // ... }; """ **Why:** "PascalCase" is a commonly used convention for class and struct names. ### 3.6. Member Variables * **Do This:** Prefix member variables with "m_" (e.g., "m_row_count", "m_data_buffer"). """c++ class MyClass { private: int m_row_count; std::string m_data_buffer; }; """ **Why:** The "m_" prefix clearly distinguishes member variables from local variables, enhancing code readability within the class. ### 3.7. Template Parameters * **Do This:** Use single uppercase letters or descriptive names for template parameters (e.g., "T", "KeyType", "ValueType"). """c++ // Correct template <typename T> T square(T value); template <typename KeyType, typename ValueType> class MyMap { // ... }; // Incorrect template <typename t> // Lowercase letter for template T square(T value); """ **Why:** Clarity over verbosity. Single letters are good for simple cases, descriptive names helpful for complex ones. ## 4. Comments ### 4.1. General Guidelines * **Do This:** Write comments to explain complex logic, algorithms, or design decisions. Comments should explain *why* the code is doing something, not *what* it is doing (the code itself explains *what*). * **Don't Do This:** Write obvious comments that simply restate what the code already says. Avoid using comments as a substitute for clear and readable code. Be wary of redundant comment blocks. **Why:** Effective comments provide valuable context and help other developers understand the intent behind the code. Bad comments clutter the code and become outdated. ### 4.2. Function Comments (Doxygen Style) * **Do This:** Use Doxygen-style comments to document functions, classes, and structs. Include a brief description, parameter descriptions, and return value descriptions. """c++ /** * @brief Calculates the average of a vector of integers. * * @param data The input vector of integers. * @return The average of the input data, or 0 if the vector is empty. */ int calculate_average(const std::vector<int>& data) { // ... } """ **Why:** Doxygen comments can be automatically processed to generate API documentation. ### 4.3. Inline Comments * **Do This:** Use inline comments ("//") to explain specific lines or sections of code that are not immediately obvious. """c++ // Calculate the offset based on the page size size_t offset = page_number * page_size; """ **Why:** Inline comments can help clarify tricky logic. They bridge the gap between intent and implementation. ### 4.4. TODO Comments * **Do This:** Use "TODO" comments to mark sections of code that need further attention or improvement (e.g., "// TODO: Handle edge cases"). * **Don't Do This:** Leave "TODO" comments in the codebase indefinitely. Ensure that all "TODO" comments are addressed before merging code. **Why:** "TODO" comments serve as reminders for future work. ## 5. C++ Specifics ### 5.1. Smart Pointers * **Do This:** Prefer smart pointers ("std::unique_ptr", "std::shared_ptr") over raw pointers to manage memory automatically and prevent memory leaks. Use "std::make_unique" and "std::make_shared" for exception safety and efficiency. """c++ // Correct std::unique_ptr<MyObject> obj = std::make_unique<MyObject>(); // Incorrect MyObject* obj = new MyObject(); // Raw pointer, potential memory leak """ **Why:** Smart pointers automate memory management, reducing the risk of memory leaks and dangling pointers. Using "make_unique" and "make_shared" provides exception safety, particularly when constructing function arguments. ### 5.2. RAII (Resource Acquisition Is Initialization) * **Do This:** Utilize RAII to manage resources (e.g., file handles, mutexes) by associating resource ownership with an object. When the object goes out of scope, the resource is automatically released. """c++ class FileHandler { public: FileHandler(const std::string& filename) : m_file(fopen(filename.c_str(), "r")) { if (!m_file) { throw std::runtime_error("Failed to open file"); } } ~FileHandler() { if (m_file) { fclose(m_file); } } private: FILE* m_file; }; // Usage { FileHandler handler("my_file.txt"); // ... use the file } // File is automatically closed when handler goes out of scope """ **Why:** RAII ensures that resources are always released, even in the presence of exceptions. This is crucial for preventing resource leaks and ensuring program correctness. ### 5.3. Const Correctness * **Do This:** Use the "const" keyword whenever possible to indicate that a variable, argument, or function does not modify the underlying data. """c++ // Correct int get_value() const { return m_value; // This function does not modify the class state. } void process_data(const std::vector<int>& data); // Data is not modified // Incorrect int get_value() { // Missing const qualifier return m_value; } """ **Why:** "const" correctness helps the compiler catch errors related to unintended modifications. It also improves code readability by clearly indicating which parts of the code are read-only. ### 5.4. Exceptions * **Do This:** Use exceptions to signal exceptional conditions or errors that cannot be handled locally. Avoid using exceptions for normal control flow. * **Don't Do This:** Ignore exceptions or catch them without proper handling. This can lead to undefined behavior or data corruption. **Why:** Exceptions provide a robust mechanism for error handling. They ensure that errors are propagated to a level where they can be properly addressed. ### 5.5. Avoidance of C-style casts * **Do This:** Prefer C++-style casts ("static_cast", "dynamic_cast", "reinterpret_cast", "const_cast") over C-style casts. """c++ // Correct double value = static_cast<double>(integer_value); // Incorrect double value = (double)integer_value; """ **Why:** C++-style casts provide better type safety and allow the compiler to perform more thorough checks. ### 5.6. Namespaces * **Do This:** Enclose DuckDB-specific code within the "duckdb" namespace. Create sub-namespaces for better organization when components grow in size. Use anonymous namespaces or "static" for file-local symbols to avoid naming conflicts. """c++ namespace duckdb { // Code related to DuckDB's core functionality namespace catalog { // Code relating to the DuckDB catalog } } // namespace duckdb """ **Why:** Namespaces prevent naming collisions, especially within large projects, and provide scope and organization. ### 5.7. Modern C++ Features (C++17 and later) * **Do This:** Utilize modern C++ features like structured bindings, "std::optional", "std::variant", and range-based for loops to write more concise and expressive code. Favor "auto" for type deduction where it improves readability. """c++ // Structured binding std::pair<int, std::string> get_data() { return {1, "hello"}; } auto [id, message] = get_data(); // Optional std::optional<int> maybe_value = get_value_if_present(); if (maybe_value) { std::cout << "Value: " << *maybe_value << std::endl; } // Range-based for loop (if you don't need the index itself in the loop) std::vector<int> values = {1, 2, 3, 4, 5}; for (int value : values) { std::cout << value << std::endl; } // For loop with index and the value at the index for (size_t i = 0; i < size(values); i++) { std::cout << values[i] << std::endl; } """ **Why:** Modern C++ features improve code safety, readability, and maintainability. "auto" makes the code more flexible and reduces the risk of type-related errors. ## 6. DuckDB Specifics ### 6.1. Data Structures * **Do This:** Utilize DuckDB's internal data structures (e.g., "Vector", "DataChunk", "SelectionVector") for efficient data processing. Understand the semantics and performance characteristics of these structures. Prefer modern mechanisms like "ArenaAllocator" to manage memory lifetimes. * **Don't Do This:** Reinvent the wheel by creating custom data structures that duplicate existing functionality. Try to stay DRY (Don't Repeat Yourself.) **Why:** Using DuckDB's internal data structures ensures compatibility with the query execution engine and optimizes performance. ### 6.2. Expression Evaluation * **Do This:** Follow the expression evaluation framework when implementing new functions or operators. Consider using DuckDB's vectorized execution model for better performance. * **Don't Do This:** Manually iterate over data when you can leverage DuckDB's vectorized execution capabilities. **Why:** The expression evaluation framework provides a consistent and efficient way to process data. ### 6.3. File Formats * **Do This:** Adhere to the specified file format conventions when implementing new file format readers or writers. * **Don't Do This:** Introduce custom file formats without proper documentation and integration with the DuckDB ecosystem. **Why:** Consistent file format handling enhances interoperability and simplifies data exchange. ### 6.4. Extension API * **Do This:** Follow the guidelines of the extension API when creating custom functions or extensions. Ensure that extensions are properly documented and tested. * **Don't Do This:** Modify DuckDB's core code directly unless you are a core developer. **Why:** The extension API allows you to extend DuckDB's functionality without modifying the core code, maintaining modularity. ## 7. Testing ### 7.1. Unit Tests * **Do This:** Write unit tests for all new code and bug fixes. Ensure that tests cover all relevant scenarios and edge cases. Leverage DuckDB's testing framework. * **Don't Do This:** Skip writing tests or write incomplete tests. **Why:** Unit tests verify the correctness of individual components and prevent regressions. ### 7.2. Integration Tests * **Do This:** Write integration tests to verify the interaction between different components. * **Don't Do This:** Assume that individual components will work correctly together without proper integration testing. **Why:** Integration tests ensure that different parts of the system work together as expected. ### 7.3. Performance Benchmarks * **Do This:** Benchmark new code and compare its performance against existing implementations. Use DuckDB's benchmarking tools. * **Don't Do This:** Introduce performance regressions without proper justification. **Why:** Performance benchmarks help identify and prevent performance bottlenecks. ## 8. Code Review ### 8.1. Review Process * **Do This:** Submit all code changes for review by other developers. Provide clear and concise descriptions of the changes. Actively participate in code reviews conducted on your code. * **Don't Do This:** Merge code without review or ignore feedback from reviewers. **Why:** Code review helps identify potential problems and ensure that the code adheres to the coding standards. ### 8.2. Review Focus * **Do This:** Focus on code correctness, readability, performance, and security during code reviews. Check for potential bugs, memory leaks, and security vulnerabilities. * **Don't Do This:** Focus solely on superficial aspects of the code (e.g., whitespace). **Why:** Code review improves the overall quality of the codebase and prevents defects from being introduced. By adhering to these coding standards, we can ensure the long-term maintainability, performance, and security of DuckDB. This document should be considered a living document and will be updated as needed to reflect current best practices and the evolving landscape of DuckDB. Remember: consistently applied standards enhance collaboration and expedite development, whether you are human or machine.