# Security Best Practices Standards for DuckDB
This document outlines the security best practices for developing applications using DuckDB. These standards aim to guide developers in writing secure, maintainable, and performant code that mitigates common vulnerabilities. This will also enable AI coding assistants to produce code that aligns with these standards.
## 1. Input Validation and Sanitization
### 1.1. Rationale
DuckDB, while primarily operating locally, can be exposed to external data sources depending on your architecture. Malicious or improperly formatted input can lead to data corruption, unexpected behavior, or potentially, although less common, code execution if using user-defined functions (UDFs) that are not properly secured. Input validation and sanitization are critical to prevent these issues.
### 1.2. Standards
* **Do This:** Validate all external inputs to DuckDB before processing them.
* **Don't Do This:** Directly use untrusted data from external sources (files, network) without validation.
### 1.3. Implementation
#### 1.3.1. Data Type Validation
Always verify that incoming data matches the expected data type. DuckDB automatically does data type coercion, which can sometimes mask issues. Explicit checks provide better control.
"""python
import duckdb
def insert_data(conn, data: dict):
"""
Inserts data into a DuckDB table after validating data types.
"""
try:
# Validate data types before insertion
if not isinstance(data["id"], int):
raise ValueError("id must be an integer")
if not isinstance(data["name"], str):
raise ValueError("name must be a string")
if not isinstance(data["value"], float):
raise ValueError("value must be a float")
conn.execute("INSERT INTO my_table VALUES (?, ?, ?)", (data["id"], data["name"], data["value"]))
except ValueError as e:
print(f"Data validation error: {e}")
except duckdb.Error as e:
print(f"DuckDB error: {e}")
# Example usage
conn = duckdb.connect(":memory:")
conn.execute("CREATE TABLE my_table (id INTEGER, name VARCHAR, value DOUBLE)")
good_data = {"id": 1, "name": "example", "value": 1.23}
insert_data(conn, good_data)
bad_data = {"id": "string", "name": "example", "value": 1.23} # Incorrect id type
insert_data(conn, bad_data) # This will now raise and be handled gracefully.
conn.close()
"""
#### 1.3.2. Range and Format Validation
Beyond data types, constrain the range of possible values and enforce specific formats where necessary.
"""python
import duckdb
import re
def validate_email(email: str) -> bool:
"""Validates an email address using a regular expression."""
email_regex = r"^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$"
return bool(re.match(email_regex, email))
def insert_user(conn, user_data: dict):
"""Inserts user data into a DuckDB table after validation."""
try:
if not isinstance(user_data["age"], int) or not (0 <= user_data["age"] <= 120): # Reasonable age range
raise ValueError("Age must be an integer between 0 and 120")
if not validate_email(user_data["email"]):
raise ValueError("Invalid email format")
conn.execute("INSERT INTO users (age, email) VALUES (?, ?)", (user_data["age"], user_data["email"]))
except ValueError as e:
print(f"Data validation error: {e}")
except duckdb.Error as e:
print(f"DuckDB error: {e}")
# Example Usage
conn = duckdb.connect(":memory:")
conn.execute("CREATE TABLE users (age INTEGER, email VARCHAR)")
good_user = {"age": 30, "email": "test@example.com"}
insert_user(conn, good_user)
bad_user = {"age": -5, "email": "invalid-email"} # Invalid age and email
insert_user(conn, bad_user)
conn.close()
"""
#### 1.3.3. Sanitization
Sanitize data to remove or escape potentially harmful characters. This applies especially to strings being used within SQL queries dynamically. Consider data masking techniques, like hashing or tokenization, for sensitive data before storing it in DuckDB.
"""python
import duckdb
import bleach # pip install bleach
def insert_comment(conn, comment: str):
"""Inserts a sanitized comment into the database."""
sanitized_comment = bleach.clean(comment, tags=[], attributes={}, styles=[], strip=True) # Options can be configured.
try:
conn.execute("INSERT INTO comments (comment) VALUES (?)", (sanitized_comment,))
except duckdb.Error as e:
print(f"DuckDB error: {e}")
# Example Usage
conn = duckdb.connect(":memory:")
conn.execute("CREATE TABLE comments (comment VARCHAR)")
unsafe_comment = " This is a comment."
insert_comment(conn, unsafe_comment)
conn.execute("SELECT * FROM comments").show() # The script tag will be removed.
conn.close()
"""
Bleach is a Python library designed for sanitizing HTML. For other types of potentially malicious input you may need a different sanitization approach. SQL parameters should always be used (see section 2) to prevent code injection, but sanitization pre-emptively reduces the risk of errors and accidental code.
## 2. SQL Injection Prevention
### 2.1. Rationale
SQL injection is a critical vulnerability where malicious SQL code can be injected into a database query through user input. This can lead to unauthorized data access, modification, or deletion.
### 2.2. Standards
* **Do This:** Always use parameterized queries or prepared statements.
* **Don't Do This:** Dynamically construct SQL queries by directly concatenating user input.
### 2.3. Implementation
#### 2.3.1. Parameterized Queries
Parameterized queries are the most effective way to prevent SQL injection. Parameters are treated as data, not as part of the SQL command, thus preventing malicious injected commands from executing.
"""python
import duckdb
def search_products(conn, search_term: str):
"""Searches for products using a parameterized query to prevent SQL injection."""
try:
# Use ? as a placeholder for the parameter. DuckDB will escape the parameter.
results = conn.execute("SELECT * FROM products WHERE name LIKE ?", ('%' + search_term + '%',)).fetchall() # Always wrap parameters in a tuple
return results
except duckdb.Error as e:
print(f"DuckDB error: {e}")
return []
# Example Usage
conn = duckdb.connect(":memory:")
conn.execute("CREATE TABLE products (id INTEGER, name VARCHAR)")
conn.execute("INSERT INTO products VALUES (1, 'Laptop'), (2, 'Mouse'), (3, 'Keyboard')")
search_term = "Laptop"
products = search_products(conn, search_term)
print(products)
# Malicious input - this will NOT result in SQL injection because of the parameter
search_term = "'; DROP TABLE products; --"
products = search_products(conn, search_term) # it is treated as literal string, even with ' and --
print(products)
conn.close()
"""
**Important:** The "?" placeholder is the correct syntax for parameterized queries in DuckDB. Ensure you always wrap the parameters in a tuple, even if there's only one parameter.
#### 2.3.2. Escaping (Discouraged, but sometimes necessary as a last resort)
While parameterized queries are strongly preferred, there might be cases where you need to dynamically build parts of the SQL query (e.g., table names or column names). In such rare scenarios, proper escaping or whitelisting is necessary.
**WARNING:** Escaping should be treated as a last resort only when parameterized queries are strictly impossible.
"""python
import duckdb
import shlex # For string escaping
def dynamic_sort(conn, column_name: str):
"""Sorts a table dynamically based on a column name. AVOID IF POSSIBLE, use parameterized queries when possible."""
# Whitelist valid column names before escaping
valid_columns = ["id", "name", "price"]
if column_name not in valid_columns:
raise ValueError("Invalid column name")
# Even with whitelisting, still escape the column name
escaped_column = shlex.quote(column_name) # Escaping function
try:
# Concatenating the escaped column name into the query
query = f"SELECT * FROM items ORDER BY {escaped_column}"
results = conn.execute(query).fetchall()
return results
except duckdb.Error as e:
print(f"DuckDB error: {e}")
return []
# Example usage
conn = duckdb.connect(":memory:")
conn.execute("CREATE TABLE items (id INTEGER, name VARCHAR, price DOUBLE)")
conn.execute("INSERT INTO items VALUES (1, 'Apple', 1.0), (2, 'Banana', 0.5), (3, 'Orange', 0.75)")
sorted_items = dynamic_sort(conn, "price")
print(sorted_items) # Output: [(2, 'Banana', 0.5), (3, 'Orange', 0.75), (1, 'Apple', 1.0)]
#Demonstrates what a failure would look like:
try:
sorted_items = dynamic_sort(conn, "injected_code; DROP TABLE items;") # Invalid column given as input
except ValueError as e:
print(f"caught: {e}")
conn.close()
"""
**Explanation:**
* **Whitelisting:** Ensuring the "column_name" is within a predefined set of valid columns significantly reduces the attack surface.
* **Escaping:** Use "shlex.quote" in Python, which provides platform-appropriate quoting mechanisms. This prevents the injection of potentially dangerous characters.
* **Error Handling:** Include comprehensive error handling to catch any unexpected exceptions during query execution.
## 3. Principle of Least Privilege
### 3.1. Rationale
The principle of least privilege (PoLP) dictates that a user, process, or system should have only the minimum necessary privileges required to perform its intended function. Limiting privileges reduces the potential damage that can be caused by accidental misuse or malicious exploitation. DuckDB itself doesn't have users/roles like a traditional client-server database as it is in-process. However, consider scenarios where your Python application connects to, and possibly creates, databases in the filesystem.
### 3.2. Standards
* **Do This:** Grant only the necessary file system permissions to the user/process running the DuckDB application. Restrict database file creation to specific directories.
* **Don't Do This:** Run the application with superuser or excessive permissions. Make database files universally accessible/writable.
### 3.3. Implementation
#### 3.3.1. Filesystem Permissions
Linux example, but the same principles apply to all operating systems.
* **Create a dedicated user:** Create a dedicated user account (e.g., "duckdb_app") to run the application.
* **Restrict database directory:** Create a directory for DuckDB databases (e.g., "/opt/duckdb_data") and set the owner and group to the dedicated user.
"""bash
sudo adduser duckdb_app
sudo mkdir /opt/duckdb_data
sudo chown duckdb_app:duckdb_app /opt/duckdb_data
sudo chmod 700 /opt/duckdb_data # Ensures no one but the user can read and write
"""
#### 3.3.2. Application Configuration
* **Configure the application:** In your application's configuration, explicitly specify the database path within the restricted directory. Use relative paths within that directory when opening databases.
* **Avoid hardcoding credentials:** Don't commit credentials to the repository.
"""python
import duckdb
import os
# This assumes the duckdb_app has full permissions to the /opt/duckdb_data directory
DATABASE_PATH = "/opt/duckdb_data/my_database.duckdb" # Explicitly define the path
# Best practice to keep data in a separate directory, with limited permissions outside directory
# Can use relative path from that location when in a secure directory. However, using
# absolute paths when initially connecting may be better since it makes it clear where the app
# is connecting, avoiding any relative path issues
# Create a database at the explicitly defined path.
conn = duckdb.connect(DATABASE_PATH)
conn.execute("CREATE TABLE IF NOT EXISTS my_table (id INTEGER, name VARCHAR)") # Example
conn.close()
"""
## 4. Secure User-Defined Functions (UDFs)
### 4.1. Rationale
User-Defined Functions (UDFs) extend DuckDB's functionality with custom code, but they also introduce potential security risks. If UDFs are not properly vetted and secured, they can become a gateway for malicious code execution within the DuckDB process.
### 4.2. Standards
* **Do This:** Carefully review and test all UDFs before deploying them. Implement input validation and sanitization within the UDF itself. If accessing external resources, use secure methods. Limit side effects.
* **Don't Do This:** Allow untrusted users to define or execute UDFs without thorough security checks. Execute external system commands directly within a UDF.
### 4.3. Implementation
#### 4.3.1. Input Validation within UDFs
Always validate and sanitize inputs within the UDF to prevent unexpected behavior or vulnerabilities.
"""python
import duckdb
import subprocess
def safe_udf(input_string: str) -> str:
"""
A secure UDF that validates the input string before processing.
Uses a safe subprocess call to avoid shell injection.
"""
# Input Validation: Only allow alphanumeric characters and spaces in the input string.
if not input_string.isalnum() and not ' ' in input_string:
return "ERROR: Invalid input characters." # Handle the error safely
# If you NEED external tools, use subprocess.run with shell=False.
try:
result = subprocess.run(["echo", input_string], capture_output=True, text=True, shell=False, timeout=5) # example
return result.stdout.strip()
except subprocess.TimeoutExpired:
return "ERROR: Timeout during command execution."
except Exception as e:
return f"ERROR: An unexpected error occurred: {str(e)}"
# Example usage:
conn = duckdb.connect(":memory:")
conn.create_function("safe_udf", safe_udf)
# Execute the UDF on some data.
result = conn.execute("SELECT safe_udf('Hello World')").fetchone()[0]
print(result)
result = conn.execute("SELECT safe_udf('Hello World; rm -rf /')").fetchone()[0] # This will return an error now!
print(result)
conn.close()
"""
#### 4.3.2. Limiting Side Effects
UDFs should ideally be pure functions, meaning they don't have side effects (e.g., modifying global state, writing to files). If side effects are unavoidable, carefully control and audit them.
"""python
import duckdb
import os
def file_writing_udf(input_string: str) -> str:
"""Writes input to a file in a controlled directory. This UDF serves as an example of how to implement a secure UDF."""
#Input Validation is CRITICAL here.
if not input_string.isalnum():
return "ERROR: Invalid filename characters."
# Determine the file path
file_path = os.path.join("/tmp/secure_udf_dir/", input_string + ".txt") # Ensure the directory exist
if not file_path.startswith("/tmp/secure_udf_dir/"): #Sanity check
return "ERROR: Path escaping attempt"
try:
with open(file_path, "w") as f:
f.write("This is written by the file_writing_udf")
return f"File written to {file_path}"
except OSError as e:
return f"Error: {e}" #Handle error
def setup_udf_filesystem():
"""Prepare the filesystem outside of DuckDB connection session."""
# Create the directory, make it writable by our user only
try:
os.makedirs("/tmp/secure_udf_dir", exist_ok=True)
os.chmod("/tmp/secure_udf_dir", 0o700) #read, write and execute for the user
except OSError as e:
print(f"Error when setting up a secure directory: {e}")
# Example Usage
conn = duckdb.connect(":memory:")
setup_udf_filesystem()
conn.create_function("file_writing_udf", file_writing_udf)
# Execute the UDF:
file_name = "output"
result = conn.execute(f"SELECT file_writing_udf('{file_name}')").fetchone()[0]
print(result)
#Check to see if there is any error from invalid call
file_name = "/../../../../very_invalid_output"
result = conn.execute(f"SELECT file_writing_udf('{file_name}')").fetchone()[0]
print(result)
conn.close()
"""
4.3.3. External Dependencies
If your UDF uses external dependencies, carefully manage and vet these dependencies. Always use virtual environments and pin dependency versions to prevent supply chain attacks.
## 5. Data Encryption
### 5.1. Rationale
Data encryption protects sensitive data both at rest (stored on disk) and in transit (while being transmitted over a network). Even though DuckDB is often used as an embedded database, encrypting sensitive data adds an extra layer of protection against unauthorized access.
### 5.2. Standards
* **Do This:** Encrypt sensitive data at rest and in transit, consider full disk encryption and authenticated connections where possible.
* **Don't Do This:** Store sensitive data in plain text without encryption.
### 5.3. Implementation
#### 5.3.1. DuckDB Encryption Support
DuckDB natively supports encryption at rest using the "PRAGMA key" command. You must specify a key when creating or opening an encrypted database.
"""python
import duckdb
import os
# Generate a random encryption key (for demonstration purposes only - store securely in production)
encryption_key = os.urandom(32).hex() # Generate 32 random bytes and convert to hex
# Create an encrypted DuckDB database
conn = duckdb.connect('encrypted_database.duckdb')
conn.execute(f"PRAGMA key = '{encryption_key}'")
conn.execute("CREATE TABLE my_table (id INTEGER, value VARCHAR)")
conn.execute("INSERT INTO my_table VALUES (1, 'Secret data')")
conn.close()
# Re-open the encrypted database (must provide the encryption key) to access data
conn = duckdb.connect('encrypted_database.duckdb')
conn.execute(f"PRAGMA key = '{encryption_key}'")
result = conn.execute("SELECT * FROM my_table").fetchall()
print(result) # [(1, 'Secret data')]
conn.close()
# Demonstrate failure
try:
conn = duckdb.connect('encrypted_database.duckdb')
conn.execute("SELECT * FROM my_table").fetchall()
except duckdb.CatalogException as e: #Correct way to handle encryption error
print(f"error: {e}") # Should Throw "database is encrypted but no key was provided"
"""
**Important Considerations:**
* **Key Management:** The most critical aspect of encryption is secure key management. *Never* hardcode encryption keys in your application code. Use a secure key management system (e.g., HashiCorp Vault, AWS KMS, Azure Key Vault) to store and access encryption keys.
* **Error Handling:** Always handle potential encryption errors gracefully, such as incorrect key provided.
* **Performance:** Encryption can impact performance. Test the performance impact of encryption on your application and optimize accordingly.
* **Transit Encryption:** DuckDB itself does not handle network connections. If you are transmitting DuckDB data over a network (e.g., using a remote file system), ensure you encrypt the data in transit using protocols like TLS/SSL.
## 6. Dependency Management
### 6.1 Rationale
Using outdated or vulnerable dependencies can expose your DuckDB application to security risks. Managing dependencies properly is crucial.
### 6.2 Standards
* **Do This:** Use a dependency manager (e.g., pip in Python projects) to track and update dependencies. Regularly scan dependencies for known vulnerabilities.
* **Don't Do This:** Use outdated versions of libraries with known security vulnerabilities. Add non-essential dependencies.
### 6.3. Implementation
#### 6.3.1. Using "pip"
In Python environments, "pip" is the standard package installer:
* **"requirements.txt":** Create a "requirements.txt" file to list all project dependencies with specific versions.
"""
duckdb==0.9.2 # Pin specific versions.
bleach==6.1.0
requests==2.31.0 #Example
"""
* **Install dependencies:** Use "pip install -r requirements.txt" to install the dependencies.
#### 6.3.2. Vulnerability Scanning
* **OWASP Dependency-Check:** Integrate OWASP Dependency-Check (or similar tools) into your build process to automatically scan dependencies for known vulnerabilities.
#### 6.3.3 Virtual Environments
It is also vital to develop using virtual environments. For more information on virtual enviroments and "pip" see documentation available online.
## 7. Security Audits and Testing
### 7.1. Rationale
Regular security audits and testing are essential to identify and remediate potential vulnerabilities in your DuckDB application.
### 7.2. Standards
* **Do This:** Conduct regular code reviews, perform penetration testing, and implement security monitoring.
* **Don't Do This:** Assume your application is secure without regular verification.
### 7.3. Implementation
### 7.3.1. Code Reviews
Enforce mandatory code reviews by experienced developers to identify potential security flaws.
### 7.3.2. Penetration Testing
Engage security professionals to perform penetration testing to simulate real-world attacks against your application. Use tools like OWASP ZAP or Burp Suite.
### 7.3.3. Security Monitoring
Implement security monitoring to detect and respond to suspicious activity. Monitor system logs, application logs, and network traffic for anomalies.
## 8. Specific DuckDB Considerations
### 8.1. Limited User Management
DuckDB, being in-process, has limited user management compared to client-server databases. However, be aware of the file permissions the process is executing under. Ensure the process only has required read/write permissions.
### 8.2 Extension Security
Be wary of installing third-party extensions. Ensure they come from trusted sources, since extensions can execute arbitrary code within the DuckDB process.
### 8.3. Shared Memory (Multi-threading)
If using DuckDB in a multi-threaded application, pay careful attention to locking and data consistency. While DuckDB supports concurrent reads and writes should be serialized properly to prevent data corruption. Using external libraries for concurrency adds a new external dimension to the security risk.
danielsogl
Created Mar 6, 2025
This guide explains how to effectively use .clinerules
with Cline, the AI-powered coding assistant.
The .clinerules
file is a powerful configuration file that helps Cline understand your project's requirements, coding standards, and constraints. When placed in your project's root directory, it automatically guides Cline's behavior and ensures consistency across your codebase.
Place the .clinerules
file in your project's root directory. Cline automatically detects and follows these rules for all files within the project.
# Project Overview project: name: 'Your Project Name' description: 'Brief project description' stack: - technology: 'Framework/Language' version: 'X.Y.Z' - technology: 'Database' version: 'X.Y.Z'
# Code Standards standards: style: - 'Use consistent indentation (2 spaces)' - 'Follow language-specific naming conventions' documentation: - 'Include JSDoc comments for all functions' - 'Maintain up-to-date README files' testing: - 'Write unit tests for all new features' - 'Maintain minimum 80% code coverage'
# Security Guidelines security: authentication: - 'Implement proper token validation' - 'Use environment variables for secrets' dataProtection: - 'Sanitize all user inputs' - 'Implement proper error handling'
Be Specific
Maintain Organization
Regular Updates
# Common Patterns Example patterns: components: - pattern: 'Use functional components by default' - pattern: 'Implement error boundaries for component trees' stateManagement: - pattern: 'Use React Query for server state' - pattern: 'Implement proper loading states'
Commit the Rules
.clinerules
in version controlTeam Collaboration
Rules Not Being Applied
Conflicting Rules
Performance Considerations
# Basic .clinerules Example project: name: 'Web Application' type: 'Next.js Frontend' standards: - 'Use TypeScript for all new code' - 'Follow React best practices' - 'Implement proper error handling' testing: unit: - 'Jest for unit tests' - 'React Testing Library for components' e2e: - 'Cypress for end-to-end testing' documentation: required: - 'README.md in each major directory' - 'JSDoc comments for public APIs' - 'Changelog updates for all changes'
# Advanced .clinerules Example project: name: 'Enterprise Application' compliance: - 'GDPR requirements' - 'WCAG 2.1 AA accessibility' architecture: patterns: - 'Clean Architecture principles' - 'Domain-Driven Design concepts' security: requirements: - 'OAuth 2.0 authentication' - 'Rate limiting on all APIs' - 'Input validation with Zod'
# Component Design Standards for DuckDB This document outlines the component design standards for DuckDB, focusing on creating reusable, maintainable, and performant components within the DuckDB ecosystem. These standards aim to ensure code quality, consistency, and long-term maintainability across the DuckDB codebase. ## 1. Core Principles of Component Design in DuckDB DuckDB, being an embedded analytical database, benefits greatly from well-designed components. Components should be modular, loosely coupled, and highly cohesive. These principles increase reusability, testability, and ease of maintenance. * **Modularity:** Each component should have a clear, well-defined purpose. * **Loose Coupling:** Components should minimize dependencies on other components. This reduces the impact of changes and makes components more independent. * **High Cohesion:** All elements within a component should be closely related and work together towards a single, well-defined purpose. * **Abstraction:** Hide implementation details and expose only necessary interfaces. This allows for internal changes without affecting external users of the component. * **Single Responsibility Principle (SRP):** Each component should have only one reason to change. ## 2. Component Granularity and Scope ### 2.1 Defining Component Boundaries * **Do This:** Define clear boundaries for each component based on logical functionality. For example, a component might be responsible for parsing SQL, optimizing queries, or executing specific operator types. * **Don't Do This:** Create components that are too large and encompass multiple unrelated functionalities, or too small, leading to excessive fragmentation and overhead. * **Why:** Well-defined boundaries help maintainability and make it easier to understand and reason about the system. ### 2.2 Examples of DuckDB Components Examples of concrete components within DuckDB's architecture include: * **Parser:** Responsible for parsing SQL queries into an abstract syntax tree (AST). * **Optimizer:** Responsible for transforming the AST into an optimized query plan. * **Execution Engine:** Responsible for executing the query plan. * **Storage Manager:** Manages data storage and retrieval. * **Vector Operations (e.g., "src/common/vector_operations"):** Implements vectorized operations for efficient data processing. * **Scalar Functions (e.g., "src/function/scalar"):** Houses implementations of SQL scalar functions. ### 2.3 Standard: Component naming and file structure * **Do This:** Use descriptive names for components and organize files logically within the "src" directory. Follow the existing DuckDB directory structure which typically separates components by related functionalities. For example, functions are grouped under "src/function", optimizers under "src/optimizer", etc. * **Don't Do This:** Use generic or ambiguous names that don't clearly indicate the component's purpose. Scatter code across unrelated directories. * **Why:** A clear and consistent file structure improves code discoverability and maintainability. Predictable naming makes it easier to navigate the codebase. **Example:** A new aggregate function related to windowing might be placed in "src/function/aggregate/distributive". The function's implementation would be in a file named "my_new_window_agg_function.cpp" or similar. ## 3. Interfaces and Abstraction ### 3.1 Defining Component Interfaces * **Do This:** Define clear and concise interfaces for each component. Use abstract classes or interfaces in C++ to define the contracts that components must adhere to. * **Don't Do This:** Expose internal implementation details in the interface. This can lead to tight coupling and make it difficult to change the internal implementation without breaking other components. * **Why:** Well-defined interfaces promote loose coupling and allow for easier testing and mocking. **Example (Abstract Class):** """c++ // src/optimizer/rule.h class Rule { public: virtual ~Rule() = default; virtual std::unique_ptr<LogicalOperator> Apply(std::unique_ptr<LogicalOperator> op, Optimizer &optimizer) = 0; virtual bool CanApply(const std::unique_ptr<LogicalOperator> &op) = 0; }; """ ### 3.2 Abstraction Levels * **Do This:** Consider different levels of abstraction. Provide high-level interfaces for common use cases and lower-level interfaces for more advanced scenarios. This aligns with the principle of "progressive disclosure," where complexity is hidden until needed. * **Don't Do This:** Expose only low-level interfaces, forcing users to deal with unnecessary complexity. Conversely, don't hide too much information, preventing access to needed customization. * **Why:** Flexible abstraction levels cater to diverse use cases and skill levels. ### 3.3 Standard: Using Abstract Factories * **Do This:** Use abstract factories to decouple the creation of objects from their usage. This allows you to switch between different implementations of a component without modifying the code that uses it. * **Don't Do This:** Directly instantiate concrete classes throughout the codebase, creating tight dependencies on specific implementations. * **Why:** Abstract factories enhance flexibility and testability. **Example:** The creation of different physical operators (e.g., Hash Join, Nested Loop Join) could be handled by an abstract factory. ## 4. Error Handling and Logging ### 4.1 Robust Error Handling * **Do This:** Implement robust error handling within each component. Use exceptions or error codes to signal failures and provide informative error messages. DuckDB uses exceptions extensively for error handling. * **Don't Do This:** Ignore errors or propagate them silently. This can lead to unpredictable behavior and make it difficult to debug issues. * **Why:** Proper error handling is crucial for reliability and maintainability. **Example:** """c++ #include "duckdb.hpp" #include "iostream" using namespace duckdb; void MyComponent::MyFunction(int input) { if (input < 0) { throw InvalidInputException("Input must be non-negative"); } // ... perform operations } int main() { DuckDB db(":memory:"); Connection con(db); try { con.Query("SELECT MyFunction(-1)"); } catch (InvalidInputException &e) { std::cerr << "Error: " << e.what() << std::endl; } return 0; } """ ### 4.2 Logging * **Do This:** Use DuckDB's logging mechanisms to log important events, warnings, and errors. Configure logging levels appropriately to control the verbosity of the output. Consider integration with the DuckDB telemetry system. * **Don't Do This:** Use "std::cout" or "printf" for logging. These methods are not controllable and cannot be easily disabled in production environments. * **Why:** Logging provides valuable insights into the behavior of the system and helps diagnose issues. Using a consistent logging framework allows for centralized management and analysis of logs. **Note:** DuckDB has a sophisticated logging mechanism, but exact usage details are not readily available in the public documentation. Refer to internal DuckDB documentation and existing code for specific logging patterns. ## 5. Testing ### 5.1 Unit Testing * **Do This:** Write unit tests for each component to verify its functionality and robustness. Use a testing framework to automate the execution of tests and ensure consistent results. * **Don't Do This:** Neglect unit testing. Untested code is more likely to contain bugs and be difficult to maintain. * **Why:** Unit tests provide confidence in the correctness of the code and help prevent regressions. DuckDB uses a custom testing framework. Refer to the existing tests in the "test" directory for examples of how to write unit tests for DuckDB components. ### 5.2 Integration Testing * **Do This:** Write integration tests to verify the interaction between different components. This ensures that the components work together correctly. DuckDB's SQL-based testing framework is well-suited for this. * **Don't Do This:** Assume that components will work together correctly without integration testing. * **Why:** Integration tests catch issues that may not be apparent from unit tests alone. **Example:** Test a query involving the Parser, Optimizer, and Execution Engine components. ## 6. Concurrency and Thread Safety ### 6.1 Thread Safety * **Do This:** Design components to be thread-safe, especially if they are accessed from multiple threads concurrently. Use appropriate synchronization mechanisms (e.g., mutexes, atomic operations) to protect shared data. DuckDB's internal concurrency model often relies on immutable data structures and vectorized operations to minimize lock contention. * **Don't Do This:** Assume that components are thread-safe without proper verification. Data races and other concurrency issues can lead to unpredictable behavior. * **Why:** Thread safety is crucial for performance and stability in a multi-threaded environment. ### 6.2 Data Structures * **Do This:** When possible, prioritize using concurrent data structures that inherently handle thread safety, or design your code to avoid shared mutable state altogether using techniques like message passing. DuckDB makes extensive use of vectorized operations, which often allows for lock-free concurrency. * **Don't Do This:** Use simple data structures without considering thread safety and rely entirely on locks. * **Why:** Concurrent data structures often offer better performance and scalability compared to using simple data structures with locks. Avoiding mutable shared state is the ideal (though often impractical) approach. ## 7. Code Style and Formatting ### 7.1 Consistent Style * **Do This:** Follow a consistent code style throughout the codebase. Use a code formatter (e.g., clang-format) to automatically enforce the style guidelines. Refer to the DuckDB's existing code for style conventions. * **Don't Do This:** Use inconsistent code styles. This can make the code harder to read and understand. * **Why:** A consistent code style improves readability and maintainability. ### 7.2 Naming Conventions * **Do This:** Follow consistent naming conventions for variables, functions, classes, and other identifiers. Use descriptive names that clearly indicate the purpose of the identifier. Prefer longer, descriptive names over short, cryptic ones. * **Don't Do This:** Use inconsistent or ambiguous naming conventions. * **Why:** Clear naming conventions improve code readability and maintainability. ## 8. Performance Considerations ### 8.1 Vectorized Operations * **Do This:** Leverage DuckDB's vectorized execution engine to perform operations on large batches of data efficiently. Use the "Vector" class and its associated methods for vectorized operations. Understanding and using vectorized operations is paramount to building performant DuckDB components. * **Don't Do This:** Implement operations on individual data elements. This can lead to significant performance overhead. * **Why:** Vectorized operations maximize CPU utilization and minimize data transfer overhead. **Example:** """c++ #include "duckdb.hpp" using namespace duckdb; void MyComponent::Add(Vector &left, Vector &right, Vector &result, idx_t count) { auto left_data = FlatVector::GetData<int32_t>(left); auto right_data = FlatVector::GetData<int32_t>(right); auto result_data = FlatVector::GetData<int32_t>(result); for (idx_t i = 0; i < count; i++) { result_data[i] = left_data[i] + right_data[i]; } } """ ### 8.2 Data Locality * **Do This:** Design components to maximize data locality. This means keeping related data close together in memory to reduce cache misses and improve performance. DuckDB's columnar storage format enhances data locality for analytical workloads. * **Don't Do This:** Scatter related data across different memory locations. * **Why:** Data locality is crucial for performance. Accessing data from memory is much faster than accessing data from disk. ### 8.3 Minimize Memory Allocation * **Do This:** Minimize memory allocation and deallocation within performance-critical sections of code. Use memory pools or other techniques to reuse memory. DuckDB has its own memory management mechanisms; familiarize yourself with them. * **Don't Do This:** Allocate and deallocate memory frequently. * **Why:** Memory allocation and deallocation can be expensive operations, especially in high-performance systems. ## 9. Security Considerations ### 9.1 Input Validation * **Do This:** Validate all inputs to components to prevent security vulnerabilities such as SQL injection and buffer overflows. Use parameterized queries to prevent SQL injection. * **Don't Do This:** Trust inputs without proper validation. * **Why:** Input validation is crucial for security. Malicious inputs can be used to compromise the system. ### 9.2 Data Encryption * **Do This:** Consider encrypting sensitive data at rest and in transit. Use strong encryption algorithms and follow security best practices. DuckDB supports encryption extensions; use them where appropriate. * **Don't Do This:** Store sensitive data in plain text. * **Why:** Data encryption protects sensitive data from unauthorized access. ## 10. Future-Proofing * **Do This:** Follow design principles that accommodate and anticipate future changes in DuckDB. Consider the potential implications of adding new data types, storage formats, or query optimization techniques. * **Don't Do This:** Create components that are tightly coupled to specific implementation details or that make assumptions about the current state of the system. * **Why:** Designed-for-change architecture ensures that future updates or new features will not break components as easily as a more brittle design. By following these component design standards, DuckDB developers can create a robust, maintainable, and performant codebase. This will contribute to the long-term success of the DuckDB project.
# Performance Optimization Standards for DuckDB This document outlines the performance optimization standards for DuckDB, providing guidelines for developers to write efficient and performant code. These standards are tailored for DuckDB's architecture and are designed to improve application speed, responsiveness, and resource usage. ## 1. Query Optimization ### 1.1. Understanding Query Plans **Standard:** Analyze query plans to identify bottlenecks and optimize query execution. * **Do This:** Use "EXPLAIN" to examine the query plan and identify areas for improvement. * **Don't Do This:** Blindly execute queries without understanding their underlying execution strategy. **Why:** Understanding the query plan allows developers to make informed decisions about indexing, data types, and query structure. **Example:** """sql EXPLAIN SELECT * FROM lineitem WHERE l_orderkey = 12345; """ This will output the query plan, showing the steps DuckDB will take to execute the query. Areas of concern include full table scans, inefficient joins, or suboptimal sorting. ### 1.2. Indexing Strategies **Standard:** Employ appropriate indexing strategies to accelerate data retrieval. * **Do This:** Create indexes on frequently queried columns, especially those used in "WHERE" clauses and join conditions. Consider using multi-column indexes for composite queries. * **Don't Do This:** Over-index tables, as this can slow down write operations and increase storage overhead. Avoid indexing columns with low cardinality or those rarely used in queries. **Why:** Indexes significantly reduce the amount of data that needs to be scanned, resulting in faster query execution. **Example:** """sql -- Single-column index CREATE INDEX idx_orderkey ON lineitem (l_orderkey); -- Multi-column index CREATE INDEX idx_order_ship ON lineitem (l_orderkey, l_shipdate); """ Carefully consider the order of columns in multi-column indexes. The most frequently queried column should come first. DuckDB (as of recent versions) also supports expression indexes, though these should be used judiciously as they can complicate maintenance. ### 1.3. Data Type Considerations **Standard:** Use the most appropriate data types for your data to minimize storage and improve performance. * **Do This:** Use smaller integer types (e.g., "SMALLINT", "INTEGER") if the range of values allows. Use "VARCHAR" with length limits when appropriate instead of "TEXT" for string data. Use the "DATE" and "TIMESTAMP" types for date and time data, respectively. * **Don't Do This:** Use unnecessarily large data types, such as "BIGINT" when "INTEGER" suffices. Use "TEXT" for columns that contain short, fixed-length strings. **Why:** Smaller data types reduce storage space and memory usage, leading to faster data processing. **Example:** """sql -- Good: Using SMALLINT when appropriate CREATE TABLE orders ( order_id SMALLINT, -- Assuming order IDs won't exceed the range of SMALLINT order_date DATE ); -- Bad: Using BIGINT unnecessarily CREATE TABLE products ( product_id BIGINT, -- INTEGER might be sufficient product_name VARCHAR -- Length limit missing ); -- Better: Using VARCHAR with length limit and explicit timestamp CREATE TABLE products ( product_id INTEGER, product_name VARCHAR(255), created_at TIMESTAMP ); """ ### 1.4. Join Optimization **Standard:** Optimize join operations to minimize the amount of data processed. * **Do This:** Use appropriate join algorithms (DuckDB generally auto-selects based on table sizes and statistics). Ensure join columns are indexed. If applicable, use "HASH JOIN" for equality joins on larger tables. Use "BROADCAST JOIN" when joining a large table to a considerably small table (DuckDB often optimizes automatically but understanding the strategy is important). Leverage pre-calculated aggregates if appropriate. * **Don't Do This:** Perform joins without indexes on join columns. Join on complex expressions rather than simple column lookups. Perform cartesian products by omitting join conditions. **Why:** Efficient join operations are critical for query performance, especially in data warehousing scenarios. **Example:** """sql -- Good: Indexed join columns CREATE INDEX idx_customer_id ON orders (customer_id); CREATE INDEX idx_customer_id ON customers (customer_id); SELECT * FROM orders JOIN customers ON orders.customer_id = customers.customer_id; -- Consider broadcasting small tables: (DuckDB might do this automatically though) SELECT /*+ BROADCAST(customers) */ * FROM orders JOIN customers ON orders.customer_id = customers.customer_id; -- Bad: No indexes, forcing a full table scan SELECT * FROM orders JOIN customers ON orders.customer_id = customers.customer_id; -- Assuming no index on customer_id """ ### 1.5. Subquery Optimization **Standard:** Rewrite subqueries where possible to improve performance. * **Do This:** Use "JOIN" operations instead of correlated subqueries when possible. Use Common Table Expressions (CTEs) to break down complex queries into smaller, manageable parts. * **Don't Do This:** Use correlated subqueries excessively, as they can significantly slow down query execution. **Why:** Correlated subqueries can be inefficient because they are executed for each row in the outer query. **Example:** """sql -- Bad: Correlated subquery SELECT o.order_id FROM orders o WHERE EXISTS ( SELECT 1 FROM lineitem l WHERE l.order_id = o.order_id ); -- Good: Using a JOIN instead SELECT DISTINCT o.order_id FROM orders o JOIN lineitem l ON o.order_id = l.order_id; -- Good: Using a CTE for readability and potential optimization WITH OrderItems AS ( SELECT order_id FROM lineitem ) SELECT o.order_id FROM orders o WHERE o.order_id IN (SELECT order_id FROM OrderItems); """ ### 1.6. Filtering Early **Standard:** Apply filters as early as possible in the query execution pipeline. * **Do This:** Place "WHERE" clauses that significantly reduce the number of rows processed at the beginning of the query. * **Don't Do This:** Filter data late in the query execution pipeline, after expensive operations like joins or aggregations. **Why:** Filtering early reduces the amount of data that subsequent operations need to process. **Example:** """sql -- Good: Filtering early significantly reduces rows SELECT * FROM orders WHERE order_date > '2023-01-01' AND customer_id IN (SELECT customer_id FROM active_customers); -- Bad: Filtering late after a join (less efficient if only a fraction of orders are recent) SELECT * FROM orders JOIN customers ON orders.customer_id = customers.customer_id WHERE order_date > '2023-01-01'; """ ## 2. Data Loading and Storage ### 2.1. Bulk Loading **Standard:** Use bulk loading techniques for large datasets. * **Do This:** Use "COPY" command or DuckDB's API to load data in bulk. Use vectorized reads when possible. * **Don't Do This:** Load data row-by-row using individual "INSERT" statements. **Why:** Bulk loading is significantly faster than individual "INSERT" statements. **Example:** """sql -- CSV import COPY lineitem FROM 'lineitem.tbl' (DELIMITER '|'); -- Parquet import (highly recommended due to DuckDB's columnar nature) COPY lineitem FROM 'lineitem.parquet' (FORMAT 'PARQUET'); """ Ensure the data is pre-sorted by clustering key for even greater performance, especially when creating clustered indexes. ### 2.2. Data Clustering and Sorting **Standard:** Cluster and sort data based on common query patterns. * **Do This:** Use "ALTER TABLE ... CLUSTER BY" to physically sort the data on disk based on specific columns. This is extremely beneficial for range queries. * **Don't Do This:** Neglect to cluster data, especially for large tables. Cluster by columns that are rarely used in queries. **Why:** Clustering data improves query performance by reducing the amount of data that needs to be scanned for range queries or queries involving a specific order. **Example:** """sql ALTER TABLE lineitem CLUSTER BY l_orderkey, l_shipdate; --Cluster by orderkey, then by shipdate within each orderkey """ ### 2.3. Compression **Standard:** Enable compression for large datasets to reduce storage space and improve I/O performance. * **Do This:** Use compression algorithms like Zstd or Snappy, especially when storing data in Parquet format. DuckDB automatically handles compression for its internal storage. * **Don't Do This:** Store uncompressed data unnecessarily. **Why:** Compression reduces the amount of data that needs to be read from disk, leading to faster query execution. **Example:** """sql -- Parquet with Zstd compression (best generally for both compression ratio and speed) COPY lineitem TO 'lineitem_compressed.parquet' (FORMAT 'PARQUET', COMPRESSION 'ZSTD'); -- Explicit compression -- DuckDB auto-compression (will use a reasonable default) CREATE TABLE compressed_table AS SELECT * FROM lineitem; """ ### 2.4. Partitioning (using Parquet Files) **Standard:** Partition data into separate files based on logical criteria (e.g., date ranges, geographic regions). * **Do This:** Store data in Parquet files, partitioned by relevant columns. Use DuckDB's globbing capabilities to efficiently query specific partitions. * **Don't Do This:** Store all data in a single large file, as this can slow down query execution. **Why:** Partitioning allows DuckDB to only read the relevant files for a given query, improving performance. **Example:** Assume you have Parquet files partitioned by year and month: "/data/orders/year=2023/month=01/orders.parquet", "/data/orders/year=2023/month=02/orders.parquet", etc. """sql -- Query data for a specific month SELECT * FROM read_parquet('/data/orders/year=2023/month=01/*.parquet'); -- Query data for a specific year SELECT * FROM read_parquet('/data/orders/year=2023/*.parquet'); -- Query for all data SELECT * FROM read_parquet('/data/orders/*/*.parquet'); --Use cautiously. Is this REALLY what you meant? """ ### 2.5. Vectorized Reads **Standard:** Utilize DuckDB's vectorized reads for efficient data processing from disk or other external sources. * **Do This:** When reading from Parquet or other file formats, ensure DuckDB is configured to utilize vectorized reads. This is enabled by default; however, verify configurations in case of custom setups. * **Don't Do This:** Implement custom, row-by-row processing when reading data into DuckDB, especially when standard file formats are used. **Why:** Vectorized reads allow DuckDB to process data in batches, significantly improving the throughput of data ingestion and query execution. **Example:** DuckDB automatically utilizes vectorized reads for Parquet files. You generally will not need to configure this directly. However, for custom data-loading implementations, ensure that you are reading data in batches and passing it to DuckDB's vectorized execution engine. ## 3. Concurrency and Parallelism ### 3.1. Connection Management **Standard:** Manage database connections efficiently. * **Do This:** Use connection pooling to reuse connections and avoid the overhead of creating new connections for each query. Close connections when they are no longer needed. * **Don't Do This:** Create a new connection for each query. Leave connections open indefinitely. **Why:** Establishing database connections can be expensive. Connection pooling improves performance by reusing existing connections. **Example (Python):** """python import duckdb import threading # Use a thread-local connection local = threading.local() def get_connection(): if not hasattr(local, "con"): local.con = duckdb.connect('my_database.duckdb') return local.con def run_query(query): con = get_connection() result = con.execute(query).fetchall() return result """ ### 3.2. Parallel Query Execution **Standard:** Leverage DuckDB's parallel query execution capabilities. * **Do This:** Configure the number of threads used for query execution using "PRAGMA threads". Ensure that queries are designed to benefit from parallelism (e.g., large scans, aggregations). * **Don't Do This:** Set the number of threads too high, as this can lead to excessive context switching and reduced performance. **Why:** Parallel query execution can significantly improve performance for CPU-bound operations. **Example:** """sql PRAGMA threads=8; -- Use 8 threads SELECT l_returnflag, l_linestatus, SUM(l_quantity) AS sum_qty, SUM(l_extendedprice) AS sum_base_price, SUM(l_discount) AS sum_disc_price FROM lineitem GROUP BY l_returnflag, l_linestatus; PRAGMA threads=-1; --Use all available cores """ Carefully assess the optimal number of threads for your workload. For I/O bound workloads, increasing number of threads excessively can introduce contention overhead. ## 4. Runtime Configuration ### 4.1. Memory Management **Standard:** Configure the amount of memory available to DuckDB. * **Do This:** Use "PRAGMA memory_limit" to set the memory available to DuckDB. Monitor memory usage to ensure that the limit is appropriate. * **Don't Do This:** Allow DuckDB to use excessive amounts of memory, potentially starving other processes. Set the memory limit too low, which can lead to disk spilling and reduced performance. **Why:** Proper memory management prevents out-of-memory errors and ensures efficient query execution. **Example:** """sql PRAGMA memory_limit='16GB'; -- Set memory limit to 16GB """ ### 4.2. Temporary Storage **Standard:** Ensure that temporary storage is configured correctly. * **Do This:** Use the "temp_directory" configuration option to specify a location for temporary files. Ensure that the specified location has sufficient storage space and high I/O performance. * **Don't Do This:** Allow temporary files to be written to the default location, which may be on a slower storage device. **Why:** DuckDB uses temporary storage for intermediate results. Configuring temporary storage correctly can improve query performance, especially when dealing with large datasets. **Example:** """sql PRAGMA temp_directory='/mnt/fast_ssd/duckdb_tmp'; """ ### 4.3. Detailed Monitoring **Standard:** Using tools to actively monitor performance of queries and IO operations * **Do This:** Use DuckDB's built-in performance monitoring features along with external system monitoring tools * **Don't Do This:** Neglect monitoring the impact of configuration changes and code optimization. Changes should be tested thoroughly, and can sometimes negatively impact performance for some workloads. **Why:** Consistent monitoring helps ensure that changes are having the impact you expect, and catches unexpected degradation. **Example:** While DuckDB contains some minimal internal monitoring, focus should be on wrapping the application in well-known monitoring frameworks used in the deployment envirnoment such as Prometheus, Grafana, or similar tools. ## 5. Code Maintainability and Readability ### 5.1. Code Formatting and Style **Standard:** Follow a consistent code formatting style. * **Do This:** Use a consistent indentation style (e.g., 4 spaces). Use meaningful variable and function names. Add comments to explain complex logic. * **Don't Do This:** Use inconsistent indentation. Use cryptic variable names. Write code without comments. **Why:** Consistent code formatting improves readability and maintainability. **Example:** """sql -- Good: Well-formatted SQL SELECT c.customer_id, c.customer_name, COUNT(o.order_id) AS order_count FROM customers c LEFT JOIN orders o ON c.customer_id = o.customer_id WHERE c.region = 'North America' GROUP BY c.customer_id, c.customer_name ORDER BY order_count DESC LIMIT 10; -- Bad: Poorly formatted SQL select c.customer_id,c.customer_name,count(o.order_id) from customers c left join orders o on c.customer_id=o.customer_id where c.region='North America' group by c.customer_id,c.customer_name order by count(o.order_id) desc limit 10; """ ### 5.2. Modular Design **Standard:** Break down complex queries and logic into smaller, reusable modules. * **Do This:** Use Common Table Expressions (CTEs) to break down complex queries into smaller parts. Create reusable functions for common operations. * **Don't Do This:** Write monolithic queries that are difficult to understand and maintain. **Why:** Modular design improves code organization and reduces code duplication. **Example:** """sql -- Good: Using CTEs to break down a complex query WITH CustomerOrders AS ( SELECT customer_id, COUNT(order_id) AS order_count FROM orders GROUP BY customer_id ), TopCustomers AS ( SELECT customer_id FROM CustomerOrders ORDER BY order_count DESC LIMIT 10 ) SELECT c.customer_id, c.customer_name, co.order_count FROM customers c JOIN TopCustomers tc ON c.customer_id = tc.customer_id JOIN CustomerOrders co ON c.customer_id = co.customer_id; -- Bad: Monolithic query SELECT c.customer_id, c.customer_name, COUNT(o.order_id) FROM customers c JOIN orders o ON c.customer_id = o.customer_id GROUP BY c.customer_id, c.customer_name ORDER BY COUNT(o.order_id) DESC LIMIT 10; """ By adhering to these coding standards, DuckDB developers can write efficient, maintainable, and performant code, ensuring that applications utilizing DuckDB run smoothly and effectively. The consistent application of these rules, aided by AI tools, should lead to a higher quality codebase and improved overall system performance. Remember to stay current with DuckDB's release notes, especially those regarding optimization, as the engine is rapidly evolving.
# API Integration Standards for DuckDB This document outlines the coding standards for integrating DuckDB with external APIs and backend services. It focuses on best practices to ensure maintainability, performance, and security when leveraging DuckDB in conjunction with external data sources and services. ## 1. General Principles of API Integration ### 1.1. Clear Separation of Concerns **Do This:** * Isolate API interaction logic from core database operations. * Create dedicated modules or functions responsible for communicating with external APIs. **Don't Do This:** * Embed API calls directly within SQL queries or stored procedures. This makes debugging incredibly difficult and tightly couples your SQL logic to an external service. **Why:** Separate concerns promote modularity and testability. API interactions are often subject to change (e.g., API version updates, schema changes), so isolating them reduces the impact of these changes on core database logic. **Example:** """python # Correct: Separate API interaction logic import requests import duckdb def fetch_data_from_api(api_url): """Fetches data from an external API.""" try: response = requests.get(api_url) response.raise_for_status() #
# State Management Standards for DuckDB This document outlines the coding standards for managing state within applications using DuckDB, focusing on data flow, reactivity, and persistence. These standards aim to ensure maintainability, performance, and security for DuckDB-driven applications. ## 1. Principles of State Management Effective state management is crucial for building robust and scalable DuckDB applications. A well-defined approach simplifies debugging, enhances testability, and improves overall code quality. ### 1.1. Explicit vs. Implicit State * **Do This:** Favor explicit state management. Clearly define and declare all state variables, data structures, and their relationships. Use appropriate data types. * **Don't Do This:** Rely on hidden or implicit state, such as global variables or mutable shared objects without clear boundaries. **Why:** Explicit state improves traceability and reduces the risk of unexpected side effects. **Example:** """python # Explicit State import duckdb def execute_query(db_connection, query): """Executes a SQL query against a DuckDB database.""" try: result = db_connection.execute(query).fetchall() return result except duckdb.Error as e: print(f"Error executing query: {e}") return None # Example Usage (Explicit Connection Object) conn = duckdb.connect(':memory:') conn.execute("CREATE TABLE mytable (id INTEGER, value VARCHAR)") conn.execute("INSERT INTO mytable VALUES (1, 'hello'), (2, 'world')") result = execute_query(conn, "SELECT * FROM mytable") print(result) conn.close() # Implicit State (Avoid) # (Using global database connections) """ ### 1.2. Immutable Data Structures * **Do This:** Use immutable data structures whenever possible to represent state. Prefer creating new copies of data upon modification rather than mutating existing objects. * **Don't Do This:** Modify data structures in place without considering the potential side effects on other parts of the application. **Why:** Immutability simplifies debugging and reasoning about data flow, particularly in concurrent environments. **Example:** """python # Immutable Data Structures & DuckDB import duckdb def update_records(db_path, table_name, updates): """ Simulates updating records by creating a new table with the modifications This is an example of immutable approach since DuckDB doesn't allow direct update in embedded mode """ conn = duckdb.connect(db_path) try: # 1. Read the existing records using DuckDB existing_records = conn.execute(f"SELECT * FROM {table_name}").fetchall() # Convert the result into a manageable format, like a dict records_dict = {record[0]: list(record[1:]) for record in existing_records} # Assuming id is record[0], and the rest are fields. # 2. Apply updates (generating new records) - Immutability approach: create new dict new_records_dict = records_dict.copy() # Create a copy for row_number, record_data in updates.items(): if row_number in new_records_dict: # We need to know the row number new_records_dict[row_number] = record_data # Update the dictionary (copy). #3 Delete old table and then add new table using dictionary conn.execute(f"DROP TABLE IF EXISTS {table_name}") # Convert each values(lists) in dictionary to tuple before adding a new table new_record_lists = {row_number: tuple(value) for row_number, value in new_records_dict.items()} table_data = list(new_record_lists.values()) # Define the column names for the new table column_names = ['id', 'name', 'age', 'city'] #Example of column Names # Create the new table using DuckDB conn.execute(f"CREATE TABLE {table_name} AS SELECT * FROM (VALUES {', '.join(map(str, table_data))}) AS t ({', '.join(column_names)})") # Verify result by reading sample data from updated table result = conn.execute(f"SELECT * FROM {table_name}").fetchall() print(f"Updated table records: {result}") except duckdb.Error as e: print(f"Error during update: {e}") finally: conn.close() # Example Usage - Important Note DuckDB requires to pass data as tuples instead of list to avoid type conversion issues db_path = 'my_example.duckdb' original_data = [(1, 'Alice', 30, 'New York'),(2, 'Bob', 25, 'Los Angeles'),(3, 'Charlie', 35, 'Chicago')] conn = duckdb.connect(db_path) conn.execute('CREATE TABLE IF NOT EXISTS users (id INTEGER, name VARCHAR, age INTEGER, city VARCHAR)') conn.executemany('INSERT INTO users VALUES (?, ?, ?, ?)', original_data) conn.close() updates = { 1: ['Alice Updated', 31, 'New Jersey'], # Key represents the row number 2: ['Bob Updated',26,'San Francisco'] } update_records(db_path, 'users', updates) """ ### 1.3. Single Source of Truth * **Do This:** Ensure that each piece of data has a single, authoritative source. Avoid redundant copies or derived data that can become inconsistent. Use DuckDB as the single source of truth for analytical data where possible. * **Don't Do This:** Cache data aggressively without proper invalidation mechanisms. **Why:** A single source of truth minimizes discrepancies and simplifies data synchronization. **Example:** """python # Single Source of Truth - DuckDB import duckdb def get_user_data(db_path, user_id): """Retrieves user data from DuckDB as the single source of truth.""" conn = duckdb.connect(db_path) try: result = conn.execute(f"SELECT * FROM users WHERE id = {user_id}").fetchone() if result: return { 'id': result[0], 'name': result[1], 'age': result[2], 'city': result[3] } else: return None except duckdb.Error as e: print(f"Error retrieving user data: {e}") return None finally: conn.close() # Usage db_path = 'my_example.duckdb' user_id = 1 user_data = get_user_data(db_path, user_id) print(user_data) """ ## 2. State Management Approaches in DuckDB Applications Different applications have different state management needs. Here's how to approach this for applications leveraging DuckDB: ### 2.1. Embedded DuckDB State * **Do This:** For small to medium-sized datasets, use DuckDB's embedded mode for direct data manipulation within the application's process. * **Don't Do This:** Attempt complex concurrent write operations in embedded mode without proper locking and transaction handling. * **Consider:** The limits of in-process memory and CPU usage for large datasets when using embedded DuckDB. **Why:** Embedded DuckDB offers simplicity and low latency for local analytics. **Example:** """python # Embedded DuckDB Example import duckdb db_conn = duckdb.connect(':memory:') # In-memory database for embedded use db_conn.execute("CREATE TABLE items (id INTEGER, name VARCHAR)") db_conn.execute("INSERT INTO items VALUES (1, 'Laptop')") db_conn.execute("INSERT INTO items VALUES (2, 'Keyboard')") results = db_conn.execute("SELECT * FROM items").fetchall() print(results) db_conn.close() """ ### 2.2. Persistent DuckDB State * **Do This:** Store the DuckDB database on disk for persisting data across application sessions. * **Don't Do This:** Neglect backup and recovery mechanisms for persistent DuckDB databases. * **Consider:** Using relative paths for the database file location to improve portability. **Why:** Persistent storage ensures data continuity even across application restarts. **Example:** """python import duckdb import os db_path = 'my_persistent_db.duckdb' # Database file path #Connect, create and close the connection db_conn = duckdb.connect(db_path) db_conn.execute("CREATE TABLE IF NOT EXISTS user_profiles (id INTEGER, username VARCHAR, email VARCHAR)") db_conn.close() # Function to insert data def insert_user_profile(db_path, id, username, email): conn = duckdb.connect(db_path) try: conn.execute("INSERT INTO user_profiles VALUES (?, ?, ?)", (id, username, email)) conn.commit() print(f"Inserted user: {username}") except duckdb.Error as e: print(f"Error inserting user: {e}") conn.rollback() finally: conn.close() #Insert sample date to persistent database insert_user_profile(db_path, 1, 'john_doe', 'john.doe@example.com') insert_user_profile(db_path, 2, 'jane_smith', 'jane.smith@example.com') # Read Function for retrieving user profile def get_user_profile(db_path, user_id): conn = duckdb.connect(db_path) try: result = conn.execute(f"SELECT * FROM user_profiles WHERE id={user_id}").fetchone() if result: return { 'id': result[0], 'username': result[1], 'email': result[2] } else: return None except duckdb.Error as e: print(f"Error getting user profile: {e}") return None finally: conn.close() # Get the data from Database and print user_profile = get_user_profile(db_path, 1) print(user_profile) """ ### 2.3. Connecting to External Data Sources * **Do This:** Utilize DuckDB's ability to directly query data from Parquet, CSV, JSON, and other file formats without importing. * **Don't Do This:** Assume that external data sources always conform to the expected schema. Implement robust error handling and schema validation. * **Consider:** Optimizing access to external data sources by filtering and aggregating data within DuckDB rather than transferring large amounts of data to the application. **Why:** External data access enables real-time analytics without data duplication. **Example:** """python # External Data Source - JSON (Important! Use the format duckdb.read_json_auto!) import duckdb def analyze_json_data(json_file_path, query): """Analyzes JSON data using DuckDB.""" try: full_query = f"SELECT * FROM read_json_auto('{json_file_path}')" # Use AUTO to let DuckDB infer schema full_query = query conn = duckdb.connect(':memory:') result = conn.execute(full_query).fetchall() conn.close() return result except duckdb.Error as e: print(f"Error querying JSON data: {e}") return None # Prepare a sample JSON file json_data = '[{"id": 1, "name": "Laptop", "price": 1200}, {"id": 2, "name": "Keyboard", "price": 75}]' with open('products.json', 'w') as f: f.write(json_data) json_file_path = 'products.json' query = f"SELECT name, price FROM read_json_auto('{json_file_path}') WHERE price > 100 " results = analyze_json_data(json_file_path, query) print(results) os.remove('products.json') # clean up the file """ ### 2.4. Managing Large Datasets * **Do This:** Use DuckDB's efficient query engine to perform aggregations, filtering, and joins on large datasets directly within the database. * **Don't Do This:** Load entire large datasets into application memory. * **Consider:** Partitioning and indexing techniques to optimize query performance on large datasets. **Why:** Optimized query execution minimizes memory usage and processing time. **Example:** """python # Large Dataset Handling import duckdb import pandas as pd def analyze_large_dataset(csv_file_path, query): """Analyzes a large CSV dataset using DuckDB.""" try: # Establish a connection to DuckDB (in-memory for example) conn = duckdb.connect(':memory:') # Register the CSV file as a virtual table conn.execute(f"CREATE TABLE my_data AS SELECT * FROM read_csv_auto('{csv_file_path}')") # Execute the query result = conn.execute(query).fetchdf() # Retrieve result as a Pandas DataFrame conn.close() return result except duckdb.Error as e: print(f"Error querying large dataset: {e}") return None #Example #Create a test file data = {'col1': [1, 2, 3, 4, 5], 'col2': ['A', 'B', 'C', 'D', 'E'], 'col3': [1.1, 2.2, 3.3, 4.4, 5.5]} df = pd.DataFrame(data) csv_file_path = "test.csv" df.to_csv(csv_file_path, index=False) query = "SELECT col2, AVG(col3) FROM my_data GROUP BY col2" results = analyze_large_dataset(csv_file_path, query) print(results) os.remove('test.csv') #Clean up test file """ ### 2.5. Transactions * **Do This:** Use transactions to ensure atomicity, consistency, isolation, and durability (ACID) when performing multiple write operations on DuckDB. * **Don't Do This:** Perform write operations without transactions, which can lead to data corruption or inconsistencies in case of errors. * **Consider:** Choosing the appropriate isolation level for transactions based on the application's concurrency requirements. **Why:** Transactions guarantee data integrity during complex operations. **Example:** """python import duckdb def transfer_funds(db_path, account_from, account_to, amount): """Transfers funds between two accounts using a transaction.""" conn = duckdb.connect(db_path) try: conn.execute("BEGIN TRANSACTION") # Start transaction # 1. Check if the sender account has sufficient balance. sender_balance = conn.execute(f"SELECT balance FROM accounts WHERE id = {account_from}").fetchone()[0] if sender_balance < amount: raise ValueError("Insufficient funds.") # 2. Withdraw from the sender account. conn.execute(f"UPDATE accounts SET balance = balance - {amount} WHERE id = {account_from}") # 3. Deposit to the receiver account. conn.execute(f"UPDATE accounts SET balance = balance + {amount} WHERE id = {account_to}") conn.commit() print("Funds transferred successfully.") except ValueError as e: conn.rollback() print(f"Transaction rolled back due to {e}") except duckdb.Error as e: conn.rollback() print(f"Error during transfer: {e}") finally: conn.close() #Setup initial state def setup_accounts(db_path): conn = duckdb.connect(db_path) try: conn.execute('CREATE TABLE IF NOT EXISTS accounts (id INTEGER, balance REAL)') conn.execute('INSERT INTO accounts VALUES (1, 1000.0)') conn.execute('INSERT INTO accounts VALUES (2, 500.0)') conn.commit() print ("Set up user account") except duckdb.Error as e: print(f"Error setting up accounts: {e}") conn.rollback() #Roll back in case of an error finally: conn.close() db_path = 'bank_db.duckdb' setup_accounts(db_path) transfer_funds(db_path, 1, 2, 200.0) #Transfer 200 from user 1 to user 2 #Verify results conn = duckdb.connect(db_path) print (conn.execute("SELECT * from accounts").fetchall()) conn.close() """ ## 3. Modern Approaches and Patterns ### 3.1. Reactive Programming * **Do This:** Use reactive programming techniques (e.g., RxPY) to automatically update application state in response to changes in the underlying DuckDB data. * **Don't Do This:** Poll the database repeatedly to detect changes. * **Consider:** Using change data capture (CDC) mechanisms if available within your DuckDB environment (though DuckDB itself has limited direct CDC). **Why:** Reactive programming enables efficient and real-time state updates. **Example (Conceptual, requires external libraries):** """python # Conceptual Reactive Example (Requires e.g., RxPY) # Note: This is a simplified conceptual example. Integration would depend on # specific libraries providing reactive capabilities around database changes. # This demonstrates the idea, not a fully working example. import duckdb import reactivex from reactivex import operators as ops def create_database_observable(db_path, query, interval): """Creates an observable that emits data from a DuckDB query at a given interval.""" def subscribe(observer, scheduler=None): def run(): try: conn = duckdb.connect(db_path) result = conn.execute(query).fetchall() observer.on_next(result) conn.close() except Exception as e: observer.on_error(e) # Propagate any errors to the observable #Recursive function to keep schedule until disposed if not observer.is_stopped: scheduler.schedule(run, interval) #Initial Schedule with recusive function scheduler.schedule(run, interval) return reactivex.disposable.Disposable(run, interval) return reactivex.create(subscribe) #Example DB Setup db_path = 'reactive_db.duckdb' conn = duckdb.connect(db_path) conn.execute("CREATE TABLE IF NOT EXISTS sensor_data (timestamp TIMESTAMP, temperature REAL)") conn.execute("INSERT INTO sensor_data VALUES ('2024-11-07 10:00:00', 25.5)") conn.close() # Create an observable that queries the DuckDB database every 5 seconds. db_observable = create_database_observable(db_path, "SELECT * FROM sensor_data", 5) # Subscribe to the observable and print the data. def on_next(data): print(f"Data emitted: {data}") def on_error(error): print(f"Error: {error}") def on_completed(): print("Completed") disposable = db_observable.subscribe( on_next=on_next, # Function to call when data is emitted on_error=on_error, # Function to call if there's an error on_completed=on_completed # Function when observable is stopped ) # Wait for 15 seconds to receive three emissions. import time time.sleep(15) # Dispose and close the database connection. disposable.dispose() conn = duckdb.connect(db_path) #To avoid errors conn.close() """ ### 3.2. Using DuckDB with Arrow for Data Transfer * **Do This:** Leverage Apache Arrow as a data transfer format between DuckDB and other systems (e.g., Pandas, Spark). Use the "arrow()" method from DuckDB connection objects to fetch data as Arrow tables. * **Don't Do This:** Rely on inefficient data serialization formats when transferring data between DuckDB and other systems. **Why:** Arrow provides zero-copy data sharing, minimizing overhead. **Example:** """python # Arrow Example import duckdb import pyarrow as pa db_conn = duckdb.connect(':memory:') db_conn.execute("CREATE TABLE my_data (id INTEGER, value VARCHAR)") db_conn.execute("INSERT INTO my_data VALUES (1, 'hello'), (2, 'world')") arrow_table = db_conn.execute("SELECT * FROM my_data").arrow() print(arrow_table) print(type(arrow_table)) # Print the type of the arrow_table db_conn.close() """ ### 3.3. Parameterized Queries * **Do This:** Use parameterized queries to prevent SQL injection attacks and improve query performance. * **Don't Do This:** Concatenate user input directly into SQL queries. **Why:** Parameterized queries sanitize user input and allow DuckDB to optimize query execution. **Example:** """python # Parameterized Query import duckdb def get_user(db_path, user_id): """Retrieves a user from the database using a parameterized query.""" conn = duckdb.connect(db_path) try: result = conn.execute("SELECT * FROM users WHERE id = ?", (user_id,)).fetchone() if result: return { 'id': result[0], 'username': result[1], 'email': result[2] } else: return None except duckdb.Error as e: print(f"Error retrieving user: {e}") return None finally: conn.close() db_path = 'user_db.duckdb' conn = duckdb.connect(db_path) conn.execute("CREATE TABLE IF NOT EXISTS users (id INTEGER, username VARCHAR, email VARCHAR)") conn.execute("INSERT INTO users VALUES (1, 'john_doe', 'john.doe@example.com')") conn.close() user = get_user(db_path, 1) print (user) """ ## 4. Error Handling and Logging ### 4.1. Specific Exception Handling * **Do This:** Catch specific "duckdb.Error" exceptions to handle different error conditions (e.g., "duckdb.CatalogException", "duckdb.InvalidInputException"). * **Don't Do This:** Use generic "except Exception:" blocks that can mask underlying issues. **Why:** Specific exception handling allows for targeted error recovery logic. **Example:** """python import duckdb def execute_query(db_path, query): """Executes a SQL query and handles potential DuckDB errors.""" conn = duckdb.connect(db_path) try: result = conn.execute(query).fetchall() return result except duckdb.CatalogException as e: print(f"Table not found: {e}") return None except duckdb.InvalidInputException as e: print(f"Invalid input: {e}") return None except duckdb.Error as e: print(f"General DuckDB error: {e}") return None finally: conn.close() db_path = 'test_db.duckdb' results = execute_query(db_path, "SELECT * FROM non_existent_table") #Raises duckdb.CatalogException print(results) results = execute_query(db_path, "SELECT * FROM 123") #Invalid query, raises duckdb.InvalidInputException """ ### 4.2. Logging * **Do This:** Use a logging framework (e.g., "logging" in Python) to record significant events, errors, and warnings related to DuckDB operations. * **Don't Do This:** Rely solely on "print()" statements for debugging in production code. Include log levels (INFO, WARNING, ERROR) appropriately. * **Consider:** Implementing structured logging to facilitate analysis of log data. **Why:** Logging provides valuable insights into application behavior and simplifies troubleshooting. **Example:** """python import duckdb import logging # Configure the logger logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s') def execute_query(db_path, query): """Executes a SQL query with logging.""" conn = duckdb.connect(db_path) try: logging.info(f"Executing query: {query}") result = conn.execute(query).fetchall() logging.info(f"Query executed successfully.") return result except duckdb.Error as e: logging.error(f"Error executing query: {e}", exc_info=True) # Log the exception details return None finally: conn.close() # Create a dummy database db_path = 'test_logging.duckdb' execute_query(db_path, "SELECT * FROM t") # Intentionally cause an error (table does not exist) """ This document provides a foundational set of standards for effective state management in DuckDB applications. By adhering to these guidelines, developers can create robust, maintainable, and performant solutions. Remember to continually review and adapt these standards as DuckDB evolves and new best practices emerge.
# Core Architecture Standards for DuckDB This document outlines the core architectural standards for contributing to and maintaining the DuckDB project. It focuses on the high-level structure, organization, and key design principles that guide development. Adherence to these standards ensures consistency, maintainability, performance, and security within the DuckDB codebase. ## 1. Fundamental Architectural Principles DuckDB's architecture is designed around several key principles that guide its development: * **Columnar Data Storage:** Data is stored in columns, enabling efficient analytical processing by minimizing I/O and maximizing vectorization opportunities. * **In-Process Execution:** DuckDB operates within the same process as the application, eliminating serialization/deserialization overhead and enabling tight integration. This design choice favors simplicity and speed for many common use cases. * **Vectorized Execution:** Queries are processed using vectorized execution, where operations are applied to entire columns (or chunks of columns) at once. This dramatically improves performance compared to row-by-row processing. * **Extensibility:** DuckDB is designed to be extensible. Custom functions ("UDFs"), table functions ("TDFs"), and other extensions can be added to provide specialized functionality. * **Data locality:** DuckDB tries to maintain high data locality by grouping related data together (e.g. by using radix partitioning). This improves cache hit ratios and reduces memory access latency. * **Minimal Dependencies:** Aiming for ease of deployment and portability, DuckDB strives to minimize external dependencies. * **Cost-Based Optimizer:** Utilizes a cost-based optimizer for efficient query planning. It estimates the cost of different execution strategies and selects the most performant one. ## 2. Project Structure and Organization The DuckDB project follows a well-defined directory structure: * **"src/":** Contains the core source code. * **"catalog/":** Manages database metadata (tables, schemas, functions, etc.). * **"common/":** Common utility functions and data structures used throughout the codebase. * **"execution/":** Implements query execution logic including the vectorized processing engine. * **"function/":** Contains built-in SQL functions. * **"main/":** Main entry point for the DuckDB library. * **"optimizer/":** Implements the query optimizer, including rule-based and cost-based optimizations. * **"parser/":** Responsible for parsing SQL queries. * **"planner/":** Creates the logical plan from the parsed SQL query. * **"storage/":** Implements storage management and data access. * **"transaction/":** Manages database transactions and concurrency control. * **"include/duckdb/":** Public header files for the DuckDB API. * **"test/":** Contains unit tests and integration tests. * **"extension/":** Location for extensions to DuckDB. * **"third_party/":** External libraries used by DuckDB. ### 2.1 Standards for Project Structure Contributions * **Do This:** Place new source code in the appropriate subdirectory within "src/". If a new component is introduced, create a dedicated subdirectory. * **Don't Do This:** Add source files to the root "src/" directory unless it's absolutely unavoidable. This keeps the codebase organized and navigable. **Example:** If you're adding a new string function, create the files "src/function/string/my_new_string_function.cpp" and "src/function/string/my_new_string_function.hpp". ### 2.2 Namespaces * **Do This:** All DuckDB code should reside within the "duckdb" namespace. Nested namespaces (e.g., "duckdb::storage") can be used to further organize code within modules. Use anonymous namespaces for file-local symbols. * **Don't Do This:** Use the global namespace or other top-level namespaces for DuckDB code. **Example:** """cpp namespace duckdb { namespace storage { class MyStorageClass { // ... }; } // namespace storage } // namespace duckdb """ ### 2.3 Directory Naming Conventions * **Do This:** Keep directory names lowercase and descriptive. Use underscores to separate words (e.g., "storage_manager"). * **Don't Do This:** Use camelCase or mixed-case directory names. Avoid abbreviations unless they are well-established within the project (e.g., "UDF" is acceptable rather than "user_defined_function"). ## 3. Coding Style and Formatting DuckDB follows a consistent coding style based on LLVM's style guide, with minor customizations. * **Do This:** * Use clang-format to automatically format code. A ".clang-format" file is provided in the root of the repository. * Follow the naming conventions for variables (snake_case), classes (PascalCase), and functions (camelCase, starting with a lowercase letter). * Use expressive and descriptive names. * Keep lines within a reasonable length (ideally under 120 characters). * **Don't Do This:** * Manually format code. Let clang-format handle the formatting. * Use cryptic or single-letter variable names (except in very localized contexts like loop counters). **Example:** """cpp // Correct class MyStorageManager { public: void initializeStorage(const string& path); private: string database_path_; }; // Incorrect class mystoragemanager { // Class name should be PascalCase public: void initstorage(const string& p); // Function starts with lowercase, variable name unclear private: string dbpath; // Variable name unclear, should be snake_case }; """ ## 4. Memory Management DuckDB employs a combination of manual memory management (using "new" and "delete"), smart pointers ("unique_ptr", "shared_ptr") for resource ownership, and a custom memory pool allocator for managing the lifetime of short-lived objects within the vectorized execution engine. ### 4.1 Standards for Memory Management * **Do This:** * Use "unique_ptr" for exclusive ownership of resources. This is the preferred way to manage memory in most cases. * Use "shared_ptr" only when shared ownership is explicitly required. Carefully consider the lifetime implications when using "shared_ptr" to avoid circular dependencies and memory leaks. * Use the memory pool allocator ("Allocator") for allocating short-lived objects within the vectorized execution engine, especially within inner loops or frequently called functions. This avoids the overhead of "new" and "delete" for each object. * When using raw pointers, ensure clear ownership transfer and deallocation, document the ownership semantics, and consider using RAII (Resource Acquisition Is Initialization) to tie the lifetime of the resource to the lifetime of an object. * **Don't Do This:** * Use raw pointers for resource ownership without clear ownership transfer. * Leak memory by failing to "delete" allocated objects. * Double-free memory. * Access memory after it has been freed (use-after-free). * Mix different memory allocation strategies haphazardly. **Example using "unique_ptr":** """cpp #include <memory> namespace duckdb { class MyObject { public: MyObject(int value) : value_(value) {} int GetValue() const { return value_; } private: int value_; }; void processObject(std::unique_ptr<MyObject> obj) { // 'obj' is exclusively owned here. std::cout << "Processing object with value: " << obj->GetValue() << std::endl; } // 'obj' is automatically deleted when it goes out of scope. std::unique_ptr<MyObject> createObject(int initialValue) { return std::make_unique<MyObject>(initialValue); } } // namespace duckdb """ **Example using the memory pool allocator:** """cpp namespace duckdb { class Vector { public: Vector(Allocator &allocator) : data_(allocator.Allocate(1024)) {} private: data_ptr_t data_; }; void myFunction(Allocator &allocator) { // Allocate a Vector using the provided allocator. Vector my_vector(allocator); } // Vector's memory is automatically deallocated when the Allocator's scope ends, usually at the end of query execution. } // namespace duckdb """ ## 5. Concurrency and Parallelism DuckDB leverages multi-threading for parallel query execution, particularly within the vectorized execution engine. ### 5.1 Standards for Concurrency * **Do This:** * Use appropriate locking mechanisms (e.g., "std::mutex", "std::shared_mutex") to protect shared data structures from race conditions. * Use fine-grained locking to minimize lock contention and maximize parallelism. * Consider using lock-free data structures for high-contention scenarios, but only when appropriate and with careful consideration of the complexity involved. The "atomic" types can be helpful here. * Utilize the task scheduler for managing parallel tasks. * **Don't Do This:** * Introduce data races by accessing shared data without proper synchronization. * Hold locks for extended periods, blocking other threads. * Create deadlocks by acquiring locks in inconsistent orders. **Example using "std::mutex":** """cpp #include <mutex> namespace duckdb { class SharedData { public: void incrementCounter() { std::lock_guard<std::mutex> lock(mutex_); // RAII-style locking counter_++; } int getCounter() const { std::lock_guard<std::mutex> lock(mutex_); return counter_; } private: int counter_ = 0; std::mutex mutex_; }; } // namespace duckdb """ ## 6. Error Handling Robust error handling is crucial for maintaining the stability and reliability of DuckDB. ### 6.1 Standards for Error Handling * **Do This:** * Use exceptions ("std::exception" or custom exception classes derived from it) to signal errors. Specifically "duckdb::Exception" and its subclasses are preferred. * Catch exceptions at appropriate levels and handle them gracefully. * Provide informative error messages that include the context of the error (e.g., the SQL query being executed, the file being processed). * Use "D_ASSERT" macros for internal assertions that should always be true. These assertions are enabled in debug builds and can help catch bugs early. * Return "Value" objects which contain error states when appropriate, especially for functions. * **Don't Do This:** * Ignore errors. * Use return codes for error handling unless there is a very specific reason to do so. Exceptions provide a much cleaner separation of concerns. * Throw generic exceptions without providing specific error information. * Use assertions for error conditions that can occur in production. Assertions are only enabled in debug builds; use exceptions for handling runtime errors. **Example using exceptions:** """cpp #include <stdexcept> #include "duckdb.hpp" namespace duckdb { void myFunction(int value) { if (value < 0) { throw InvalidInputException("Value must be non-negative"); } // ... } void anotherFunction() { try { myFunction(-1); } catch (const InvalidInputException& e) { std::cerr << "Error: " << e.what() << std::endl; // Handle the error appropriately (e.g., log it, return an error code). } catch (const Exception& e) { std::cerr << "DuckDB Error: " << e.what() << std::endl; // Catch DuckDB specific exceptions } catch (const std::exception& e) { std::cerr << "Standard exception: " << e.what() << std::endl; // Catch standard exceptions } catch (...) { std::cerr << "Unknown error occurred." << std::endl; // Handle unexpected errors. } } } // namespace duckdb """ ## 7. Logging DuckDB uses a logging system to record events and diagnostic information at different levels of severity. ### 7.1 Standards for Logging * **Do This:** * Use the logging macros (e.g., "D_LOG", "D_INFO", "D_DEBUG", "D_WARN", "D_ERROR") to log events at the appropriate severity level. * Include relevant context in log messages (e.g., the function name, the current state of the system). * Use structured logging to make log messages easier to parse and analyze. * **Don't Do This:** * Over-log, creating excessive noise in the logs. * Log sensitive information (e.g., passwords, API keys). * Use "std::cout" or "std::cerr" for logging. Use the DuckDB logging macros instead for consistency and configurability. **Example using logging macros:** """cpp #include "duckdb.hpp" namespace duckdb { void myFunction(int value) { D_DEBUG("myFunction called with value: {}", value); //Debug level message if (value < 0) { D_ERROR("Invalid value: {}", value); // Error Level message throw InvalidInputException("Value must be non-negative"); } D_INFO("Processing value: {}", value); } } // namespace duckdb """ ## 8. Extensibility DuckDB is designed to be extensible, allowing developers to add custom functions, table functions, and other extensions. ### 8.1 Standards for Extensibility * **Do This:** * Follow the documented API for creating custom functions and table functions. Refer to the DuckDB documentation for the latest API details. * Provide clear documentation and examples for your extensions. * Consider contributing your extensions back to the DuckDB community or publishing them as separate packages that others can use. * Ensure that extensions are thread-safe and do not introduce data races. Use proper synchronization mechanisms when accessing shared data structures. * **Don't Do This:** * Modify the core DuckDB code to add custom functionality. Use the extension API instead. * Introduce breaking changes to the extension API without careful consideration and communication with the community. * Create extensions that are insecure or unreliable. **Example Registering UDFs:** """cpp #include "duckdb.hpp" #include "duckdb/function/scalar_function.hpp" namespace duckdb { static void my_scalar_function(DataChunk &args, ExpressionState &state, Vector &result) { auto &input = args.data[0]; UnaryFunction::Execute<int32_t, int32_t>(input, result, args.size(), [&](int32_t input) { return input + 1; }); } class MyExtension : public Extension { public: std::string Name() override { return "my_extension"; } void Load(DatabaseInstance &instance) override { Connection con(instance); con.BeginTransaction(); auto &catalog = Catalog::GetCatalog(*con.GetContext()); ScalarFunction my_function("my_scalar_function", {LogicalType::INTEGER}, LogicalType::INTEGER, my_scalar_function); catalog.CreateFunction(*con.GetContext(), my_function); con.Commit(); } }; extern "C" { DUCKDB_EXTENSION_API void MyExtension_init(duckdb::DatabaseInstance &db) { db.RegisterExtension(std::make_unique<MyExtension>()); } DUCKDB_EXTENSION_API const char *MyExtension_version() { return duckdb::DuckDB::LibraryVersion(); } } } // namespace duckdb """ ## 9. Testing Tests are written using GTest and are located in the "test/" directory. Each component of the DuckDB system should have corresponding unit tests. ### 9.1 Testing Standards * **Do This:** * Write both unit tests and integration tests to cover different aspects of the code. Unit tests should focus on individual components, while integration tests should verify the interaction between multiple components. * Use descriptive test names that clearly indicate what is being tested. * Write tests that are reliable and repeatable. Avoid tests that depend on external factors (e.g., network connectivity, specific file system layout) unless those factors are explicitly part of the test. * Aim for achieving good test coverage by writing tests that exercise all code paths and edge cases. * **Don't Do This:** * Skip writing tests, even for small changes. * Write tests that are flaky or unreliable. Fix or remove such tests. * Commit code without running the tests first. **Example Test:** """cpp #include "catch.hpp" #include "duckdb.hpp" using namespace duckdb; TEST_CASE("Basic test", "[core]") { DBConfig config; DuckDB db(nullptr, &config); Connection con(db); REQUIRE(con.Query("SELECT 42")->GetValue(0, 0) == Value::INTEGER(42)); } TEST_CASE("Test that asserts","[common]") { // this test will only fail in debug mode REQUIRE_ASSERT(D_ASSERT(1 == 2)); } """ ## 10. Documentation Clear and up-to-date documentation is essential for making DuckDB easy to use and contribute to. ### 10.1 Standards for Documentation * **Do This:** * Document all public APIs (functions, classes, etc.) using Doxygen-style comments. * Provide clear and concise explanations of the purpose, usage, and limitations of the API. * Include examples to illustrate how to use the API. * Keep the documentation up-to-date as the code evolves. * Document internal design decisions and architectural choices to help other developers understand the codebase. * Use meaningful comments within the code to explain complex logic or non-obvious decisions. * **Don't Do This:** * Skip documenting public APIs. * Write documentation that is vague, incomplete, or inaccurate. * Let the documentation become outdated. **Example using Doxygen comments:** """cpp namespace duckdb { /** * @brief Initializes the storage manager. * @param path The path to the database file. */ void initializeStorage(const std::string& path); } // namespace duckdb """ ## 11. Specific Considerations for Core Architecture * **Catalog Management:** The "catalog/" directory is critical. Changes here affect the entire database. Code here requires extensive testing. Correct locking is critical to prevent database corruption. Ensure all catalog changes are properly logged for recovery purposes. * **Query Optimizer:** The "optimizer/" is performance-sensitive. New optimization rules should be carefully evaluated for their impact on query performance. Use benchmarks before and after changes. Pay special attention to corner cases for robustness. * **Storage Layer:** The "storage/" directory is responsible for data persistence. Correct implementations of the Write-Ahead Log (WAL) is critical for durability. Thoroughly test recovery scenarios after system crashes or power failures. Performance changes in the storage systems have a global impact. By adhering to these coding standards, developers can contribute to the DuckDB project in a consistent, maintainable, and high-quality manner. This collaborative effort ensures that DuckDB remains a powerful and reliable analytical database system.