# Performance Optimization Standards for DuckDB

This document outlines the performance optimization standards for DuckDB, providing guidelines for developers to write efficient and performant code. These standards are tailored for DuckDB's architecture and are designed to improve application speed, responsiveness, and resource usage.

## 1. Query Optimization

### 1.1. Understanding Query Plans

**Standard:** Analyze query plans to identify bottlenecks and optimize query execution.

* **Do This:** Use "EXPLAIN" to examine the query plan and identify areas for improvement.

* **Don't Do This:** Blindly execute queries without understanding their underlying execution strategy.

**Why:** Understanding the query plan allows developers to make informed decisions about indexing, data types, and query structure.

**Example:**

"""sql

EXPLAIN SELECT * FROM lineitem WHERE l_orderkey = 12345;

"""

This will output the query plan, showing the steps DuckDB will take to execute the query. Areas of concern include full table scans, inefficient joins, or suboptimal sorting.

### 1.2. Indexing Strategies

**Standard:** Employ appropriate indexing strategies to accelerate data retrieval.

* **Do This:** Create indexes on frequently queried columns, especially those used in "WHERE" clauses and join conditions. Consider using multi-column indexes for composite queries.

* **Don't Do This:** Over-index tables, as this can slow down write operations and increase storage overhead. Avoid indexing columns with low cardinality or those rarely used in queries.

**Why:** Indexes significantly reduce the amount of data that needs to be scanned, resulting in faster query execution.

**Example:**

"""sql

-- Single-column index

CREATE INDEX idx_orderkey ON lineitem (l_orderkey);

-- Multi-column index

CREATE INDEX idx_order_ship ON lineitem (l_orderkey, l_shipdate);

"""

Carefully consider the order of columns in multi-column indexes. The most frequently queried column should come first. DuckDB (as of recent versions) also supports expression indexes, though these should be used judiciously as they can complicate maintenance.

### 1.3. Data Type Considerations

**Standard:** Use the most appropriate data types for your data to minimize storage and improve performance.

* **Do This:** Use smaller integer types (e.g., "SMALLINT", "INTEGER") if the range of values allows. Use "VARCHAR" with length limits when appropriate instead of "TEXT" for string data. Use the "DATE" and "TIMESTAMP" types for date and time data, respectively.

* **Don't Do This:** Use unnecessarily large data types, such as "BIGINT" when "INTEGER" suffices. Use "TEXT" for columns that contain short, fixed-length strings.

**Why:** Smaller data types reduce storage space and memory usage, leading to faster data processing.

**Example:**

"""sql

-- Good: Using SMALLINT when appropriate

CREATE TABLE orders (

order_id SMALLINT, -- Assuming order IDs won't exceed the range of SMALLINT

order_date DATE

);

-- Bad: Using BIGINT unnecessarily

CREATE TABLE products (

product_id BIGINT, -- INTEGER might be sufficient

product_name VARCHAR -- Length limit missing

);

-- Better: Using VARCHAR with length limit and explicit timestamp

CREATE TABLE products (

product_id INTEGER,

product_name VARCHAR(255),

created_at TIMESTAMP

);

"""

### 1.4. Join Optimization

**Standard:** Optimize join operations to minimize the amount of data processed.

* **Do This:** Use appropriate join algorithms (DuckDB generally auto-selects based on table sizes and statistics). Ensure join columns are indexed. If applicable, use "HASH JOIN" for equality joins on larger tables. Use "BROADCAST JOIN" when joining a large table to a considerably small table (DuckDB often optimizes automatically but understanding the strategy is important). Leverage pre-calculated aggregates if appropriate.

* **Don't Do This:** Perform joins without indexes on join columns. Join on complex expressions rather than simple column lookups. Perform cartesian products by omitting join conditions.

**Why:** Efficient join operations are critical for query performance, especially in data warehousing scenarios.

**Example:**

"""sql

-- Good: Indexed join columns

CREATE INDEX idx_customer_id ON orders (customer_id);

CREATE INDEX idx_customer_id ON customers (customer_id);

SELECT *

FROM orders

JOIN customers ON orders.customer_id = customers.customer_id;

-- Consider broadcasting small tables: (DuckDB might do this automatically though)

SELECT /*+ BROADCAST(customers) */ *

FROM orders

JOIN customers ON orders.customer_id = customers.customer_id;

-- Bad: No indexes, forcing a full table scan

SELECT *

FROM orders

JOIN customers ON orders.customer_id = customers.customer_id; -- Assuming no index on customer_id

"""

### 1.5. Subquery Optimization

**Standard:** Rewrite subqueries where possible to improve performance.

* **Do This:** Use "JOIN" operations instead of correlated subqueries when possible. Use Common Table Expressions (CTEs) to break down complex queries into smaller, manageable parts.

* **Don't Do This:** Use correlated subqueries excessively, as they can significantly slow down query execution.

**Why:** Correlated subqueries can be inefficient because they are executed for each row in the outer query.

**Example:**

"""sql

-- Bad: Correlated subquery

SELECT o.order_id

FROM orders o

WHERE EXISTS (

SELECT 1

FROM lineitem l

WHERE l.order_id = o.order_id

);

-- Good: Using a JOIN instead

SELECT DISTINCT o.order_id

FROM orders o

JOIN lineitem l ON o.order_id = l.order_id;

-- Good: Using a CTE for readability and potential optimization

WITH OrderItems AS (

SELECT order_id FROM lineitem

)

SELECT o.order_id

FROM orders o

WHERE o.order_id IN (SELECT order_id FROM OrderItems);

"""

### 1.6. Filtering Early

**Standard:** Apply filters as early as possible in the query execution pipeline.

* **Do This:** Place "WHERE" clauses that significantly reduce the number of rows processed at the beginning of the query.

* **Don't Do This:** Filter data late in the query execution pipeline, after expensive operations like joins or aggregations.

**Why:** Filtering early reduces the amount of data that subsequent operations need to process.

**Example:**

"""sql

-- Good: Filtering early significantly reduces rows

SELECT *

FROM orders

WHERE order_date > '2023-01-01'

AND customer_id IN (SELECT customer_id FROM active_customers);

-- Bad: Filtering late after a join (less efficient if only a fraction of orders are recent)

SELECT *

FROM orders

JOIN customers ON orders.customer_id = customers.customer_id

WHERE order_date > '2023-01-01';

"""

## 2. Data Loading and Storage

### 2.1. Bulk Loading

**Standard:** Use bulk loading techniques for large datasets.

* **Do This:** Use "COPY" command or DuckDB's API to load data in bulk. Use vectorized reads when possible.

* **Don't Do This:** Load data row-by-row using individual "INSERT" statements.

**Why:** Bulk loading is significantly faster than individual "INSERT" statements.

**Example:**

"""sql

-- CSV import

COPY lineitem FROM 'lineitem.tbl' (DELIMITER '|');

-- Parquet import (highly recommended due to DuckDB's columnar nature)

COPY lineitem FROM 'lineitem.parquet' (FORMAT 'PARQUET');

"""

Ensure the data is pre-sorted by clustering key for even greater performance, especially when creating clustered indexes.

### 2.2. Data Clustering and Sorting

**Standard:** Cluster and sort data based on common query patterns.

* **Do This:** Use "ALTER TABLE ... CLUSTER BY" to physically sort the data on disk based on specific columns. This is extremely beneficial for range queries.

* **Don't Do This:** Neglect to cluster data, especially for large tables. Cluster by columns that are rarely used in queries.

**Why:** Clustering data improves query performance by reducing the amount of data that needs to be scanned for range queries or queries involving a specific order.

**Example:**

"""sql

ALTER TABLE lineitem CLUSTER BY l_orderkey, l_shipdate; --Cluster by orderkey, then by shipdate within each orderkey

"""

### 2.3. Compression

**Standard:** Enable compression for large datasets to reduce storage space and improve I/O performance.

* **Do This:** Use compression algorithms like Zstd or Snappy, especially when storing data in Parquet format. DuckDB automatically handles compression for its internal storage.

* **Don't Do This:** Store uncompressed data unnecessarily.

**Why:** Compression reduces the amount of data that needs to be read from disk, leading to faster query execution.

**Example:**

"""sql

-- Parquet with Zstd compression (best generally for both compression ratio and speed)

COPY lineitem TO 'lineitem_compressed.parquet' (FORMAT 'PARQUET', COMPRESSION 'ZSTD'); -- Explicit compression

-- DuckDB auto-compression (will use a reasonable default)

CREATE TABLE compressed_table AS SELECT * FROM lineitem;

"""

### 2.4. Partitioning (using Parquet Files)

**Standard:** Partition data into separate files based on logical criteria (e.g., date ranges, geographic regions).

* **Do This:** Store data in Parquet files, partitioned by relevant columns. Use DuckDB's globbing capabilities to efficiently query specific partitions.

* **Don't Do This:** Store all data in a single large file, as this can slow down query execution.

**Why:** Partitioning allows DuckDB to only read the relevant files for a given query, improving performance.

**Example:**

Assume you have Parquet files partitioned by year and month: "/data/orders/year=2023/month=01/orders.parquet", "/data/orders/year=2023/month=02/orders.parquet", etc.

"""sql

-- Query data for a specific month

SELECT * FROM read_parquet('/data/orders/year=2023/month=01/*.parquet');

-- Query data for a specific year

SELECT * FROM read_parquet('/data/orders/year=2023/*.parquet');

-- Query for all data

SELECT * FROM read_parquet('/data/orders/*/*.parquet'); --Use cautiously. Is this REALLY what you meant?

"""

### 2.5. Vectorized Reads

**Standard:** Utilize DuckDB's vectorized reads for efficient data processing from disk or other external sources.

* **Do This:** When reading from Parquet or other file formats, ensure DuckDB is configured to utilize vectorized reads. This is enabled by default; however, verify configurations in case of custom setups.

* **Don't Do This:** Implement custom, row-by-row processing when reading data into DuckDB, especially when standard file formats are used.

**Why:** Vectorized reads allow DuckDB to process data in batches, significantly improving the throughput of data ingestion and query execution.

**Example:**

DuckDB automatically utilizes vectorized reads for Parquet files. You generally will not need to configure this directly. However, for custom data-loading implementations, ensure that you are reading data in batches and passing it to DuckDB's vectorized execution engine.

## 3. Concurrency and Parallelism

### 3.1. Connection Management

**Standard:** Manage database connections efficiently.

* **Do This:** Use connection pooling to reuse connections and avoid the overhead of creating new connections for each query. Close connections when they are no longer needed.

* **Don't Do This:** Create a new connection for each query. Leave connections open indefinitely.

**Why:** Establishing database connections can be expensive. Connection pooling improves performance by reusing existing connections.

**Example (Python):**

"""python

import duckdb

import threading

# Use a thread-local connection

local = threading.local()

def get_connection():

if not hasattr(local, "con"):

local.con = duckdb.connect('my_database.duckdb')

return local.con

def run_query(query):

con = get_connection()

result = con.execute(query).fetchall()

return result

"""

### 3.2. Parallel Query Execution

**Standard:** Leverage DuckDB's parallel query execution capabilities.

* **Do This:** Configure the number of threads used for query execution using "PRAGMA threads". Ensure that queries are designed to benefit from parallelism (e.g., large scans, aggregations).

* **Don't Do This:** Set the number of threads too high, as this can lead to excessive context switching and reduced performance.

**Why:** Parallel query execution can significantly improve performance for CPU-bound operations.

**Example:**

"""sql

PRAGMA threads=8; -- Use 8 threads

SELECT l_returnflag, l_linestatus, SUM(l_quantity) AS sum_qty, SUM(l_extendedprice) AS sum_base_price, SUM(l_discount) AS sum_disc_price

FROM lineitem

GROUP BY l_returnflag, l_linestatus;

PRAGMA threads=-1; --Use all available cores

"""

Carefully assess the optimal number of threads for your workload. For I/O bound workloads, increasing number of threads excessively can introduce contention overhead.

## 4. Runtime Configuration

### 4.1. Memory Management

**Standard:** Configure the amount of memory available to DuckDB.

* **Do This:** Use "PRAGMA memory_limit" to set the memory available to DuckDB. Monitor memory usage to ensure that the limit is appropriate.

* **Don't Do This:** Allow DuckDB to use excessive amounts of memory, potentially starving other processes. Set the memory limit too low, which can lead to disk spilling and reduced performance.

**Why:** Proper memory management prevents out-of-memory errors and ensures efficient query execution.

**Example:**

"""sql

PRAGMA memory_limit='16GB'; -- Set memory limit to 16GB

"""

### 4.2. Temporary Storage

**Standard:** Ensure that temporary storage is configured correctly.

* **Do This:** Use the "temp_directory" configuration option to specify a location for temporary files. Ensure that the specified location has sufficient storage space and high I/O performance.

* **Don't Do This:** Allow temporary files to be written to the default location, which may be on a slower storage device.

**Why:** DuckDB uses temporary storage for intermediate results. Configuring temporary storage correctly can improve query performance, especially when dealing with large datasets.

**Example:**

"""sql

PRAGMA temp_directory='/mnt/fast_ssd/duckdb_tmp';

"""

### 4.3. Detailed Monitoring

**Standard:** Using tools to actively monitor performance of queries and IO operations

* **Do This:** Use DuckDB's built-in performance monitoring features along with external system monitoring tools

* **Don't Do This:** Neglect monitoring the impact of configuration changes and code optimization. Changes should be tested thoroughly, and can sometimes negatively impact performance for some workloads.

**Why:** Consistent monitoring helps ensure that changes are having the impact you expect, and catches unexpected degradation.

**Example:**

While DuckDB contains some minimal internal monitoring, focus should be on wrapping the application in well-known monitoring frameworks used in the deployment envirnoment such as Prometheus, Grafana, or similar tools.

## 5. Code Maintainability and Readability

### 5.1. Code Formatting and Style

**Standard:** Follow a consistent code formatting style.

* **Do This:** Use a consistent indentation style (e.g., 4 spaces). Use meaningful variable and function names. Add comments to explain complex logic.

* **Don't Do This:** Use inconsistent indentation. Use cryptic variable names. Write code without comments.

**Why:** Consistent code formatting improves readability and maintainability.

**Example:**

"""sql

-- Good: Well-formatted SQL

SELECT

c.customer_id,

c.customer_name,

COUNT(o.order_id) AS order_count

FROM

customers c

LEFT JOIN

orders o ON c.customer_id = o.customer_id

WHERE

c.region = 'North America'

GROUP BY

c.customer_id, c.customer_name

ORDER BY

order_count DESC

LIMIT 10;

-- Bad: Poorly formatted SQL

select c.customer_id,c.customer_name,count(o.order_id) from customers c left join orders o on c.customer_id=o.customer_id where c.region='North America' group by c.customer_id,c.customer_name order by count(o.order_id) desc limit 10;

"""

### 5.2. Modular Design

**Standard:** Break down complex queries and logic into smaller, reusable modules.

* **Do This:** Use Common Table Expressions (CTEs) to break down complex queries into smaller parts. Create reusable functions for common operations.

* **Don't Do This:** Write monolithic queries that are difficult to understand and maintain.

**Why:** Modular design improves code organization and reduces code duplication.

**Example:**

"""sql

-- Good: Using CTEs to break down a complex query

WITH CustomerOrders AS (

SELECT

customer_id,

COUNT(order_id) AS order_count

FROM

orders

GROUP BY

customer_id

TopCustomers AS (

SELECT

customer_id

FROM

CustomerOrders

ORDER BY

order_count DESC

LIMIT 10

)

SELECT

c.customer_id,

c.customer_name,

co.order_count

FROM

customers c

JOIN

TopCustomers tc ON c.customer_id = tc.customer_id

JOIN

CustomerOrders co ON c.customer_id = co.customer_id;

-- Bad: Monolithic query

SELECT c.customer_id, c.customer_name, COUNT(o.order_id) FROM customers c JOIN orders o ON c.customer_id = o.customer_id GROUP BY c.customer_id, c.customer_name ORDER BY COUNT(o.order_id) DESC LIMIT 10;

"""

By adhering to these coding standards, DuckDB developers can write efficient, maintainable, and performant code, ensuring that applications utilizing DuckDB run smoothly and effectively. The consistent application of these rules, aided by AI tools, should lead to a higher quality codebase and improved overall system performance. Remember to stay current with DuckDB's release notes, especially those regarding optimization, as the engine is rapidly evolving.

Cline

This guide explains how to effectively use .clinerules with Cline, the AI-powered coding assistant.

Overview

The .clinerules file is a powerful configuration file that helps Cline understand your project's requirements, coding standards, and constraints. When placed in your project's root directory, it automatically guides Cline's behavior and ensures consistency across your codebase.

Key Concepts

Purpose of .clinerules

Defines project-specific guidelines and requirements
Enforces consistent coding standards
Establishes documentation practices
Sets testing and quality requirements
Configures error handling preferences

File Location

Place the .clinerules file in your project's root directory. Cline automatically detects and follows these rules for all files within the project.

Rule Structure

1. Project Overview

# Project Overview
project:
  name: 'Your Project Name'
  description: 'Brief project description'
  stack:
    - technology: 'Framework/Language'
      version: 'X.Y.Z'
    - technology: 'Database'
      version: 'X.Y.Z'

2. Code Standards

# Code Standards
standards:
  style:
    - 'Use consistent indentation (2 spaces)'
    - 'Follow language-specific naming conventions'
  documentation:
    - 'Include JSDoc comments for all functions'
    - 'Maintain up-to-date README files'
  testing:
    - 'Write unit tests for all new features'
    - 'Maintain minimum 80% code coverage'

3. Security Rules

# Security Guidelines
security:
  authentication:
    - 'Implement proper token validation'
    - 'Use environment variables for secrets'
  dataProtection:
    - 'Sanitize all user inputs'
    - 'Implement proper error handling'

Best Practices

Writing Effective Rules

Be Specific
- Use clear, actionable language
- Provide examples where helpful
- Define measurable criteria
Maintain Organization
- Group related rules together
- Use consistent formatting
- Keep critical rules at the top
Regular Updates
- Review rules periodically
- Update based on team feedback
- Document changes in version control

Common Patterns

# Common Patterns Example
patterns:
  components:
    - pattern: 'Use functional components by default'
    - pattern: 'Implement error boundaries for component trees'
  stateManagement:
    - pattern: 'Use React Query for server state'
    - pattern: 'Implement proper loading states'

Integration with Development Workflow

Using with Version Control

Commit the Rules
- Include .clinerules in version control
- Document rule changes in commit messages
- Review rule changes as part of PR process
Team Collaboration
- Discuss rule changes with team
- Maintain changelog for rule updates
- Ensure all team members understand rules

Troubleshooting

Common Issues

Rules Not Being Applied
- Verify file location (must be in root directory)
- Check file formatting
- Ensure Cline has access to the file
Conflicting Rules
- Review rule hierarchy
- Resolve conflicts explicitly
- Document rule precedence
Performance Considerations
- Keep rules concise and focused
- Avoid overly complex rule structures
- Regular cleanup of obsolete rules

Examples

Basic Project Setup

# Basic .clinerules Example
project:
  name: 'Web Application'
  type: 'Next.js Frontend'
  standards:
    - 'Use TypeScript for all new code'
    - 'Follow React best practices'
    - 'Implement proper error handling'

testing:
  unit:
    - 'Jest for unit tests'
    - 'React Testing Library for components'
  e2e:
    - 'Cypress for end-to-end testing'

documentation:
  required:
    - 'README.md in each major directory'
    - 'JSDoc comments for public APIs'
    - 'Changelog updates for all changes'

Advanced Configuration

# Advanced .clinerules Example
project:
  name: 'Enterprise Application'
  compliance:
    - 'GDPR requirements'
    - 'WCAG 2.1 AA accessibility'

architecture:
  patterns:
    - 'Clean Architecture principles'
    - 'Domain-Driven Design concepts'

security:
  requirements:
    - 'OAuth 2.0 authentication'
    - 'Rate limiting on all APIs'
    - 'Input validation with Zod'

Performance Optimization Standards for DuckDB

Cline

Overview

Key Concepts

Purpose of .clinerules

File Location

Rule Structure

1. Project Overview

2. Code Standards

3. Security Rules

Best Practices

Writing Effective Rules

Common Patterns

Integration with Development Workflow

Using with Version Control

Troubleshooting

Common Issues

Examples

Basic Project Setup

Advanced Configuration

Related Rules

Component Design Standards for DuckDB

API Integration Standards for DuckDB

State Management Standards for DuckDB

Core Architecture Standards for DuckDB

Testing Methodologies Standards for DuckDB