# Tooling and Ecosystem Standards for DuckDB

This document outlines the coding standards and best practices related to **Tooling and Ecosystem** for DuckDB development. It provides specific guidance on leveraging recommended libraries, tools, and extensions to enhance development workflows, maintainability, performance, and security of DuckDB-based projects. These standards are designed to work in concert with AI coding assistants like GitHub Copilot and Cursor, providing them with the necessary context to generate high-quality, DuckDB-idiomatic code.

## 1. Development Environment and Tooling

### 1.1. Recommended IDEs and Editors

**Do This:**

* Use IDEs and editors with strong DuckDB support (e.g., VS Code with the DuckDB extension, JetBrains DataGrip, DBeaver). These provide syntax highlighting, code completion, and integration with DuckDB's CLI.

* Configure your editor with a DuckDB language server if available.

* Utilize linters and formatters specific for the host language (e.g., "flake8" and "black" for Python) to maintain code consistency. Apply formatting before committing code.

**Don't Do This:**

* Rely solely on basic text editors lacking DuckDB-specific features.

* Skip configuring linters and formatters, leading to inconsistent code style.

**Why This Matters:** Enhanced tooling improves developer productivity, reduces errors, and ensures code maintainability.

**Example:**

"""python

# VS Code settings.json for Python and DuckDB

{

"python.linting.flake8Enabled": true,

"python.formatting.provider": "black",

"[python]": {

"editor.formatOnSave": true,

"editor.codeActionsOnSave": {

"source.organizeImports": true

}

"files.autoSave": "afterDelay",

"files.autoSaveDelay": 500

}

"""

### 1.2. Version Control (Git)

**Do This:**

* Use Git for version control. Commit frequently with descriptive commit messages.

* Create branches for new features or bug fixes. Follow Gitflow or a similar branching strategy.

* Utilize pull requests for code review.

* Store DuckDB-related scripts, configurations, and any required data (if suitable for the repository) in the Git repository. Exclude generated DuckDB database files ( ".duckdb" files) from version control using ".gitignore".

**Don't Do This:**

* Commit directly to the main branch without code review.

* Store sensitive data (e.g., API keys, passwords) directly in the repository. Utilize environment variables and secure configuration management instead.

* Track large data files or binary artifacts in Git. Consider using a separate data storage solution and version control for code only.

**Why This Matters:** Version control is essential for collaboration, code management, and rollback capabilities.

**Example:**

""".gitignore

# DuckDB database files

*.duckdb

*.wal

# Temporary files

tmp/

*.tmp

# venv

venv/

"""

### 1.3. Build Systems (CMake)

**Do This:**

* For C/C++ projects which interface with DuckDB, use CMake to manage the build process. CMake simplifies cross-platform compilation and dependency management.

* Properly link against the DuckDB library in your "CMakeLists.txt" file.

* Use CMake's "find_package" command to locate DuckDB.

**Don't Do This:**

* Manually manage compilation flags and linker options.

* Hardcode paths to the DuckDB library.

**Why This Matters:** CMake ensures portability and reproducible builds.

**Example:**

"""cmake

# CMakeLists.txt

cmake_minimum_required(VERSION 3.15)

project(MyDuckDBProject)

find_package(DuckDB REQUIRED)

add_executable(my_duckdb_app main.cpp)

target_link_libraries(my_duckdb_app DuckDB::duckdb) # Link against DuckDB

"""

## 2. DuckDB Extensions and Libraries

### 2.1. Utilizing DuckDB Extensions

**Do This:**

* Enable and use relevant DuckDB extensions to enhance functionality. Popular extensions include "httpfs" (for accessing data over HTTP/S), "parquet" (for reading Parquet files), "json" (for working with JSON data), and "excel" (for reading Excel files).

* Install extensions using the "INSTALL" statement: "INSTALL httpfs;".

* Load extensions using the "LOAD" statement: "LOAD httpfs;". Load extensions within your DuckDB scripts or application code. Check if an extension is already loaded before attempting to load it again.

**Don't Do This:**

* Forget to install and load extensions before using their functionality.

* Load extensions unnecessarily, as this can increase startup time.

**Why This Matters:** Extensions extend DuckDB's capabilities and improve data integration.

**Example:**

"""sql

-- Install and load the httpfs extension for accessing data over HTTP

INSTALL httpfs;

LOAD httpfs;

-- Query a CSV file directly from a URL

SELECT * FROM read_csv_auto('https://raw.githubusercontent.com/nytimes/covid-19-data/master/us.csv');

"""

### 2.2. Python Integration (DuckDB's "duckdb" Library)

**Do This:**

* Use the "duckdb" Python library for seamless integration with Python applications.

* Utilize parameterized queries to prevent SQL injection vulnerabilities: "con.execute("SELECT * FROM my_table WHERE id = ?", (user_input,))".

* Use the "df" method to convert between DuckDB result sets and Pandas DataFrames for data analysis: "df = con.execute("SELECT * FROM my_table").df()".

* Use DuckDB's ability to directly query Pandas DataFrames: "con.execute("SELECT * FROM df").df()".

**Don't Do This:**

* Use string formatting to construct SQL queries, as this is prone to SQL injection.

* Ignore error handling when interacting with the DuckDB library.

* Fail to close connections and cursors properly. Use context managers ("with") to ensure resources are released.

**Why This Matters:** The Python library simplifies data interaction and enables complex data workflows.

**Example:**

"""python

import duckdb

import pandas as pd

# Connect to an in-memory DuckDB database

con = duckdb.connect(':memory:')

# Create a Pandas DataFrame

data = {'id': [1, 2, 3], 'value': ['a', 'b', 'c']}

df = pd.DataFrame(data)

# Register the DataFrame with DuckDB

con.register('my_dataframe', df)

# Query the DataFrame using SQL

result = con.execute('SELECT * FROM my_dataframe WHERE id > 1').fetchdf()

print(result)

# Create a table from the DataFrame

con.execute("CREATE TABLE my_table AS SELECT * FROM my_dataframe")

# Execute a parameterized query

user_input = 2

result = con.execute("SELECT * FROM my_table WHERE id = ?", (user_input,)).fetchdf()

print(result)

# Close the connection

con.close()

"""

### 2.3. R Integration (DuckDB's "duckdb" Package)

**Do This:**

* Use the "duckdb" R package for interacting with DuckDB from R.

* Leverage "dbConnect()", "dbExecute()", "dbFetch()" and other functions provided by the package.

* Use "dplyr" verbs within DuckDB using "dplyr::tbl()" to transparently push down operations.

**Don't Do This:**

* Manually construct SQL queries when R functions can achieve the same result.

* Forget to disconnect from the database after use.

**Why This Matters:** The R package enables efficient data analysis workflows within the R ecosystem.

**Example:**

"""R

library(duckdb)

library(dplyr)

# Connect to an in-memory DuckDB database

con <- dbConnect(duckdb::duckdb(), dbdir = ":memory:", read_only = FALSE)

# Create a data frame

df <- data.frame(id = 1:3, value = c("a", "b", "c"))

# Write the data frame to a DuckDB table

dbWriteTable(con, "my_table", df)

# Query the table using dplyr

result <- tbl(con, "my_table") %>%

filter(id > 1) %>%

collect()

print(result)

# Execute a SQL query

result <- dbGetQuery(con, "SELECT * FROM my_table WHERE id = 1")

print(result)

# Disconnect from the database

dbDisconnect(con, shutdown = TRUE)

"""

### 2.4 Arrow Integration

**Do This:**

* Use Arrow for efficient data exchange between DuckDB and other systems. DuckDB has native support for Arrow.

* Convert DuckDB result sets to Arrow tables using ".arrow()" in Python and ".arrow()" in R.

* Pass Arrow tables directly to DuckDB for querying.

* Install and utilize the "arrow" extension for advanced Arrow support: "INSTALL arrow; LOAD arrow;".

**Don't Do This:**

* Rely on inefficient data serialization methods when Arrow provides a faster alternative.

* Ignore potential data type incompatibilities between DuckDB and other systems.

**Why This Matters:** Arrow enables zero-copy data transfer, significantly boosting performance.

**Example (Python):**

"""python

import duckdb

import pyarrow.parquet as pq

import pyarrow as pa

# Connect to DuckDB

con = duckdb.connect(':memory:')

# Create a sample Arrow table

table = pa.Table.from_pydict({'id': [1, 2, 3], 'value': ['a', 'b', 'c']})

# Register the Arrow table with DuckDB

con.register('my_arrow_table', table)

# Query the Arrow table using SQL

result = con.execute('SELECT * FROM my_arrow_table WHERE id > 1').df()

print(result)

# write arrow to parquet

pq.write_table(table, 'test.parquet')

# Load parquet to duckdb

con.execute("CREATE TABLE parquet_table AS SELECT * FROM 'test.parquet'")

result = con.execute('SELECT * FROM parquet_table WHERE id > 1').df()

print(result)

con.close()

"""

### 2.5. Data Visualization Tools

**Do This:**

* Connect data visualization tools (e.g., Tableau, Power BI, Metabase) to DuckDB as a data source.

* Prefer direct connections to DuckDB using the appropriate drivers.

* Leverage the ability to rapidly prototype dashboards locally using DuckDB and subsequently deploy them using larger data warehouses.

**Don't Do This:**

* Rely on exporting data to files for visualization when direct connections are possible.

* Overload DuckDB with very complex queries for visualization purposes. Consider pre-aggregating data if necessary.

**Why This Matters:** Data visualization helps understand trends and insights in your data.

### 2.6. Testing Frameworks

**Do This:**

* Use testing frameworks (e.g., pytest in Python, testthat in R) to write unit and integration tests for your DuckDB-based code.

* Write tests to verify the correctness of SQL queries, data transformations, and data loading processes. Use assertion libraries for expressive testing.

* Utilize DuckDB's in-memory capabilities for isolated testing environments.

**Don't Do This:**

* Skip writing tests, leading to undetected bugs and regressions.

* Use production databases for testing, which can lead to data corruption.

**Why This Matters:** Testing ensures code quality and reliability.

**Example (Python with pytest):**

"""python

import duckdb

import pytest

@pytest.fixture(scope="function")

def duckdb_conn():

conn = duckdb.connect(':memory:')

conn.execute("CREATE TABLE test_table (id INTEGER, value VARCHAR)")

conn.execute("INSERT INTO test_table VALUES (1, 'a'), (2, 'b'), (3, 'c')")

yield conn

conn.close()

def test_select_all(duckdb_conn):

result = duckdb_conn.execute("SELECT * FROM test_table").fetchall()

assert len(result) == 3

assert result[0] == (1, 'a')

def test_where_clause(duckdb_conn):

result = duckdb_conn.execute("SELECT * FROM test_table WHERE id > 1").fetchall()

assert len(result) == 2

assert result[0] == (2, 'b')

"""

## 3. Performance Optimization Tools

### 3.1. Query Profiling

**Do This:**

* Use DuckDB's built-in query profiler to identify performance bottlenecks in SQL queries.

* Use "PRAGMA show_profile" to analyze query execution plans and identify slow operations.

* Utilize the "EXPLAIN" statement to understand the query execution plan before running the query.

**Don't Do This:**

* Guess at performance bottlenecks without profiling.

* Ignore the query execution plan when optimizing queries.

**Why This Matters:** Profiling and analyzing query plans provides insights into performance issues.

**Example:**

"""sql

-- Enable profiling

PRAGMA enable_profiling;

-- Execute a query

SELECT COUNT(*) FROM lineitem;

-- Show the profiling information

PRAGMA show_profile;

--To clear the profiling information

PRAGMA disable_profiling

-- Analyze query execution plan

EXPLAIN SELECT * FROM lineitem WHERE l_shipdate > '1998-12-01';

"""

### 3.2. Indexing

**Do This:**

* Create indexes on frequently queried columns to speed up data retrieval. Ensure that indexes are actually used and not slowing down write operations.

* Consider using different index types (e.g., B-Tree, Hash) based on query patterns.

**Don't Do This:**

* Create indexes on every column, as this can slow down write operations.

* Forget to analyze the performance impact of indexes.

**Why This Matters:** Indexes improve query performance by allowing DuckDB to quickly locate data.

**Example:**

"""sql

-- Create an index on the l_shipdate column

CREATE INDEX idx_shipdate ON lineitem (l_shipdate);

"""

### 3.3. Data Partitioning (future)

**Do This (when available):**

* Explore data partitioning strategies to improve query performance on large datasets (future feature).

**Don't Do This (currently):**

* Rely on data partitioning until it is a fully supported feature in DuckDB.

## 4. Security Tools and Practices

### 4.1. Data Encryption

**Do This:**

* Utilize DuckDB's encryption features to protect sensitive data at rest and in transit (if supported and required).

* Consult with security experts on choosing appropriate encryption algorithms and key management strategies.

**Don't Do This:**

* Store encryption keys directly in code or configuration files. Use secure key management systems.

**Why This Matters:** Encryption protects data from unauthorized access.

### 4.2. Least Privilege Principle

**Do This:**

* Grant users only the necessary privileges to access and modify data.

* Use roles to manage permissions and simplify administration.

**Don't Do This:**

* Grant users excessive privileges, which can lead to security vulnerabilities.

**Why This Matters:** Limiting privileges reduces the impact of security breaches. At the time of writing DuckDB doesn't have significant user permissioning controls.

### 4.3. Input Validation and Sanitization

**Do This:**

* Validate and sanitize all user inputs to prevent SQL injection attacks. Use parameterized queries and prepared statements to escape user-provided data.

**Don't Do This:**

* Trust user inputs without validation.

* Use string concatenation to build SQL queries with user inputs.

**Why This Matters:** Input validation prevents malicious code from being executed. Always use parameterized queries.

## 5. Community and Support

### 5.1. Engaging with the DuckDB Community

**Do This:**

* Participate in the DuckDB community forums, mailing lists, and GitHub discussions.

* Contribute to the DuckDB project by submitting bug reports, feature requests, and pull requests.

* Share your knowledge and experience with other DuckDB users.

**Don't Do This:**

* Be afraid to ask questions or seek help from the community.

* Ignore community guidelines and best practices.

**Why This Matters:** Community involvement fosters collaboration and improves the DuckDB ecosystem.

### 5.2. Staying Up-to-Date

**Do This:**

* Follow the DuckDB release notes and documentation to stay informed about new features, bug fixes, and security updates.

* Subscribe to the DuckDB newsletter or RSS feed.

**Don't Do This:**

* Use outdated versions of DuckDB, which may contain known bugs and security vulnerabilities.

**Why This Matters:** Staying up-to-date ensures you are using the latest and greatest features and security patches.

These standards serve as a comprehensive guide to leveraging tooling and ecosystem components when developing with DuckDB. By following these guidelines, developers can build robust, efficient, and secure DuckDB-based applications. Adherence will also ensure AI coding assistants provide more targeted and appropriate suggestions during development.

Cline

This guide explains how to effectively use .clinerules with Cline, the AI-powered coding assistant.

Overview

The .clinerules file is a powerful configuration file that helps Cline understand your project's requirements, coding standards, and constraints. When placed in your project's root directory, it automatically guides Cline's behavior and ensures consistency across your codebase.

Key Concepts

Purpose of .clinerules

Defines project-specific guidelines and requirements
Enforces consistent coding standards
Establishes documentation practices
Sets testing and quality requirements
Configures error handling preferences

File Location

Place the .clinerules file in your project's root directory. Cline automatically detects and follows these rules for all files within the project.

Rule Structure

1. Project Overview

# Project Overview
project:
  name: 'Your Project Name'
  description: 'Brief project description'
  stack:
    - technology: 'Framework/Language'
      version: 'X.Y.Z'
    - technology: 'Database'
      version: 'X.Y.Z'

2. Code Standards

# Code Standards
standards:
  style:
    - 'Use consistent indentation (2 spaces)'
    - 'Follow language-specific naming conventions'
  documentation:
    - 'Include JSDoc comments for all functions'
    - 'Maintain up-to-date README files'
  testing:
    - 'Write unit tests for all new features'
    - 'Maintain minimum 80% code coverage'

3. Security Rules

# Security Guidelines
security:
  authentication:
    - 'Implement proper token validation'
    - 'Use environment variables for secrets'
  dataProtection:
    - 'Sanitize all user inputs'
    - 'Implement proper error handling'

Best Practices

Writing Effective Rules

Be Specific
- Use clear, actionable language
- Provide examples where helpful
- Define measurable criteria
Maintain Organization
- Group related rules together
- Use consistent formatting
- Keep critical rules at the top
Regular Updates
- Review rules periodically
- Update based on team feedback
- Document changes in version control

Common Patterns

# Common Patterns Example
patterns:
  components:
    - pattern: 'Use functional components by default'
    - pattern: 'Implement error boundaries for component trees'
  stateManagement:
    - pattern: 'Use React Query for server state'
    - pattern: 'Implement proper loading states'

Integration with Development Workflow

Using with Version Control

Commit the Rules
- Include .clinerules in version control
- Document rule changes in commit messages
- Review rule changes as part of PR process
Team Collaboration
- Discuss rule changes with team
- Maintain changelog for rule updates
- Ensure all team members understand rules

Troubleshooting

Common Issues

Rules Not Being Applied
- Verify file location (must be in root directory)
- Check file formatting
- Ensure Cline has access to the file
Conflicting Rules
- Review rule hierarchy
- Resolve conflicts explicitly
- Document rule precedence
Performance Considerations
- Keep rules concise and focused
- Avoid overly complex rule structures
- Regular cleanup of obsolete rules

Examples

Basic Project Setup

# Basic .clinerules Example
project:
  name: 'Web Application'
  type: 'Next.js Frontend'
  standards:
    - 'Use TypeScript for all new code'
    - 'Follow React best practices'
    - 'Implement proper error handling'

testing:
  unit:
    - 'Jest for unit tests'
    - 'React Testing Library for components'
  e2e:
    - 'Cypress for end-to-end testing'

documentation:
  required:
    - 'README.md in each major directory'
    - 'JSDoc comments for public APIs'
    - 'Changelog updates for all changes'

Advanced Configuration

# Advanced .clinerules Example
project:
  name: 'Enterprise Application'
  compliance:
    - 'GDPR requirements'
    - 'WCAG 2.1 AA accessibility'

architecture:
  patterns:
    - 'Clean Architecture principles'
    - 'Domain-Driven Design concepts'

security:
  requirements:
    - 'OAuth 2.0 authentication'
    - 'Rate limiting on all APIs'
    - 'Input validation with Zod'

Tooling and Ecosystem Standards for DuckDB

Cline

Overview

Key Concepts

Purpose of .clinerules

File Location

Rule Structure

1. Project Overview

2. Code Standards

3. Security Rules

Best Practices

Writing Effective Rules

Common Patterns

Integration with Development Workflow

Using with Version Control

Troubleshooting

Common Issues

Examples

Basic Project Setup

Advanced Configuration

Related Rules

Component Design Standards for DuckDB

Performance Optimization Standards for DuckDB

API Integration Standards for DuckDB

State Management Standards for DuckDB

Core Architecture Standards for DuckDB