Getting Started with DuckLake 1.0: A SQL-Based Data Lake Format

Overview

DuckLake 1.0 introduces a fresh approach to managing data lake metadata. Instead of scattering metadata across numerous files in object storage, it centralizes table metadata in a SQL database—making updates, sorting, and partitioning more efficient. Built as a DuckDB extension, DuckLake integrates seamlessly with existing workflows and offers compatibility with Iceberg-style features. This guide walks you through its setup, core operations, and common pitfalls.

Getting Started with DuckLake 1.0: A SQL-Based Data Lake Format — Source: www.infoq.com

Prerequisites

DuckDB: Version 0.10.0 or higher (command-line interface or Python binding).
Object Storage: A bucket or directory (e.g., S3, MinIO, local filesystem) for storing parquet files.
SQL Database: For the catalog—DuckDB itself works for local testing; production uses PostgreSQL or MySQL.
DuckLake Extension: Install via INSTALL ducklake; LOAD ducklake;.

Step-by-Step Instructions

1. Install and Load the DuckLake Extension

Open DuckDB and run:

INSTALL ducklake FROM community;
LOAD ducklake;

This registers DuckLake’s functions and types. Verify with SELECT * FROM ducklake_version();

2. Create a DuckLake Catalog

A catalog holds all table metadata. Use CREATE DUCKLAKE CATALOG:

CREATE DUCKLAKE CATALOG my_catalog
  DATABASE 'duckdb'  -- can be 'postgresql' or 'mysql'
  CONNECTION_STRING 'file:///path/to/catalog.db';

-- Switch to the catalog
USE my_catalog;

Tip: For remote databases, use a connection string like postgresql://user:pass@host/db.

3. Create a DuckLake Table

Define a table with partitioning and sorting:

CREATE DUCKLAKE TABLE sales (
    order_id INTEGER,
    amount DECIMAL(10,2),
    order_date DATE,
    region VARCHAR
)
PARTITIONED BY (region)
SORTED BY (order_date);

This creates a logical table. Data is stored as Parquet files in your object storage.

4. Insert Data

Insert directly or from a SELECT:

INSERT INTO sales VALUES
    (1, 150.00, '2025-01-15', 'East'),
    (2, 200.50, '2025-01-16', 'West');

DuckLake automatically writes new Parquet files per partition and updates the catalog.

5. Query the Table

Standard SQL works—DuckLake reads the catalog to locate files:

SELECT region, SUM(amount) AS total_sales
FROM sales
WHERE order_date >= '2025-01-01'
GROUP BY region;

Partition pruning and sorting are applied automatically.

6. Manage Partitions and Small Updates

DuckLake supports incremental updates without rewriting whole partitions. Use MERGE or DELETE:

DELETE FROM sales WHERE order_id = 1;

MERGE INTO sales AS target
USING (VALUES (3, 300.00, '2025-01-20', 'East')) AS src
ON target.order_id = src.column1
WHEN MATCHED THEN UPDATE SET amount = src.column2
WHEN NOT MATCHED THEN INSERT (order_id, amount, order_date, region)
    VALUES (src.column1, src.column2, src.column3, src.column4);

The catalog tracks these small changes efficiently.

7. Iceberg Compatibility

DuckLake can read Iceberg tables if you enable compatibility mode:

SET ducklake_iceberg_compat = true;
SELECT * FROM iceberg_scan('s3://bucket/iceberg_table');

Write support is limited to DuckLake-native tables.

Common Mistakes

Forgetting to load the extension: Always run LOAD ducklake; after installation.
Wrong catalog connection string: Ensure the path or database URL is correct and accessible.
Partition key mismatch: When inserting, include the partition column; missing it causes errors.
Overwriting small files: DuckLake handles small updates, but avoid frequent tiny inserts—compact periodically with OPTIMIZE TABLE sales;.
Ignoring sorting: Define a sort column to speed up range queries; otherwise full scans occur.

Summary

DuckLake 1.0 simplifies data lake management by storing metadata in SQL, enabling faster updates and smarter partitioning. With its DuckDB extension, you get a lightweight yet powerful alternative to Hive or Iceberg for analytical workloads. Start small, tune your partitions, and enjoy seamless SQL-driven data lakes.