10 Crucial Lessons from Cloudflare's ClickHouse Slowdown: How a Hidden Bottleneck Almost Delayed Millions in Billing

1. The High Stakes of ClickHouse in Cloudflare's Operations

ClickHouse is an open-source OLAP database that powers critical financial systems at Cloudflare. Every day, millions of analytical queries run to calculate usage-based billing for customers. These jobs must finish within strict windows; otherwise, invoices become inconsistent and hard to audit. Beyond billing, ClickHouse supports fraud detection and other revenue-sensitive processes. Any slowdown has immediate downstream impacts, making performance reliability a top priority. Understanding this context explains why even a few seconds of delay can trigger widespread concern.

10 Crucial Lessons from Cloudflare's ClickHouse Slowdown: How a Hidden Bottleneck Almost Delayed Millions in Billing — Source: blog.cloudflare.com

2. The Migration That Triggered the Slowdown

During a routine cluster migration, daily aggregation jobs — responsible for ensuring bills go out — slowed dramatically. The team initially checked standard performance metrics: I/O rates, memory usage, rows scanned, and parts read. All were within normal ranges. This ruled out common issues like insufficient hardware or data skew. The mystery deepened as delays persisted for hours, threatening the billing schedule. It became clear that the bottleneck was not visible in typical monitoring dashboards.

3. The Petabyte-Scale Analytics Platform: Ready-Analytics

To simplify onboarding, Cloudflare built Ready-Analytics in early 2022. Instead of creating new tables per team, teams stream data into one massive table (over 2 PiB by December 2024). Datasets are separated by a namespace field, and all records share a standard schema: 20 floats, 20 strings, a timestamp, and an indexID. The primary key is (namespace, indexID, timestamp). This design optimizes sorting per namespace, but also created long-term challenges.

4. How Sorting Affects Query Performance in ClickHouse

In ClickHouse, data is physically sorted on disk according to the primary key. The indexID (a string) plays a crucial role, allowing each namespace to have its own optimal sort order. Queries that match the sort order are extremely fast because ClickHouse can skip large data chunks. However, any mismatch forces full scans. The Ready-Analytics system relied on this design to serve hundreds of applications efficiently. Yet this strength became a weakness when retention policies changed.

5. The One-Size-Fits-All Retention Policy

Cloudflare uses partition-based retention (daily partitions) because it predates ClickHouse's native TTL features. The Ready-Analytics table kept data for exactly 31 days by dropping older partitions. This worked for some teams but blocked others: teams needing years of storage (legal/compliance) or just a few days had to avoid Ready-Analytics entirely. The lack of per-namespace flexibility was a major limitation, forcing many to use more complex conventional setups.

6. Why Per-Namespace Retention Was Needed

Different teams at Cloudflare have vastly different data retention requirements. Some require multi-year storage for audit trails; others only need ephemeral logs. The rigid 31-day policy forced teams to choose between simplicity (Ready-Analytics) and meeting obligations. This led to a backlog of requests for a custom retention solution. The engineering team understood that enabling per-namespace retention would unlock Ready-Analytics for many more use cases and reduce onboarding friction.

7. The Hidden Bottleneck in ClickHouse Internals

After the migration slowed jobs, engineers dug deeper. They discovered the bottleneck was not in typical metrics but buried deep inside ClickHouse's internal sorting and merging logic. When multiple namespaces with different retention periods coexist in the same table, the primary key sorting interacts poorly with partition-level deletions. The result was excessive CPU overhead during merge operations, not I/O wait. This subtle issue required a forensic analysis of ClickHouse's source code.

8. Debugging the Unseen: Tools and Techniques

To locate the bottleneck, the team used advanced profiling: perf traces, ClickHouse's query profiler, and custom instrumentation. They compared query execution plans before and after the migration. The key clue came from examining the number of parts being merged and the sorting key's influence on merge granularity. By isolating the namespace with the longest retention, they reproduced the slowdown in a test environment. This systematic approach confirmed the root cause.

9. The Three Patches That Fixed It

The team wrote three patches to address the hidden bottleneck. The first optimized merge behavior when partitions have heterogeneous retention periods. The second improved the sort key selection to avoid redundant comparisons across namespaces. The third added a new internal metric to surface such bottlenecks proactively. These patches were contributed back to the ClickHouse open-source project. After deployment, aggregation jobs returned to normal speeds, and billing resumed on schedule.

10. Lessons for Scaling ClickHouse in Multi-Tenant Environments

Cloudflare's experience offers valuable lessons: design primary keys with cross-tenant workloads in mind; monitor merge CPU time (not just I/O); and flexible retention policies are critical for adoption. Most importantly, when all standard metrics look healthy but performance is poor, look deeper into the database engine's internals. The three patches not only fixed the immediate issue but also improved ClickHouse's robustness for all users. This story underscores the importance of investing in deep database knowledge even when using mature technology.

Conclusion: The hidden bottleneck in Cloudflare's ClickHouse pipeline was a stark reminder that performance issues can lurk where you least expect them. By carefully dissecting internals and writing targeted patches, the team restored billing reliability and improved the platform for everyone. For organizations running large-scale analytics, these ten lessons provide a roadmap for diagnosing similar problems and building more resilient systems.

Tags: