Putty Ssh
ArticlesCategories
Education & Careers

6 Critical Improvements from Cloudflare's 'Code Orange: Fail Small' Project

Published 2026-05-03 01:17:22 · Education & Careers

After two intense quarters of engineering work, Cloudflare has completed its internal project, code-named Code Orange: Fail Small. The goal was to prevent global outages like those on November 18 and December 5, 2025, and build a more resilient, secure, and reliable network for every customer. While resilience is never a final destination, the team has shipped key innovations that directly address those failures. Here are the six most impactful improvements every Cloudflare customer should know about.

1. Health-Mediated Deployment for Configuration Changes

Previously, internal configuration changes reached the network instantly—a risky approach that contributed to past outages. Now, Cloudflare has adopted health-mediated deployment for all configuration changes that affect customer traffic. Changes are rolled out progressively, with real-time health monitoring that can automatically detect anomalies and revert modifications before they impact your traffic. This means safer, more controlled updates across the network, reducing the chance of human error or misconfiguration causing widespread disruption.

6 Critical Improvements from Cloudflare's 'Code Orange: Fail Small' Project
Source: blog.cloudflare.com

2. Introducing Snapstone: A Unified Progressive Rollout System

To make health-mediated deployment consistent and easy, Cloudflare built Snapstone, an internal system that bundles configuration changes into packages. Snapstone enables gradual rollout, automated health checks, and instant rollback—without requiring each team to build their own solution. It’s flexible: teams can define any unit of configuration that needs health mediation, whether it’s a data file or a control flag. Before Snapstone, applying progressive deployment to config was manual and inconsistent. Now it’s the default, closing a critical gap in Cloudflare’s resilience infrastructure.

3. Identifying and Securing High-Risk Configuration Pipelines

Not all configuration changes pose the same risk. Cloudflare analyzed its internal pipelines to identify those most likely to cause network-wide issues. For these high-risk pipelines, new tools were built to enforce stricter validation and testing before any change goes live. This proactive approach ensures that dangerous deployments are caught early, even before they reach a health-mediated rollout. By focusing on the most sensitive areas, Cloudflare reduces the number of incidents originating from configuration errors.

4. Reducing Failure Impact with Targeted Measures

Beyond safer deployments, Cloudflare implemented measures to limit the blast radius of any failure. This includes redesigning system boundaries so that an issue in one service doesn’t cascade to others. Features like circuit breakers, graceful degradation, and automated traffic rerouting have been strengthened. The result: even if a change goes wrong, its impact is contained, and your traffic continues flowing with minimal interruption. These changes were directly inspired by the lessons learned from the November and December outages.

6 Critical Improvements from Cloudflare's 'Code Orange: Fail Small' Project
Source: blog.cloudflare.com

5. Revamped Break Glass and Incident Management Procedures

When emergencies happen, clear and fast procedures are critical. Cloudflare overhauled its break glass protocols—emergency access methods to override normal controls. The new procedures are more secure, auditable, and include faster rollback options. Incident management has also been refined: teams now have clearer roles, communication channels, and decision-making frameworks. This reduces confusion during a crisis and speeds up recovery, directly improving uptime for your services.

6. Strengthening Customer Communication During Outages

Transparency is key during incidents. Cloudflare has improved how it communicates with customers when things go wrong. Status pages are updated faster, with more precise information. Internal alerts now trigger automated customer notifications, and post-incident reports are published more quickly. These changes ensure you’re never left in the dark—you get timely, actionable updates that help you manage your own systems during a Cloudflare issue.

With these six improvements, Cloudflare’s network is stronger than ever. The Code Orange: Fail Small project may be complete, but the commitment to resilience continues. Every change is now safer, every failure is contained, and every customer is better informed. That’s a win for the entire internet.