Soluções

Empresa

Cal.ai

Desenvolvedor

Recursos

Preços

Por

Alex van Andel

When a Dependency Melts Down: Our June 22–23 Incident Post-Mortem

On June 22, 2026, our primary asynchronous task and queue management partner, Trigger.dev, suffered a catastrophic, multi-region cascading failure. Because our core platform infrastructure relies heavily on their engine to process background jobs, webhooks, and critical workflows, this upstream failure immediately created a high-severity incident for our own application.

Trigger.dev has published their own transparent engineering breakdown here: Trigger.dev Incident Report (June 22, 2026). Below is the aligned breakdown of how their infrastructure failure directly mapped to our internal engineering timeline and how we engineered our way out of a total blackout.

The Aligned Timeline

Time (UTC)

What Happened Upstream (Full Report)

What Happened Internally / Our Experience

June 22, 19:00

The Trigger: AWS capacity shortage in us-east-1 forces ~10,000 runs to pile up in upstream queues.

Our background tasks quietly begin stalling. Queues start growing under the hood.

June 22, 20:09

The Meltdown: Upstream Kubernetes control plane hits 100% CPU, loses database quorum, and us-east-1 goes completely down.

Our production environment stops processing runs. Thousands of tasks platform-wide are backlog-locked.

June 22, 21:58

Upstream engineering is actively fighting a "reconnection storm" loop.

Incident Reported: Elevated failure rates are noted, and an internal incident room is initialized.

June 23, 00:59

Upstream provider recommends users manually switch workloads over to their eu-central-1 region.

Incident Lead Assigned: The incident response room is fully activated with an assigned lead to handle cross-region evaluation.

June 23, 01:01 - 02:10

Upstream provider deploys an emergency dashboard update to allow bulk-moving runs between regions.

Active Mitigation: At 01:10, the team manually cancels our queued US runs to clear our account's blocked execution capacity and routes all new traffic to the European region. By 02:10, messages begin processing again.

June 23, 02:02 - 02:33

Upstream provider is manually scaling up their database backing store.

Severity Upgraded to Critical: Over 30,000+ of our own tasks are now queued up and waiting. The team posts a critical status update to our public status page.

June 23, 04:35 - 05:05

Upstream infrastructure updates slow execution speeds back to a crawl.

The Webhook Bottleneck: Processing rates drop. By 05:05, we identify that our webhook-delivery queue is the primary bottleneck refusing to flush.

June 23, 08:38

The Second Cascade: Redirected global traffic overwhelms the provider’s European region, knocking it completely offline.

Total Standstill: Both upstream regions are now effectively dead. We have over 13,000+ tasks frozen in flight, entirely at the mercy of upstream recovery.

June 23, 10:34

Upstream engineering is struggling to bring their European region back online.

The Circuit Breaker Deployment: Realizing the primary background processing layer is completely stalled, our team shifts from infrastructure mitigation to an architectural bypass. We deploy an emergency hotfix across all core application and API environments to temporarily disable async task management.

June 23, 10:41

Upstream provider is still working on resizing their control plane nodes.

Immediate Relief: The architectural bypass works flawlessly. By routing around the frozen background queues, critical app processes fallback to direct execution paths. Engineering confirms: "Emails are flowing again."

June 23, 17:06

Upstream provider finally stabilizes both regions and restores processing speeds.

Queue Cleared: Engineering confirms our queues across both regions have completely emptied out. System performance returns to normal.

June 23, 19:31

Post-recovery stability holds.

Incident Resolved: Systems are fully nominal; the team enters the post-incident review flow.

Key Takeaways & Our Experience Evaluation

While the upstream post-mortem focused on database tuning and cluster infrastructure, our internal evaluation focuses on graceful degradation and architectural resilience.

1. The Power of the "Kill Switch"

The defining moment of our recovery occurred at 10:34 UTC. When our provider's secondary region collapsed, we shifted from trying to manage background queues to implementing a structural bypass.

  • The Bypass: By deploying a global configuration change to disable background tasking, we effectively snipped the cord to our broken dependency. This forced critical user transactions, like booking confirmation emails, to temporarily bypass the frozen background engine and process directly.

  • Within 7 minutes, core business functions were completely restored for our users, proving that having built-in, toggable fallback paths for major third-party dependencies is a non-negotiable requirement for high availability.

2. Managing Multi-Region Concurrency Traps

During the mitigation phase, navigating a multi-region setup presented unexpected platform limitations. Tasks stuck in an intermediate phase in the US region continued to consume our account's global execution capacity even after we pointed our application to Europe.

  • Identifying and executing a targeted bulk-cancel of the stuck US pool at 01:10 was critical to freeing up execution capacity in our newly targeted European region.

3. Upstream Blast-Radius Hardening

While our emergency fallback saved the day for mission-critical flows like email, we still had over 13,000+ tasks sitting dead in frozen upstream queues until the provider fully recovered.

  • The Future Path: Moving forward, we are evaluating the design of a more localized buffer. For background jobs or outbound webhooks, we are evaluating implementing a lightweight internal staging queue that can store tasks locally during an upstream blackout, ensuring the correct order during replay.

Summary

Ultimately, our engineering team successfully diagnosed a complex multi-region concurrency trap, pivoted traffic, and deployed a creative architectural bypass that rescued our user experience hours before our provider managed to stabilize their infrastructure.

We know this was a very difficult moment for the Trigger.dev team. We respect them and continue to have a great partnership. This post is to simply share our side of the incident and how our team reacted.

Comece com o Cal.com gratuitamente hoje!

Experimente uma programação e produtividade sem interrupções, sem taxas ocultas. Registe-se em segundos e comece a simplificar a sua programação hoje, sem necessidade de cartão de crédito!