Studio Outage

Incident Report for WellSaid Labs

Postmortem

Summary

WellSaid Labs Studio experienced an outage due to a traffic routing policy change made at the load balancer level. The change, intended to reduce the latency of clip generation and downloading, configured the routing rules for the Studio API and Studio web to services that failed to be updated leading to the load balancer being unable to route incoming requests to the backend services.

Timeline

  • 2024-09-18 22:57 UTC : Infrastructure change begins to roll out
  • 2024-09-18 23:00 UTC : Routing rules for studio production services are updated
  • 2024-09-18 23:00 UTC : Studio page and API requests begin failing
  • 2024-09-18 23:05 UTC : Outage detected by automated systems and infrastructure team is alerted
  • 2024-09-18 23:06 UTC : Cause of outage identified
  • 2024-09-18 23:10 UTC : Fix identified
  • 2024-09-18 23:15 UTC : Changes to fix routing rules begin to roll out
  • 2024-09-18 23:18 UTC : Routing changes applied to production
  • 2024-09-18 23:18 UTC : Services begin to spin back up to handle traffic
  • 2024-09-18 23:19 UTC : Infrastructure team is able to access studio web and call the studio API
  • 2024-09-18 23:20 UTC : System reports healthy status

Impact

All users attempting to load pages within the WellSaid Labs studio would have experienced failing requests. Those attempting to generate clips would have been unaffected.

Resolution

The infrastructure team removed and recrated the failing backend services using the automated deployment pipeline allowing the routing layer to properly reach them.

Follow-Up

Next steps

The infrastructure team is working on changes to ensure alignment between the different layers involved in routing requests to the studio services and tightening the dependency links between them such that the new backends must exist before traffic is attempted to be routed to them.

Posted Sep 19, 2024 - 01:19 UTC

Resolved

This incident has been resolved.
Posted Sep 18, 2024 - 23:20 UTC

Monitoring

A fix has been implemented and we are monitoring the results.
Posted Sep 18, 2024 - 23:15 UTC

Identified

The issue has been identified and a fix is being implemented.
Posted Sep 18, 2024 - 23:00 UTC
This incident affected: Studio (Editor, API).