Increased latency & error rates for API customers

Incident Report for WellSaid Labs

Postmortem

Incident Summary

Elevated latency and error rates impacting all API customers. The errors were a result of a single gateway pod serving traffic in a faulty state after it failed to initialize a system component properly.

Impact

Roughly 20% of API customer traffic experienced increased latency and/or error rates for a period of 24 hours.

Root Cause

The source of error rates and latency was pinned down to a single pod responsible for handling API customer traffic. This pod in particular failed to initialize a middleware component properly but continued to unsuccessfully serve traffic. Upon termination of the problematic pod, service was restored.

Preventative Measures

  • Improve gateway initialization logic and handling of failure scenarios
  • Improve logging severity to ensure relevant errors trigger alerts accordingly
  • Adjust alerting policies around error rates and latency for gateway components
Posted Mar 05, 2024 - 17:34 UTC

Resolved

Elevated latency and error rates impacting all API customers.

Incident Start: 2024-03-02 13:48 PT
Incident End: 2024-03-03 13:48 PT
Incident Duration: 24hrs
Impact: Roughly 20% of API Customer traffic experienced increased latency and/or error rates
Posted Mar 02, 2024 - 22:00 UTC