Increased latency & error rates for API customers

Incident Report for WellSaid Labs

Postmortem

Incident Summary

Elevated latency and error rates impacting all API customers. The errors were a result of a single gateway pod serving traffic in a faulty state after it failed to initialize a system component properly.

Impact

Roughly 20% of API customer traffic experienced increased latency and/or error rates for a period of 24 hours.

Root Cause

The source of error rates and latency was pinned down to a single pod responsible for handling API customer traffic. This pod in particular failed to initialize a middleware component properly but continued to unsuccessfully serve traffic. Upon termination of the problematic pod, service was restored.

Preventative Measures

Improve gateway initialization logic and handling of failure scenarios
Improve logging severity to ensure relevant errors trigger alerts accordingly
Adjust alerting policies around error rates and latency for gateway components

Posted Mar 05, 2024 - 17:34 UTC

Resolved

Elevated latency and error rates impacting all API customers.

Incident Start: 2024-03-02 13:48 PT
Incident End: 2024-03-03 13:48 PT
Incident Duration: 24hrs
Impact: Roughly 20% of API Customer traffic experienced increased latency and/or error rates

Posted Mar 02, 2024 - 22:00 UTC