Developer TTS API authorization failures

Incident Report for WellSaid Labs

Postmortem

Summary

During a migration of our internal TTS service infrastructure, our external facing TTS Developer API experienced a partial outage due to authentication issues with the new TTS service. The root cause was identified as the use of invalid authentication keys from the Developer API which had not been updated to reflect those within the new internal TTS service. This resulted in a service degradation leading to 19 failed requests.

Timeline

  • 2024-11-10 01:39 UTC : Migration of internal TTS service to new infrastructure begins
  • 2024-11-10 01:41 UTC : Developer TTS API Gateway nodes begin to pick up new routing change
  • 2024-11-10 01:41 UTC : The nodes routing to the new TTS service begin to fail requests
  • 2024-11-10 01:42 UTC : Monitoring system shows failing requests between the two services
  • 2024-11-10 01:42 UTC : Routing logic changed back to existing infrastructure
  • 2024-11-10 01:43 UTC : Root cause of mismatching keys is identified
  • 2024-11-10 01:43 UTC : Monitoring system alert for failed requests to Developer TTS fires
  • 2024-11-10 01:44 UTC : Impact analysis begins
  • 2024-11-10 01:44 UTC : Developer TTS API Gateway successfully authenticates to existing internal TTS service
  • 2024-11-10 01:44 UTC : Existing authentication keys used by Developer TTS API Gateway are manually added to new TTS service
  • 2024-11-10 01:51 UTC : Verification of existing keys against new TTS infrastructure is performed
  • 2024-11-10 02:00 UTC : Impact analysis finds only Developer TTS API is affected
  • 2024-11-10 02:01 UTC : Routing logic changed back to new infrastructure
  • 2024-11-10 02:02 UTC : Developer TTS API Gateway begins to successfully authenticates to internal TTS service
  • 2024-11-10 02:05 UTC : Infrastructure team continues to monitor both services

Impact

  • Some users of the Developer TTS API hitting nodes connecting to the new TTS service would have experienced server errors while attempting to generate new clips, in total 19 requests over a 3 minute time frame failed

Action Items

  • A planned extension of this infrastructure migration is to implement automatic rotation and refreshing of API keys between the two services to remove the need for manual syncing
Posted Nov 10, 2024 - 03:21 UTC

Resolved

All Developer TTS API gateway nodes are properly communicating with internal TTS service.
Posted Nov 10, 2024 - 02:02 UTC

Monitoring

Keys between the two services have been synced and we are seeing successful requests coming from the Developer API gateway. We are continuing to monitor to ensure all nodes reflect this change.
Posted Nov 10, 2024 - 01:44 UTC

Identified

Some TTS Developer API nodes have started performing requests to internal TTS service with incorrect api keys.
Posted Nov 10, 2024 - 01:41 UTC
This incident affected: Text to Speech (Developer API).