Dashboard 502 errors
Incident Report for Alloy
Postmortem

Summary:

On August 30th, our dashboard was inaccessible for about 30 minutes and slow to update for several hours. The API (processing applications and data) was not affected and no data was lost or put at risk. Operations teams performing manual reviews were the primary group impacted by this incident.

Root Cause:

The API server received thousands of requests with an unusual volume of entities in groups. This feature was not designed to support (or defend against) groups of the size received. The Groups feature is intended to manage situations where multiple entities belong to the same Application - such as 3 people applying for a joint checking account. Our dashboard uses a special database for displaying data in that dashboard. That database performs several queries that were not optimized to process a group of entities this large. The database slowed down, a backlog of queries developed, and eventually, the database went totally offline causing a dashboard outage.

We were able to restore service by rate-limiting the client, removing the impacted entity links, and restarting the services that power the dashboard. This immediately restored service to the dashboard and after roughly another hour all of the backed-up queries were back to real-time.

Actions taken or planned to avoid future incidents:

In order to avoid this situation in the future we are:

  • Implementing rate limits on this feature to prevent unexpected behavior from misuse
  • Improving our proactive monitoring and alerting of the database that powers the dashboard to identify a potential issue more quickly
  • Implementing generalized guardrails and optimizations against intensive queries impacting the overall system
Posted Sep 07, 2021 - 14:38 EDT

Resolved
This incident has been resolved.
Posted Aug 30, 2021 - 13:02 EDT
Monitoring
The dashboard is functioning but some features are slow - continuing to work on this
Posted Aug 30, 2021 - 12:38 EDT
Identified
There is a partial fix implemented and we're working to get everything back online
Posted Aug 30, 2021 - 12:09 EDT
Investigating
The dashboard is currently unreachable - we are investigating and working to get it back up
Posted Aug 30, 2021 - 11:45 EDT
This incident affected: Customer Dashboard.