Summary:
On August 30th, our dashboard was inaccessible for about 30 minutes and slow to update for several hours. The API (processing applications and data) was not affected and no data was lost or put at risk. Operations teams performing manual reviews were the primary group impacted by this incident.
Root Cause:
The API server received thousands of requests with an unusual volume of entities in groups. This feature was not designed to support (or defend against) groups of the size received. The Groups feature is intended to manage situations where multiple entities belong to the same Application - such as 3 people applying for a joint checking account. Our dashboard uses a special database for displaying data in that dashboard. That database performs several queries that were not optimized to process a group of entities this large. The database slowed down, a backlog of queries developed, and eventually, the database went totally offline causing a dashboard outage.
We were able to restore service by rate-limiting the client, removing the impacted entity links, and restarting the services that power the dashboard. This immediately restored service to the dashboard and after roughly another hour all of the backed-up queries were back to real-time.
Actions taken or planned to avoid future incidents:
In order to avoid this situation in the future we are: