Production API Error Elevation

Incident Report for Alloy

Resolved

This incident has been resolved and the indicators that we have been observing on our side have recovered to the extent that we're confident in the current health of the system. The incident's effects lasted from approximately 20:39 - 22:28 ET, though the impact was intermittent during that time and mostly impacted our APIs.

The root cause appears to be a cleanup process initiated by our Amazon Web Services Aurora database causing some read processes to slow down resulting in most API queries failing. We were able to restore service by moving these workloads to a different location while the process finished. We are working with AWS and our internal teams to avoid both this specific issue and any related issues in the future.

Posted Jan 04, 2022 - 01:00 EST

Monitoring

We've moved some of our load off of read replicas, which seems to have mitigated the issues we were seeing so far. We are still monitoring and discussing the root cause. We will continue to monitor this to make sure it is stable before closing this incident.

Posted Jan 03, 2022 - 22:56 EST

Update

We're still investigating - something is causing massive issues with all read replicas of our production database cluster. We've tried a series of experiments and methods to recover the service. We are now working on larger-scale fixes which we'll be able to roll out in the new few minutes. We will post immediately upon recovery or the issue is identified.

Posted Jan 03, 2022 - 22:27 EST

Update

We are currently aware of major issues with our APIs and certain degradations with our dashboard. We are trying to diagnose the root cause currently.

Posted Jan 03, 2022 - 21:25 EST

Investigating

Our API integration tests have encountered an increase in errors. We are currently investigating. Stay tuned for updates.

Posted Jan 03, 2022 - 20:40 EST

This incident affected: Production API, Customer Dashboard, and Sandbox API.