Incident Timeline:
- 05:45 UTC: The first customer reports that running Chef on their environment failed due to an ongoing run - this was fixed in the standard fashion and the customer informed.
- 06:30 - 07:00 UTC: Three further customers reported the same issue. No further reports were received after this time.
- 09:00 UTC: Support Engineers working the customer tickets found that the standard fix did not resolve the issue and that Chef runs instigated via the ey-core gem were failing to initiate correctly in our platform backend. This was found to apply to all requests resulting in platform actions (Apply, Deploy, etc) made via the API, either being generated by the use of the ey-core gem or via direct curl requests to the API. Dashboard actions and calls using the older ey gem CLI remained functional.
- 09:45 UTC: The status page and dashboard were updated to acknowledge the incident and request customers avoid performing tasks via the API and ey-core gem.
- 09:45 UTC - 11:30 UTC: Engineers continued to investigate various platform components in order to identify the issue.
- 11:30 UTC: The database instance in the environment of a platform component used for log streaming for API calls was identified as failed, due to failing hardware at AWS. The instance was restarted to move it to a new host, and the application reconfigured and restarted to bring it back online.
- 12:25 UTC: After full testing the incident was declared resolved and the impacted customers informed.
Incident Root Cause:
The root cause of the incident was confirmed to be the failure (at the host hardware level) of the database instance in an environment not directly used, but depended upon for logging, by the API and thus the ey-core gem that calls to the API.
Incident Impact:
All customers on the EY Platform were unable to perform actions on their environments through API calls, most commonly enacted through the use of the ey-core gem. As the EY Cloud Dashboard was still fully functional customers were able to perform all required actions via the UI, with the sole exception being the uploading of custom Chef recipes, which is an API only process, though this action remained performable throughout due to it not utilising the impacted log streaming component. Also due to the time of day the number of customers impacted was minimal. Therefore the impact can be classified as Minor.
Incident Corrective Actions:
Initial corrective actions will entail ensuring monitoring and alerting is in place on the log streaming environment in order to make any future failures more immediately apparent to frontline staff.
Future corrective actions will be to investigate the APIs reliance on the log streaming environment and investigate ensuring failure of that environment is a non-breaking event.