Issue that prevents running chef currently blocking configuration changes and spinning up new instances on stacks up to V5
Incident Report for Engine Yard
Postmortem

Incident Timeline:

  • 18:15 UTC: First customer reports issues with Chef Apply runs on instances.
  • 18:50 UTC: Investigation into the issue links the issue to failures with Portage (Gentoo package manager), specifically the failure of customer instances to access the platform’s Portage server.
  • 20:15 UTC: Further testing has shown the issue to be wide-reaching and impacting all stacks (stable v1 to v5) aside from (the Ubuntu based) stable-v6. As such an Incident is created to inform customers.
  • 21:00 UTC: The Portage server is identified and restarted, though remains inaccessible.
  • 21:30 UTC: Portage server access is gained and direct investigation of connection failures from platform instances is performed.
  • 21:50 UTC: Flushing of IPTables firewall rules on Portage server restores connectivity of customer instances to Portage.
  • 22:00 UTC: Re-application of dynamically generated IPTables firewall rules is found to not impact connectivity.
  • 22:15 UTC: Source of issue is tracked to a change in the format of the AWS IP Range list published shortly before the latest automatic update to the dynamically generated IPTables firewall rules on the Portage server. This change was found to be reverted in later published lists.
  • 22:30 UTC: Incident is declared as Resolved.

Incident Root Causes:

  • Engine Yard instances running stacks stable-v1 to stable-v5 run the Gentoo OS, which utilises Portage as its package manager. Engine Yard curates its own packages through a dedicated Portage server.
  • Upon a Chef Apply run instances synchronise the local Portage Tree from the Portage server.
  • For security reasons the Portage servers is fire-walled from the wider internet, but needs to still allow access from AWS IP addresses.
  • To keep this fire-walling up to date, the Portage server runs a regular task to download the latest published AWS IP Ranges list and utilise this to dynamically generate firewall rules, granting these ranges access.
  • In the (UTC) evening of 1st April Amazon published a new IP list, which contained an additional field, not previously included in the list.
  • This additional field led to a failure of the list to be parsed by the firewall rule generation script, leading to the Portage server no longer granting access to the AWS IP ranges, thus blocking access to Portage from customer instances.

Incident Impact:

Customers on stacks stable-v1 to stable-v5 saw Chef Apply run failures on creation of new instances or configuration of existing instances when such runs required updating of the Portage Tree or installation of packages. This failure applied to all regions. The incident did impact the deployment or running of customer applications.

Incident Corrective Actions:

We have reached out to Amazon regarding the changes to the published IP Range list, with regards to why the list was published in such a state and if such changes will be published again in future.

Engine Yard will be undertaking a review of the dynamic fire-walling script, in order to prevent future parsing failures resulting in unwanted firewall lockdowns. We shall also be reviewing and improving internal documentation and knowledge sharing in order to improve investigation and resolution of any future Platform issues.

Posted Apr 02, 2020 - 09:50 UTC

Resolved
This incident has been resolved.
Posted Apr 01, 2020 - 21:34 UTC
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Apr 01, 2020 - 21:14 UTC
Identified
The issue has been identified and a fix is being implemented.
Posted Apr 01, 2020 - 20:13 UTC
This incident affected: Engine Yard Cloud.