Elevate Technology Group Incident – RCA (Root Cause Analysis)
November 21, 2023Overview
On November 2nd, 2023, Elevate Technology Group received reports of clients experiencing intermittent cloud services out of the Hillsboro, OR location. Elevate Technology Group immediately activated its disaster response plan and worked towards identifying, restoring, and monitoring services. The cause of this outage was due to unexpected multiple underlying power incidents that are described later in this report.
The total outage time for impacted services related to power ranged from 4:43a.m. PST to 5:23a.m. PST on November 2nd, 2023. Once power was restored to the physical infrastructure, Elevate Technology Group began restoring services that did not automatically recover. By 6:25a.m. PST, over 70 percent of impacted services were restored. By 7:23 a.m. PST, all services were restored except for a small subset of voice customers.
It’s important to note that not all customers and services were impacted.
Timeline
All events listed below took place on November 2nd, 2023, and are in PST. Events in bold are provided by Elevate Technology Group (ETG). The regular text was provided by a 3rd party.
12:05a.m. – Flexential team gets an alert that a subset of generators is running.
12:06a.m. – The site infrastructure manager confirms site is stable on generator.
12:14a.m. – The operations on call engineer arrives on site.
12:15a.m.- Site infrastructure manager confirms with the utility provider PGE that they are having a utility outage from the local substation.
4:40a.m. – All generators go into cool down status and are no longer providing power to the facility. The facility is now running on batteries.
4:40a.m. – A ground fault is detected.
4:40a.m. – UPS Batteries start draining.
4:46a.m. - 5:01a.m. – Batteries for each UPS deplete and shut power down.
5:03a.m.– ETG has first contact with customer regarding intermittent outages.
5:09a.m. – Incident response plan activated, and internal communication bridge is opened.
5:05a.m.- Approval from PGE to use utility feed B.
5:16 a.m.– Site engineers use manual transfer switches to feed mechanical systems to give them the ability to restore power via generator on feed A.
5:17a.m. – generators are manually switched over to feed A on generators.
5:27a.m.- Flexential sent initial notification to data center customers.
5:32a.m. – Site is fully powered via generators on feeder A.
6:21a.m. – Notice sent to customers using out of band communication.
6:25a.m. – Approximately 70 percent of ETG’s Cloud infrastructure was restored.
6:31a.m. – ETG confirms networking services are 100 percent online.
7:23a.m. – All cloud services except for 2 voice customers were restored.
8:24a.m. – One remaining voice customer outage.
8:44a.m. – ETG engineer arrives onsite to diagnose remaining single voice customer issue.
10:19a.m. – Last impacted voice customer services restored.
Root Cause
Elevate Technology Group leases space in a data center (PDX02) run by Flexential in Hillsboro, OR, where it houses its internal and client infrastructure. On November 2nd, 2023, Portland General Electric (PGE), the utility company that services PDX02, had an unplanned maintenance event affecting one of their independent power feeds into the building resulting in PDX02 experiencing an unexpected power outage on one of their feeds. The data center’s generators automatically kicked on and held the load of the data center until 4:40a.m. when a ground fault at the PGE substation caused all ten generators to go into protection mode which disabled their power output and shut them down. Following this event, most battery backups / UPS drained and shut down power to most of the facility.
Closing Statement and Ongoing Efforts
The impact on Elevate Technology Group’s services was significantly less than other Flexential customers due to a significant amount of time and research put in to automate its systems. Additional details are pending a meeting between Elevate Technology Group and Flexential engineers to get a further understanding of why ETG lost power to three of its twelve power feeds and gain insight on what ETG is permitted to do to revise its design to prevent these types of events in the future.
CloudFare RCA
https://blog.cloudflare.com/post-mortem-on-cloudflare-control-plane-and-analytics-outage/