On June 12th, 2025 a subset of Golioth platform services, including stream and OTA, experienced an outage that impacted customer devices. While the degraded availability of the Golioth platform was linked to outages on other cloud platforms, we take full responsibility for the impact on our customers. We are deeply sorry, and we will be taking steps to eliminate single points of failure that were exposed by this incident.
Timeline
The first signs of disruption were observed by the Golioth engineering team at 18:10 UTC when metric absence alerts fired on our monitoring notification channels. These alerts are an indication that the platform may be experiencing issues that are not triggering alerts due to degraded observability. We confirmed at this time that we were unable to access the console and APIs of our metrics and alerting provider, significantly reducing our platform visibility. At this point in time, the Golioth console remained fully functional, and devices in our test fleet remained connected to the platform.
We quickly verified that we still had management access to our compute clusters, allowing us to access logging and metrics directly from our services. At this point we observed issues with streaming data ingestion and processing, while other services appeared to remain stable. We took action to update status.golioth.io at 18:14 UTC in order to notify customers of the degraded stream processing functionality. We also began exercising the OTA service to determine whether devices were able to access and download assets. Unfortunately, our status page was also impacted by provider outages.
After identifying issues with the OTA service, and observing status page update failures, we posted on the Golioth forum at 18:51 UTC to inform customers of scope of the outage. At this point in time, underlying service providers still reported no issues, though our knowledge of the Golioth platform architecture led us to believe that failure to authenticate to managed service offerings seemed to be the root cause of the outage.
As we started to receive information about the outage from service providers, we began working towards diverting traffic to alternative managed and self-hosted service offerings. During this time, we continued to provide updates on the forum. We observed OTA operations beginning to recover shortly before 20:16 UTC. By 20:28 UTC, errors emitted from streaming data ingestion began to subside. Golioth’s platform architecture is designed to handle degraded stream processing gracefully. Because of this, as the service is recovering, newly ingested data may experience slight delays in delivery as data that was ingested while the service was unavailable is processed. By 21:00 UTC, we were able to verify that any heightened delivery latency had subsided.
Remediation
When an incident occurs, our most critical responsibility to our customers is communication. Given this, our inability to immediately update our status page with accurate and timely information is especially unfortunate. While the likelihood of simultaneous outages of multiple service providers is low, incidents such as this one have revealed brittle infrastructure in our status reporting machinery. We will be taking immediate steps to provide a more robust offering.
In a similar vein, the loss of metrics visualization and alerting infrastructure during this outage was disappointing. Our ability to directly access services, combined with our familiarity with visualization tooling that can be run locally, allowed us to obtain a fairly clear picture of the scope of the incident. However, this process ideally would be much faster and alerting infrastructure would automatically failover from one provider to another. Fortunately, our instrumentation of metric absence alerts still allowed us to respond quickly to this incident.
While we were pleased that the impact of managed services outages remained isolated to the components of the Golioth platform that directly interact with those services, stream and OTA are critical for our customers and any downtime can cause significant negative impact on their products. As previously mentioned, when issues were observed, we began to take steps towards moving stream and OTA traffic to alternative infrastructure.
While all Golioth services are thoughtfully designed to be agnostic of a particular managed infrastructure offering, the provisioning of alternative infrastructure and the reconfiguration of services to target it is not a fully automated process. Because of this, we made a calculated decision not to divert traffic in the early stages of this incident. This process should be fully automated, and there should be seamless transition from one infrastructure provider to another. There are both short and long term action items that we will be taking to improve our posture in this area.
Upon service recovery, Golioth stream processing infrastructure began to churn through any data that had been successfully ingested immediately during or immediately prior to the beginning of the incident. As expected, our services and compute clusters automatically horizontally scaled to handle this temporary period of increased load, and previously implemented flow controls ensured that services were not overwhelmed during this period. However, we were able to identify a few minor improvements with respect to load balancing and scalability, which could have expedited the recovery process in this instance. Some of the changes have already been deployed to production since yesterday’s incident.
Looking Forward
In the hours since this incident, we have been able to identify key areas of improvement. We will continue working through our remediation strategy, and we will be evaluating larger scale architectural changes to our multi-tenant SaaS offering to reduce the impact of any future service provider outages. We take the reliability of the Golioth platform very seriously, and we welcome any inquiries from individual customers regarding the impact of the outage on their device fleets. We apologize again for the disruption and look forward to offering improved availability moving forward.
No comments yet! Start the discussion at forum.golioth.io