Amazon.com’s cloud service returned to normal operations late Monday afternoon, the company announced, following a widespread internet outage that caused global turmoil for thousands of sites, including popular apps like Snapchat and Reddit. Amazon Web Services (AWS) later noted that some services still had a backlog of messages that would take a few more hours to fully process.
AWS provides computing power, data storage, and other essential digital services to companies, governments, and individuals worldwide, making it the world’s largest cloud provider. The disruption impacted workers from London to Tokyo, halting everyday tasks such as processing payments via Venmo or using the video calling service Zoom.
Experts noted that this was the largest internet disruption since last year’s CrowdStrike malfunction, which disabled technology systems in hospitals, banks, and airports. The incident once again highlighted the vulnerability of the world’s interconnected, and often fragile, digitail infrastructures.
The root cause and vulnerable data center
The outage marked at least the third major internet disruption in five years linked to AWS’s northern Virginia cluster, known as US-EAST-1. AWS said the outage originated at this location — its oldest and largest for web services — which also suffered outages in 2021 and 2020. According to AWS documentation, US-EAST-1 is often the default region for many services.
Amazon did not immediately address why this particular data center remains so vulnerable.
AWS stated that the root cause of the outage was an underlying subsystem designed to monitor the health of its network load balancers. This subsystem, which distributes traffic across various servers, failed. The issue originated within the “EC2 internal network,” part of Amazon’s Elastic Compute Cloud (EC2) service. This failure prevented applications from finding the correct address for the DynamoDB API, a critical cloud database used to store user information and other essential data.
Expert criticism and global impact
Computer science professor Ken Birman of Cornell University emphasized the need for developers to build better fault tolerance. He noted that while AWS offers tools to protect against such problems, many companies fail to utilize them fully or create backups with other cloud providers.
“When people cut costs and cut corners to try to get an application up, and then forget that they skipped that last step and didn’t really protect against an outage, those companies are the ones who really ought to be scrutinized later,” Birman told Reuters.
Jake Moore, a global cybersecurity advisor at ESET, concurred, stating that the incident “once again highlights the dependency we have on relatively fragile infrastructures.”
Ookla, which owns Downdetector, reported that over 4 million users and at least a thousand companies were affected globally.
Affected U.S. apps and platforms included:
Social Media/Apps: Reddit, Snapchat, Duolingo, and Signal
Financial/Trading: Venmo, cryptocurrency exchange Coinbase, and trading app Robinhood
Gaming: Roblox, Fortnite, Clash Royale, and Clash of Clans
Rideshare: Uber competitor Lyft
Amazon’s Own Services: Amazon’s shopping website, Prime Video, and Alexa were also impacted.
For major businesses, hours of cloud downtime translate to millions in lost productivity and revenue. Despite the operational turmoil, Amazon’s shares rose 1.6% to $216.48, showing that Wall Street was largely unfazed by the disruption.