Skip to content
TRAC GRC Solution
 

Frustration-Free Risk Management

Simplify cybersecurity risk management and tackle your cybersecurity challenges with ease. TRAC is a powerful GRC tool that automates the tedious risk assessment process and delivers customized results aligned with regulations, best practices, and your strategic goals.

Blog_HeaderGradients-11
Justin HicksNovember 06, 20253 min read

When the Cloud Crashes: Lessons from the AWS and Azure Outages

Lessons Learned from the AWS and Azure Cloud Outages
4:47

Executive Summary

In October 2025, two of the world's largest cloud service providers — Amazon Web Services (AWS) and Microsoft Azure — suffered major outages just days apart. On October 20, AWS's US-EAST-1 region suffered a Domain Name System (DNS) failure — the system that translates web addresses into IP addresses — while Azure customers faced widespread access issues on October 29 due to a global identity management disruption.

Together, these events served as a wake-up call for organizations that rely heavily on the public cloud, revealing that even the biggest names in cloud computing can go down.

For financial institutions and other critical sectors, now is the time to reassess cloud dependency, review business continuity plans, and strengthen resilience strategies. Cloud convenience does not replace redundancy.

 

Who Can Be Affected?

Any organization that hosts critical systems, data, or operations in the cloud — especially those relying on a single provider or region — is vulnerable to disruption.

Even industries with strong uptime expectations, such as banking, healthcare, and government services, felt the effects of the October AWS outage. Major platforms like Snapchat, Fortnite, Coinbase, Halifax Bank, and the Bank of Scotland were all impacted when AWS's DNS failure took key infrastructure offline. Less than two weeks later, Azure customers worldwide experienced authentication and connectivity issues that left users unable to log in or access applications.

These incidents underscore that cloud reliability is not infallible, no matter how large or sophisticated the provider.

 

How Does This Threat Work?

Both incidents illustrate how complex, interconnected systems can fail — and how those failures cascade quickly.

  • AWS US-EAST-1 outage: A DNS failure in one of AWS's busiest regions caused services across multiple industries to go dark for hours.
  • Azure outage: A configuration error in Microsoft's identity services created global login failures for Azure Active Directory and related applications.

 

For many organizations, the issue wasn't just downtime — it was dependency. If your environment is built on a single cloud provider or region, even a brief service interruption can disrupt customer access, transaction processing, and critical internal operations.

 

Increased Probability

Cloud outages are not new, but the frequency and scale of recent incidents are reminders that "rare" events can and do happen. As cloud infrastructure becomes more centralized and interconnected, the blast radius of a single point of failure increases.

High availability, once considered a built-in benefit of the cloud, now depends on intentional architecture: redundancy across regions, hybrid strategies that blend on-premises and cloud workloads, and well-tested failover plans.

Organizations that assume "99.99% uptime" means total reliability may be unprepared for the real-world implications of even a few hours of downtime.

 

What Can You Do?

 

Diversify Cloud Deployment

Avoid hosting all workloads in a single region or provider. Deploying across multiple regions or cloud providers can help maintain availability if one platform experiences an outage.

 

Test Your Recovery Objectives

Define and routinely test your recovery time objectives (RTO) and recovery point objectives (RPO). Simulate downtime events to ensure recovery plans work in practice, not just on paper.

 

Strengthen Backup Connectivity

Ensure your business can still operate if a cloud service is disrupted. This may include on-premises backups, local failover servers, or alternative authentication methods.

 

Evaluate Vendor Dependencies

Review your vendor ecosystem for hidden dependencies. Some "on-premises" applications rely on cloud-based services for updates or authentication without your awareness.

 

Build a Culture of Resilience

Technology alone won't solve the problem. Educate leadership and teams on resilience planning, incident response, and business continuity practices. Resilience is not just an IT function — it's an organizational mindset.

 

Don't Wait Until It's Too Late

The October AWS and Azure outages are a reminder that even the most reliable technology has limits. The goal isn't to eliminate downtime entirely but rather to minimize impact and ensure continuity when downtime happens.

Organizations that proactively test failover plans, diversify infrastructure, and plan for the unexpected will weather the next outage far better than those who assume the cloud will always be up.

A little redundancy today can save a lot of explaining tomorrow.

Blog_Lock&Line-Gray

 

avatar

Justin Hicks

Justin Hicks is an information security consultant for SBS CyberSecurity, where he works with organizations to strengthen their information security programs and enhance their overall cybersecurity posture. Justin has more than 19 years of experience in system administration and information security analysis, providing a deep technical foundation for understanding and mitigating evolving cyber threats. He holds several industry-leading certifications, including Certified Information Systems Security Professional (CISSP), Certified Information Systems Auditor (CISA), Certified Information Security Manager (CISM), and GIAC Security Essentials Certification (GSEC). Justin is passionate about the financial industry and dedicated to helping financial institutions protect their data, maintain compliance, and build resilient security programs that safeguard their customers and communities.

RELATED ARTICLES