The fragile web: 2025's lessons on uptime, reality, and engineering rigor


Towards a philosophical reset for SREs

If you are into IT operations or leadership, you likely spent at least one weekend in 2025 huddled over a laptop while the rest of the world slept. For the last decade, our industry has pursued five nines (99.999% uptime) as the holy grail. We architected redundant systems, deployed across multiple availability zones, and optimized our code until it hummed. We convinced ourselves that if we just engineered hard enough, we could tame the chaos of the internet. We thought we could. We really did. But 2025 was the year the internet pushed back. With brutal clarity, this year demonstrated that the internet can never really be under our control.

The internet is a sprawling, breathing, interdependent organism held together by BGP routes, DNS propagations, API handshakes, content delivery networks, internet service providers, and, the cream on top, human actions—many of which are affected by global third parties our eyes may never meet. When the underlying fabric of this digital world buckles, like a configuration error in the US rippling out to a mobile user in Singapore, your application goes belly up, your phone starts to buzz, and the rest of the workflow you know as clearly as the back of your thumb.

As we move on from 2025, it occurs to us as a good time for a philosophical reset. The goal for 2026 may no longer be a "perfect" uptime, because, after all these years, we know that perfection in a distributed system is a mirage. The goal must be to achieve a state of anti-fragility—the ability to improve under dire circumstances through continuous learning and corrective action. We can go further when we stop asking questions like, "How do we prevent failure?" and start asking, "How do we bounce back before the end users hit the rage emojis?" and “How can we not fail the same way twice, but fail a different way and learn a better lesson?” As you read this blog, sit back, take stock philosophically, and find ways to act with engineering rigor which is non-negotiable in IT.

The events: A timeline of fragility

A single catastrophic event did not define 2025, but rather a cascading series of failures that exposed the hidden dependencies of the modern web. From global cloud giants to security-induced lockdowns, the year showed us that complexity is now a persistent, ever-present dragon that everyone must navigate. Below is a summary of major internet incidents that defined our year. Please note that these are not listed to assign blame, but to highlight the shared reality in which we all operate.

Top 2025 internet incidents, causes, and impacts

  • Oct. 20, 2025, AWS US-EAST-1:
    Cause:
    A DynamoDB subsystem update triggered a latent bug, causing failures in the internal DNS plane.
    Impact: A 15-hour disruption that affected nearly 20% of the web, halting logistics, IoT devices, and major SaaS platforms.
  • Oct. 29, 2025, Microsoft Azure global outage:
    Cause:
    A configuration change to Azure Front Door (AFD) propagated incorrectly, creating a routing loop.
    Impact: An 8-hour blackout for Microsoft 365, Teams, and Xbox that stalled global enterprise productivity as emails and files became inaccessible.
  • Nov. 18, 2025, Cloudflare network event:
    Cause:
    A software update to the bot management module caused a spike in 5xx errors due to resource exhaustion.
    Impact: Critical platforms, such as X (formerly Twitter) and ChatGPT, went offline. It highlighted how reliant the modern web is on a handful of CDN providers for security and routing.
  • June 12, 2025, Google Cloud IAM lockout:
    Cause:
    An invalid automated update to the Identity and Access Management (IAM) system.
    Impact: Services were technically up, but users could not log in. This "zombie uptime" (running but inaccessible) confused standard monitoring tools that only checked for HTTP 200 OK statuses.
  • Other supply chain breaches:
    Cause:
    Exploitation of third-party software vulnerabilities.
    Impact: A recorded 34% spike in vulnerability exploits targeted the tools SREs use to manage their own stacks, turning trusted software into attack vectors.

The explanation: Why the internet breaks

Despite billions of dollars in investment, why does the internet still feel so fragile? Here’s a reality check:

1. The hidden dependency crisis

In the 2020s, no application is an island. A typical e-commerce checkout flow might rely on a payment gateway (Stripe, PayPal), a shipping calculator (FedEx API), a tax calculator (Avalara), and a CDN (Cloudflare, Akamai). If any one of these fails, the user perceives your site as broken. Although you cannot write an SLA that governs a third-party API, you are still held responsible for its performance.

2. Configuration is the new code

Notice the root causes in the list above? None of them were server fires or hard drive failures. They are all software and configuration issues. As we move toward Infrastructure as Code (IaC), a bad configuration push is just as destructive as a bad code deployment, but it often propagates faster. Through automation, we have not just advanced in our ability to grow at scale, but also break things at scale.

3. The complexity of MTTI—mean time to innocence

When an outage hits, the most stressful phase is not fixing it; it is finding it. In 2025, SRE teams reported wasting hours just trying to answer the question: "Is it us, or is it the cloud provider?" Without deep observability, teams tore apart their own perfectly functioning code while the actual issue lay in a fiber cut thousands of miles away or a DNS resolution error at the ISP level.

The rationalization: Accepting the new normal

If we accept that we cannot control the internet, how do we proceed? We proceed by changing our metrics and our mindset. We must move away from vanity metrics. Server uptime percentages are irrelevant if your users cannot log in due to an IAM failure. Page load time averages can be misleading if 5% of your users in a specific region experience timeouts.

We need SMART goals for reliability

  • Specific: Do not target uptime. Target checkout availability.
  • Measurable: Use service-level objectives (SLOs) that align with user pain points, not server health.
  • Achievable: Acknowledge that 100% is impossible. Aim for 99.9% and use the remaining 0.1% (your error budget) to experiment and ship faster.
  • Relevant: Prioritize the questions that matter, such as "Does this metric impact revenue or brand reputation?"
  • Time-bound: Measure reliability over rolling windows (e.g., 28 days) to smooth out short-term anomalies, spot long-term degradation, and consider edge cases and special occasions where IT resources are strained, like holiday shopping seasons or calamities.

This philosophical shift frees us. It allows us to stop panicking every time a graph dips so we can start focusing on what actually matters: resilience.

Best practices for SREs to follow in 2026

We cannot prevent the hurricanes of the internet, but we can build houses that withstand them. Drawing on the hard lessons of 2025, here are the engineering practices that separate fragile stacks from robust ones.

1. Diversify your critical paths

The all-in-one-cloud strategy is passe. While you do not need to be fully multi-cloud for its own sake (which adds complexity), you must have a failover plan for critical dependencies.

  • If you use a CDN, have a break-glass procedure in place to route traffic directly to the origin or through a secondary backup CDN.
  • Ensure your DNS has a secondary provider or a long Time-to-Live (TTL) failover strategy.

2. Adopt outside-in monitoring

Your servers live in a data center; your users live in the real world. Monitoring your CPU usage tells you nothing about the user who is accessing your site from a slow 5G connection in London. Implement digital experience monitoring (DEM). Synthetically simulate user journeys (login, search, checkout) from global locations every five minutes. This alerts you to regional outages before your real users even become aware of them.

3. Converge security and observability

The 2025 outages demonstrated that performance issues and security breaches often appear the same at first glance (e.g., a DDoS attack resembles a traffic spike; a ransomware encryption resembles high disk I/O). Stop treating Information security and IT operations as silos. Your observability tool should be able to correlate a spike in latency with a spike in blocked firewall requests.

4. Automate the boring remediation

You cannot scale reliability with human hours. If a known issue (like a whole disk or a hung process) wakes an engineer up at 3am, that is a failure of automation. Use AIOps to detect anomalies and trigger automated runbooks. If a server is non-responsive, the system should attempt a restart and capture logs before paging a human. Leverage event correlation techniques to look beyond the red herrings of distraction by employing machine learning to perform causal analysis among correlated probable root causes.

5. Data tiering for cost control

Observability costs skyrocketed in 2025 and will continue to rise in 2026. Go essentialist, if not minimalist. Logging everything is no longer economically viable. Adopt a strategy where you keep high-fidelity data for three days (for immediate debugging) and aggregate or sample data for 30+ days (for trend analysis). This keeps your budget in check without blinding you.

How ManageEngine Site24x7 can help

To navigate the vagaries of the web, you need an observability partner that sees the whole picture, grows with you, and supports you through it all. A platform that not only shows you green lights but also provides the context to understand the red ones. ManageEngine Site24x7 is built for modern IT, having evolved from simple monitoring into a full-stack AI-powered observability platform.

  • We tell you where : With our global network of monitoring nodes, we test your uptime from where your users are, not just where your servers are. We help you pinpoint if the issue is in your code, your cloud provider, or the local ISP.
  • We tell you why : Our unified agent connects the dots. We correlate your application traces, server logs, and network packets into a single view. No more tab-switching during a crisis.
  • We help you act : Our IT automation capabilities allow you to heal your infrastructure automatically. Whether it is clearing a cache or restarting a container, Site24x7 can handle the routine tasks so your engineers can focus on the complex ones.
  • We respect your budget : With flexible data retention and a unified licensing model, we help you achieve complete visibility without the "observability tax" that other vendors charge.

The internet of 2026 may break. It may stutter. It may surprise us. But with Site24x7, rest assured that you will not be left in the dark. Try ManageEngine Site24x7 today.


Comments (0)