A Short History of Unnecessary Downtime

Those old enough to remember the turn of the last century recall the global almost-disaster that nearly occurred at the strike of midnight on December 31, 1999. Known as Y2K, it stemmed from a widely-used programming shortcut that resulted in many programs only using two digits instead of four to indicate the year (e.g. 99 rather than 1999). There were enormous fears that computers and programs wouldn’t be able to operate once the calendar flipped and the year became “00”.

Prognosticators at the time envisioned chaos in the streets and various end-of-days scenarios. They surmised that the work needed to remedy the problem was too vast, too costly, and too manual to surmount in time for New Year’s Eve. Unlike most disasters, Y2K provided technologists with ample time to avoid catastrophe, and in the end, tragedy was indeed averted.

And Then There Was Last Week…

Fast forward nearly twenty-one years, to last week when a spate of seemingly avoidable configuration updates left vast swaths of heavily trafficked websites out of commission for hours on end. On Thursday, September 30th, reports of outages at Amazon, Google, Microsoft, and Cisco began to surface. The culprit, it seemed, was a relatively innocuous certificate update that had been issued by certificate authority Let’s Encrypt. The group, according to Wikipedia, is “a non-profit certificate authority run by Internet Security Research Group that provides X.509 certificates for Transport Layer Security encryption at no charge.”

The certificates weren’t issued in a vacuum, and despite the “Internet Blackout” that followed, businesses arguably had the foresight to get ahead of the changes required. Yet for reasons unknown, but surely related to complexity, workload, and other priorities, they didn’t address them in time. As a result, the ostensibly commercial-free certificate multi-billion dollar companies were using likely caused outages costing millions.

And Then There Was This…

Just a few days later, reports of widespread outages were again making news. This time, it was billions of Facebook, WhatsApp, and Instagram users whose attempts to connect and communicate were greeted with error messages (as pictured below). Reports of internal access failures were reported at Facebook as well. The resulting user downtime, lost revenue, and plunging employee productivity piled on to a major PR problem that just arose for Facebook thanks to a whistleblower. It has been a rough week for them for sure.

In today’s age of advanced technologies such as machine learning and AI, it’s difficult to understand how or why repeatable, generally manual tasks – such as configuration changes – could have such an outsized impact on global commerce. How could such a seemingly simple task or set of tasks essentially cause a $30 billion in market value loss within a single day?

Why Would This Happen?

When we think back to the challenges related to finding a Y2K fix, we are talking about a time when computing was in its early adolescence. Automation tools were nonexistent or simplistic (at least by today’s standards); every solution in 2000 was hard-coded and bespoke. In the years that have followed Y2K, innovation has progressed at lightning speed. We’re currently in an age when the solution to a problem like a date record need not cost millions of dollars, require custom code, or countless hours to address.

There are solutions in place that can accelerate:

Democratized IT
Out of the box API support
Professional services in times of priority need
15-day SLA adherence
Rapid workflow deployment

Similarly, the rollout of encryption certificates should be easy. Yet, it is still anything but for enterprises of all sizes, even for the very largest organizations in the world.

What Can We Learn From These Outages?

The key takeaway from these recent episodes is two-fold. First, our IT resources and operations, and the business processes they support, are highly complex and growing more so each day. Second, the fact is that managing this complexity effectively requires intelligent automation.

In plain English, with our systems and processes, there now are just too many ephemeral pieces moving too fast for humans to keep up with. We need help from the machines, and that help comes in the form of automation. Automating processes like configuration changes, and handling them quickly and correctly, is the answer.

Application programming interfaces (APIs) are the nexus of this needed automation. Broad and deep API-driven automation can multiply and extend the reach of IT organizations every day, but especially in times of trouble.

Final Thoughts

API-driven automation is the foundation of Pliant’s technology. With our Pliant Platform, we’re enabling organizations to get ahead and stay ahead of these types of complex process challenges. Once deployed, our Platform helps organizations proactively address requirements around key IT resources and critical processes so they can move their businesses forward.

To learn more about Pliant’s API-driven intelligent automation platform, contact us by clicking here.

Or, if you’d like to see our Platform in action first, attend the joint webinar we’re presenting with our friends at ElastiFlow entitled “Automate Critical Tasks with ElastiFlow & Pliant” on Wednesday, October 27th at 11:00 AM Eastern.

In today’s super-connected and hyper-speed environments, simple, little things like config changes can blow up in your face. Don’t be tomorrow’s version of what’s happened over the past week. Be smart, and start leading your organization down the process automation path.

Ready to take your first step? Pliant can help.

Click here to learn how Pliant automates the Configuration of HTTPS Application Load Balancing on the F5 BIG-IP.

Platform

Platform Components

By Use Case

By App

By Industry

Resources

News

Support

About

Partners

And Then There Was Last Week…

And Then There Was This…

Why Would This Happen?

What Can We Learn From These Outages?

Final Thoughts