IT workers around the world were likely barraged with AOL emails from their elderly relatives on October 4, asking why their computers were broken and their phones had stopped working. (At least that’s what happened to this humble reporter.) But it wasn’t a massive internet outage that caused this mass panic. In fact, the social networking site favored by those over a certain age — Facebook — was down. Everybody on Twitter was talking about it, including Twitter, which took the opportunity to troll its rival.
The outage came the day after a scathing 60 Minutes report featured an interview with a former Facebook insider turned whistleblower about the company’s algorithms that purposely fed political polarization because that kind of content tends to be more profitable.
So far, the timing of Facebook’s outage seems to be just a coincidence. In a blog post after the network had begun its recovery, Facebook VP of Infrastructure Santosh Janardhan issued an apology and explained what had gone wrong.
“The underlying cause of this outage also impacted many of the internal tools and systems we use in our day-to-day operations, complicating our attempts to quickly diagnose and resolve the problem,” he wrote. “Our engineering teams have learned that configuration changes on the backbone routers that coordinate network traffic between our data centers caused issues that interrupted this communication. This disruption to network traffic had a cascading effect on the way our data centers communicate, bringing our services to a halt.”
For those in IT, maybe this scenario sounds familiar. Maybe you’ve also been in a situation where, say, a server certificate expired, the single spark that created a whole apocalypse for your entire network.
It’s also important to point out that Janardhan said there was no foul play involved in the outage.
“We want to make clear that there was no malicious activity behind this outage — its root cause was a faulty configuration change on our end,” he wrote.
Although Facebook is mostly considered a consumer network, its outage offers a series of lessons to IT organizations. First and foremost, there’s the important part about how to avoid similar catastrophes in your own enterprise. But also, Facebook’s influence extends beyond consumer markets. Plus, Facebook’s other brands, Instagram and WhatsApp, also experienced an outage, and WhatsApp is certainly used in business scenarios as well as by consumers.
The following slides cover five lessons every IT organization should learn from the Facebook outage.