New analysis: Taylor Swift & server crashes

Hanging out over the weekend with three 12-year-old girls, I was not going to miss the fact that Taylor Swift just released a new song. Over the course of just 90 minutes, these girls shouted “Alexa – play Taylor Swift’s new song” five times.

And we were just one house with three 12-year-old girls. Imagine that scenario, played out over millions of cell phones across the globe – asking not just for the audio of her song but clicking to play the video. That’s a lot of data to serve up, all in a condensed time. It’s a lot like a Black Friday sale, where shoppers swamp Macy’s or Target or Saks (all examples of Black Friday meltdown victims). In this case, the victims are all the online music providers getting hit up simultaneously for one particular video, over and over.

So we shouldn’t be surprised to hear that servers all around the world are crashing under the weight of Taylor Swift’s popularity. Already moody teenagers will become decidedly grumpy when their videos won’t load or play all jittery. And the media providers will pay the price.

It’s logical to ask – these providers knew her new song was coming, and they know how popular a singer she is. Why aren’t they more prepared? They should have planned for this major traffic deluge and been ready with all the needed IT capacity, right? How do these meltdowns keep happening? We all saw ESPN go down on opening game day when Fantasy Football players crashed their site. And HBO had a similar catastrophe when the new season opener of “Game of Thrones” was posted.

Why does this happen, over and over – that an event of predictable popularity crashes a site? Let’s look at what’s involved from technology perspective, and let’s think about the steps organisations can take to avoid this kind of disaster.

Serving up a Taylor Switch video takes a lot of IT resources. You need significant network, storage, and database capacity. Turns out, some aspects of those technology assets are pretty easy to expand but others aren’t. You can add storage pretty readily, and you can increase network bandwidth pretty easily too. Given a couple months, both of those resources can grow without much struggle. You can even grow database capacity fairly quickly – but here’s the rub. Apps have to talk directly to databases, so while you might be able to add database capacity quickly, recoding apps to talk to that database is far from a quick endeavour.

The cost of an outage

We’ve seen customers with nine- and 18-month projects slotted for refactoring applications to take advantage of scaled out databases. Why? Because you have to go into the application code and update each query to tell it which database server should get its request. And that’s if you can change the code. Sometimes organisations are running off-the-shelf software and can’t access the code to adjust it for database scale. Other times, we’ve had customers with full control over the source code but they’re terrified to change it for database scalability. Perhaps the timing is too close to an event – like the debut of Taylor Swift’s new song. Or maybe the problem is personnel turnover – the team members who really understand the source code aren’t there anymore, and the current engineers are afraid to break the interworkings of the application code.

For companies supporting this music debut, any outages will cost them dearly. In the moment, the pain will likely be loss of eyeballs and therefore loss of ad revenue. In the longer term, they may not get the rights to host new music because of poor performance this time around. Few companies can withstand outages today without significant financial impact. Amazon remains and extreme example, where the company is said to love $66,000 every minute it’s offline.

These music companies won’t suffer to the same degree, but they will suffer financially. So the question becomes – what steps could they have taken to survive such a surge in traffic. Given some notice, here are steps you can take to ready your infrastructure for a big onslaught of customer traffic:

·         Leverage cloud capacity – if you’re already running in the cloud, consider growing the capacity you can dedicate to the impacted applications and data sources. Public-facing sites and applications in particular should be the focus for enhancement to accommodate surge capacity.

·         Reinforce web resources – you can grow the capacity of web servers fairly easily, leveraging additional hardware for scale out paired with TCP load balancers to distribute the load. Augmenting your web infrastructure with additional horizontal scale out will help keep the website online.

·         Focus on database resources – the database is often the weakest link in many organisations’ technology stacks. Without high-performing database server, all your other capacity increases won’t matter because your apps and web servers will grind to a halt. Remember, “slow” is the same as “down.” Monitor your databases’ performance, and, like with the web tier, consider horizontal scale out to increase capacity, paired with database load balancing software to distribute the load with no need for code changes.

·         Monitor the complete application stack – look to Application Performance Monitoring software for a holistic view of your customers’ experiences. Where are the bottlenecks in your application’s overall performance? What recommendations can the software make for root cause and troubleshooting?

Let Taylor Swift be a lesson to us all – we should all be ready to support major spikes in traffic without encountering an IT disaster. Whether you’re running in your own data centres or in the cloud – or the more common “both” scenario of hybrid architectures today – keep watch over all metrics that reveal how “hot” your systems are running. For business continuity, move from a disaster recovery approach (have a disaster, go down, and then move systems to a backup site) to an active/active architecture, where you’ve got multiple instances of all critical systems running in different locales. In that mode, if one entire site goes down, you’ve got the capacity in your other location(s) to take on the full load with no interruption to your users.

Don’t let your own success – a major increase in site traffic – because your failure. Take steps ahead of time to scale that trickiest of technology areas – your data tier – to avoid your own Taylor Swift outage disaster. Because no one wants to write the next “Here’s why we failed, and here’s why it won’t happen again” mea culpa blogs.

Michelle McLean, VP of Marketing at ScaleArc
Image Credit: Scanrail1 / Shutterstock