Blade Runner was set in 2019, in a world of humanoid replicants, and flying cars. In reality, it’s 2018 and organisations like Secret Cinema has its own, more mundane technology problems to worry about.
No doubt like many of my peers in the tech world, Ridley Scott’s Blade Runner is an important film to me. Tackling themes of identity and humanity against a backdrop of a dystopian future; the film explores the idea that technology gone rogue can be dangerous.
In 2018 companies are still struggling to launch high-demand ticket sales campaigns, without experiencing server trouble. When Secret Cinema opened its site last week for tickets to Blade Runner, the company was forced to send an email, tweet and other comms explaining that it had to take tickets off-sale.
In the email and tweet, the company explained: "due to overwhelming demand, the surge of traffic has crashed our ticket provider's entire site globally and all its related events".
The issues faced by Secret Cinema not only meant people couldn’t buy tickets, but also meant that many were unsure whether they had successfully purchased tickets, or inadvertently purchased multiple tickets due to the lack of confirmation - all caused, presumably, by infrastructure failure.
While the exact cause of the failure is, as an observer, hard to identify, what we do know is that there is rarely a single cause of failure for a distributed system.
However, there are steps that companies can and should take to avoid scaling trouble. While it can act as a good marketing ploy to show the enormous demand for an event, the people buying tickets like to be able to do what they came to a site to do.
Large spikes in traffic can cause problems for the most established business, but if you’re expecting one, measures can be taken to prepare and minimise the potential for damage.
Firstly, there is some pre-sale work that can be done to increase the chance of a successful on sale. Capacity modelling involves calculating what the spike is likely to look like and how much demand you can expect.
This data will use historical information, or broader industry data, in addition to what you’ve seen on the site after initially announcing the first details. These analytics help you to estimate demand, the size of the audience and work out how many tickets you are likely to sell. You’ll also know how you’re releasing the tickets.
Some companies will work on a staged basis, sending batches of emails at a time to potential buyers, others will issue all tickets at once.
It's always wise to err on the side of caution and plan for traffic on the higher side of your calculations.
Once you know the amount of traffic, you're expecting you can start load testing. The more realistic you can make these tests the better.
It can seem like a natural win to merely throw large numbers of HTTP requests at a URL, but that often doesn’t take into account user behaviour such as dwell or wait time, which more sophisticated tools can manage.
Some people may be using multiple devices to try to increase their chances of getting tickets. Others may be in a syndicate to be the first to get tickets. Some may even seek to deploy bots to get as many cheap tickets as possible. All of this behaviour adds additional traffic, which needs to be thought about as part of the test model.
If you make use of tooling that accurately simulates the traffic you'll generate and replicates the path a user may take when you launch; you can empirically test whether your kit will stand up. Some vendors offer a service which will record many representative user journeys and then scale those up to expected levels of traffic. Some organisations might only run this testing before a big event, but businesses where ticket sales online are a core business function, building a testing framework and engaging in regular performance regression testing as part of the general software delivery lifecycle is advisable.
Another issue that can be overlooked is that you’re likely to be reliant on other vendors also to have the ability to scale to the level of your traffic spike. As a result, a failure may be nothing to do your own capacity. For example, if a payment gateway or email provider fails it can bring the rest to a halt. These are contractual conversations; it’s important to understand the service level agreements your suppliers have committed to.
You may want to load test their services yourself, though you may meet some reluctance. You may also want to consider having multiple similar vendors to lean on. Comic Relief, for example, have a very short window of very high demand for donations so they smooth payment handling across different financial receivers, and can remove them from the equation if they start to struggle.
When The Scale Factory was asked to support a major ticket launch for one of the worlds most in-demand pop concerts, we worked with the client to ensure ticket sales went to multiple PayPal accounts, so there wasn’t a bottleneck at that stage.
Often when a big tech failure occurs, it’s put down to human error. Pointing the finger at a human is never a good excuse and is often seen as an easy way out - just ask the state of Hawaii. The healthier way of looking at a high impact incident is to look at what systemic failures there are that allowed that to happen, rather than pinning this on a single individual working within that system.
Another area that is often not considered is managing the expectations of the user. In the case of Secret Cinema, many of the complaints were from people who were unsure if they had been allocated tickets because they had not received an email immediately. A simple line on the payment confirmation page that emails could take a few hours to come through could have made a difference.
So what now?
By all accounts, Secret Cinema was back up and running on the same day - and only saw a few hours where customers were unable to make purchases.
Being Secret Cinema though, it could all be part of the experience of the event.
Jon Topper is CTO of The Scale Factory
Image source: Shutterstock/violetkaipa