Last year, I was writing an article called: “Performance testing: When will they learn?“, pointing the fact that most websites were still unprepared when it comes to handling large volume of traffic. Everyone could remember the Apple and AT&T pre-order crash last year in the US. A typical example of what not to do when launching a new product. In France, the most famous case of web failure was the launch of the 2 millions Euros governmental website france.fr which crashed miserably after 5 minutes of live. This morning, we have a good challenger to take the first spot.
Rueducommerce.fr is one of the leading eCommerce website in France and today they had decided to sell 1000 HP TouchPad at 7am. So I decided to wake-up a bit earlier today knowing it would be a win-win situation for me: I could have ended up with a very cheap Android tablet or a very good story to tell. Obviously, I didn’t end up with my tablet but I’ve got my story. A story I’ve heard too many times in the past.
It was a perfect setup for a failure:
- A hot and cheap product – Thanks to a great marketing from HP, the TouchPad is THE hottest tablet this summer. I don’t think they’ve planned this stunt before discontinuing the tablet though. I mean, c’mon, this is HP we’re talking about here. It’s been a while since they’ve planned anything.
- Scarcity – 1000 tablets to be sold at a very low price. Even if you don’t need a tablet, at 99 Euros you probably need one. I know I need one
- Social media buzz – Long gone are the day when you had to pay millions for a marketing campaign when you’ve got Facebook, Twitter and a couple of blogs covering your story. Yesterday they posted the announcement on their Facebook page and let people do their marketing job.
At 6:50am I’ve tried to log on to the site and for 5 minutes got a Service Unavailable - DNS failure The server is temporarily unable to service Reference. I was expecting a sluggish experience but not stuck at the door without being redirected to a throttle page (very nice article on the subject by Dan “the Man” Bartow: The graceful turn-away page). A typical DNS server should be able to handle something around 1M requests per seconds if properly configured. Either the whole country was up to buy a tablet or they were facing a DDoS attack or … someone didn’t do his job.
Finally after 5 minutes of constant reload, I was in! The website was super slow with some dynamic contents not loading properly but I was able to get to the TouchPad product page.
I nervously added the Tablet in my shopping cart only to see the following message after a 2 minutes wait
Sorry, our website is temporarily overloaded but is still available (!). Our engineers have been made aware of the incident at the same time you discover this page (!) and are working actively to get the service back online. Apologies for the inconvenience. We advise you to reload this page regularly.
An unfortunate typical message I would say. Needless to say at 7:10am, I gave up. I was monitoring the situation on twitter and obviously I was not the only facing these issues. No surprise here.
So what has happened exactly? Thanks to Yannick Simon who is a Vice President at rueducommerce, we have some technical details about the issues they have faced (kudos for the transparency on that one!)
This is an Akamai snapshot during the peak. They topped 40k requests/sec (around 800k hits/sec). Not as dramatic as I would have thought but still a nice peak.
- Their firewall got burnt (!) trying to handle 100k concurrent connection.
- Some of their DNS servers were down before and after the peak.
- Their Varnishes were on their knees rapidly.
- Average bandwidth was 2 Gbps with peaks at 4Gbps.
- Traffic was 5 times the typical sales traffic.
- Their Ad servers were dying and slowing down the whole site.
I don’t know how they test the scalability and reliability of their system but I know one thing: They don’t perform load testing outside of their firewall. Or they don’t do it at scale. Or they don’t do it on production system. My guess is that they rely on internal testing + some good old extrapolation. Fools
They could have avoided this disaster very easily. At SOASTA, we handle these type of situations all the time unfortunately. Don’t get me wrong, it’s good for the business but I would rather help companies put the right best practices in place with the right products rather than help them in panic mode.
So what should have they done?
- First, get in touch with us at SOASTA late last week (the sooner the better but this is an easy case for us to handle).
- We would have created the right scenarios probably in a couple of hours. This is your typical eCommerce website and scenarios are fairly generic to cover the most probable problems.
- We would have tested gradually from our global cloud on their production system ramping up gradually to their expected traffic. How many simultaneous connection were they expecting? Good question. They probably didn’t have a clue and I can’t blame them. But in these type of events, we recommend testing at up to 500% of expected traffic. We would have been right on par in this case if they considered their typical sales traffic as their expected traffic for this events.
- We would probably had identified their DNS configuration problem very rapidly, probably around 20k VUs or so (just a guess here). The great thing about having real-time analytics is to be able to make change to the infrastructure while the test is running and understand the impact of this change. You don’t have to finish running your tests, waiting for the performance analysis, make the change and retest. A typical dance if you’re using traditional (outdated?) load testing products.
- Because we’re testing externally from the cloud, we would have been able to bang on that firewall and clearly identify the capacity problem (Firewall problems are amongst the top 5 problems we find with our customers)
- I’m not quite sure if they had a bandwidth bottleneck, but again, testing outside of your firewall allows you to identify bandwidth capacity problem super fast.
- I didn’t hear any report of load balancers issues but I guess they have some in the mix. Again, testing at high volume allow customers to identify load balancer problem in a snap.
During the test, we would typically be in the right side of this graph and assuming they had tested properly in the previous stages, it would have been a breeze to find their infrastructure limitation.
I’ve reached out to Yannick this morning and I’ve got this too typical answer:
“Even with maximum precaution, we would have had crashes. We were at x5 our typical sales traffic. We have 3 hosting providers, 3 CDNs, Varnishes, big brand firewalls, etc. It would have not been enough”
The answer is typical because most companies takes their infrastructure for granted. They think it’s just a matter of throwing hardware, CDNs, Firewall, Load balancer, bandwidth to be able to handle traffic. Wake up call guys: You have no idea how your infrastructure is going to handle unexpected traffic unless you … TEST and expose it, ahead of time, to the same amount of unexpected traffic. Load Balancers are our number one problem we face with our customers. They buy it and configure it at default settings … Sure it’s going to distribute your traffic nicely with a few thousands concurrent connection (I’ve seen load balancer issue with less than 500 …). Is it going to be properly configured to handle 10k? 50k? 100k? Probably not. How to decide which algorithm is best suited if you don’t test your load balancer with the right amount of traffic? The same goes for all your infrastructure from your web servers, databases, CDNs, Ad servers, media servers, third-party etc.
In summary, there are no valid reasons to see these type of problems anymore. There are inexpensive ways to avoid such disaster and our long growing list of customers is the proof. Today we have 1000 happy TouchPad users and 99 000 very angry customers who might decide to never shop at rueducommerce again. Hopefully next time I’ll receive a call from Yannick and get them ready for such event. The holiday season is approaching rapidly and maybe after he watches this webinar, I’ll get a call