After 4 years, the wait is finally over ! The World Cup started last Friday. Beside bringing lots of excitement all over the world, lots of buzzing ears due to too many Vuvuzela, it also seems that the World Cup is brining a lot of pain to the Twitter engineering team. It might not be related, but the fact is that the tweetersphere is experiencing high level of frustration ever since the event started. Performance degradation have been seen 3 days before the start and have been increasing ever since.
This is not the first occurrence for major problems but as far as I can recall it is the first time it lasts so long. From my perspective, it definitely demonstrate that Twitter might be running on thin ice and might not have that much room for growth.
As Ed Ceaser and Nick Kallen explains in their anatomy of a whale article and as most of you know, there can be many factors affecting the performance of a high volume service running on the web. Major issues arise when you end up with spike of volume and your hardware infrastructure is not able to cope. Unfortunately throwing more servers, with more CPUs and memory is only a temporary measure and definitely not a cure. Understanding which component of your system is the bottleneck is a nightmare, especially when you have millions of users screaming at you . You’ve got to think fast and have a way to analyze your data very, very quickly. We have today a lot of tools allowing us to gather data from our servers, our network infrastructure, our load balancer, our cache etc. but the aggregation and correlation of massive amount of live data is still very difficult to perform today. And this is a real struggle for the Twitter team and for everyone else running high volume. They candidly admit it in the article:
Debugging performance issues is really hard. But it’s not hard due to a lack of data; in fact, the difficulty arises because there is too much data. We measure dozens of metrics per individual site request, which, when multiplied by the overall site traffic, is a massive amount of information about Twitter’s performance at any given moment. Investigating performance problems in this world is more of an art than a science. It’s easy to confuse causes with symptoms and even the data recording software itself is untrustworthy.
This is why Performance Intelligence is becoming more and more critical with lots of vendors jumping on this bandwagon from CA with their Real-Time maps for Performance Intelligence, Soasta with their OLTP engine, Confio, Hyperic.
What I don’t quite understand from the recent Twitter outages is the strategy the Twitter engineering team use to test their fix or enhancement. With today access to cheap cloud environment, way to generate loads and loads of virtual users you would think that a staging environment is in place to deploy new chunk of code. When I look at the latest status from their status page, I’m wondering.
Update: We’re currently experiencing site availability issues resulting from the failed enhancement of a new approach to timeline caching. Our infrastructure and operations engineers are currently working to resolve this. We’ll update you soon with an ETA.
Update: 10:17 PM PDT: Users may temporarily experience missing tweets from their timelines. They will be restored shortly.
Update: 11:51 PM PDT: We’re recovering from the site outage. Remember, users may temporarily experience missing or duplicate tweets from their timelines. Normal timelines will be restored shortly.
What I do appreciate though, is the level of visibility the Twitter team gives us. They screwed up, they were probably not ready to handle an event such as the world cup (Especially the USA vs England game !) but they’re honest about it:
In brief, we made three mistakes:
* We put two critical, fast-growing, high-bandwith components on the same segment of our internal network.
* Our internal network wasn’t appropriately being monitored.
* Our internal network was temporarily misconfigured.
There are a lot of information shared by the engineering team on their blog, unfortunately I can’t find anything regarding testing, how they do it, who is doing it (devs vs dedicated Software Engineer in Test), how they cope with the hardware challenge etc. Google is very good at reaching out to the test community but I haven’t seen much coming from Twitter. Maybe this is the right opportunity for them to do so?