As we’re getting close to enter soon the Exabyte age (especially when we’ll have millions of sensor everywhere !), I wanted to get my head around the state of data today and wanted to share with you my opinion about where I think data is headed.
The 90′s have seen software companies competing against each other to come up with the killer apps. Whether it was desktop application, back-office, supply chain management, data management etc. Business Intelligence was already in the place with company such as Business Objects, MicroStrategy, Oracle and they were already helping business to make sense of all their internal data but it was mostly for reporting purposes rather than predictive analysis. The 00′s have seen the explosion of the Internet with its loads of new online services, new business models (with only a few of them which are actually viable), new mechanism to deliver information, better collaboration model, allowing a reduction of cost and improving relationship between business partners. As in the 90′s, competition is fierce with new players coming every day to compete on the same global market. It’s a huge market and everyone would like a piece of the pie. But we’re seeing today a lot of company, large or small, offering the same kind of products or service with no real competitive advantage. They all have access to low cost labors, same technologies, same global knowledge and we’re seeing (in my opinion) a stage of slow innovation from a software or service perspective. The real main differentiator today between 2 similar business is not so much at the feature level but around the efficiency to better understand data pertaining to this business, predicting future trend to be able to trigger actionable data for better profit. As an example, it helps business better manage their supply chain (reduce inventory and stock-outs), optimize customer selection and increase their loyalty, optimize pricing, increase product and service quality by detecting problems early, optimize marketing campaign for stronger return etc. Business is of course one of the main driver helping innovation in the analytics world. Science is an obvious contributor and beneficiary as well, from astrophysics, health care, biology, physics, meteorology etc. They all need improved data-collection technology and analysis.
As Joe Hellerstein mentions, we’ve entered the industrial revolution of data and it’s a brand new ball game!
The Petabyte age
My first computer back in 1984, an Apple IIe, had 128KB of memory (I was one of the lucky one !) and I was able to store a whooping 140KB on a single 5.25 inch floppy. Then came the time of the Atari ST and Commodore Amiga years with their 512KB memory and 3.5 inch floppy with 720KB storage. In the 90′s, my first PC, a 386sx with 1MB memory and a 20MB hard-drive. Then it went ballistic and I end up today with 4GB of memory of my primary machine and multiple terabyte of data (mainly pictures. You can see some of them on my photo blog !)
I couldn’t dream of a terabyte of storage 5 years ago and now it’s becoming the norm for consumers. Digital Data consumption has exploded for consumers between the music, pictures, movies, books etc. Business and science are of course following this trend
To give you an idea:
- Wall-Mart handles 1 Million customer transaction per hour, feeding databases estimated at more than 2.5 petabyte.
- It’s difficult to evaluate the amount of data Google is currently storing and handling everyday but a white paper released in 2008 mentions that between the indexing, processing and the serve ups of ads, Google was processing 20,000 terabytes of data (20 petabytes !) a day.
- Ebay stores 8.5 petabyes of data.
- AMSTAR a digital library used by the National Center for Atmospheric Research (NCAR) will preserve and protect 30 petabytes of data.
- The amount of information consumed by Americans in 2008 totaled 3.6 zettabytes based on a study from researchers in the University of California in San Diego (UCSD)
- More than 20 hours of YouTube videos are uploaded… every minute (source)
- Over 200 billion emails are sent… every day (source)
- In 2005, mankind created 150 exabytes of data… this year it is estimated we will create 1,200 exabytes (source)
- In 2020 we expect to create 35 Zettabytes of data, or 35,000 exabytes. (source)
If you’re lost with all these bytes, here is refresher !
From Data to Wisdom
The data part is only the tip of the iceberg and actually it is actually pretty pointless to gather them without doing anything meaningful with them. In the late 1970s the DIKW model was built to allow data mangers to treat their data as a strategic resource and establish a relationship between data and knowledge allowing them to feed decisions based on this knowledge.
The model assumes the following chain of action:
- Data comes in the form of raw observations and measurements. Raw webserver logs, Databases Transaction logs, mobile phone call logs etc. There is nothing you can get from these data unless you do some crunching on them.
- Information is created by analyzing relationships and connections between the data. It is capable of answering simple “who/what/where/when/why” style questions. Information is a message, and consider the audience and the purpose.Your data starts to make some sense and become relevant to your business: Where do your visitors navigate on your website, where do they click, how long do they stay on a specific page, Where do they come from, where do they after their visit, what do they buy, who have you called from your mobile, where were you, what did you buy after being in X or Y store etc.
- Knowledge is created by using the information for action. Knowledge answers the question “how”. Basically we want to understand based on the information we’ve aggregated the conditions as to what makes people behave in the way they do. If we know that a 25 years old man, living in New York, married, working for Microsoft, with a credit score of 650 has bought a Laptop last month, can we make a prediction of what he will be likely buying in the next 2 weeks, how much he will spend, what store he will chose and how will he react if we send him a coupon for X or Y products?
- Wisdom deals with the future, as it takes implications and lagged effects into account. We now know the who, what, where, when and hopefully why. We need to test our predictions, execute them and monitor the results and probably adapt. This is where you end up with your A/B or Champion/Challenger strategy (Smashing magazine has just released its ultimate guide to A/B Testing ! How convenient.)
Today, a lot of company stops at the Information stage an analyze their metrics retroactively to understand what has happened. The successful company will go all the way to the wisdom stage to develop models to predicts optimal action for the future. Predictive analytics is a booming market where companies such as SPSS (acquired by IBM last year), Prediction Impact and of course Experian mine historical behavioral data, transactional and demographic data to develop prediction model for future behavior. Earlier this year, Google has acquire Recorded Future which tries to analyze the past and present to predict the future with technology that extracts event and time information for the web to predict stock market events or even terrorists attack.
Predictive analytics is all over all business verticals: Marriott International has developed a company wide program called Total Hotel Optimization Program for establishing the optimal price for guest rooms (revenue management). The software integrate a system to optimize offerings to frequent customers and calculate the risk for these customers to defect to competitor. UPS has deployed a software solution which is able to accurately predict customer defections by examining usage patterns and complaints. Banks of course are not the last one in this bandwagon with Capital One with their Information-based strategy and Barclays with their information-based customer management. Other example include sports where statistical analysis is used to select the best player for a particular game or phase within a game. The new England Patriots, Oakland A’s and the Boston Red Sox all use software to make decision before and during the game. A book by Michael Lewis MoneyBall gives us some insight on the A’s overall approach to analytics.
Leveraging free data
In May 2010, the Open Data Study written by Becky Hogge was published by the Open Society Institute. It explores the benefits of opening government data in the US and the UK and tries to give clues on whether similar efforts should be pushed outside of Western democracies. The origin of this approach is the belief that the opening of public data can provide important economic and social advantages. For Becky Hogge, available spacial, demographic, budgetary and social data can be used to improves services and create economic growth.
Data.gov.uk was officially launched in January 2010. The site has been seen as a victory for pro-OpenData community. United Kingdom now see various websites and applications based on data released, especially map data related to postal codes, about the last general election in May 2010.
Data.gov is a U.S. Government portal providing access to databases created by the U.S. federal government and its agencies. It was launched in 2009 with two objectives. Firstly, the desire to encourage a bottom-up communication and the emergence of new ideas of governance, enhancing transparency of public services, citizen participation and collaboration between the state and its citizens. The opening of government data has also been conceived as a means of improving the efficiency of government agencies.
It originally contained 76 databases from 11 government agencies. Fearing that the momentum towards OpenData falling back and that too little data are publicly released, Obama has adopted a decree, December 8, 2009, requiring each government agency to publish at least three databases of quality.
The British portal already offers three times more data, while its American counterpart has six months advance. And data.gov.uk chose standardized format to foster the development of semantic web, unlike data.gov.
The WDI provides a valuable statistical picture of the world and how far we’ve come in advancing development,” said Justin Yifu
Lin, the World Bank’s Chief Economist and the Senior Vice President for Development Economics. “Making this comprehensive data free for all is a dream come true.
They’ve also announced the launch of freely available online database and public API to 1,000+ indicators. These data used to be very expensive and I can’t start to imagine all the mashups we’re going to see flourishing in the next months.
Any company serious about analytics needs to keep an eye on Open Data. Whether they want to contribute with their own data or leverage other’s data. There is a real opportunity here to opens new collaboration between the public and the private sector.
A need for new tools and computing environment
This deluge of data brings new challenges from a software and computing environment. Traditional RDBMS typically show their limit when dealing with vast amount of data and supercomputer are still not accessible to most company.
Cloud computing is the obvious choice to collect, store and process data. With an illusion of infinite computing resources available on-demand, the elimination of up-front commitment and the ability to pay for use of computing resources on a short-term, as-needed basis you have the perfect environment to process your data. No wonder that by 2012, customers are expected to spend $42 billion on cloud computing !
The next trends fueling these new uses of data are open-source innovative framework that supports data-intensive distributed application enabling to work with thousand of nodes and petabytes of data. Hadoop is the most popular framework today and include the following sub-projects:
- Hadoop Common:The common utilities that support the other Hadoop subprojects.
- Chukwa: A data collection system for managing large distributed systems.
- HBase: A scalable, distributed database that supports structured data storage for large tables.
- HDFS: A distributed file system that provides high throughput access to application data.
- Hive: A data warehouse infrastructure that provides data summarization and ad hoc querying.
- MapReduce:A software framework for distributed processing of large data sets on compute clusters.
- Pig: A high-level data-flow language and execution framework for parallel computation.
- ZooKeeper:A high-performance coordination service for distributed applications.
As I have discussed a while back, the NoSQL movement is taking more and more importance and is seeing an acceleration in usage against the traditional RDBMS. Parallel database management system are good at working with structured data. the noSQL approach gives the user much more control over how unstructured data is retrieved making it more suitable for processing petabytes of data.
Create an analytics mindset in your organization
Pushing for an analytics culture can create some tension between people with entrepreneurial and visionary mind and those being more pragmatic requiring a lot of proof before implementing a switch of strategy. The goal here is to find the right balance, being able to validate the strategy at a small scale and find a way to interpolate the results for a larger scale. There is always a risk to interpolate these results but that’s why successful organization hire the best mind to lead this effort. At Google, Procter and Gamble, UPS, Harrah’s, Capital One, Barclays, Amazon, Ebay etc. all organization driving an analytics strategy have leading authorities in maths, statistical analysis, optimization etc. Companies are competing for top talent and statistician is becoming THE sexy job for the future. A lot of the big players are getting really serious to attract the best talent. An example would be IBM and their plan to open a specialized data analytics center.
Because this strategy might impact the culture, processes, approach to business and skills, it needs to be push strongly from the top of the organization. You need charismatic leaders who can have both a blue-sky approach to empower their people (especially R&D) as well pragmatic, business and down to earth in order to maximize returns. Gary Loveman, Jeff Bezos, Rich Fairbanks are of this kind with the clout, perspective and cross-functional scope to change the culture.
There are 2 important additional topics I will cover in subsequent posts
Visualization of Data
Privacy and the danger of an open society