Until a few years ago the term Big Data was only known to a limited group of experts. Now it is nearly impossible to visit a website or read an article without stumbling across Big Data. A Google query results in over 1.5 billion hits in less than half a second, two years ago this was only one fifth of that number. Hits containing links to web pages on the increased number of visitors of a museum or the improvement of the supply chain performance of Tesco due to use of Big Data. You get the impression that Big Data is everywhere. Many times Big Data is positioned as the answer to everything. It is the end of theory as former Wired editor Chris Anderson wants us to belief. The promise of Big Data seems to imply that in the sheer size of data sets there lurks some kind of magic. When the size of the data set passes some critical threshold the answer to all questions will come forward as if Apollo no longer lives in Delphi but in large data sets. Has Deep Thought become reality with Big Data and will it answer all our questions, including the ultimate question to Life, The Universe and Everything? Or a slightly simpler question whether P = NP or not? Will Big Data end Operations Research?
|red = Big Data, blue = ORMS|
The introduction of enterprise wide information and planning systems like ERP and the Internet has led to a vast increase in the data that is being collected and stored. IBM estimates that each day 2.5 quintillion bytes of data are generated and this rate is growing every day. So fast that 90% of the data that we have available today arose in the past 2 years. The ability to use this data can have enormous benefits. The success of companies like Google and Amazon proofs that. When sales of two items correlate at Amazon they end up in each other’s “Customers Who Bought This Item Also Bought” lists, boosting sales. The same principle is used by Google in their page rank algorithm. Using similar techniques Target was able to identify a correlation between the sales of a set of products and pregnancy. Using point of sales data and data from customer loyalty cards this correlation was used to personalise ads and offers sent to Target customers upsetting a father when his teenage daughter started to get ads for diapers and baby oil. Quite a story, but is it proof of the success of big data? We must be cautious not to be fooled by our observation bias, as we don’t know how many Target customers incorrectly received the pregnancy related ads.
As data is not objective and correlation most of the time doesn’t imply a causal link we must be cautious in blindly following what the data seems to tell us. Data is created as we gather it, it acquires meaning when we analyse it. Every analyst should know the many pitfalls in each of these steps. For example, the analysis of Twitter and Foursquare data during hurricane Sandy from the New York area showed some interesting results. The number of tweets coming from Manhattan suggested that Manhattan was worse off than Coney Island. However, the reverse was true, as smart phone ownership is much higher in Manhattan. Due to black outs, recharging smart phones became impossible, lowering the number of tweets and check-ins from Coney Island even more. A similar thing happened in the beginning of the year when Google estimated the number of flu infected people. Google estimated that 11% of the US population was infected, twice the level the CDC estimated. Cause of the overestimation probably was the media hype boosting the number of Google queries on the subject. A lot of data not a panacea after all?
Making decisions based on correlations found in data alone can bring you benefits when you are a Target customer, but can keep you from rescue when you’re living in Coney Island. Quality decision making doesn’t result from data alone, let alone the random quest for correlations in large data sets. Data however is an important ingredient for quality decision making. As Ron Howard suggests, quality decision making starts with framing the problem. The decision is supported by what you know (data), what you want (decision criteria, objectives) and what you can do (alternatives, requirements and restrictions). Collectively, these represent the decision basis, the specification of the decision. Logic (the math model) operates on the decision basis to produce the best alternative. Note that if any one of the basic elements is missing, there is no decision to be made. What to decide when there are no alternatives? How to decide between alternatives when it’s unclear how to rank them? If you do not have any data linking what you decide to what will happen, then all alternatives serve equally well. The reverse is also true, gathering data that doesn’t help to judge or rank the alternative decisions to the problem is pointless. Data is said to be the new oil. My take is that organisations shouldn’t be fixated on gathering and mining data but combine it with a structured approach on decision making. Than data will becomethe new soil for improvements, new insights and innovations.
It took Deep Though 7.5 million years to answer the ultimate question. As nobody knew what the ultimate question to Life, The Universe and Everything actually was, nobody knows what to make of the answer (42). To find the question belonging to the answer (some kind of intergalactic Jeopardy!) a new computer is constructed (not Watson) which takes 10 million years to find the answer. Unfortunately it is destroyed 5 minutes before it reaches the answer. An Operations Researcher (or Certified Analytics Professional) would probably have done a better job, starting with framing the question, gathering and validating the relevant data, constructing and calibrating a model, and finaly providing the best possible answer. I already know that it isn’t 42 or Big Data, but Operations Research.