Wednesday 8 December 2010

Statistical Reasoning for Dummies


"Statistical thinking and reasoning is necessary for efficient citizenship as the ability to read and write"

Is this statement to bold? I don’t think so. We are surrounded with statistics, uncertainties and probabilities and need to understand them, use them and make decisions with them. But, as it turns out, statistical reasoning is very difficult given the many mistakes that are made in newspapers, medical decision making, social science, gambling, politics. You name it, it’s everywhere and so are the mistakes. To give you an extreme example, in Innumeracy J.A. Paulos tells a story about a weather forecaster. The weather forecaster reports a 50% change of rain on Saturday, also a 50% chance of rain on Sunday. He concludes that it will rain the weekend for certain. More recently the publication of Stonewall stating that the average coming out age has been dropping was proven to be wrong by Ben Goldacre. The Stonewall survey is seriously flawed and proves the obvious point that people tend to get older when they get older, nothing more and nothing less. See Ben’s Bad Science weblog for more details. Yesterday a big news item on local television was that mother, son and grandson are born on the same date. Statistically it’s not that extraordinary, contrary what the journalist said (“It’s a miracle”). It’s easy to make a long list of these kinds of mistakes (the next Great Operations Research Blog Challenge theme?), but how to resolve this? Maybe some statistical reasoning for dummies could help? Let’s start with an introductory chapter, some basics.

As an Operations Researcher I am used to work with terms like probability, risk, variance, covariance, t-test, and many other statistical “Red” words as Sam Savage calls them is his book The Flaw of Averages. Many times these “Red” words are used to express a probability or risk, leading to many mistakes or confusion. Take for example the story from journalist David Duncan that was in Wired magazine a few years ago. David did a complete gene scan that checked for genetic decease markers in his DNA. Such tests will soon be part of everyday medical care (and insurance acceptance terms?). To his distress David receives the message that he has mutations in his DNA, raising his risk of having a heart attack. Such risks are expressed as the probability that you will have a heart attack is x%, a single event probability. It is similar to the statement that the probability that it will rain tomorrow is 30%. But what does it mean? Will it rain 30% of the time tomorrow, or in 30% of the country? Both inferences are wrong by the way. The problem with single event probabilities expressed in this way is that without a reference to the class of events the probability relates to, you are left in the dark as to how to interpret it. It causes Duncan to worry about having a heart attack, but should he have worries about it? A way around this confusion is to include the reference class to the probability. So, the weather forecaster should state something like that in 3 out of the 10 times he predicted rain for tomorrow, there was at least a trace of rain the next day. Much of the confusion of David could have been resolved if the doctor would have added a reference class, putting things in perspective.

Another classic misunderstanding is the interpretation of a conditional probability, like in interpreting diagnostics tests in medicine. See my earlier blog entry on that. The approach I used to explain the correct way to interpret the test results “translates” the probabilities (stated in percentages) into real numbers, making it easier to understand. Actually it does more or less the same as adding the reference class to the single event probability. It adds context. The last example of a much misunderstood statistic is relative risk. In the Netherlands there was much debate on whether girls should to be vaccinated against cervical cancer caused by the human Papilloma virus (HPV). To express the effectiveness of the vaccination, a relative risk reduction was used. Something like; “This vaccine will reduce the risk of getting cervical cancer from an HPV infection with x%”. This kind of statement is used regularly to express the effectiveness of preventive methods like screening, vaccines or other risk mitigation strategies. Using relative risk reduction as a measure can however be confusing. For example, if the number of women dying of cervical cancer reduces from 4 to 3 per 1000, the relative risk reduction is 25%. A massive risk reduction you would say. However, if you look at the actual reduction of women dying this is only 0,1% (=1/1000). The confusion, again, comes from not expressing the reference data causing many people to think that the relative risk reduction applies to those how take the vaccine, but it actually applies to those how don’t and die (25% less dead).

So first lesson in statistical reasoning is look for the reference and translate probabilities and risks into numbers. For us “professionals”, lest present our results in a smart and easy to understand way and skip using those Red words.

1) I used HG Wells’ statement on statistical reasoning from somewhere in the beginning of last century as a starting point.

No comments:

Post a Comment