The Art of Statistics

How to Learn from Data

Contributors

By David Spiegelhalter

Formats and Prices

Price

$18.99

This item is a preorder. Your payment method will be charged immediately, and the product is expected to ship on or around September 3, 2019. This date is subject to change due to shipping delays beyond our control.

In this "important and comprehensive" guide to statistical thinking (New Yorker), discover how data literacy is changing the world and gives you a better understanding of life’s biggest problems.  
 
Statistics are everywhere, as integral to science as they are to business, and in the popular media hundreds of times a day. In this age of big data, a basic grasp of statistical literacy is more important than ever if we want to separate the fact from the fiction, the ostentatious embellishments from the raw evidence — and even more so if we hope to participate in the future, rather than being simple bystanders.
 
In The Art of Statistics, world-renowned statistician David Spiegelhalter shows readers how to derive knowledge from raw data by focusing on the concepts and connections behind the math. Drawing on real world examples to introduce complex issues, he shows us how statistics can help us determine the luckiest passenger on the Titanic, whether a notorious serial killer could have been caught earlier, and if screening for ovarian cancer is beneficial. The Art of Statistics not only shows us how mathematicians have used statistical science to solve these problems — it teaches us how we too can think like statisticians. We learn how to clarify our questions, assumptions, and expectations when approaching a problem, and — perhaps even more importantly — we learn how to responsibly interpret the answers we receive.
 
Combining the incomparable insight of an expert with the playful enthusiasm of an aficionado, The Art of Statistics is the definitive guide to stats that every modern person needs.

Excerpt

Explore book giveaways, sneak peeks, deals, and more.

Tap here to learn more.



List of Figures

0.1 Age and Year of Death of Harold Shipman’s Victims
0.2 Time of Death of Harold Shipman’s Patients
0.3 The PPDAC Problem-Solving Cycle
1.1 30-Day Survival Rates Following Heart Surgery
1.2 Proportion of Child Heart Operations Per Hospital
1.3 Percentage of Child Heart Operations Per Hospital
1.4 Risk of Eating Bacon Sandwiches
2.1 Jar of Jelly Beans
2.2 Different Ways of Displaying Jelly Bean Guesses
2.3 Jelly-Bean Guesses Plotted on a Logarithmic Scale
2.4 Reported Number of Lifetime Opposite-Sex Partners
2.5 Survival Rates Against Number of Operations in Child Heart Surgery
2.6 Pearson Correlation Coefficients of 0
2.7 World Population Trends
2.8 Relative Increase in Population by Country
2.9 Popularity of the Name ‘David’ Over Time
2.10 Infographic on Sexual Attitudes and Lifestyles
3.1 Diagram of Inductive Inference
3.2 Distribution of Birth Weights
5.1 Scatter of Sons’ Heights v. Fathers’ Heights
5.2 Fitted Logistic Regression Model for Child Heart Surgery Data
6.1 Memorial to Titanic Victim
6.2 Summary Survival Statistics for Titanic Passengers
6.3 Classification Tree for Titanic Data
6.4 ROC Curves for Algorithms Applied to Training and Test Sets
6.5 Probabilities of Surviving the Titanic Sinking
6.6 Over-Fitted Classification Tree for Titanic Data
6.7 Post-Surgery Survival Rates for Women with Breast Cancer
7.1 Empirical Distribution of Number of Sexual Partners for Varying Sample Sizes
7.2 Bootstrap Resamples from Original Sample of 50
7.3 Bootstrap Distribution of Means at Varying Sample Sizes
7.4 Bootstrap Regressions on Galton’s Mother—Daughter Data
8.1 A Simulation of the Chevalier de Méré’s Games
8.2 Expected Frequency Tree for Two Coin Flips
8.3 Probability Tree for Flipping Two Coins
8.4 Expected Frequency Tree for Breast Cancer Screening
8.5 Observed and Expected Number of Homicides
9.1 Probability Distribution of Left-Handers
9.2 Funnel Plot of Bowel-Cancer Death Rates
9.3 BBC Plot of Opinion Polls Before the 2017 General Election
9.4 Homicide Rates in England and Wales
10.1 Sex Ratio for London Baptisms, 1629–1710
10.2 Empirical Distribution of Observed Difference in Proportions of Left/Right Arm Crossers
10.3 Cumulative Number of Death Certificates Signed by Shipman
10.4 Sequential Probability Ratio Test for Detection of a Doubling in Mortality Risk
10.5 Expected Frequencies of the Outcomes of 1,000 Hypothesis Tests
11.1 Expected Frequency Tree for Three-Coin Problem
11.2 Expected Frequency Tree for Sports Doping
11.3 Reversed Expected Frequency Tree for Sports Doping
11.4 Bayes’ ‘Billiard’ Table
12.1 Traditional Information Flows for Statistical Evidence



List of Tables

1.1 Outcomes of Children’s Heart Surgery
1.2 Methods for Communicating the Lifetime Risk of Bowel Cancer in Bacon Eaters
2.1 Summary Statistics for Jelly-Bean Guesses
2.2 Summary Statistics for the Lifetime Number of Sexual Partners
4.1 Outcomes for Patients in the Heart Protection Study
4.2 Illustration of Simpson’s Paradox
5.1 Summary Statistics of Heights of Parents and Their Adult Children
5.2 Correlations between Heights of Adult Children and Parent of Same Gender
5.3 Results of a Multiple Linear Regression Relating Adult Offspring Height to Mother and Father
6.1 Error Matrix of Classification Tree on Titanic Training and Test Data
6.2 Fictional ‘Probability of Precipitation’ Forecasts
6.3 Results of a Logistic Regression for Titanic Survivor Data
6.4 Performance of Different Algorithms on Titanic Test Data
6.5 Breast Cancer Survival Rates Using the Predict 2.1 Algorithm
7.1 Summary Statistics for Lifetime Sexual Partners Reported by Men
7.2 Sample Means of Lifetime Sexual Partners Reported by Men
9.1 Comparison of Exact and Bootstrap Confidence Intervals
10.1 Cross-Tabulation of Arm-Crossing Behaviour by Gender
10.2 Observed and Expected Counts of Arm-Crossing by Gender
10.3 Observed and Expected Days with Each Number of Homicide Incidents
10.4 Results of Heart Protection Study with Confidence Intervals and P-values
10.5 The Output in R of a Multiple Regression Using Galton’s Data
10.6 Possible Outcomes of a Hypothesis Test
11.1 Likelihood Ratios for Evidence Concerning Richard III Skeleton
11.2 Recommended Verbal Interpretations of Likelihood Ratios
11.3 Kass and Raftery’s Scale for Interpretation of Bayes Factors
12.1 Questionable Interpretation and Communication Practices
13.1 Exit Poll Predictions for Three Recent General Elections



Introduction

The numbers have no way of speaking for themselves. We speak for them. We imbue them with meaning.

—Nate Silver, The Signal and the Noise1

Why We Need Statistics

Harold Shipman was Britain’s most prolific convicted murderer, though he does not fit the archetypal profile of a serial killer. A mild-mannered family doctor working in a suburb of Manchester, between 1975 and 1998 he injected at least 215 of his mostly elderly patients with a massive opiate overdose. He finally made the mistake of forging the will of one of his victims so as to leave him some money: her daughter was a solicitor, suspicions were aroused, and forensic analysis of his computer showed he had been retrospectively changing patient records to make his victims appear sicker than they really were. He was well known as an enthusiastic early adopter of technology, but he was not tech-savvy enough to realize that every change he made was time-stamped (incidentally, a good example of data revealing hidden meaning).

Of his patients who had not been cremated, fifteen were exhumed and lethal levels of diamorphine, the medical form of heroin, were found in their bodies. Shipman was subsequently tried for fifteen murders in 1999, but chose not to offer any defence and never uttered a word at his trial. He was found guilty and jailed for life, and a public inquiry was set up to determine what crimes he might have committed apart from those for which he had been tried, and whether he could have been caught earlier. I was one of a number of statisticians called to give evidence at the public inquiry, which concluded that he had definitely murdered 215 of his patients, and possibly 45 more.2

This book will focus on using statistical science * to answer the kind of questions that arise when we want to better understand the world—some of these questions will be highlighted in a box. In order to get some insight into Shipman’s behaviour, a natural first question is:

What kind of people did Harold Shipman murder, and when did they die?

The public inquiry provided details of each victim’s age, gender and date of death. Figure 0.1 is a fairly sophisticated visualization of this data, showing a scatter-plot of the age of victim against their date of death, with the shading of the points indicating whether the victim was male or female. Bar-charts have been superimposed on the axes showing the pattern of ages (in 5—year bands) and years.

Some conclusions can be drawn by simply taking some time to look at the figure. There are more black than white dots, and so Shipman’s victims were mainly women. The bar-chart on the right of the picture shows that most of his victims were in their 70s and 80s, but looking at the scatter of points reveals that although initially they were all elderly, some younger cases crept in as the years went by. The bar-chart at the top clearly shows a gap around 1992 when there were no murders. It turned out that before that time Shipman had been working in a joint practice with other doctors but then, possibly as he felt under suspicion, he left to form a single-handed general practice. After this his activities accelerated, as demonstrated by the top bar-chart.




Figure 0.1
A scatter-plot showing the age and the year of death of Harold Shipman’s 215 confirmed victims. Bar-charts have been added on the axes to reveal the pattern of ages and the pattern of years in which he committed murders.




This analysis of the victims identified by the inquiry raises further questions about the way he committed his murders. Some statistical evidence is provided by data on the time of day of the death of his supposed victims, as recorded on the death certificate. Figure 0.2 is a line graph comparing the times of day that Shipman’s patients died to the times that a sample of patients of other local family doctors died. The pattern does not require subtle analysis: the conclusion is sometimes known as ‘inter-ocular’, since it hits you between the eyes. Shipman’s patients tended overwhelmingly to die in the early afternoon.

The data cannot tell us why they tended to die at that time, but further investigation revealed that he performed his home visits after lunch, when he was generally alone with his elderly patients. He would offer them an injection that he said was to make them more comfortable, but which was in fact a lethal dose of diamorphine: after a patient had died peacefully in front of him, he would change their medical record to make it appear as if this was an expected natural death. Dame Janet Smith, who chaired the public inquiry, later said, ‘I still do feel it was unspeakably dreadful, just unspeakable and unthinkable and unimaginable that he should be going about day after day pretending to be this wonderfully caring doctor and having with him in his bag his lethal weapon… which he would just take out in the most matter-of-fact way.’




Figure 0.2
The time at which Harold Shipman’s patients died, compared to the times at which patients of other local general practitioners died. The pattern does not require sophisticated statistical analysis.




He was taking some risk, since a single post-mortem would have exposed him, but given the age of his patients and the apparent natural causes of death, none were performed. And his reasons for committing these murders have never been explained: he gave no evidence at his trial, never spoke about his misdeeds to anyone, including his family, and committed suicide in prison, conveniently just in time for his wife to collect his pension.

We can think of this type of iterative, exploratory work as ‘forensic’ statistics, and in this case it was literally true. There is no mathematics, no theory, just a search for patterns that might lead to more interesting questions. The details of Shipman’s misdeeds were determined using evidence specific to each individual case, but this data analysis supported a general understanding of how he went about his crimes.

Later in this book, in Chapter 10, we will see whether formal statistical analysis could have helped catch Shipman earlier.* In the meantime, the Shipman story amply demonstrates the great potential of using data to help us understand the world and make better judgements. This is what statistical science is all about.

Turning the World Into Data

A statistical approach to Harold Shipman’s crimes required us to stand back from the long list of individual tragedies for which he was responsible. All those personal, unique details of people’s lives, and deaths, had to be reduced to a set of facts and numbers that could be counted and drawn on graphs. This might at first seem cold and dehumanizing, but if we are to use statistical science to illuminate the world, then our daily experiences have to be turned into data, and this means categorizing and labelling events, recording measurements, analysing the results and communicating the conclusions.

Simply categorizing and labelling can, however, present a serious challenge. Take the following basic question, which should be of interest to everyone concerned with our environment:

How many trees are there on the planet?

Before even starting to think about how we might go about answering this question, we first have to settle a rather basic issue. What is a ‘tree’? You may feel you know a tree when you see it, but your judgement may differ considerably from others who might consider it a bush or a shrub. So to turn experience into data, we have to start with rigorous definitions.

It turns out that the official definition of a ‘tree’ is a plant with a woody stem that has a sufficiently large diameter at breast height, known as the DBH. The US Forest Service demands a plant has a DBH of greater than 5 inches (12.7 cm) before officially declaring it a tree, but most authorities use a DBH of 10 cm (4 inches).

But we cannot wander round the entire planet individually measuring each woody-stemmed plant and counting up those that meet this criterion. So the researchers who investigated this question took a more pragmatic approach: they first took a series of areas with a common type of landscape, known as a biome, and counted the average number of trees found per square kilometre. They then used satellite imaging to estimate the total area of the planet covered by each type of biome, carried out some complex statistical modelling, and eventually came up with an estimated total of 3.04 trillion (that is 3,040,000,000,000) trees on the planet. This sounds a lot, except they reckoned there used to be twice this number.*3

If authorities differ about what they call a tree, it should be no surprise that more nebulous concepts are even more challenging to pin down. To take an extreme example, the official definition of ‘unemployment’ in the UK was changed at least thirty-one times between 1979 and 1996.4 The definition of Gross Domestic Product (GDP) is continually being revised, as when trade in illegal drugs and prostitution was added to the UK GDP in 2014; the estimates used some unusual data sources—for example Punternet, a review website that rates prostitution services, provided prices for different activities.5

Even our most personal feelings can be codified and subjected to statistical analysis. In the year ending September 2017, 150,000 people in the UK were asked as part of a survey: ‘Overall, how happy did you feel yesterday?’6 Their average response, on a scale from zero to ten, was 7.5, an improvement from 2012 when it was 7.3, which might be related to economic recovery since the financial crash of 2008. The lowest scores were reported for those aged between 50 and 54, and the highest between 70 and 74, a typical pattern for the UK.*

Measuring happiness is hard, whereas deciding whether someone is alive or dead should be more straightforward: as the examples in this book will demonstrate, survival and mortality is a common concern of statistical science. But in the US each state can have its own legal definition of death, and although the Uniform Declaration of Death Act was introduced in 1981 to try to establish a common model, some small differences remain. Someone who had been declared dead in Alabama could, at least in principle, cease to be legally dead were they across the state border in Florida, where the registration must be made by two qualified doctors.7

These examples show that statistics are always to some extent constructed on the basis of judgements, and it would be an obvious delusion to think the full complexity of personal experience can be unambiguously coded and put into a spreadsheet or other software. Challenging though it is to define, count and measure characteristics of ourselves and the world around us, it is still just information, and only the starting point to real understanding of the world.

Data has two main limitations as a source of such knowledge. First, it is almost always an imperfect measure of what we are really interested in: asking how happy people were last week on a scale from zero to ten hardly encapsulates the emotional wellbeing of the nation. Second, anything we choose to measure will differ from place to place, from person to person, from time to time, and the problem is to extract meaningful insights from all this apparently random variability.

For centuries, statistical science has faced up to these twin challenges, and played a leading role in scientific attempts to understand the world. It has provided the basis for interpreting data, which is always imperfect, in order to distinguish important relationships from the background variability that makes us all unique. But the world is always changing, as new questions are asked and new sources of data become available, and statistical science has had to change too.

People have always counted and measured, but modern statistics as a discipline really began in the 1650s when, as we shall see in Chapter 8, probability was properly understood for the first time by Blaise Pascal and Pierre de Fermat. Given this solid mathematical basis for dealing with variability, progress was then remarkably rapid. When combined with data on the ages at which people die, the theory of probability provided a firm basis for calculating pensions and annuities. Astronomy was revolutionized when scientists grasped how probability theory could handle variability in measurements. Victorian enthusiasts became obsessed with collecting data about the human body (and everything else), and established a strong connection between statistical analysis and genetics, biology and medicine. Then in the twentieth century statistics became more mathematical and, unfortunately for many students and practitioners, the topic became synonymous with the mechanical application of a bag of statistical tools, many named after eccentric and argumentative statisticians that we shall meet later in this book.

This common view of statistics as a basic ‘bag of tools’ is now facing major challenges. First, we are in an age of data science, in which large and complex data sets are collected from routine sources such as traffic monitors, social media posts and internet purchases, and used as a basis for technological innovations such as optimizing travel routes, targeted advertising or purchase recommendation systems—we shall look at algorithms based on ‘big data’ in Chapter 6. Statistical training is increasingly seen as just one necessary component of being a data scientist, together with skills in data management, programming and algorithm development, as well as proper knowledge of the subject matter.

Another challenge to the traditional view of statistics comes from the huge rise in the amount of scientific research being carried out, particularly in the biomedical and social sciences, combined with pressure to publish in high-ranking journals. This has led to doubts about the reliability of parts of the scientific literature, with claims that many ‘discoveries’ cannot be reproduced by other researchers—such as the continuing dispute over whether adopting an assertive posture popularly known as a ‘power pose’ can induce hormonal and other changes.8 The inappropriate use of standard statistical methods has received a fair share of the blame for what has become known as the reproducibility or replication crisis in science.

With the growing availability of massive data sets and user-friendly analysis software, it might be thought that there is less need for training in statistical methods. This would be naïve in the extreme. Far from freeing us from the need for statistical skills, bigger data and the rise in the number and complexity of scientific studies makes it even more difficult to draw appropriate conclusions. More data means that we need to be even more aware of what the evidence is actually worth.

For example, intensive analysis of data sets derived from routine data can increase the possibility of false discoveries, both due to systematic bias inherent in the data sources and from carrying out many analyses and only reporting whatever looks most interesting, a practice sometimes known as ‘data-dredging’. In order to be able to critique published scientific work, and even more the media reports which we all encounter on a daily basis, we should have an acute awareness of the dangers of selective reporting, the need for scientific claims to be replicated by independent researchers, and the danger of over-interpreting a single study out of context.

All these insights can be brought together under the term data literacy, which describes the ability to not only carry out statistical analysis on real-world problems, but also to understand and critique any conclusions drawn by others on the basis of statistics. But improving data literacy means changing the way statistics is taught.

Teaching Statistics

Generations of students have suffered through dry statistics courses based on learning a set of techniques to be applied in different situations, with more regard to mathematical theory than understanding both why the formulae are being used, and the challenges that arise when trying to use data to answer questions.

Fortunately this is changing. The needs of data science and data literacy demand a more problem-driven approach, in which the application of specific statistical tools is seen as just one component of a complete cycle of investigation. The PPDAC structure has been suggested as a way of representing a problem-solving cycle, which we shall adopt throughout this book.9 Figure 0.3 is based on an example from New Zealand, which has been a world-leader in statistics education in schools.

The first stage of the cycle is specifying a Problem; statistical inquiry always starts with a question, such as our asking about the pattern of Harold Shipman’s murders or the number of trees in the world. Later in this book we shall focus on problems ranging from the expected benefit of different therapies immediately following breast cancer surgery, to why old men have big ears.

Genre:

  • "An important and comprehensive new book"—Hannah Fry, The New Yorker
  • "David Spiegelhalter's The Art of Statistics shines a light on how we can use the ever-growing deluge of data to improve our understanding of the world.... The Art of Statistics will serve students well. And it will be a boon for journalists eager to use statistics responsibly -- along with anyone who wants to approach research and its reportage with healthy skepticism."—Evelyn Lamb, Nature
  • "The Art of Statistics is alight with Spiegelhalter's enthusiasm .... It leaves readers with a better handle on the ins and outs of data analysis, as well as a heightened awareness that, as Spiegelhalter writes, "Numbers may appear to be cold, hard facts, but ... they need to be treated with delicacy." Sciencenews
  • "A book that crams in so much statistical information and nonetheless remains lucid and readable is highly improbable, and yet here it is. In an age of scientific clickbait, 'big data' and personalised medicine, this is a book that nearly everyone would benefit from reading"—Stuart Ritchie, The Spectator
  • "This is an excellent book. Spiegelhalter is great at explaining difficult ideas...Yes, statistics can be difficult. But much less difficult if you read this book"—The Evening Standard (UK)
  • "What David Spiegelhalter does here is provide a very thorough introductory grounding in statistics without making use of mathematical formulae. And it's remarkable. Spiegelhalter is warm and encouraging -- it's a genuinely enjoyable read.... This book should be required reading for all politicians, journalists, medics and anyone who tries to influence people (or is influenced) by statistics. A tour de force."—Popular Science
  • "Do you trust headlines telling you...that bacon, ham and sausages carry the same cancer risk as cigarettes? No, nor do I. That is why we need a book like this that explains how such implausible nonsense arises in the first place. Written by a master of the subject...this book tells us to examine our assumptions. Bravo."—Standpoint
  • "Spiegelhalter goes beyond debunking numerical nonsense to deliver a largely mathematics-free but often formidable education on the vocabulary and techniques of statistical science.... An admirable corrective to fake news and sloppy thinking."—Kirkus
  • "A call to arms for greater societal data literacy.... Spiegelhalter's work serves as a reminder that there are passionate, self-aware statisticians who can argue eloquently that their discipline is needed now more than ever."—Financial Times
  • "Like the fictional investigator Sherlock Holmes, Spiegelhalter takes readers on a trail to challenge methodology and stats thrown at us by the media and others. But where other authors have attempted this and failed, he is inventive and clever in picking the right examples that spark the reader's interest to become active on their own."—Engineering & Technology
  • "What David Spiegelhalter does here is provide a very thorough introductory grounding in statistics without making use of mathematical formulae. And it's remarkable. Spiegelhalter is warm and encouraging -- it's a genuinely enjoyable read.... This book should be required reading for all politicians, journalists, medics and anyone who tries to influence people (or is influenced) by statistics. A tour de force."—Pop Science Books
  • "In this wonderfully accessible introduction to modern statistics, David Spiegelhalter has created a worthy successor to classics such as Mooney's Facts from Figures. Using many real examples, he introduces the methods and underlying concepts, showing the power and elegance of statistics for gaining understanding and for informing decision-making."—David J. Hand, author of The Improbability Principle
  • "David Spiegelhalter combines clarity of thinking with superb communication skills and a wealth of experience of applying statistics to everyday problems. The result is The Art of Statistics, a book that manages to be enjoyable as well as informative: an engaging introduction for the lay person who wants to gain a better understanding of statistics. Even those with expertise in statistics will find much within these pages to stimulate the mind and cast new light on familiar topics. A real tour de force which deserves to be widely read."—Dorothy Bishop, professor of developmental neuropsychology and Wellcome Trust Principal Research Fellow in the Department of Experimental Psychology, University of Oxford
  • "If I had to trust just one person to interrogate statistical data, I'd trust David Spiegelhalter. He is a master of the art. Here, he shows us how it's done. The result is brilliant; nothing short of an essential guide to finding things out -- delivered through a series of detective-like investigations of specific examples ranging from sexual behavior to murder. The technical essentials are also all here: from averages to infographics, algorithms and Bayesian statistics - both their power and their limitations. All this makes The Art of Statistics a first call for all those setting out on a career or study that involves working with data. But beyond that, it's self-help for anyone with a serious desire to become a clued-up citizen in a world of numbers. If you want pat answers, or meat for your prejudices, go elsewhere. But if you want to develop the skills to see the world as it is, and to tell it how it is -- honestly and seriously -- this is the book."—Michael Blastland, co-author of The Tiger That Isn't: Seeing Through a World of Numbers
  • "David Spiegelhalter is probably the greatest living statistical communicator; more than that, he's one of the great communicators in any field. This marvelous book will transform your relationship with the numbers that swirl all around us. Read it and learn. I did."—Tim Harford, author of The Undercover Economist

  • "Some (including Einstein) define genius as the art of taking something complex and making it simple. In this equation-free, all-encompassing, and totally-understandable-by-anyone introduction to the ideas, tools, and practice of statistics, Spiegelhalter meets that definition. This book is perfect for anyone who has wanted to learn statistics but felt overwhelmed by complicated mathematical equations."—Scott Page, author of The Model Thinker

On Sale
Sep 3, 2019
Page Count
448 pages
Publisher
Basic Books
ISBN-13
9781541618527

David Spiegelhalter

About the Author

David Spiegelhalter is a British statistician and chair of the Winton Centre for Risk and Evidence Communication in the Statistical Laboratory at the University of Cambridge. In 2014 he was knighted for his services to statistics, and from 2017 to 2018 he served as president of the Royal Statistical Society. He lives in the United Kingdom.

Learn more about this author