Statistics Commentary Series: Commentary No. 24
Box plots were first described by John Tukey in a very influential book called Exploratory Data Analysis2 or just EDA by the cognoscenti. In this book, Tukey emphasizes 3 points: (1) we should never begin analyzing data before we have visualized them in some way (a self-evident point overlooked far too often even by seasoned data analysts); (2) we can learn much about variables and their relationships with other variables by simply graphing them; and (3) we should rely more on robust statistical methods; that is, parameters such as the median and inter-quartile range (IQR) that are more resilient to non-normality of the data and the effects of outliers. His approach to statistics is summed up in one of his quotes: “Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question, which can always be made precise.”3, p. 13 Unfortunately, Tukey’s fertile mind (he introduced the terms “bit” for “binary digit” and “software,” developed the fast Fourier transform algorithm, and was a major contributor to statistics) also led to him introduce many new and somewhat confusing terms to describe the parts of the box plot; here, I’ll use the more common ones in addition to his.
An example of a box plot is shown in Figure 1. There are a few points to bear in mind before we discuss its anatomy. First, this box plot is drawn vertically, but they can be drawn horizontally as well. Second, it’s shown with a dashed line for the median, open circles for outliers, an X for the mean, and an asterisk for the extreme outlier (these terms will be defined in a bit). Different computer programs may use other symbols, such as a solid line for the median, closed circles for the outliers, a + for the mean, and so forth, but the context or a footnote should make it clear what’s what.
To illustrate the construction of a box plot, we’ll use these 19 values:
8 14 26 26 28 29 33 34 34 35 36 39 39 40 43 52 65 73 92
As you can see, they’re already rank ordered, from lowest to highest. The median, as you likely remember from introductory statistics, is the value that divides the data in half.