Statistics Commentary Series: Commentary No. 24

    loading  Checking for direct PDF access through Ovid


When we want to show how 2 or more groups differ on some variable, we often use a bar chart, with the height of the bars reflecting the magnitudes of the variables. If we want to convey some more information, we can add error bars, corresponding to the standard deviation (SD), the standard error of the mean (SEM), or – better yet – 1.96 SEMs, reflecting the 95% confidence interval (CI).1 But that’s a lot of real estate on a journal page devoted to portraying just 2 pieces of information for each group. On the other hand, there is much more information about the variables that we may want to know but that isn’t shown in a bar chart, such as whether or not the data are skewed, if there are outliers, how tightly the data are clustered around the mean, and so forth. This article will introduce you to a graphing method – box plots (also called box-and-whisker plots) and an extension of them, notched box plots – that can meet the demand for displaying a lot of information about a variable in an easily digestible manner.
Box plots were first described by John Tukey in a very influential book called Exploratory Data Analysis2 or just EDA by the cognoscenti. In this book, Tukey emphasizes 3 points: (1) we should never begin analyzing data before we have visualized them in some way (a self-evident point overlooked far too often even by seasoned data analysts); (2) we can learn much about variables and their relationships with other variables by simply graphing them; and (3) we should rely more on robust statistical methods; that is, parameters such as the median and inter-quartile range (IQR) that are more resilient to non-normality of the data and the effects of outliers. His approach to statistics is summed up in one of his quotes: “Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question, which can always be made precise.”3, p. 13 Unfortunately, Tukey’s fertile mind (he introduced the terms “bit” for “binary digit” and “software,” developed the fast Fourier transform algorithm, and was a major contributor to statistics) also led to him introduce many new and somewhat confusing terms to describe the parts of the box plot; here, I’ll use the more common ones in addition to his.
An example of a box plot is shown in Figure 1. There are a few points to bear in mind before we discuss its anatomy. First, this box plot is drawn vertically, but they can be drawn horizontally as well. Second, it’s shown with a dashed line for the median, open circles for outliers, an X for the mean, and an asterisk for the extreme outlier (these terms will be defined in a bit). Different computer programs may use other symbols, such as a solid line for the median, closed circles for the outliers, a + for the mean, and so forth, but the context or a footnote should make it clear what’s what.
To illustrate the construction of a box plot, we’ll use these 19 values:
8 14 26 26 28 29 33 34 34 35 36 39 39 40 43 52 65 73 92
As you can see, they’re already rank ordered, from lowest to highest. The median, as you likely remember from introductory statistics, is the value that divides the data in half.
    loading  Loading Related Articles