In the Behold the Box Plot post, we examined a simple box plot, which is good for most purposes. The version that most statisticians use, which I’ll call the statistical box plot has two additional features.
The first, is to show the average or mean of the distribution in the box. The second is to characterize some observations as outliers and show them explicitly. Figure 1 illustrates the statistical box plot. I will explain each part of this new schematic.
The inter-quartile range (IQR) is the distance between the upper and lower quartiles and contains (by definition) 50% of the data points around to the median. The average, denoted by a dashed line, completes the picture of the data distribution. Compensation professionals rely more on the median, since it is a measure of central tendency that is not influenced by outliers.
It is important to identify data points that are far from the median. These may or may not be relevant to the analysis. When looking at salaries, for example, a person with an extremely high or low salary may be a special case and therefore outside the scope of the analysis. In general, the focus of any analysis is where the bulk of the data are – those data points within the IQR.
The IQR is used as a metric to denote two ranges of data that are illustrated in Figure 2. The “inner fence” is 1.5 times the IQR beyond the lower and upper quartiles (i.e., beyond the box). The “outer fence” extends 1.5 times the IQR beyond the inner fence (i.e., 3 times the IQR beyond the box).
These ranges allow us to classify data points beyond the IQR as “outliers” or “extremes.” An outlier is any data point that is beyond the inner fence but within the outer fence. Outliers are denoted by an “x” symbol. An extreme is any data point that lies beyond the outer fence. Extreme values are denoted by an “o” symbol.
As you can see in Figure 3, the lines that extended from the box no longer reach out to the minimum and maximum of the data. These lines are also known as “whiskers” (box plots are sometimes referred to as box-and-whisker diagrams).
In the statistical box plot, the lines extend to next data point outside the box. If there is no data point within the inner fence, then the line extends to the inner fence (there can still be outlier or extreme data points).
There is a lot of terminology involved here, but the picture itself speaks a thousand words. Once you are familiar with the elements of a statistical box plot, you can glean all the relevant features of the underlying data distribution at a glance.
Statistical packages such as Stata, SAS and SPSS allow you to customize the box plots to various degrees. Each uses slightly different terminology for each of the elements we have reviewed so far. However, now that you have the general idea, you should be able to navigate the peculiarities of any software that generates customized box plots.
One useful feature is to vary the width of the box by the data size (i.e., number of observations) so that you can visually weight the data when comparing different distributions. If you are not interested in the outliers and extremes, you can suppress them from the graphic.
Box plots can be used to characterize data distributions in any field. Figure 4 shows the results of the Michelson-Morley experiments of 1887 to measure the speed of light. Of course, the box plots were created a century later, having been invented in the late 1970’s.
There is an even more elaborate way to represent a distribution (hint: it is named after a musical instrument). I’ll defer that description until we’ve taken a look at probability densities and cumulative distributions. Stay tuned!