Compensation professionals are constantly examining the distribution of data – e.g., the distribution of salaries, salary increases, bonus payouts, etc. While typically focused on quartiles and medians of the distribution, they recognize that they need to look at the entire distribution to truly understand the data in question and recommend the right decisions.

The box plot is one of the most widely used devices for examining and comparing data distributions. Due to their almost exclusive reliance on Excel for data analysis, however, the vast majority of compensation professionals have not been exposed box plots (Excel does not offer box plots in its graph portfolio) and their usefulness. This post attempts to persuade compensation professionals and other human capital analysts to consider introducing box plots into their analyses and presentations.

The important features of any data distribution are the variation of data and their clustering, if any, around certain values. The variation is characterized by the range of data as determined by the minimum and maximum values. If the clustering typically occurs around one value, this is the central tendency of the data. The other important feature is the spread – the degree to which the data cluster around the central tendency. If there is only a little clustering, the data have a large spread.

For small data sets, the astute analyst can typically get a sense of the distribution by looking at the column of numbers (sorting them in ascending or descending order helps a lot). However, for large data sets, you would need to plot a histogram. Even then, some more work would be required to identify the quartiles. This is where the box plot comes in – it portrays the data distribution in a simple graphic that displays all its important features. Figure 1 shows a simplified box plot.

The box represents the 50% of the data between the 25th and 75th percentiles (i.e., the lower and upper quartiles) and draws focus to the center of the distribution, the median, which is depicted by the red line inside the box. The vertical lines extending from the box reach out to the minimum and maximum values. Looking at two different distributions via their box plots as in Figure 2, we can compare their central tendencies, spreads and ranges at one glance.

The box plots are much more striking and informative than the tabular data.

How might box plots be used in compensation? Figure 3 illustrates how internal salaries are arrayed in relation to salary ranges and the market data. Again, in one quick glance you can see where the outliers are (both internal and versus the market) and how the internal salaries compare with the market salaries. It is unfortunate that most survey vendors do not illustrate market data using box plots and are not in a position to report back participants’ overall salary distribution in relation to the market.

I referred to the box plots depicted above as simplified box plots. A subsequent post will explain variations on the simplified box plot. I should mention that the box plot is a fairly recent invention. It was introduced by John Tukey in 1975 and has been a staple of statisticians since then. Box plots are used widely in medical research and economics and are a standard graphing format in statistical software programs.

In a subsequent post, we’ll examine how the average of the distribution and outliers are depicted.

Pingback: Deeper into Box Plots | The Nelson Touch Blog