Example Code

Boxplots and Stem-and-Leaf Displays

Products and Environment

This section reflects the products and operating system used to create the example.

To download NI software, including the products shown below, visit ni.com/downloads.

    Software

  • LabVIEW

Code and Documents

Attachment

Description

You can use boxplots and stem-and-leaf displays in exploratory data analysis (EDA) to display the basic statistics of data sets in a visual format. The boxplot is useful for summarizing a data set. A boxplot consists of rectangles that you position according to the quartiles and median of the data set. The stem-and-leaf display shows all the data points in a tabular form. Individually, the boxplot and the stem-and-leaf display typically yield only half the picture of the data set. However, when you present the boxplot and stem-and-leaf display together, they complement each other well.

 

How to Use

1. Introduction to Boxplots and the Stem-and-Leaf Displays

You can use boxplots to see the distribution of a single data set or, when you plot several boxplots side by side, to compare the location and variation of several data sets. A boxplot consists of rectangles and lines and shows the median of the data, the upper and lower quartiles, and any data points that possibly are outside values. The boxplot is a popular EDA tool because of its visual nature and its ability to show you information quickly. The following figure shows a typical boxplot.



The large, divided rectangle in the middle forms the box around which you draw the rest of the boxplot. The upper and lower quartiles of the data set determine the size and location of this box. The line that divides the box horizontally through the middle represents the median of the data set. You draw the top edge of the box at the value corresponding to the upper quartile of the data. The upper quartile is the median of the upper 50% of the data values, or the values greater than the global median. You draw the bottom edge of the box at the value corresponding to the lower quartile of the data. The lower quartile is the median of the lower 50% of the data values, or the values less than the global median. Vertical lines called whiskers extend from the middle of the top and bottom edges of the box. The whiskers are 1.5 times the inner quartile spread in length, which you measure from the median. The inner quartile spread is the difference between the upper and lower quartiles of the data. The whiskers provide an arbitrary cutoff point to identify data points that possibly are outside values. You represent data points falling outside the whiskers but less than three times the length of the inner quartile spread with small x's. You represent points beyond the whiskers with large X's.

Boxplots are useful for determining where the majority of the data lies. Boxplots also draw attention to extreme data that you need to examine for measurement errors. The boxplot in the figure above shows data that has a median of 2.07, an upper quartile of 2.10, and a lower quartile of 2.06. This plot shows data that is not widespread because 50% of the data points lie within .05 units of each other. The plot also contains seven data points that are potential outside values. This plot suggests that the data follows the bell curve in its distribution, but you need to inspect the extreme points for measurement errors.

You can display several boxplots side by side to contrast the differences in distributions and medians of different sets of data. The following figure shows four boxplots side by side.

Although the medians are all roughly the same, you can see at a glance that the spread of each data set is different. The boxplot on the left shows data that appears to be distributed evenly. The median is in the middle of the rectangle, and the whiskers are about the same length. In addition, the plot contains no outside values. The median of the second plot from the left appears to be slightly off-center. The amount of extreme values is a point of concern because it suggests that the data vary widely. The third boxplot shows data that has less variation and spread than the other plots. The fourth boxplot shows data that is significantly upwardly-skewed. The median of this plot is closer to the top of the rectangle than to the bottom, and the upper whisker is longer than the bottom one. All the boxplots have approximately the same median, and the two boxplots on the left have approximately the same variation in the data.

You can use boxplots in conjunction with stem-and-leaf displays. In contrast to the summary data of the boxplot, stem-and-leaf displays group individual observations to show the distribution of the data set. When you use boxplots and stem-and-leaf displays together, each complements the weaknesses of the other.

To construct a stem-and-leaf display, you subdivide the range of the data set into 6 to 14 divisions. You can choose the number of divisions arbitrarily and adjust the divisions later to make the display most meaningful. List the divisions as ranges in a column on the left side of the display. Then draw a vertical line to the right of the column, and write the identifying digits of the observations to the right of the line to form a row. The column of ranges is the stem of the display, and the individual observations on the right of the vertical line are the leaves. For example, the data set {19, 21, 21, 23, 23, 23, 23, 26, 26, 28, 30, 33} looks like the following figure when you divide the data into a set of four equal ranges.



The following figure shows a symmetrical boxplot and its corresponding stem-and-leaf display.



The boxplot shows that 50% of the data set lies between 2.05 and 2.15. The stem-and-leaf display supports this conclusion. More importantly, the symmetric boxplot suggests that the data set has a bell-shaped distribution, which the stem-and-leaf display also supports. By looking at the stem-and-leaf display, you can immediately determine that the potential outlier at the bottom of the boxplot has a value of 1.75.

Stem-and-leaf displays are especially useful because they can show gaps in the data that are not obvious with the boxplot alone. In the following figure, the boxplot is symmetrical and evenly spaced, suggesting that the data set has a bell-shaped distribution.




The stem-and-leaf display clearly tells another story. It shows that the data is not normally distributed. In fact, the stem-and-leaf display shows two concentrations of data points, one at each end of the range.

Data sets with flat distributions appear to be distributed normally on the boxplot as in the figure above. Once again, the stem-and-leaf display corrects this assumption by showing that all the stems have roughly the same number of leaves.

 

2. Creating Boxplots in LabVIEW

You construct boxplots with rectangles, lines, and x's, which the waveform graph in LabVIEW does not support. Instead, you can use the XY graph to construct boxplots in LabVIEW. You can use the XY graph to map every x value to a unique y value, which allows you to draw lines to and from any point on the graph. Therefore, you can construct a boxplot by stringing together a list of points that you carefully format to create either connected lines or disconnected x's.

The following figure illustrates a hierarchy tree consisting of example VIs that together create a boxplot.


The top-level VI, Boxplot, has two subVIs. In the branch on the right, you calculate the data necessary to display the plot. The Compute Boxplot data VI is the top-level VI for this branch. In the branch on the left, you format the plots using the Format Plots VI. This arrangement of the hierarchy tree is necessary because the Boxplot VI can display several boxplots in the same XY graph. You cannot determine the final number of boxplots at program time so you need to be able to dynamically format the plots. The interaction of the top three VIs in the figure above is nontrivial. The block diagram for the Boxplot VI, shown in the following figure, contains the functionality of these three VIs.

.



The block diagram of the Boxplot VI shows that the For Loop accepts the data as a 1D array of clusters of 1D arrays, allowing you to process data sets of different lengths and display them in parallel boxplots. If you use a 2D array instead, LabVIEW adds zeros to the individual data sets so that the data sets are all the same length. Using the array of clusters of arrays avoids this behavior. When you run the Boxplot VI, the VI autoindexes the data array to show a cluster containing a 1D array. This VI then unbundles the cluster and passes the data to the Compute Boxplot data subVI. Data from this subVI contains information that you need to display a single boxplot. The Boxplot VI builds this information into an array and passes it to Box Plot, an XY graph. The sequence structure in the block diagram above ensures that this VI passes the data to the XY graph before the Format Plots VI formats the individual boxplots.

The construction of each boxplot on the XY graphs requires the use of five plots: one for the box, one for the upper whisker, one for the lower whisker, one for the little x's, or near outliers, and one for the big X's, or far outliers. You can use three plots instead of five if you combine the box and the whiskers, but you then cannot include the outlier values in the same plot because they require a different point style that is not compatible with the line plots. The following figure shows the block diagram of the Compute Boxplot data VI. This figure illustrates that the box with the median constitutes the first plot because its data is wired to the first position of the Build Array function. The far outliers and near outliers are the second and third plots, respectively, and the upper and lower whiskers are the last two plots.



The block diagram in the figure above contains annotated sections A through F. The Compute Boxplot data VI first calculates the median and quartiles of the data, as shown in section A. As shown in section B, this VI then calculates the x location of the center of the boxplot. This calculation is vital when you combine two or more boxplots in one XY graph. Section C shows that this VI computes the data that represents the box and median. In section D, this VI uses the median and quartile information to establish the length of the whiskers and builds the whisker arrays. In section E, this VI sorts the data into near and far outliers. This VI then bundles outlier arrays and whisker arrays with arrays having the center x value. Finally, in section F, this VI combines all the clusters of arrays into an array of clusters of arrays.

The Format Plots VI formats each of the five plots that the Compute Boxplot data VI creates to display the boxplot. The following figure shows the block diagram of the Format Plots VI.



This VI immediately resizes the XY graph using the first property node on the left. This VI then passes the reference to the For Loop that iterates through each of the boxplots that you will display. The property node within the For Loop sets the characteristics of all five plots of each boxplot by repeating the properties Active Plot, Color, Line Style, Point Style, and Interpolation for each of the component plots. The Active Plot property selects the plot that subsequent properties will affect. This VI then sets the color, line thickness, and representation of the exact data points. This VI also determines whether to automatically connect the data points. The Format Plots VI then sets the Active Plot property to the next plot, and the process repeats.

This VI uses one property node instead of several because calling property nodes forces LabVIEW to switch to the user interface thread, requiring a small amount of time that can add up over several property node calls. By using just one property node, you can increase performance by calling the user interface only once.

The error cluster in the For Loop runs through a shift register so you can abort future property changes when an error occurs. If you use a tunnel instead, any error not in the last iteration of the For Loop is lost and no indication that an error occurred is shown.

Boxplot.llb contains an example along with the VIs described above.
 

 

3. Creating Stem-and-Leaf Displays in LabVIEW


The stem-and-leaf display is unique in EDA because it is text-based. The following figure shows the block diagram of the Stem-and-Leaf example VI, which creates a stem-and-leaf display. Rather than manipulating graphs, the Stem-and-Leaf VI manipulates strings. This VI determines the range of the stems, appends the leaves, and builds up all the rows.





This VI first checks to see if the data array you wire in contains any elements. If the data array is empty, the VI does not run. Otherwise, this VI executes Case 1 from the block diagram. Inside this case, this VI first sorts the data and calculates the range of the stems. This VI then passes this information to the For Loop where the VI creates the stems before passing the data to the While Loop. The While Loop steps through the array, takes the last digits of values falling into the range of the particular stem, then appends each digit to the stem string until the VI reaches a value outside of the range. This VI then appends a stem string to the existing stem-and-leaf string, and the process repeats until the VI processes all the data.

Before this VI can display the data, the Format Stem-and-Leaf VI must process the data. LabVIEW wraps strings extending beyond the border of a string indicator to the next line, which negatively affects the formatting for the stem-and-leaf display. Therefore, the Format Stem-and-Leaf VI truncates the rows before displaying data. The following figure shows the block diagram of the Format Stem-and-Leaf VI.


This VI first breaks the string into an array of strings in the first While Loop. Some stems have fewer characters than others, which causes the "|" character not to line up. This problem can occur when the data range is from positive to negative, or when the range is large. For instance, you can split a range from -10 to 14 into the follow divisions: -10 to -6, -6 to -2, -2 to 2, 2 to 6, 6 to 10, and 10 to 14, producing stems of lengths 5, 4, 4, 3, 4, and 5 respectively. In order for the "|" character to line up, you need to calculate an array of spaces outside of the While Loop. Therefore, this VI also processes an array of string lengths to the "|" character in the While Loop. The For Loop builds the array of spaces and adds the space string to the beginning of the string. The next logical step is to truncate the string so that the string fits in the string indicator. The Format Stem-and-Leaf VI then compares the number of digits to the length of the string to fill the display that has a default of 64 characters. This VI uses the smaller of the two numbers to truncate the string, which it then concatenates into a final string that you can display.
 

 

Example code from the Example Code Exchange in the NI Community is licensed with the MIT license.

Contributors