Statistics allow you to summarize data and draw conclusions for the present by condensing large amounts of data into a form that brings out all the essential information and is yet easy to remember. To condense data, single numbers must make the data more intelligible and help draw useful inferences. For example, in a season, a sports player participates in 51 games and scores a total of 1,568 points. The total of 1,568 points includes 45 points in Game A, 36 points in Game B, 51 points in Game C, 45 points in Game D, and 40 points in Game E. As the number of games increases, remembering how many points the player scored in each individual game becomes increasingly difficult. If you divide the total number of points that the player scored by the number of games played, you obtain a single number that tells you the average number of points the player scored per game. Equation A yields the points per game average for the player.
![]() |
(A) |
Computing percentage provides a method for making comparisons. For example, the officials of an American city are considering installing a traffic signal at a major intersection. The purpose of the traffic signal is to protect motorists turning left from oncoming traffic. However, the city has only enough money to fund one traffic signal but has three intersections that potentially need the signal. Traffic engineers study each of the three intersections for a week. The engineers record the total number of cars using the intersection, the number of cars traveling straight through the intersection, the number of cars making left-hand turns, and the number of cars making right-hand turns. The following table shows the data for one of the intersections.
Day | Total Number of Cars Using the Intersection | Number of Cars Turning Left | Number of Cars Turning Right | Number of Cars Continuing Straight |
1 |
1,258 |
528 |
330 |
400 |
2 |
1,306 |
549 |
340 |
417 |
3 |
1,355 |
569 |
352 |
434 |
4 |
1,227 |
515 |
319 |
393 |
5 |
1,334 |
560 |
346 |
428 |
6 |
694 |
291 |
180 |
223 |
7 |
416 |
174 |
108 |
134 |
Totals |
7,590 |
3,186 |
1,975 |
2,429 |
Looking only at the raw data from each intersection might make determining which intersection needs the traffic signal difficult because the raw numbers can vary widely. However, computing the percentage of cars turning at each intersection provides a common basis for comparison. To obtain the percentage of cars turning left, divide the number of cars turning left by the total number of cars using the intersection and multiply that result by 100. For the intersection whose data is shown in the previous table, the following equation gives the percentage of cars turning left.
![]() |
(B) |
Given the data for the other two intersections, the city officials can obtain the percentage of cars turning left at those two intersections. Converting the raw data to a percentage condenses the information for the three intersections into single numbers representing the percentage of cars that turn left at each intersection. The city officials can compare the percentage of cars turning left at each intersection and rank the intersections in order of highest percentage of cars turning left to the lowest percentage of cars turning left. Ranking the intersections can help determine where the traffic signal is needed most. Thus, in a broad sense, the term statistics implies different ways to summarize data to derive useful and important information from it.
The mean value is the average value for a set of data samples. The following equation defines an input sequence X consisting of n samples.
X = {x0, x1, x2, x3, …, xn – 1} | (C) |
The following equation yields the mean value for input sequence X.
![]() |
(D) |
The mean equals the sum of all the sample values divided by the number of samples, as shown in Equation A.
The median of a data sequence is the midpoint value in the sorted version of the sequence. The median is useful for making qualitative statements, such as whether a particular data point lies in the upper or lower portion of an input sequence.
The following equation represents the sorted sequence of an input sequence X.
S = {s0, s1, s2, …, sn – 1} | (E) |
You can sort the sequence either in ascending order or in descending order. The following equation yields the median value of S.
![]() |
(F) |
where i = (n – 1)/2 and k = n/2.
Equation G defines a sorted sequence consisting of an odd number of samples sorted in descending order.
S = {5, 4, 3, 2, 1} | (G) |
In Equation G, the median is the midpoint value 3.
Equation H defines a sorted sequence consisting of an even number of samples sorted in ascending order.
S = {1, 2, 3, 4} | (H) |
The sorted sequence in Equation H has two midpoint values, 2 and 3. Using Equation F for n is even, the following equation yields the median value for the sorted sequence in Equation H.
xmedian = 0.5(sk – 1 + sk) = 0.5(2 + 3) = 2.5
The standard deviation s of an input sequence equals the positive square root of the sample variance s2, as shown in the following equation.
![]() |
(K) |
The mode of an input sequence is the value that occurs most often in the input sequence. The following equation defines an input sequence X.
X = {0, 1, 3, 3, 4, 4, 4, 5, 5, 7} | (L) |
The mode of X is 4 because 4 is the value that occurs most often in X.
The moment about the mean is a measure of the deviation of the elements in an input sequence from the mean. The following equation yields the mth order moment σnm for an input sequence X.
![]() |
(M) |
where n is the number of elements in X and is the mean of X.
For m = 2, the moment about the mean equals the population variance σ2.
A histogram is a bar graph that displays frequency data and is an indication of the data distribution. A histogram provides a method for graphically displaying data and summarizing key information.
Equation N defines a data sequence.
X = {0, 1, 3, 3, 4, 4, 4, 5, 5, 8} | (N) |
To compute a histogram for X, divide the total range of values into the following eight intervals, or bins:
The histogram display for X indicates the number of data samples that lie in each interval, excluding the upper boundary. The following figure shows the histogram for the sequence in Equation N.
The previous figure shows that no data samples are in the 2–3 and 6–7 intervals. One data sample lies in each of the intervals 0–1, 1–2, and 7–8. Two data samples lie in each of the intervals 3–4 and 5–6. Three data samples lie in the 4–5 interval.
The number of intervals in the histogram affects the resolution of the histogram. A common method of determining the number of intervals to use in a histogram is Sturges' Rule, which is given by the following equation.
Number of Intervals = 1 + 3.3log(size of (X))
The mean square error (mse) is the average of the sum of the square of the difference between the corresponding elements of two input sequences. The following equation yields the mse for two input sequences X and Y.
![]() |
(O) |
where n is the number of data points.
You can use the mse to compare two sequences. For example, system S1 receives a digital signal x and produces an output signal y1. System S2 produces y2 when it receives x. Theoretically, y1 = y2. To verify that y1 = y2, you want to compare y1 and y2. Both y1 and y2 contain a large number of data points. Because y1 and y2 are large, an element-by-element comparison is difficult. You can calculate the mse of y1 and y2. If the mse is smaller than an acceptable tolerance, y1 and y2 are equivalent.
The root mean square (rms) of an input sequence equals the positive square root of the mean of the square of the input sequence. In other words, you can square the input sequence, take the mean of the new squared sequence, and take the square root of the mean of the new squared sequence. The following equation yields the rms Ψx for an input sequence X.
![]() |
(P) |
where n is the number of elements in X.
Root mean square is a widely used quantity for analog signals. The following equation yields the root mean square voltage Vrms for a sine voltage waveform.
![]() |
(Q) |
where Vp is the peak amplitude of the signal.