Skip to main content
Engineering LibreTexts

3.2: Measures of Variation

  • Page ID
    118185
  • \( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

    \( \newcommand{\dsum}{\displaystyle\sum\limits} \)

    \( \newcommand{\dint}{\displaystyle\int\limits} \)

    \( \newcommand{\dlim}{\displaystyle\lim\limits} \)

    \( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)

    ( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)

    \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

    \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)

    \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

    \( \newcommand{\Span}{\mathrm{span}}\)

    \( \newcommand{\id}{\mathrm{id}}\)

    \( \newcommand{\Span}{\mathrm{span}}\)

    \( \newcommand{\kernel}{\mathrm{null}\,}\)

    \( \newcommand{\range}{\mathrm{range}\,}\)

    \( \newcommand{\RealPart}{\mathrm{Re}}\)

    \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

    \( \newcommand{\Argument}{\mathrm{Arg}}\)

    \( \newcommand{\norm}[1]{\| #1 \|}\)

    \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

    \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)

    \( \newcommand{\vectorA}[1]{\vec{#1}}      % arrow\)

    \( \newcommand{\vectorAt}[1]{\vec{\text{#1}}}      % arrow\)

    \( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \( \newcommand{\vectorC}[1]{\textbf{#1}} \)

    \( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)

    \( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)

    \( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)

    \( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

    \(\newcommand{\avec}{\mathbf a}\) \(\newcommand{\bvec}{\mathbf b}\) \(\newcommand{\cvec}{\mathbf c}\) \(\newcommand{\dvec}{\mathbf d}\) \(\newcommand{\dtil}{\widetilde{\mathbf d}}\) \(\newcommand{\evec}{\mathbf e}\) \(\newcommand{\fvec}{\mathbf f}\) \(\newcommand{\nvec}{\mathbf n}\) \(\newcommand{\pvec}{\mathbf p}\) \(\newcommand{\qvec}{\mathbf q}\) \(\newcommand{\svec}{\mathbf s}\) \(\newcommand{\tvec}{\mathbf t}\) \(\newcommand{\uvec}{\mathbf u}\) \(\newcommand{\vvec}{\mathbf v}\) \(\newcommand{\wvec}{\mathbf w}\) \(\newcommand{\xvec}{\mathbf x}\) \(\newcommand{\yvec}{\mathbf y}\) \(\newcommand{\zvec}{\mathbf z}\) \(\newcommand{\rvec}{\mathbf r}\) \(\newcommand{\mvec}{\mathbf m}\) \(\newcommand{\zerovec}{\mathbf 0}\) \(\newcommand{\onevec}{\mathbf 1}\) \(\newcommand{\real}{\mathbb R}\) \(\newcommand{\twovec}[2]{\left[\begin{array}{r}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\ctwovec}[2]{\left[\begin{array}{c}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\threevec}[3]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\cthreevec}[3]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\fourvec}[4]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\cfourvec}[4]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\fivevec}[5]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\cfivevec}[5]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\mattwo}[4]{\left[\begin{array}{rr}#1 \amp #2 \\ #3 \amp #4 \\ \end{array}\right]}\) \(\newcommand{\laspan}[1]{\text{Span}\{#1\}}\) \(\newcommand{\bcal}{\cal B}\) \(\newcommand{\ccal}{\cal C}\) \(\newcommand{\scal}{\cal S}\) \(\newcommand{\wcal}{\cal W}\) \(\newcommand{\ecal}{\cal E}\) \(\newcommand{\coords}[2]{\left\{#1\right\}_{#2}}\) \(\newcommand{\gray}[1]{\color{gray}{#1}}\) \(\newcommand{\lgray}[1]{\color{lightgray}{#1}}\) \(\newcommand{\rank}{\operatorname{rank}}\) \(\newcommand{\row}{\text{Row}}\) \(\newcommand{\col}{\text{Col}}\) \(\renewcommand{\row}{\text{Row}}\) \(\newcommand{\nul}{\text{Nul}}\) \(\newcommand{\var}{\text{Var}}\) \(\newcommand{\corr}{\text{corr}}\) \(\newcommand{\len}[1]{\left|#1\right|}\) \(\newcommand{\bbar}{\overline{\bvec}}\) \(\newcommand{\bhat}{\widehat{\bvec}}\) \(\newcommand{\bperp}{\bvec^\perp}\) \(\newcommand{\xhat}{\widehat{\xvec}}\) \(\newcommand{\vhat}{\widehat{\vvec}}\) \(\newcommand{\uhat}{\widehat{\uvec}}\) \(\newcommand{\what}{\widehat{\wvec}}\) \(\newcommand{\Sighat}{\widehat{\Sigma}}\) \(\newcommand{\lt}{<}\) \(\newcommand{\gt}{>}\) \(\newcommand{\amp}{&}\) \(\definecolor{fillinmathshade}{gray}{0.9}\)
    Learning Objectives

    By the end of this section, you should be able to:

    • 3.2.1 Define and calculate the range, the variance, and the standard deviation for a dataset.
    • 3.2.2 Use Python to calculate measures of variation for a dataset.

    Providing some measure of the spread, or variation, in a dataset is crucial to a comprehensive summary of the dataset. Two datasets may have the same mean but can exhibit very different spread, and so a measure of dispersion for a dataset is very important. While measures of central tendency (like mean, median, and mode) describe the center or average value of a distribution, measures of dispersion give insights into how much individual data points deviate from this central value.

    The following two datasets are the exam scores for a group of three students in a biology course and in a statistics course.

    Dataset A: Exam scores for students in a biology course: 40, 70, 100
    Dataset B: Exam scores for students in a statistics course: 69, 70, 71

    Notice that the mean score for both Dataset A and Dataset B is 70.

    However, the datasets are significantly different from one another:

    Dataset A has larger variability where one student scored 30 points below the mean and another student scored 30 points above the mean.
    Dataset B has smaller variability where the exam scores are much more tightly clustered around the mean of 70.

    This example illustrates that publishing the mean of a dataset is often inadequate to fully communicate the characteristics of the dataset. Instead, data scientists will typically include a measure of variation as well.

    The three primary measures of variability are range, variance, and standard deviation, and these are described next.

    Range

    Range is a measure of dispersion for a dataset that is calculated by subtracting the minimum from the maximum of the dataset:

    Range=MaxMinRange=MaxMin

    Range is a straightforward calculation but makes use of only two of the data values in a dataset. The range can also be affected by outliers.

    Example 3.6

    Calculate the range for Dataset A and Dataset B:

    Dataset A: Exam scores for students in a biology course: 40, 70, 100
    Dataset B: Exam scores for students in a statistics course: 69, 70, 71

    Answer

    For Dataset A, the maximum data value is 100 and the minimum data value is 40.
    The range is then calculated as:

    Range = Max Min Range = 100 40 Range = 60 Range = Max Min Range = 100 40 Range = 60

    For Dataset B, the maximum data value is 71 and the minimum data value is 69.
    The range is then calculated as:

    Range = Max Min Range = 71 69 Range = 2 Range = Max Min Range = 71 69 Range = 2

    The range clearly indicates that there is much less spread in Dataset B as compared to Dataset A.

    One drawback to the use of the range is that it doesn’t take into account every data value. The range only uses two data values from the dataset: the minimum (min) and the maximum (max). Also the range is influenced by outliers since an outlier might appear as a minimum or maximum data value and thus skew the results. For these reasons, we typically use other measures of variation, such as variance or standard deviation.

    Variance

    The variance provides a measure of the spread of data values by using the squared deviations from the mean. The more the individual data values differ from the mean, the larger the variance.

    A financial advisor might use variance to determine the volatility of an investment and therefore help guide financial decisions. For example, a more cautious investor might opt for investments with low volatility.

    The formula used to calculate variance also depends on whether the data is collected from a sample or a population. The notation s2s2 is used to represent the sample variance, and the notation σ2σ2 is used to represent the population variance.

    Formula for the sample variance:

    s2=(xx)2n1s2=(xx)2n1

    Formula for the population variance:

    σ2=(xµ)2Nσ2=(xµ)2N

    In these formulas:
    xx represents the individual data values
    xx represents the sample mean
    nn represents the sample size
    µµ represents the population mean
    NN represents the population size

    Alternate Formula for Variance

    An alternate formula for the variance is available. It is sometimes used for more efficient computations:

    σ2=x2Nµ2σ2=x2Nµ2

    In the formulas for sample variance and population variance, notice the denominator for the sample variance is n1n1, whereas the denominator for the population variance is NN. The use of n1n1 in the denominator of the sample variance is used to provide the best estimate for the population variance, in the sense that if repeated samples of size nn are taken and the sample mean computed each time, then the average of those sample means will tend to the population mean as the number of repeated samples increase.

    It is important to note that in many data science applications, population data is unavailable, and so we typically calculate the sample variance. For example, if a researcher wanted to estimate the percentage of smokers for all adults in the United States, it would be impractical to collect data from every adult in the United States.

    Notice that the sample variance is a sum of squares. Its units of measurement are squares of the units of measurement of the original data. Since these square units are different than the units in the original data, this can be confusing. By contrast, standard deviation is measured in the same units as the original dataset, and thus the standard deviation is more commonly used to measure the spread of a dataset.

    Standard Deviation

    The standard deviation of a dataset provides a numerical measure of the overall amount of variation in a dataset in the same units as the data; it can be used to determine whether a particular data value is close to or far from the mean, relative to the typical distance from the mean.

    The standard deviation is always positive or zero. It is small when the data values are all concentrated close to the mean, exhibiting little variation, or spread. It is larger when the data values are spread out more from the mean, exhibiting more variation. A smaller standard deviation implies less variability in a dataset, and a larger standard deviation implies more variability in a dataset.

    Suppose that we are studying the variability of two companies (A and B) with respect to employee salaries. The average salary for both companies is $60,000. For Company A, the standard deviation of salaries is $8,000, whereas the standard deviation for salaries for Company B is $19,000. Because Company B has a higher standard deviation, we know that there is more variation in the employee salaries for Company B as compared to Company A.

    There are two different formulas for calculating standard deviation. Which formula to use depends on whether the data represents a sample or a population. The notation ss is used to represent the sample standard deviation, and the notation σσ is used to represent the population standard deviation. In the formulas shown, xx is the sample mean, µµ is the population mean, nn is the sample size, and NN is the population size.

    Formula for the sample standard deviation:

    s=(xx)2n1s=(xx)2n1

    Formula for the population standard deviation:

    σ=(xµ)2Nσ=(xµ)2N

    Notice that the sample standard deviation is calculated as the square root of the variance. This means that once the sample variance has been calculated, the sample standard deviation can then be easily calculated as the square root of the sample variance, as in Example 3.7.

    Example 3.7

    A biologist calculates that the sample variance for the amount of plant growth for a sample of plants is 8.7 cm2. Calculate the sample standard deviation.

    Answer

    The sample standard deviation (ss) is calculated as the square root of the variance.

    s = s 2 = 8.7 = 2.9 cm s = s 2 = 8.7 = 2.9 cm

    Example 3.8

    Assume the sample variance (s2s2) for a dataset is calculated as 42.2. Based on this, calculate the sample standard deviation.

    Answer

    The sample standard deviation (ss) is calculated as the square root of the variance.

    s = s 2 = 42.2 = 6.5 years s = s 2 = 42.2 = 6.5 years

    This result indicates that the standard deviation is about 6.5 years.

    Notice that the sample variance is the square of the sample standard deviation, so if the sample standard deviation is known, the sample variance can easily be calculated.

    Use of Technology for Calculating Measures of Variability

    Due to the complexity of calculating variance and standard deviation, technology is typically utilized to calculate these measures of variability. For example, refer to the examples shown in Coefficient of Variation on using Python for measures of variation.

    Coefficient of Variation

    A data scientist might be interested in comparing variation with different units of measurement of different means, and in these scenarios the coefficient of variation (CV) can be used. The coefficient of variation measures the variation of a dataset by calculating the standard deviation as a percentage of the mean. Note: coefficient of variation is typically expressed in a percentage format.

    CV = σ μ × 100 % Sample CV = s x × 100 % CV = σ μ × 100 % Sample CV = s x × 100 %

    Example 3.9

    Compare the relative variability for Company A versus Company B using the coefficient of variation, based on the following sample data:

    Company A: Sample Mean=$68,000, Sample Standard Deviation=$9,200Sample Mean=$68,000, Sample Standard Deviation=$9,200

    Company B: Sample Mean=$71,000, Sample Standard Deviation=$6,400Sample Mean=$71,000, Sample Standard Deviation=$6,400

    Answer

    Calculate the coefficient of variation for each company:

    CV for Company A = s x × 100 % = 9,200 68,000 × 100 % = 13.5 % CV for Company B = s x × 100 % = 6,400 71,000 × 100 % = 9.0 % CV for Company A = s x × 100 % = 9,200 68,000 × 100 % = 13.5 % CV for Company B = s x × 100 % = 6,400 71,000 × 100 % = 9.0 %

    Company A exhibits more variability relative to the mean as compared to Company B.

    Using Python for Measures of Variation

    DataFrame.describe() computes standard deviation as well on each column of a dataset. The std lists the standard deviation of each column (See Figure 3.3).

    A data table summarizing statistics about 966 items in the “movie profit” dataset, with columns for “unnamed: 0,” “rating,” “duration,” “US gross” and “worldwide gross.”  The standard deviation row is highlighted. The standard deviation is about 0.89 for ratings and about 21.6 for durations.  The standard deviation for US gross earnings is about $110.6 million and for worldwide gross earnings about $294.76 million.
    Figure 3.3 The Output of
    DataFrame.describe()
    with the Movie Profit Dataset

    This page titled 3.2: Measures of Variation is shared under a CC BY 4.0 license and was authored, remixed, and/or curated by OpenStax via source content that was edited to the style and standards of the LibreTexts platform.