Skip to main content
Engineering LibreTexts

15.3: Numerical Data- Other Summary Statistics

  • Page ID
    39301
  • \( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \) \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)\(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\) \(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\)\(\newcommand{\AA}{\unicode[.8,0]{x212B}}\)

    That YouTube data set is a good segue to talking about that most overused of all statistics: the mean. Nearly everyone, if you ask them “what’s the typical number of plays for these videos?” will use the mean, or average, to get at the answer. After all, isn’t that what we mean by “the average number of plays?”

    The answer is: not really, and not usually. Look what happens if we compute the mean (using the .mean() Series method) in this case:

    Code \(\PageIndex{1}\) (Python):

    print(num_plays.mean())

    | 14018.888235294118

    Consider just how misleading that really is. The “average” number of plays is over 14,000. Yet the .9 -quantile was less than 1/10 th of that! In fact, even the .97-quantile is only:

    Code \(\PageIndex{2}\) (Python):

    print(num_plays.quantile(.97))

    | 3836.0

    So over 97% of the videos have less than the mean of 14,000 plays. I think you’ll agree that it is nonsensical to claim that “the typical number of plays is 14,018,” no matter how you slice it.

    We’ll see in the next section why the mean is hopelessly skewed here. Basically, unless the data is symmetrical and “bell-curvy,” it gives a meaningless number. It is almost always safer and more illuminating to look at the median (or other quantiles).

    For completeness, one other commonly cited summary statistic is the standard deviation, which can be computed with the .std() method:

    Code \(\PageIndex{3}\) (Python):

    print(num_plays.std())

    The standard deviation, like the IQR, is a measure of the “spread” of a data set – a high number means (in this example) higher variability in the number of plays from video to video. As with the mean, it’s essentially meaningless (no pun intended) unless the data is nice and bell-curve shaped.

    Speaking of which, we’ll never be able to judge the “shape” of anything unless we get some graphical plots involved. So let’s turn our focus to that.


    This page titled 15.3: Numerical Data- Other Summary Statistics is shared under a CC BY-SA 4.0 license and was authored, remixed, and/or curated by Stephen Davies (allthemath.org) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.