15.3: Numerical Data- Other Summary Statistics

Last updated
Save as PDF

Page ID: 39301

\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \) \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)\(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\) \(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\)\(\newcommand{\AA}{\unicode[.8,0]{x212B}}\)

That YouTube data set is a good segue to talking about that most overused of all statistics: the mean. Nearly everyone, if you ask them “what’s the typical number of plays for these videos?” will use the mean, or average, to get at the answer. After all, isn’t that what we mean by “the average number of plays?”

The answer is: not really, and not usually. Look what happens if we compute the mean (using the .mean() Series method) in this case:

Code \(\PageIndex{1}\) (Python):

print(num_plays.mean())

| 14018.888235294118

Consider just how misleading that really is. The “average” number of plays is over 14,000. Yet the .9 -quantile was less than 1/10 th of that! In fact, even the .97-quantile is only:

Code \(\PageIndex{2}\) (Python):

print(num_plays.quantile(.97))

| 3836.0

So over 97% of the videos have less than the mean of 14,000 plays. I think you’ll agree that it is nonsensical to claim that “the typical number of plays is 14,018,” no matter how you slice it.

We’ll see in the next section why the mean is hopelessly skewed here. Basically, unless the data is symmetrical and “bell-curvy,” it gives a meaningless number. It is almost always safer and more illuminating to look at the median (or other quantiles).

For completeness, one other commonly cited summary statistic is the standard deviation, which can be computed with the .std() method:

Code \(\PageIndex{3}\) (Python):

print(num_plays.std())

The standard deviation, like the IQR, is a measure of the “spread” of a data set – a high number means (in this example) higher variability in the number of plays from video to video. As with the mean, it’s essentially meaningless (no pun intended) unless the data is nice and bell-curve shaped.

Speaking of which, we’ll never be able to judge the “shape” of anything unless we get some graphical plots involved. So let’s turn our focus to that.