Skip to main content
Engineering LibreTexts

19.1: The Concept of Statistical Significance

  • Page ID
    39327
  • \( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \) \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)\(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\) \(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\)\(\newcommand{\AA}{\unicode[.8,0]{x212B}}\)

    Before we get to the details, we need to face head-on what is probably the single most important concept in statistics, that of statistical significance (or “stat sig,” for short). It is so immensely important that I’m going to ask you to put down whatever snack you’re eating right now, fold your hands, and pay very close attention.

    All forms of bivariate analysis are variations on a single theme, namely: discovering whether or not an association exists between two variables. Recall from section 10.2 (p. 91) that an association means that two variables are correlated in some way: that certain values of one tend to go more often with certain values of the other. To make it concrete, let’s say one of our variables is sex (at birth, male or female) and the other is height (in inches, say). We want to know: “are taller people more often male, and shorter people more often female, or is there no connection between sex and height?”

    Now the first thing you think of doing, of course, is getting a sample (recruiting volunteers, say) of both males and females, measuring their heights, and taking the average (mean). Let’s say you do that, and you come up with the following numbers:

    Females – average height: 65.5 inches

    Males – average height: 69.3 inches

    Clearly, in your sample, males were on average somewhat taller – 3.8 inches taller, in fact. A careless thinker would immediately conclude: “aha! My hypothesis is confirmed. I scientifically carried out my study, and mathematically computed the results, and now here is some hard data proving the conclusion that generally speaking, men tend to be taller than women.”

    Are you convinced by that reasoning?

    I hope you’re not. Here’s why. Let’s change the example, and suppose that instead of height, we measured our volunteers’ IQ. Taking the averages as before, we come up with these numbers:

    Females – average IQ: 102.4

    Males – average IQ: 98.6

    In this case, the average of the females in the sample was higher than the males was. Shall we conclude that in general, women tend to be smarter than men?

    Confirmation bias

    If you’re like most people, you’ll accept that first finding as confirmation of men’s tallness, and you’ll reject the second finding as just a fluke of the sample. Undoubtedly, this is because you went into the question already having an opinion about the matter. You just know in your heart that men do tend to be taller than women (you’ve observed thousands of both sexes, in fact, and have in fact noticed that trend) whereas you know in your heart that neither sex has an advantage in intelligence (ditto). This leads you to reason as follows:

    1. “Well of course my male volunteers were taller than the female ones. I’ve known all along that males tend to be taller in general, and this just confirms it!”
    2. “Aw, c’mon, we only sampled a few people and measured their IQs. Sure, these particular women might have been a bit smarter than these particular men, but if I ran the experiment again on different volunteers, it might just as easily go the other way. It’d be silly to draw a grand conclusion from that.”

    Psychologists call this fallacy of reasoning “confirmation bias.” We have a natural tendency to interpret information in a way that affirms our prior beliefs. Data that seems to contradict it, we simply talk our way out of.

    Confirmation bias is one of the most insidious enemies of humankind. It leads to wrong reasoning, the entrenchment of beliefs, dangerous overconfidence, polarization, and in the worst cases, groupthink. When a group of people succumbs to groupthink, “orthodox” viewpoints are encouraged, while alternative viewpoints are dismissed and suppressed. Every piece of evidence that conforms to the group’s consensus belief is hailed as evidence confirming it, and evidence that contradicts it is chalked up to mere statistical anomaly.

    One of many examples of this phenomenon was the CIA circa 2002: from the top to the bottom, nearly every member of the organization was certain that Sadaam Hussein’s terrible regime in Iraq possessed weapons of mass destruction (WMDs). Later, when it was inexplicably discovered that this “fact” wasn’t true after all (after we had made irreversible decisions based on it), analysts pored over the CIA’s decision-making process to try and make sense of it. Confirmation bias was perhaps the key ingredient.

    Be aware of it in your own thinking, and at all costs steer yourself away from it!

    The Perils of Eyeballing It

    Okay. Let’s suppose that we’ve freed ourselves from confirmation bias, and we’re actually looking at the numbers objectively. The first question still remains: does an association exist between the two variables?

    Men were an average of 3.8 inches taller in our sample. That’s a difference...but is it enough of a difference? Women were smarter by 3.8 IQ points. That’s a difference...but is it enough?

    To clarify, when we say “enough,” what we mean is: “enough of a difference to generalize our findings to the population at large.” Here’s a paradox: what we have is a sample; yet oddly, it’s never the sample we actually care about. We care about what the sample tells us about the population.

    Think about it. Nobody cares whether the 14 females we surveyed are smarter on average than the 17 males we surveyed. But if we make the claim: “in general women are smarter than men,” suddenly everybody cares. To use another example, nobody cares that 58% of the 2,000 people in our phone sample said they intend to vote for Elizabeth Warren, and only 39% said Donald Trump. But if we say “across the country, we predict more Warren votes than Trump votes by such-and-such margin,” this is big news.

    Now the first law to beat into your head is that you absolutely cannot reliably eyeball it. This is what everyone who hasn’t taken a Data Science or Statistics class tries to do. They squint at the difference (3.8 IQ points, e.g.) and bite their lip and mutter, “well, that sure seems (or doesn’t seem) like a pretty big difference. I’ll bet this says (or doesn’t say) something about intelligence among the sexes in general.”

    Stop. You cannot. People are demonstrably very bad at judging whether or not a difference between groups is “enough.” Part of the problem is that the answer to the question turns on three separate things: how big the difference is, how large your sample size is, and – importantly – how variable the data is (meaning, how widely the points you sample differ from each other). All three of these factors need to be mixed into a soup in just the right way in order to properly judge, and human intuition is just flat terrible at doing that.

    So eyeballing is a non-starter. But happily, it turns out that statistics provides us an iron-clad, dependable, quantitative, take-it-tothe-bank method for determining whether the pattern in a data set is “enough” to justifiably claim an association between variables. And that is the concept of statistical significance.

    A “Statistically Significant Difference”

    The correct way to determine whether a difference is “enough” is to use the appropriate statistical test. A statistical test is a standard procedure for incorporating all three factors I previously mentioned (the degree of difference, the sample size, and the amount of variance among data points) to come up with a principled, defensible answer to this question: is the pattern I see in my sample a statistically significant one? Can we be reasonably confident that it will also be true of the population as a whole?

    All the statistical tests we’ll learn (in the next chapter) have a common output: a p-value. That’s nice because you don’t have to memorize a lot of different rules for interpreting a lot of different tests.

    So what the heck is a “p-value?” There have been controversies galore1 , and even entire books2 devoted to the subject, which means that whatever I choose to write here can be nitpicked by statisticians a dozen different ways.

    That’s okay. I’m going to write it anyway, and this will serve you very well. Here goes:

    Definition: p-value

    A p-value is a number between 0 and 1. If you run a statistical test on your bivariate data, and the p-value is less than your α (“alpha”) value (normally, .05), then there is a statistically significant association between the variables. Otherwise, there isn’t. End of story.

    Recall from section 10.5 (p. 102) that α is “where to set the bar” to detect a meaningful association. It’s essentially how often we’re willing to draw a false conclusion. For social science data (that is, data involving humans), you should always choose .05 to avoid controversy. For physical science data, you should always choose .01.

    The bottom line is this: if you spot a possible relationship between two of your variables (like gender and IQ), run the appropriate statistical test (see next chapter) and look at the p-value. If it’s less than α, then the difference you thought you saw officially is “enough.” You can therefore declare “yep, these two variables are associated, to a confidence level of α.” If it’s not less than α, then even though you thought you saw a meaningful tendency in the data, you can officially say, “nope, it’s not a stat sig diff. This is very likely just an artifact of the particular data points I collected in my sample. Pay of no mind.”

    1For instance:

    • Colquhoun D (2017). “p-values.” Royal Society Open Science. 4(12): 171085.
    • Murtaugh, Paul A. (2014). “In defense of p-values.” Ecology. 95(3): 611–617
    • Wasserstein, Ronald L.; Lazar, Nicole A. (2016). “The ASA’s Statement on p-Values: Context, Process, and Purpose.” The American Statistician. 70(2): 129-133

    2Vickers, A. J. (2009). What is a p-value anyway? Boston: Pearson.


    This page titled 19.1: The Concept of Statistical Significance is shared under a CC BY-SA 4.0 license and was authored, remixed, and/or curated by Stephen Davies (allthemath.org) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.