Chapter 2: T-tests, P-values, and Hypothesis Tests
- Page ID
- 123742
\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)
\( \newcommand{\dsum}{\displaystyle\sum\limits} \)
\( \newcommand{\dint}{\displaystyle\int\limits} \)
\( \newcommand{\dlim}{\displaystyle\lim\limits} \)
\( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)
( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)
\( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)
\( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)
\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)
\( \newcommand{\Span}{\mathrm{span}}\)
\( \newcommand{\id}{\mathrm{id}}\)
\( \newcommand{\Span}{\mathrm{span}}\)
\( \newcommand{\kernel}{\mathrm{null}\,}\)
\( \newcommand{\range}{\mathrm{range}\,}\)
\( \newcommand{\RealPart}{\mathrm{Re}}\)
\( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)
\( \newcommand{\Argument}{\mathrm{Arg}}\)
\( \newcommand{\norm}[1]{\| #1 \|}\)
\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)
\( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)
\( \newcommand{\vectorA}[1]{\vec{#1}} % arrow\)
\( \newcommand{\vectorAt}[1]{\vec{\text{#1}}} % arrow\)
\( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\( \newcommand{\vectorC}[1]{\textbf{#1}} \)
\( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)
\( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)
\( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)
\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)
\(\newcommand{\avec}{\mathbf a}\) \(\newcommand{\bvec}{\mathbf b}\) \(\newcommand{\cvec}{\mathbf c}\) \(\newcommand{\dvec}{\mathbf d}\) \(\newcommand{\dtil}{\widetilde{\mathbf d}}\) \(\newcommand{\evec}{\mathbf e}\) \(\newcommand{\fvec}{\mathbf f}\) \(\newcommand{\nvec}{\mathbf n}\) \(\newcommand{\pvec}{\mathbf p}\) \(\newcommand{\qvec}{\mathbf q}\) \(\newcommand{\svec}{\mathbf s}\) \(\newcommand{\tvec}{\mathbf t}\) \(\newcommand{\uvec}{\mathbf u}\) \(\newcommand{\vvec}{\mathbf v}\) \(\newcommand{\wvec}{\mathbf w}\) \(\newcommand{\xvec}{\mathbf x}\) \(\newcommand{\yvec}{\mathbf y}\) \(\newcommand{\zvec}{\mathbf z}\) \(\newcommand{\rvec}{\mathbf r}\) \(\newcommand{\mvec}{\mathbf m}\) \(\newcommand{\zerovec}{\mathbf 0}\) \(\newcommand{\onevec}{\mathbf 1}\) \(\newcommand{\real}{\mathbb R}\) \(\newcommand{\twovec}[2]{\left[\begin{array}{r}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\ctwovec}[2]{\left[\begin{array}{c}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\threevec}[3]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\cthreevec}[3]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\fourvec}[4]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\cfourvec}[4]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\fivevec}[5]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\cfivevec}[5]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\mattwo}[4]{\left[\begin{array}{rr}#1 \amp #2 \\ #3 \amp #4 \\ \end{array}\right]}\) \(\newcommand{\laspan}[1]{\text{Span}\{#1\}}\) \(\newcommand{\bcal}{\cal B}\) \(\newcommand{\ccal}{\cal C}\) \(\newcommand{\scal}{\cal S}\) \(\newcommand{\wcal}{\cal W}\) \(\newcommand{\ecal}{\cal E}\) \(\newcommand{\coords}[2]{\left\{#1\right\}_{#2}}\) \(\newcommand{\gray}[1]{\color{gray}{#1}}\) \(\newcommand{\lgray}[1]{\color{lightgray}{#1}}\) \(\newcommand{\rank}{\operatorname{rank}}\) \(\newcommand{\row}{\text{Row}}\) \(\newcommand{\col}{\text{Col}}\) \(\renewcommand{\row}{\text{Row}}\) \(\newcommand{\nul}{\text{Nul}}\) \(\newcommand{\var}{\text{Var}}\) \(\newcommand{\corr}{\text{corr}}\) \(\newcommand{\len}[1]{\left|#1\right|}\) \(\newcommand{\bbar}{\overline{\bvec}}\) \(\newcommand{\bhat}{\widehat{\bvec}}\) \(\newcommand{\bperp}{\bvec^\perp}\) \(\newcommand{\xhat}{\widehat{\xvec}}\) \(\newcommand{\vhat}{\widehat{\vvec}}\) \(\newcommand{\uhat}{\widehat{\uvec}}\) \(\newcommand{\what}{\widehat{\wvec}}\) \(\newcommand{\Sighat}{\widehat{\Sigma}}\) \(\newcommand{\lt}{<}\) \(\newcommand{\gt}{>}\) \(\newcommand{\amp}{&}\) \(\definecolor{fillinmathshade}{gray}{0.9}\)Central Limit Theorem
We just examined the dispersion of sample values around the mean value of the sample, \(\overline{x}\). But we did not obtain an estimate for the uncertainty in \(\overline{x}\) approximated to the true mean \(\mu\). To do this we would have to repeat the experiment test and compare the \(\overline{x}\) of each sample test. Thus we would also obtain a set of samples for the mean.
One of the most important and profound theories in statistics deals with this set of sample means and their distribution and is referred to as the Central Limit Theorem. So far we have been working on the assumption that the population that we are working with is normally distributed. While many times that may be true it is not always the case. Let’s think for instance if we were working with a uniform sample distribution like the one below.

At this point one might think we must give up and we cannot perform any data analysis... if only I were so kind.
No instead the Central Limit Theorem postulates a very powerful idea: regardless of the shape of the population distribution, the distribution of the mean values of a sample will be normally distributed as long as you obtain a large number of means, \(n > 20\). We can see this visually with an example in the Mathematica Notebook for this lecture. If we randomly pick 6 samples from this distribution and average—if we look at 2 or 10 averages as seen below—the distributions do not look Gaussian. But if we select 30 or more (sometimes 20 is enough but I would obtain 30 just to be safe) then we see the distribution of the mean measurand values becomes Gaussian!

This has implications for the analysis we can perform, but let's prove this and show that it works for other types of distributions like an Exponential, \(\chi^2\), and \(\beta\). As you can see below, the distribution of the means are clearly normal or Gaussian distributions.

The practical application that is of interest to us as experimentalists is that we do not need to think about or worry about the shape of the population distribution. Since we know that the sample means are normally distributed, we can simply use the normal distribution of the means to calculate confidence intervals and do more complex analysis and comparisons moving forward... you are so lucky—t-test comparisons and hypothesis testing are coming soon!
So when we invoke the central limit theorem, we can make the following definitions where the standard deviation of the distribution of the means is:
$$
\sigma_{\overline{x}} = \frac{\sigma}{\sqrt{n}}
\]
We can use the standard deviation of the means to now re-write our PDF or Gaussian function so \(z\) now becomes, when describing the distribution of the means:
$$
z = \frac{\overline{x} - \mu}{\sigma_{\overline{x}}} = \frac{\overline{x} - \mu}{\frac{\sigma}{\sqrt{n}}}
\]
This new definition of \(z\) allows us to re-write our confidence interval equation as follows, but now we can explicitly determine the confidence level or uncertainty in the population mean!
$$
\mu = \overline{x} \pm z_{\frac{c}{2}} \frac{\sigma}{\sqrt{n}}
\]
We can also use the approximation that \(\sigma \approx S_x\) for \(n > 30\), and we can then define the sample standard deviation of the sample means to be:
$$
S_{\overline{x}} = \frac{S_{x}}{\sqrt{n}}
\]
Now we can express the confidence interval of the population mean as:
$$
\mu = \overline{x} \pm z_{\frac{c}{2}} \frac{S_x}{\sqrt{n}}
\]
Whew!! That was a lot of work. Now let's go back to our rolling particles and calculate the 95% confidence interval for the population mean pressure.
Well that will simply be:
$$
\mu = \overline{x} \pm z_{\frac{c}{2}} \frac{S_x}{\sqrt{n}} = 2.94 \pm 0.063
\]
As you can see this is a much, much, much smaller range than our previous calculation due to the \(\sqrt{\frac{1}{n}}\) factor. This is the key difference when trying to estimate the uncertainty in the population mean or using \(\overline{x}\) as an estimate of the population mean. Previously we were just looking at the likelihood of observing a value that deviates from the population mean by a particular value.
In summary, the \(c\%\) interval for the mean value is narrower than the data by a factor of \(\frac{1}{\sqrt{n}}\). This is because \(n\) observations have been used to average out the random deviations of individual measurements.
Confidence Intervals for Small Samples
We have been working with large samples thus far, which is generally considered to be when \(n \geq 30\), but there are many experiments when \(n\) will often be less than 30. For these situations, we will utilize the Student's t-distribution, developed by an amateur statistician writing under the pseudonym Student:
$$
t = \frac{\overline{x} - \mu}{\frac{S_x}{\sqrt{n}}}
\]
Student was actually William Gosset, a Guinness brewer who was applying statistical analysis to the brewing process. Guinness was against him publishing the data (and thus giving away the Guinness secrets), so he published under the pseudonym.
The assumption here is that the underlying population satisfies the Gaussian distribution. This distribution also depends on the degrees of freedom, \(\nu = n - 1\).
The t-distribution is similar to the PDF—it is symmetric and the total area is unity. Moreover, the t-distribution approaches the standard Gaussian PDF as \(\nu\) or \(n\) becomes large, and for \(n > 30\), the two distributions are essentially identical, as we will see in the t-table soon.
The area, \(\alpha\), is between \(t\) and \(t \rightarrow \infty\) for a given sample size. This is very different from our Z-table, where we were looking at the area between \(z = 0\) and \(z\). The value of \(\alpha\) corresponds to the probability that for a given sample size, \(t\) will have a value greater than that given in the table. We can assert with confidence level \(c = 1 - \alpha\) that the actual value of \(t\) does not fall in the shaded area.
A two-sided confidence interval is then:
$$
\overline{x} - t_{\frac{\alpha}{2}, \nu} \frac{S_x}{\sqrt{n}} < \mu < \overline{x} + t_{\frac{\alpha}{2}, \nu} \frac{S_x}{\sqrt{n}}
\]
Sometimes \(\alpha\) in literature will be referred to as the level of significance. The precision uncertainty in the value of \(\overline{x}\) is:
$$
P_x = t_{\frac{\alpha}{2}, \nu} \frac{S_x}{\sqrt{n}}
$$
Hypothesis Testing for a Single Mean for a Small Sample Size
The reason we are using these statistical tools is to make decisions regarding a measurand. One of the most common methods for doing this is hypothesis testing.
Typically, we deal with two hypotheses.
Null Hypothesis:
This is the first step in hypothesis testing.
The null hypothesis is written as
\( H_0 : \mu = \mu_0 \),
where \(\mu_0\) is some constant specific value.
Alternative Hypothesis:
This is the second step. The choice should reflect what we are attempting to show.
There are three common forms of alternative hypotheses:
- Two-tailed test: concerned with whether a population mean \(\mu\) is different from a specific value \(\mu_0\)
That is, \( H_a : \mu \neq \mu_0 \)
- Left-tailed test: concerned with whether a population mean is less than a specific value
That is, \( H_a : \mu < \mu_0 \)
- Right-tailed test: concerned with whether a population mean is greater than a specific value
That is, \( H_a : \mu > \mu_0 \)

Procedure for Hypothesis Testing
1. Define the null hypothesis, \( H_0 \)
2. Define the alternative hypothesis, \( H_a \)
3. Define the confidence level (for example, 95 percent)
4. Calculate the value of \( t_{exp} \) from the sample data
5. Determine the critical value from the t-distribution: \( t_{\alpha,\nu} \) or \( t_{\frac{\alpha}{2},\nu} \), where \(\nu\) is degrees of freedom
6. If \( t_{exp} \) falls in the rejection region, reject \( H_0 \) and accept \( H_a \)
7. If \( t_{exp} \) falls in the do-not-reject region, conclude there is not enough evidence to reject \( H_0 \) at the given confidence level
Let’s apply this to an example. Suppose we want to test if the PCM data has a mean of 2.00 mg at 95 percent confidence:
$$
H_0: \mu = 2.00 \text{ mg}
\]
$$
H_a: \mu \neq 2.00 \text{ mg}
\]
$$
t_{exp} = \frac{\overline{x} - \mu_0}{\frac{S_x}{\sqrt{n}}} = 0.99011
\]
$$
t_{0.025,17} = 2.11
\]
Since \( t_{exp} < t_{crit} \), we fail to reject \( H_0 \). The data is consistent with a mean of 2.00 mg at 95 percent confidence.
P-Values
In scientific literature and statistics courses, you may have encountered the term P-value. This term is very common and often used more frequently than t-values or z-values in reporting results.
The P-value is the probability of getting a result that is more extreme than the value actually observed, assuming the null hypothesis is true.
Let’s revisit the hypothesis test example we just discussed. We performed a two-tailed test with a confidence level of 95 percent, meaning each tail cutoff is 2.5 percent, or 0.025.
If the P-value is greater than 0.025, the result falls in the do-not-reject region. If the P-value is less than 0.025, it falls in the reject region.
In our case, the test yielded:
P-value = 0.336
Since 0.336 is greater than 0.025, we do not reject the null hypothesis. This agrees with the result obtained using the critical t-value.
The interpretation of the P-value in this case is that there is a 33.6 percent chance of obtaining a result as extreme or more extreme than the one observed, if the true mean were 2.00 mg.
This tells us that our sample data is consistent with the null hypothesis at the 95 percent confidence level.
Another PCM Application
Let’s now apply hypothesis testing to a new scenario. Suppose we want to know:
Does the sample come from a population whose true mean weight is greater than 1.99 mg, assuming a confidence level of 99 percent?
We will use a right-tailed test.
The hypotheses are:
$$
H_0: \mu = 1.99 \text{ mg}
\]
$$
H_a: \mu > 1.99 \text{ mg}
\]
From our data, we calculate:
$$
t_{exp} = \frac{\overline{x} - \mu_0}{\frac{S_x}{\sqrt{n}}} = 2.025
\]
The critical value from the t-distribution table is:
$$
t_{0.01,17} = 2.567
\]
Because \( t_{exp} = 2.025 \) is less than \( t_{0.01,17} = 2.567 \), it does not fall in the rejection region.
Therefore, we conclude that we do not have sufficient evidence to reject the null hypothesis at the 99 percent confidence level.
This means the population mean is not significantly greater than 1.99 mg.
Note that this does not prove the population mean is 1.99 mg. We are simply saying we do not have strong enough evidence to conclude otherwise.
Now, let’s look at the P-value.
The calculated P-value is:
P-value = 0.02943
For a right-tailed test at 99 percent confidence, we compare against alpha = 0.01. Since 0.02943 is greater than 0.01, the result also falls in the do-not-reject region. This agrees with the conclusion from the t-test.
Yet Another PCM Application
Let’s consider a new hypothesis test.
Does the sample come from a population whose true mean weight is less than 2.01 mg, assuming a confidence level of 90 percent?
This will be a left-tailed test.
The hypotheses are:
$$
H_0: \mu = 2.01 \text{ mg}
\]
$$
H_a: \mu < 2.01 \text{ mg}
\]
From our data, we calculate:
$$
t_{exp} = \frac{\overline{x} - \mu_0}{\frac{S_x}{\sqrt{n}}} = -0.0448035
\]
The critical value from the t-distribution table for a 90 percent confidence level and 17 degrees of freedom is:
$$
t_{0.1,17} = -1.33
\]
Because \( t_{exp} = -0.0448 \) is greater than \( -1.33 \), it does not fall in the rejection region.
Therefore, we do not reject the null hypothesis. The data does not provide sufficient evidence to conclude that the population mean is less than 2.01 mg at 90 percent confidence.
Now, let’s examine the P-value.
P-value = 0.4824
For a 90 percent confidence level, the cutoff is alpha = 0.10. Since the P-value is greater than 0.10, we are again in the do-not-reject region. This agrees with our earlier conclusion.
Hypothesis Testing for a Large Sample Size
When the sample size is large, we follow the same hypothesis testing procedure as with small samples. However, we use the z-distribution instead of the t-distribution.
Specifically, for large \( n \), we replace \( t_{\alpha,\nu} \) with \( z_{\alpha} \), and we compute the test statistic using:
$$
z_{exp} = \frac{\overline{x} - \mu_0}{\frac{S_x}{\sqrt{n}}}
\]
Note that when \( n > 30 \), we can either use the z-distribution or use the t-distribution with \( \nu \approx \infty \), which produces essentially the same result.
This approach is useful when working with larger data sets, as the normal (Gaussian) approximation becomes more accurate due to the Central Limit Theorem.
Looking Back at Rolling Velocity
Let’s apply hypothesis testing to a real experimental question:
Does the sample of rolling magnetic beads come from a population with a velocity less than 3.1 micrometers per second, at a confidence level of 99 percent?
This is a left-tailed test.
The hypotheses are:
$$
H_0: \mu = 3.1 \, \frac{\mu m}{s}
\]
$$
H_a: \mu < 3.1 \, \frac{\mu m}{s}
\]
From the data, we calculate:
$$
t_{exp} = \frac{\overline{x} - \mu_0}{\frac{S_x}{\sqrt{n}}} = -1.3383
\]
Since this is a large sample size case, we can use the z-distribution. The critical value for a one-tailed 99 percent confidence test is:
$$
z_{0.01} = -2.362
\]
Now compare:
- \( t_{exp} = -1.3383 \)
- \( z_{crit} = -2.362 \)
Because \( t_{exp} > z_{crit} \), it does not fall in the rejection region.
Conclusion: We do not reject the null hypothesis. The data does not provide sufficient evidence to conclude that the velocity is less than 3.1 micrometers per second at 99 percent confidence.
Now look at the P-value:
P-value = 0.09151
Since the P-value is greater than 0.01, this again confirms that we are in the do-not-reject region.
t-Test Comparison of Sample Means
We can also compare two samples based solely on their means using the t-test for independent samples.
The test statistic is:
$$
t = \frac{\overline{x}_1 - \overline{x}_2}{\sqrt{\left( \frac{S_1^2}{n_1} \right) + \left( \frac{S_2^2}{n_2} \right)}}
\]
Where:
- \( \overline{x}_1, S_1, n_1 \) are the sample mean, standard deviation, and size of sample 1
- \( \overline{x}_2, S_2, n_2 \) are the sample mean, standard deviation, and size of sample 2
The degrees of freedom \( \nu \) can be approximated using:
$$
\nu = \frac{ \left[ \left( \frac{S_1^2}{n_1} \right) + \left( \frac{S_2^2}{n_2} \right) \right]^2 }{ \frac{ \left( \frac{S_1^2}{n_1} \right)^2 }{n_1 - 1} + \frac{ \left( \frac{S_2^2}{n_2} \right)^2 }{n_2 - 1} }
\]
After computing this value, round \( \nu \) to the nearest integer.
If the value of \( t \) falls within the interval \( \pm t_{\frac{\alpha}{2}, \nu} \), then the two means are not significantly different at the specified confidence level.
This test is very versatile — it works for comparing large samples, small samples, or a mix of both.
Are These Materials Significantly Stiffer?
Let’s compare two samples to determine if Material A is significantly stiffer than Material B.
Here is the sample data:
Material A:
Mean \( \overline{x}_A = 302.6 \) GPa
Standard deviation \( S_A = 1.27 \) GPa
Sample size \( n_A = 12 \)
Material B:
Mean \( \overline{x}_B = 302.3 \) GPa
Standard deviation \( S_B = 1.7 \) GPa
Sample size \( n_B = 15 \)
Hypotheses:
$$
H_0: \overline{x}_A = \overline{x}_B
\]
$$
H_a: \overline{x}_A \neq \overline{x}_B
\]
Approximated degrees of freedom:
$$
\nu \approx 22
\]
Calculated test statistic:
$$
t_{exp} = 1.44
\]
Critical value for a two-tailed 95 percent confidence level:
$$
t_{0.025,22} = 2.074
\]
Since \( t_{exp} = 1.44 \) is less than \( 2.074 \), the result falls in the do-not-reject region.
Conclusion: There is no statistically significant difference in the stiffness of Material A and Material B at the 95 percent confidence level.
From a simulation, suppose the P-value was calculated to be:
P-value = 0.162
This also falls in the do-not-reject region, since 0.162 > 0.05. Both the t-test and P-value support the same conclusion.


