20.3: Two categorical variables

Last updated
Save as PDF

Page ID: 39343

\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \) \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)\(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\) \(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\)\(\newcommand{\AA}{\unicode[.8,0]{x212B}}\)

Okay. Let’s return our attention to the people DataFrame, and begin with a bivariate analysis of the gender and color columns. The first thing we should do, of course, is inspect each one individually, using .value_counts() and perhaps a bar chart from sections 15.1 and 15.4. Let’s say we’ve done that.

The next obvious question: is there an association between the two variables? In other words, are there particular values of one that tend to go with particular values of the other? In still other words, do people of different genders tend to have different favorite colors?

Contingency Tables

The first tool to get at this question is called a contingency table. This is very much like .value_counts(), but for two variables instead of one. Our function is crosstab() from the Pandas package: if we give it two columns as arguments, it computes the complete set of counts from all possible combinations of variables. Here’s what it looks like:

Code \(\PageIndex{1}\) (Python):

pd.crosstab(people.gender, people.color)

| color blue green pink purple red yellow

| gender

| female 240 402 665 644 289 378

| male 1403 0 0 248 463 258

| other 1 2 2 2 1 2

Interpreting this is straightforward. Every cell in the matrix tells us how many people had a particular gender and a particular favorite color. For instance, there were 378 females who named yellow as their favorite color, and no males at all chose green.

Plotting Two Categorical Variables

So now we have a table of counts – how to turn this into a pretty and informative plot?

Unfortunately, there doesn’t seem to be any great way to do this. There’s something called a “mosaic plot” which attempts it, but they’re not very easy to visually interpret. Another option is a “heat map,” which essentially reproduces the above table as squares in a grid, with each square color coded on a continuum by its height (for instance, low numbers might be dark blue and high numbers bright yellow, with a rainbow spectrum of number in between). That’s sort of okay, but to be honest I prefer to just look at the numbers.

The χ2 test

The statistical test to use for two categorical variables is called the χ2 test (pronounced “kai-squared,” not “chai-squared,” by the way). To run it, it’s convenient to first store the contingency table itself as a variable. I’ll call it gender_color since it’s a table of the genders of people and their favorite colors:

Code \(\PageIndex{2}\) (Python):

gender_color = pd.crosstab(people.gender, people.color)

Now, we run the test by calling the chi2_contingency() function from SciPy:

Code \(\PageIndex{3}\) (Python):

scipy.stats.chi2_contingency(gender_color)

| (2125.8933435, 0.0, 10, array([[8.60798e+02, 2.11534e+02,

| 3.49241e+02, 4.68098e+02, 3.94270e+02, 3.34056e+02],

| [7.79913e+02, 1.91657e+02, 3.16424e+02, 4.24113e+02,

| 3.57223e+02, 3.02667e+02],

| [3.28800e+00, 8.08000e-01, 1.33400e+00, 1.78800e+00,

| 1.50600e+00, 1.27600e+00]]))

I know, I know: that output is downright hideous. Here’s the deal, though: all you have to do is look at the second number in that long, banana-and-boxie-laden thing. The second number is the p-value. It is 0.0. This is obviously lower than .05 (our α), and therefore, we can conclude that gender and color are associated.

All the other stuff in that output are fine-grained details that statisticians like to pore over. For us, the only thing we need to see from a χ2 (or any other) test is the p-value.