20.1: Three Bivariate Scenarios

Last updated
Save as PDF

Page ID: 39341

\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \) \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)\(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\) \(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\)\(\newcommand{\AA}{\unicode[.8,0]{x212B}}\)

As we saw with univariate data in chapter 15, different kinds of plots and statistics are appropriate depending on the variable’s scale of measure – categorical or numeric. There are thus three different cases for bivariate analysis:

Two categorical variables
One categorical variable and one numeric variable
Two numeric variables

We’ll consider each case in turn. Throughout all the remaining sections, we’ll use this fictitious data set, called people:

| gender salary color followers

| 0 male 54.94 purple 26

| 1 female 72.48 purple 22

| 2 male 9.47 blue 27

| 3 other 60.08 red 22

| 4 male 37.62 red 13

Each row represents one fictional person we interviewed, and includes their gender, their salary (in thousands of dollars per year), their favorite color, and the number of followers they have on some unspecified social media website.

The DataFrame has 5000 rows, and no special “index” variable: none of the columns that we collected are unique, so we just let Pandas default to indexing the rows by number, 0 through 4,999.