18.2: The .groupby() method

Last updated
Save as PDF

Page ID: 39321

\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \) \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)\(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\) \(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\)\(\newcommand{\AA}{\unicode[.8,0]{x212B}}\)

One of the most useful methods in the whole DataFrame repertoire is .groupby(). It applies when you want summary statistics (mean, quantile, max/min, etc.) not for the whole data set, but for each subset of the data set, where the subsets split on the values of one of the variables.

Here’s an example in action. It’s old news that we could find, say, the median IQ of the Simpsons family overall:

Code \(\PageIndex{1}\) (Python):

print(simpsons['IQ'].median())

But it’s new news that we can do this for each gender separately, via:

Code \(\PageIndex{2}\) (Python):

print(simpsons.groupby('gender')['IQ'].median())

| gender

| F 120.0

| M 74.0

| Name: IQ, dtype: float64

We give a categorical variable as the argument to .groupby(), and specify a numeric variable as the column we wish to analyze. Finally, we choose the summary statistic we want (.median() in the above case).

All this produces a resulting Series. Think hard: the keys of the resulting Series are the values of the categorical variable (in the original Series) that we grouped by; and the values of the resulting Series, are the results of applying the summary statistic function to each of the subsets separately.

So now, in addition to knowing that the overall Simpson family median IQ is 95, we also know that among Simpson boys and men, it’s only 74, whereas among girls and women, it’s an impressive 120.

Another example: let’s find the maximum age for each hair style:

Code \(\PageIndex{3}\) (Python):

print(simpsons.groupby('hair')['age'].max())

| hair

| buzz 10

| curly 8

| none 36

| shaggy 4

| stacked tall 34

| Name: age, dtype: int64

(Since there’s so many different hairstyles present, Maggie turns out to be the only one whose age is not represented here.)