15.1: Categorical Data- Counts of Occurrences

Last updated
Save as PDF

Page ID: 39299

\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \) \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)\(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\) \(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\)\(\newcommand{\AA}{\unicode[.8,0]{x212B}}\)

Let’s say you had access to a poll on people’s favorite pop stars. You import this into a big ol’ Pandas Series called faves:

Code \(\PageIndex{1}\) (Python):

print(faves)

| 0 Katy Perry

| 1 Rihanna

| 2 Justin Bieber

| 3 Drake

| 4 Rihanna

| 5 Taylor Swift

| 6 Adele

| 7 Adele

| 8 Taylor Swift

| 9 Justin Bieber

| ...

| 1395 Katy Perry

| dtype: object

That’s great, but it’s also kinda TMI. You probably don’t care who the first person’s idol is, nor the fifteenth, nor the last. Much more interesting is simply how many times each value appears in the Series. This information is available from the Pandas .value_counts() method:

Code \(\PageIndex{2}\) (Python):

counts = faves.value_counts()

print(counts)

| Taylor Swift 388

| Katy Perry 265

| Drake 261

| Adele 212

| Rihanna 136

| Justin Bieber 134

| dtype: int64

The .value_counts() method returns another Series, but the values of the original Series become the keys of the new one. This tells us at a glance how popular each answer is relative to the others.

To get percentages instead of totals, just divide by the total and multiply by 100, of course:

Code \(\PageIndex{3}\) (Python):

print(counts / len(counts) * 100)

| Taylor Swift 27.7937

| Katy Perry 18.9828

| Drake 18.6963

| Adele 15.1862

| Rihanna 09.7421

| Justin Bieber 09.5989

| dtype: int64

Recall that the mode is the only measure of central tendency that makes sense for categorical data. And all you have to do is call .value_counts() and look at the top result. (In this case, Taylor Swift.)

Note that .value_counts() is a Pandas Series method, not a NumPy method. If you find yourself with a NumPy array instead, you can just wrap it in a Series as we did in Section 11.1:

Code \(\PageIndex{4}\) (Python):

my_array = np.array(['red','blue','red','green','green', 'green','blue'])

print(pd.Series(my_array).value_counts())

| green 3

| red 2

| blue 2

| dtype: int64