Skip to main content
Engineering LibreTexts

15.1: Categorical Data- Counts of Occurrences

  • Page ID
    39299
  • \( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \) \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)\(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\) \(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\)\(\newcommand{\AA}{\unicode[.8,0]{x212B}}\)

    Let’s say you had access to a poll on people’s favorite pop stars. You import this into a big ol’ Pandas Series called faves:

    Code \(\PageIndex{1}\) (Python):

    print(faves)

    | 0 Katy Perry

    | 1 Rihanna

    | 2 Justin Bieber

    | 3 Drake

    | 4 Rihanna

    | 5 Taylor Swift

    | 6 Adele

    | 7 Adele

    | 8 Taylor Swift

    | 9 Justin Bieber

    | ...

    | 1395 Katy Perry

    | dtype: object

    That’s great, but it’s also kinda TMI. You probably don’t care who the first person’s idol is, nor the fifteenth, nor the last. Much more interesting is simply how many times each value appears in the Series. This information is available from the Pandas .value_counts() method:

    Code \(\PageIndex{2}\) (Python):

    counts = faves.value_counts()

    print(counts)

    | Taylor Swift 388

    | Katy Perry 265

    | Drake 261

    | Adele 212

    | Rihanna 136

    | Justin Bieber 134

    | dtype: int64

    The .value_counts() method returns another Series, but the values of the original Series become the keys of the new one. This tells us at a glance how popular each answer is relative to the others.

    To get percentages instead of totals, just divide by the total and multiply by 100, of course:

    Code \(\PageIndex{3}\) (Python):

    print(counts / len(counts) * 100)

    | Taylor Swift 27.7937

    | Katy Perry 18.9828

    | Drake 18.6963

    | Adele 15.1862

    | Rihanna 09.7421

    | Justin Bieber 09.5989

    | dtype: int64

    Recall that the mode is the only measure of central tendency that makes sense for categorical data. And all you have to do is call .value_counts() and look at the top result. (In this case, Taylor Swift.)

    Note that .value_counts() is a Pandas Series method, not a NumPy method. If you find yourself with a NumPy array instead, you can just wrap it in a Series as we did in Section 11.1:

    Code \(\PageIndex{4}\) (Python):

    my_array = np.array(['red','blue','red','green','green', 'green','blue'])

    print(pd.Series(my_array).value_counts())

    | green 3

    | red 2

    | blue 2

    | dtype: int64


    This page titled 15.1: Categorical Data- Counts of Occurrences is shared under a CC BY-SA 4.0 license and was authored, remixed, and/or curated by Stephen Davies (allthemath.org) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.