16.2: Missing Values

Last updated
Save as PDF

Page ID: 39307

\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \) \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)\(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\) \(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\)\(\newcommand{\AA}{\unicode[.8,0]{x212B}}\)

Let’s change the example to a different family, and a slightly bigger DataFrame. The “simpsons.csv” file is reproduced below. Do you notice anything odd about it?

Code \(\PageIndex{1}\) (Python):

name,species,age,gender,fave,IQ,hair,salary

Homer,human,36,M,beer,74,,52000

Marge,human,34,F,helping others,120,stacked tall,

Bart,human,10,M,skateboard,90,buzz,

Lisa,human,8,F,saxophone,200,curly,

Maggie,human,1,F,pacifier,100,curly,

SLH,dog,4,M,,,shaggy,

What I mean is the positioning of some of the commas. The sharpeyed reader will see a “double comma” in Homer’s row. Even a dull-eyed reader will notice several commas in a row in SLH’s2 row. And nearly every row (the exception being Homer’s) ends with a comma, which just looks messed up.

This weird punctuation implies the existence of missing values, which means just what it sounds like: there’s simply no data for certain columns of certain rows. Homer doesn’t have a “hair” value, no one but Homer has a “salary” value, and SLH is missing all kinds of stuff.

When we read this into a Pandas DataFrame a la:

Code \(\PageIndex{2}\) (Python):

simpsons = pd.read_csv("simpsons.csv").set_index('name')

the result looks like this:

| species age gender fave IQ hair salary

| name

| Homer human 36 M beer 74.0 NaN 52000.0

| Marge human 34 F helping others 120.0 stacked tall NaN

| Bart human 10 M skateboard 90.0 buzz NaN

| Lisa human 8 F saxophone 200.0 curly NaN

| Maggie human 1 F pacifier 100.0 curly NaN

| SLH dog 4 M NaN NaN shaggy NaN

The missing values come up as NaN’s, the same value you may remember from p. 114. The monker “not a number” makes sense for the salary case, although I think it’s a bit weird for Homer’s hair (not a number? is hair supposed to be a number?...) At any rate, we can expect that this will be the case for many real-world data sets.

“Missing” can mean quite a few subtly different things, actually. Maybe it means that the value for that object of study was collected, but lost. Maybe it means it was never collected at all. Maybe it means that variable doesn’t really make sense for that object, as in the case of a dog’s IQ. Ultimately, if we want to use the other values in that row, we’ll have to come to terms with what the missing values mean. For now, let’s just learn a couple of coarse ways of dealing with them.

One (sometimes) handy method is .dropna(). If you call it, it will return a modified copy of the DataFrame in which any row with an NaN is removed. This turns out to be overkill in the Simpson’s case, though:

Code \(\PageIndex{3}\) (Python):

print(simpsons.dropna())

| Empty DataFrame

| Columns:[species, age, gender, fave, IQ, hair, salary]

| Index: []

In other words, nothing’s left. (Every row had at least one NaN in it, so nothing survived.)

We could pass an optional argument to .dropna() called “how”, and set it equal to "all": in this case only rows with all NaN values are removed. Sometimes that’s “underkill,” as in our Simpson’s example: after all, none of the rows are entirely NaN’s, so calling .dropna(how="all") would leave everything intact.

Another option is the .fillna() method, which takes a “default value” argument: any NaN value is replaced with the default in the modified copy returned. Let’s try it with the string "none" as the default value:

Code \(\PageIndex{4}\) (Python):

print(simpsons.fillna("none"))

| species age gender fave IQ hair salary

| name

| Homer human 36 M beer 74.0 none 52000.0

| Marge human 34 F helping others 120.0 stacked tall none

| Bart human 10 M skateboard 90.0 buzz none

| Lisa human 8 F saxophone 200.0 curly none

| Maggie human 1 F pacifier 100.0 curly none

| SLH dog 4 M NaN NaN shaggy none

This is possibly useful, but in this case it’s not a perfect fit because different columns call for different defaults. The fave and hair columns could well have “none” (indicating no favorite thing, and no hair, respectively) but we might want the default salary to be 0. The way to accomplish that is to change the individual columns of the DataFrame. Here goes:

Code \(\PageIndex{5}\) (Python):

simpsons['salary'] = simpsons['salary'].fillna(0)

simpsons['IQ'] = simpsons['IQ'].fillna(100)

simpsons['hair'] = simpsons['hair'].fillna("none")

print(simpsons)

| species age gender fave IQ hair salary

| name

| Homer human 36 M beer 74.0 none 52000.0

| Marge human 34 F helping others 120.0 stacked tall none

| Bart human 10 M skateboard 90.0 buzz none

| Lisa human 8 F saxophone 200.0 curly none

| Maggie human 1 F pacifier 100.0 curly none

| SLH dog 4 M NaN 100.0 shaggy none

Here we’ve assumed that the default IQ, for someone who hasn’t taken the test, is 100 (the average). I left the NaN in fave as is, since that seemed appropriate.

By the way, that code is actually more than it may appear at first. When we execute a line like:

simpsons['salary'] = simpsons['salary'].fillna(0)

we’re really saying “please replace the salary column of the simpsons DataFrame with a new column. That new column should be – wait for it – the existing salary column but with zeros replacing the NaN’s.”

We’ll see many more cases of changing DataFrame columns wholesale in the following chapters.