29.1: What “Doing Well” Means

Last updated
Save as PDF

Page ID: 88796

\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \) \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)\(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\) \(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\)\(\newcommand{\AA}{\unicode[.8,0]{x212B}}\)

The most common (and simplest) way to measure a classifier’s performance is to simply count how many of the test points it correctly classifies, and divide by the total number of test points. This gives us the classification accuracy as a fraction between 0 and 1 (or, if we want to multiply by 100, a “percentage accuracy” from 0% to 100%.) It’s possible to do this because, as you’ll remember, our test data is comprised of labeled examples, just like our training data is. Therefore, we know the “right answer” for each test point, and we can simply compare it to our classifier’s prediction.

Even though this is the most common approach, it’s worth taking a moment to consider alternatives. The key assumption of this accuracy measure is that all kinds of prediction errors are equal. In the videogame case, we’re saying that mistakenly labeling a videogamer as a non-videogamer is “just as bad” as mistakenly labeling a nonvideogamer as a videogamer. And that might be just the right thing for our gaming company to do.

But consider other settings. Suppose that our classifier’s inputs are features from an MRI image, and our prediction is “cancer” or “no cancer.” Now, it’s a much different story. Mistakenly predicting that a certain patient has cancer when they actually don’t might throw a needless scare into them. That’s bad. But it’s far worse in the other direction: mistakenly giving a clean bill of health to a patient who actually has early stage cancer risks losing a life. In cases like this, we would need to penalize our classifier more harshly for false negatives than for false positives.

It’s also a different story when the labels aren’t equally represented. Recall the NFL fan prediction problem from Figure 26.1 (p. 262). Consider if we performed fan prediction in a city like Dallas, which is comprised of (say) 99% Cowboys fans and only 1% Ravens fans. If we were to penalize a classifier equally for mistaken-Cowboypredictions and mistaken-Ravens-predictions, a one-line classifier could earn a pretty good score:

Code \(\PageIndex{1}\) (Python):

def predict(age, hometown, current_residence, yrs_in_residence):

return "Cowboys"

It’s not even worth trying hard to ferret out the few Ravens fans if we’re going to be docked a full point every time we dare to predict one. They’re just too rare. The only way to get a classifier to be bold and try to identify the tiny population of Ravens fans is to penalize it more heavily for missing them than for falsely identifying them.

Anyway, for the rest of this chapter, we’ll use the vanilla “count all prediction mistakes equally” approach, but it’s worth remembering that this doesn’t make sense in all situations.