26.2: Three Kinds of Examples

Last updated
Save as PDF

Page ID: 88800

$ \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } $ $ \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} $$\newcommand{\id}{\mathrm{id}}$ $ \newcommand{\Span}{\mathrm{span}}$ $ \newcommand{\kernel}{\mathrm{null}\,}$ $ \newcommand{\range}{\mathrm{range}\,}$ $ \newcommand{\RealPart}{\mathrm{Re}}$ $ \newcommand{\ImaginaryPart}{\mathrm{Im}}$ $ \newcommand{\Argument}{\mathrm{Arg}}$ $ \newcommand{\norm}[1]{\| #1 \|}$ $ \newcommand{\inner}[2]{\langle #1, #2 \rangle}$ $ \newcommand{\Span}{\mathrm{span}}$ $\newcommand{\id}{\mathrm{id}}$ $ \newcommand{\Span}{\mathrm{span}}$ $ \newcommand{\kernel}{\mathrm{null}\,}$ $ \newcommand{\range}{\mathrm{range}\,}$ $ \newcommand{\RealPart}{\mathrm{Re}}$ $ \newcommand{\ImaginaryPart}{\mathrm{Im}}$ $ \newcommand{\Argument}{\mathrm{Arg}}$ $ \newcommand{\norm}[1]{\| #1 \|}$ $ \newcommand{\inner}[2]{\langle #1, #2 \rangle}$ $ \newcommand{\Span}{\mathrm{span}}$$\newcommand{\AA}{\unicode[.8,0]{x212B}}$

Now here’s the deal. There are three kinds of example rows we’re going to deal with:

1. training data – labeled examples which we will show to our classifier, and from which it will try to find useful patterns to make future predictions.

2. test data – labeled examples which we will not show to our classifier, but which we will use to measure how well it performs.

3. new data – unlabeled examples that we will get in the future, after we’ve deployed our classifier in the field, and which we will feed to our classifier to make predictions.

The purpose of the first group is to give the classifier useful information so it can intelligently classify.

The purpose of the second group is to assess how good the classifier’s predictions are. Since the test set consists of labeled examples, we know the “true answer” for each one. So we can feed each of these test points to the classifier, look at its prediction, compare it to the true answer, and judge whether or not the classifier got it right. Assessing its accuracy is usually just a matter of computing the percentage of how many test points it got right.

The third group exists because after we’ve built and evaluated our classifier, we actually want to put it into action! These are new data points (new sporting goods customers, say) for which we don’t know the “true answer” but want to predict it so we can send catalogs likely to be well-received.

Thou shalt not reuse

Now one common question – which leads to a super important point – is this: why can’t we use all the labeled examples as training data? After all, if we have 1000 labeled examples we’ve had to work hard (or pay $$) to get, it seems silly to only use some of them to train our classifier. Shouldn’t we want to give it all the labeled data possible, so it can learn the maximum amount before predicting?

The first reply is: “but then we wouldn’t have any test data, and so we wouldn’t know how good our classifier was before putting it out in the field.” Clearly, before we base major business decisions on the results of our automated predictor, we need to have some idea of how accurate its predictions are.

It’s then often countered: “sure, but why not then re-use that data for testing? Instead of splitting the 1000 examples into training points and test points, why not just use all 1000 for training, and then test the classifier on all 1000 points? What’s not to like?”

This is where the super important point comes in, and it’s so important that I’ll put it all in boldface. It turns out that you absolutely cannot test your classifier on data points that you gave it to train on, because you will get an overly optimistic estimate of how good your classifier actually is.

Here’s an analogy to make this point more clear. Suppose there’s a final exam coming up in your class, and your professor distributes a “sample exam” a week before exam day for you to study from. This is a reasonable thing to do. As long as the questions on the sample exam are of the same type and difficulty as the ones that will appear on the actual final, you’ll learn lots about what the professor expects you to know from taking the sample exam. And you’ll probably increase your actual exam score, since this will help you master exactly the right material.

But suppose the professor uses the exact same exam for both the sample exam and the actual final exam? Sure, the students would be ecstatic, but that’s not the point. The point is: in this case, students wouldn’t even have to learn the material. They could simply memorize the answers! And after they all get their A’s back, they might be tempted to think they’re really great at chemistry...but they probably aren’t. They’re probably just really great at memorizing and regurgitating.

Going from “the kinds of questions you may be asked” to “exactly the questions you will be asked” makes all the difference. And if you just studied the sample exam by memorization, and were then asked (surprise!) to demonstrate your understanding of the material on a new exam, you’d probably suck it up.

And so, the absolute iron-clad rule is this: any data that is given to the classifier to learn from must not be used to test it. The test data must be comprised of representative, but different, examples. It’s the only way to assess how well the classifier generalizes to new data that it hasn’t yet seen (which, of course, is the whole point).

Splitting the difference

Okay, so given that we have to split our precious labeled examples into two sets, one for training and one for testing, how much do we devote to each? It turns out that there are some sophisticated techniques (beyond the scope of this book, but stay tuned for Volume Two) in which we can cleverly re-use portions of the data for different purposes, and effectively make use of nearly all of it for training.

But for our introductory approach here, we’ll just use a rule of thumb: 70% for training data, and the other 30% for test data.

As I mentioned earlier, we’ll normally shuffle the rows randomly before dividing them into these two groups, just in case there’s any pattern to the order in which they appear. For example, in our NFL fan data set, it might turn out that the data came to us sequenced in a way such that people living on the east coast were at the beginning of the DataFrame and those living out west were at the end. Any arrangement like this would spell doom for our classification endeavor. For one thing, we wouldn’t be training on any west coast people, and so our classifier would be oblivious to what those data points looked like. For another thing, we’d only be using west coasters to test our classifier, meaning that whatever accuracy measure we computed is likely to be way off. Randomizing the data is the sure way around this.

Here’s some code to create training and test sets. The .sample() method of a DataFrame lets you choose some percentage of its rows randomly. Its frac argument is a number between 0 and 1 and specifies what fraction of the rows you want. Using the above rule of thumb, let’s choose 70% of them for our training data:

Code $\PageIndex{1}$ (Python):

training = fans.sample(frac=.7)

print(training)

Notice that the numeric index values (far left) are in no particular order, since that’s the point of taking a random sample. Also notice that there are only 14 rows in this DataFrame instead of the full 20 that were in fans.

Now, we want our test set. The trick here is to say: “give me all the rows of fans that were not selected for the training set.” By building a query with the squiggle operator (“~”, meaning “not”) in conjunction with the “.isin()” method, we can create a new DataFrame called “test” that has exactly these rows:

Code $\PageIndex{2}$ (Python):

test = fans[~fans.index.isin(training.index)]

print(test)

That code says, in English: “create a new variable test that contains only those rows of fans whose index is not present in any of the training DataFrame’s indices.” As you can verify through visual inspection, the result does have exactly the 6 rows that were missing from training.