27.1: A Working Example

Last updated
Save as PDF

Page ID: 88788

\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \) \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)\(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\) \(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\)\(\newcommand{\AA}{\unicode[.8,0]{x212B}}\)

Here’s a (fictitious) domain problem that we’ll use to demonstrate the principles in this chapter and the next. Say we own a videogame business, and we want to send full-color product catalogs to unsuspecting college students, so that they will buy our games and keep us in business (while meanwhile failing out of school due to playing games all the time).

Now full-color catalogs are expensive to print and ship, so we want to be smart about this. We definitely don’t want to send a bunch of catalogs to students who aren’t likely buyers; that would run our business into the ground. Instead, we’d like to identify the subset of students who probably gamers, and send catalogs to only those students.

Suppose that through nefarious means, we have acquired the following data set:

Screen Shot 2022-09-16 at 12.56.55 PM.png

Each row represents one college student, with three features. The first is their major – PSYC (Psychology), MATH (Mathematics), or CPSC (Computer Science). (For simplicity, we’ll say these are the only three possibilities, since your author happens to like them the best.) The second is their age (numeric), and the third is their gender: male, female, or other. The last column is our target: whether or not this student is a videogamer. Glance over this DataFrame for a moment.

Eyeing the prior

As you remember from section 26.3, before we even think about features, we might take a minute to just look at the target variable itself. We ask ourselves “given no other information about a student, what would be our gut feel about their videogame status?” Our pal the .value_counts() method is perfect to compute this:

Code \(\PageIndex{1}\) (Python):

print(students.VG.value_counts())

Output:

Screen Shot 2022-09-16 at 1.00.30 PM.png

So if we’re smart, we’d guess “no” for such mysterious persons, but we could only expect to be right about (10/17)^ths, or 59%, of the time. Not great, although better than a coin flip.

Sticking with categorical features

Now it turns out that decision trees work best with all categorical features, not a mix of categorical and numeric. So for now, we’re going to simply classify each of our students into three buckets: “young” (18 or younger), “middle” (19-21), and “old” (22+).1 For the moment, don’t ask why we chose three age categories instead of two or four, and don’t ask why we chose those particular split points. We just did. More on that later. Our training data now looks like this:

Screen Shot 2022-09-16 at 1.04.12 PM.png

and we’re now officially ready to consider decision trees.