28.2: Decision Tree Induction

Last updated
Save as PDF

Page ID: 88804

\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \) \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)\(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\) \(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\)\(\newcommand{\AA}{\unicode[.8,0]{x212B}}\)

Okay, so now we understand what a decision tree is, and even how to code one up in Python. The key question that remains is: how do we figure out what tree to build?

There are lots of different choices, even for our little videogame example. We could put any of the three features at the root. For each branch from the root, we could put either of the other features, or we could stop with a leaf. And the leaf could be a Yes leaf or a No leaf. That’s a lot of “coulds.” How can we know what a good tree might be – i.e., a tree that classifies new points more or less correctly? The answer, of course, is to take advantage of the training data. It consists of labeled examples that are supposed to be our guide. Using the training data to “learn” a good tree is called inducing a decision tree. Let’s see how.

“Greedy” algorithms

Our decision tree induction algorithm is going to be a greedy one. This means that instead of looking ahead and strategizing about future nodes far down on the tree, we’re just going to grab the immediate best-looking feature at every individual step and use that. This won’t by any means guarantee us the best possible tree, but it will be quick to learn one.

An illustration to help you understand greedy algorithms is to think about a strategy game like chess. If you’ve ever played chess, you know that the only way to play well is to think ahead several moves, and anticipate your opponent’s probable responses. You can’t just look at the board naïvely and say, “why look at that: if I move my rook up four squares, I’ll capture my opponent’s pawn! Let’s do it!” Without considering the broader implications of your move, you’re likely to discover that as soon as you take her pawn, she turns around and takes your rook because she’s lured you into a trap.

A greedy algorithm for chess would do exactly that, however. It would just grab whatever morsel was in front of it without considering the fuller consequences. That may seem really dumb – and it is, for chess – but for certain other problems it turns out to be a decent approach. And decision tree induction is one of those.

The reason we resort to a greedy algorithm is that for any realsized data set, the number of possible trees to consider is absolutely overwhelming. There’s simply not enough time left in the universe to look at them all – and that’s not an exaggeration. So you have to find some way of picking a tree without actually contemplating every one, and it turns out that grabbing the immediately bestlooking feature at each level is a pretty good way to do that.

Choosing “the immediate best” feature

Now what does that mean, anyway: “choosing the immediate best feature?” We’re going to define it as follows: the best feature to put at any given node is the one which, if we did no further branching from that node but instead put only leaves below it, would classify the most training points correctly. Let’s see how this works for the videogame example.

Our left-most feature in the DataFrame is Major, so let’s consider that one first. Suppose we put Major at the root of the tree, and then made each of its branches lead to leaves. What value should we predict for each of the majors? Well, we can answer that with another clever use of .value_counts(), this time conjoining it with a call to .groupby(). Check out this primo line of code:

Code \(\PageIndex{1}\) (Python):

students.groupby('Major').VG.value_counts()

Stare hard at that code. You’ll realize that all these pieces are things you already know: we’re just combining them in new ways. That line of code says “take the entire students DataFrame, but treat each of the majors as a separate group. And what should we do with each group? Well, we count up the values of the VG column for the rows in that group.” The result is as follows:

We can answer “how many would we get right?” by reading right off that chart. For the PSYC majors, there are two who play videogames and three who do not. Clearly, then, if we presented a Psychology major to this decision tree, it ought to predict ’No’, and that prediction would be correct for 3 out of the 5 Psychology majors on record. For the MATH majors, we would again predict ’No’, and we’d be correct 3 out of 4 times. Finally, for the CPSC majors, we have 4 Yeses and 4 Nos, so that’s not much help. We essentially have to pick randomly since the training data doesn’t guide us to one answer or the other. Let’s choose ‘Yes’ for our Computer Science answer, just so it’s different than the others. The best one-level decision tree that would result from putting Major at the top is therefore depicted in Figure 28.2.1. It gets ten out of the seventeen training points correct (59%). Your reaction is probably “Big whoop – we got that good a score just using the prior, and ignoring all the features!” Truth. Don’t lose hope, though: Major was only one of our three choices.

Figure \(\PageIndex{1}\): A one-level decision tree if we put the Major feature at the root – it would classify ten of the seventeen training points correctly.

Let’s repeat this analysis for the other two features and see if either one fares any better. Here’s the query for Age:

Code \(\PageIndex{2}\) (Python):

students.groupby('Age').VG.value_counts()

This yields:

Making the sensible predictions at the leaves based on these values gives the tree in Figure 28.2.2. It gets eleven points right (65%) – a bit of an improvement.

Figure \(\PageIndex{2}\): A one-level decision tree if we chose the Age feature for the root – it would classify eleven of the seventeen training points correctly.

Finally, we could put Gender at the root. Here’s the query for it:

Code \(\PageIndex{3}\) (Python):

students.groupby('Gender').VG.value_counts()

Paydirt! Splitting on the Gender feature first, as shown in Figure 28.2.3, gets us a whopping fourteen points correct, or over 82%.

This is clearly the winner of the three. And since we’re being greedy and not bothering to look further downstream anyway, we hereby elect to put Gender at the root of our tree.

Figure \(\PageIndex{3}\): A one-level decision tree if we chose the Gender feature for the root. It would classify fourteen of the seventeen training points correctly – easily the best of the three choices.

It’s worth taking a moment to look at those .value_counts() outputs and see if you can develop some intuition about why Gender worked so much better at the root than the other two features did. The reason is that for this data set, Gender split the data into groups that were more homogeneous than the other splits gave. “Homogeneous” here means that each group was more “pure,” or put another way, more lopsided towards one of the labels. Gender gave us a 5-to-1 lopsided ratio on the M branch, and an even more lopsided 2-to-8 ratio on the F branch. Intuitively, this means that Gender really is correlated with videogame use, and this shows up in purer splits. Contrast this with the situation when we split on Major first, and we ended up with a yucky 4-to-4 ratio on the CPSC branch. An even split is the worst of all possible worlds: here, it means that learning someone’s a Computer Science major doesn’t tell you jack about their videogame use. That in turn means it’s pretty useless to split on.

Lather, Rinse, Repeat

So far, we’ve done all that work just to figure out which feature to put at the root of our tree. Now, we progress down each of the branches and do the exact same thing: figure out what to put at each branch. We’ll continue on and on like this for the entire tree. It’s turtles all the way down.

Let’s consider the left branch of Figure 28.2.3. What do we do with males? There are now only two remaining features to split on. (It wouldn’t make sense to split on Gender again, since the only people who will reach the left branch are males anyway: there’d be nothing to split on.)

Thus we could put either Major or Age at that left branch. To figure out which one is better, we’ll do the same thing we did before, only with one slight change: now, we need to consider only males in our analysis.

We augment our primo line of code from above with a query at the beginning, so that our counts include only males:

Code \(\PageIndex{4}\) (Python):

students[students.Gender=="M"].groupby('Major').VG.value_counts()

Wow, cool: the CPSC and PSYC folks are perfectly homogeneous now. If we end up deciding to split on Major here, we can put permanent dark purple squares for each of those majors simply declaring “Yes.” In all, splitting here gives us 5 out of 6 correct. The tree-in-progress we’d end up with is in Figure 28.2.4.

Figure \(\PageIndex{4}\): The tree-in-progress if we choose to split on Major in the male branch.

Our other choice, of course, is to split on Age instead:

Code \(\PageIndex{5}\) (Python):

students[students.Gender=="M"].groupby('Age').VG.value_counts()

Again, 5 out of 6 correct. Here, middle-aged students are the only heterogeneous group; the old folks and young-uns are clean splits. With this choice, our tree would appear as in Figure 28.2.5.

Figure \(\PageIndex{5}\): On the other hand, the tree-in-progress if we choose to split on Age in the male branch instead.

So at this point, since splitting on either feature and then stopping would give us exactly 5 out of 6 points correct, we just flip a coin. I just flipped one, and it came out tails (for Age) – hope that’s okay with you.

Finishing Up the Left Branch

The two “Yes” leaves in Figure 28.2.5 are now set in stone, since every single young male in our training set was indeed a videogamer, as was every old male. Now we just have to deal with the middle branch.

Only one feature now remains to split on – Major – so we’ll do that, and produce the result in Figure 28.2.6. There’s exactly one middle-aged male MATH major in the original DataFrame (line 12, p. 272), and he’s labeled “No,” so we’ll guess “No” in the MATH branch. Similarly, we have one data point to guide us for CPSC majors (line 3), so we’ll predict “Yes” in this case. The PSYC branch presents a conundrum, though: our data set doesn’t have any middle-aged male Psychology majors, so how do we know what to guess in this case?

Figure \(\PageIndex{6}\): Going one level further down after splitting on Age for males. We have data for middle-aged CPSC and MATH males...but what to do with middle-aged PSYC males?

The best way to handle this is to fall back to a more general case where you do have examples. It’s true that we have no training points for middle-aged male Psychology majors, but we do have points for middle-aged males-in-general, and we discovered that 5 out of 6 of them were gamers. So it makes sense to default to “Yes” in the PSYC branch of this part of the tree, even though we don’t have any data points exactly like that. So that’s what we’ll do. The finished left branch is depicted in Figure 28.2.7.

Figure \(\PageIndex{7}\): The decision tree we’re in the process of inducing, with the left branch entirely completed.

Finishing Up the Rest of the Tree

The rest of the process is just the same stuff done over and over¹. At each branch of the tree, we take the subset of the training points that remain (i.e., the training points that match the path from the root thus far, and are therefore applicable), and decide what to branch on next. When we get to a completely homogeneous group, we stop and put a leaf there. The end result of all these efforts is the final decision tree for the videogame data set, in Figure 28.2.8, and its Python equivalent in Figure 28.2.9.

One interesting aspect of our final tree is the female→PSYC→middleaged branch. You’ll see that this leaf is labeled “Yes(?)” in the diagram. Why the question mark? Because this is the one case where we have a contradiction in our training data. Check out lines 2 and 16 back on p. 272. They each reflect a middle-aged female Psychology major, but with different labels: the first one is not a videogame player, but the second one is.

I always thought the term “contradiction” was amusing here. Two similar people don’t have exactly the same hobbies – so what? Is that really so surprising? Do all middle-aged female Psychology majors have to be identical?

Of course not. But you can also see things from the decision tree’s point of view. The only things it knows about people are those three attributes, and so as far as the decision tree is concerned, the people on lines 2 and 16 really are indistinguishable. When contradictions occur, we have no choice but to fall back on some sort of majority-rules strategy: if out of seven otherwise-identical people, two play videogames and five do not, we’d predict “No” in that branch. In the present case, we can’t even do that much, because we have exactly one of each. So I’ll just flip a coin again. (*flip*) It came up heads, so we’ll go with “Yes.”

Notice that in this situation, the resulting tree will actually misclassify one or more training points. If we called our function in Figure 28.10 and passed it our person from line 2 ('PSYC', 'middle', 'F'), it would return "Yes" even though line 2 is not a gamer. Furthermore, contradictions are the only situation in which this will ever happen; if the data is contradiction-free, then every training point will be classified correctly by the decision tree.

Paradoxically, it turns out that’s not necessarily a good thing, as we’ll discover in Volume Two of this series. For now, though, we’ll simply declare victory.

Figure \(\PageIndex(8)\): The final decision tree for the videogame data set.

Figure \(\PageIndex(9)\): The final decision tree for the videogame data set, as a Python function.