26.3: "The Prior"

Last updated
Save as PDF

Page ID: 88801

\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \) \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)\(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\) \(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\)\(\newcommand{\AA}{\unicode[.8,0]{x212B}}\)

One more piece of lingo before we dive into a particular classification technique next chapter. And that’s known as “the prior” of a data set.

The term comes from something called Bayesian reasoning, which is a whole subject (and a super cool one!) in its own right. All you need to know here is the concept of two different quantities: the prior, and the posterior.

In common usage, the word “prior” means “beforehand,” and so it does here: the prior is your best judgment about what the target value of a new example might be before you actually look at the feature values in that example. “Posterior,” on the other hand, means “afterwards,” and means your best judgment about the target value after duly taking into consideration all the feature values.

For example, you may have noticed that in my made-up data set, above, I had a lot of Ravens fans. This is because I live in the D.C. area, and happen to know a lot of Ravens fans. Out of my 20 labeled examples, a whopping twelve of them, in fact, had Ravens as their value in the team column.

Thus, consider the following question. Suppose you knew nothing about a person except that they were one of Stephen’s friends. Which NFL team do you think they’d support? Assuming this data set is representative of Stephen’s friends, you’d say: “I’d predict they’d be a Ravens fan, and I’d estimate that I’d have about a 60% chance of being right ( 12/20 ).” This is the prior. You’re not taking into account anything about their age, where they were born, etc.; in fact, you weren’t even told those things. Instead, you’re just “using the prior” and treating everyone the same.

It would be a different story if I told you that this person was born in New York City. Then you might squint your eyes at my data set and realize that there are only two New Yorkers in it, and neither one is a Ravens fan: they’re both Giants fans! Now you might very well move away from your prior assumption. “Sure, most of Stephen’s friends are Ravens fans, so ‘Ravens’ is a reasonable guess, but now that you’ve told me they’re from NY, that very well might change my mind. Now, my guess is ‘Giants’.”

I keep saying “might” and “may” because different kinds of classifiers work in different ways. Some of them may choose to take advantage of some features but not others; some may just stick with the prior in certain situations. The notion of “the prior” is mainly useful as a baseline for comparison: it’s the best you can do given no other possibly correlating information. The name of the game in classification, of course, is to intelligently use that other information to make more informed guesses, and to beat the prior. One of many ways to approach this is the decision tree classification algorithm, which we’ll look at in detail next.