13.1: Queries

Last updated
Save as PDF

Page ID: 39285

\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \) \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)\(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\) \(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\)\(\newcommand{\AA}{\unicode[.8,0]{x212B}}\)

One of the most powerful things we’ll do with a data set is to query it. This means that instead of specifying (say) a particular key, or something like “the minimum” or “the maximum,” we provide our own custom criteria and ask Pandas to give us all values that match. This kind of operation is also sometimes called filtering, because we’re taking a long list of items and sifting out only the ones we want.

The syntax is interesting: you still use the boxies (like you do when giving a specific key) but inside the boxies you put a condition that will be used to select elements. It’s best seen with an example. Re-using the understanding variable from above, we can query it and ask for all the elements greater than 5:

Code \(\PageIndex{1}\) (Python):

more_than_five = understanding[understanding > 5]

print(more_than_five)

| 4 15

| 2 13

| 9 7

| dtype: int64

The new thing here is the “understanding > 5” thing inside the boxies. The result of this query is itself a Series, but one in which everything that doesn’t match the condition is filtered out. Thus we only have three elements instead of five. Notice the keys didn’t change, and they also had nothing to do with the query: our query was about values.

We could change this, if we were interested in putting a restriction on keys instead, using the .index syntax:

Code \(\PageIndex{2}\) (Python):

index_more_than_five = understanding[understanding.index > 5]

print(index_more_than_five)

| 10 4

| 12 3

| 9 7

| dtype: int64

See how tacking on “.index” in the query made all the difference.

Query Operators

Now I have a surprise for you. It makes perfect sense to use the character “>” (called “greater-than,” ”right-angle-bracket,” or simply ”wakka”) to mean “greater than.” And the character “<” makes sense as “less than.” Unfortunately, the others don’t make quite as much sense. See the top table in Figure 13.1.1.

“Greater/less than or equal to” isn’t hard to remember, and it’s a good thing Python doesn’t require symbols like “≤” or “≥” since those are hard to find on your keyboard. You just type both symbols back-to-back, with no space. More problematic are the last two entries in the top table. The “!=” operator (pronounced “bangequals”) is used as a stand-in for “≠” which also isn’t keyboard friendly. And that one doesn’t have a good mnemonic; you just have to memorize it.

By far the most error-prone of this set is the “==” (double-equals) operator, which simply means “equals.” Yes, you do have to use double-equals instead of single-equals in your queries, and yes it matters. As additional incentive, let me inform you that if you use single-equals when you needed to use double-equals, it will seem to work at first, but you will silently get the wrong answer.

Memorize this fact! Failing to use double-equals is quite possibly the single most common programming error for beginners.

Figure \(\PageIndex{1}\): Query operators: simple and compound

Here are some more examples to test your understanding. Make sure you understand why each output is what it is.

Code \(\PageIndex{3}\) (Python):

understanding = pd.Series([15,4,13,3,7], index=[4,10,2,12,9])

print(understanding[understanding <= 7])

| 10 4

| 12 3

| 9 7

| dtype: int64

Code \(\PageIndex{4}\) (Python):

print(understanding[understanding != 13])

| 4 15

| 10 4

| 12 3

| 9 7

| dtype: int64

Code \(\PageIndex{5}\) (Python):

print(understanding[understanding == 3])

| 12 3

| dtype: int64

Code \(\PageIndex{6}\) (Python):

print(understanding[understanding.index >= 9])

| 10 4

| 12 3

| 9 7

| dtype: int64

Compound Queries

Often, your query will involve more than one criterion. This is called a compound condition. It’s not as common with Serieses as it will be with DataFrames in a couple chapters, but there are still uses for it here.

Suppose I want all the key/value pairs of understanding where the value is between 5 and 14. This is really two conditions masquerading as one: we want all pairs where (1) the value is greater than 5, and also (2) the value is less than 14. I put the word “and” in boldface in the previous sentence because that’s the operator called for here. We only want elements in our results where both things are true, and therefore, we “and together the two conditions.” (“And” is being used as a verb here.)

The way to achieve this is as follows. The syntax is nutty, so pay close attention:

Code \(\PageIndex{7}\) (Python):

x = understanding[(understanding > 5) & (understanding < 14)]

print(x)

| 2 13

| 9 7

| dtype: int64

First, notice that we put each of our two conditions in bananas here. This is not optional, as it turns out: you’ll get a non-obvious error message if you omit them. Second, see how we combined the two with the “&” operator from the bottom half of Figure 13.1.1. The result, then, was only the elements that satisfied both conditions.

It can be tricky to figure out whether you want an and or an or. Unfortunately they don’t always correspond to their colloquial English usage. Let’s see what happens if we switch the “&” symbol to a “|” (pronounced “pipe”):

Code \(\PageIndex{8}\) (Python):

y = understanding[(understanding > 5) | (understanding < 14)]

print(y)

| 4 15

| 10 4

| 2 13

| 12 3

| 9 7

| dtype: int64

You can see that we got everything back. That’s because or means “only give me the elements where either one of the conditions, or both, are true.” In this case, this is guaranteed to match everything, because if you think about it, every number is either greater than five, or less than fourteen, or both. (Think deeply.)

Even though in this example it didn’t do anything exciting, an “or” does sometimes return a useful result. Consider this example:

Code \(\PageIndex{9}\) (Python):

z = understanding[(understanding.index > 10) | (understanding > 5)]

print(z)

| 4 15

| 2 13

| 12 3

| dtype: int64

Here we’re asking for all key/value pairs in which either the key is greater than ten, or the value is greater than ten, or both. This reeled in exactly three fish as shown above. If we changed this “|” to an “&”, we’d have caught no fish. (Take a moment to convince yourself of that.)

The last entry in Figure 13.1.1 is the “~” sign, which is pronounced “tilde,” “twiddle,” or “squiggle.” It corresponds to the English word not, although in an unusual place in the sentence. Here’s an example:

Code \(\PageIndex{10}\) (Python):

a = understanding[~(understanding.index > 10) | (understanding > 10)]

print(a)

| 4 15

| 10 4

| 2 13

| 9 7

| dtype: int64

Search for and stare at the squiggle in that line of code. In English, what we said was “give me elements where either the key is not greater than ten, or the value is greater than ten, or both.” The four matching elements are shown above.

Changing the “or” back to an “and” here gives us this output instead:

Code \(\PageIndex{11}\) (Python):

b = understanding[~(understanding.index > 10) & (understanding > 10)]

print(b)

| 4 15

| 2 13

| dtype: int64

These are the only two rows where both conditions are true (and remember that the first one is “not-ted.”)

It can be tricky to get compound queries right. As with most things, it just takes some practice.

Queries on strings

So far our examples have involved only numbers. Pandas also lets us perform queries on text data, specifying constraints on such things as the length of strings, letters in certain positions, and case (upper/lower).

Let’s return to the Marvel-themed series from section 11.1:

Code \(\PageIndex{12}\) (Python):

alter_egos = pd.Series(['Hulk','Spidey','Iron Man','Thor'], index=['Bruce','Peter','Tony','Thor'])

By appending “.str” to the end of the variable name, we can get access to most of the string-based methods we’d like to use. For instance, find all the values with exactly four letters:

Code \(\PageIndex{13}\) (Python):

four_letter_names = alter_egos[alter_egos.str.len() == 4]

print(four_letter_names)

| Bruce Hulk

| Thor Thor

| dtype: object

or all the values that contain a space:

Code \(\PageIndex{14}\) (Python):

spaced_out = alter_egos[alter_egos.str.contains(' ')]

print(spaced_out)

| Tony Iron Man

| dtype: object

or all the keys whose first character is a T:

Code \(\PageIndex{15}\) (Python):

to_a_tee = alter_egos[alter_egos.index.str.startswith('T')]

print(to_a_tee)

| Tony Iron Man

| Thor Thor

| dtype: object

or all entries where either the value is greater than five letters long or the key is the same as the value:

Code \(\PageIndex{16}\) (Python):

huh = alter_egos[(alter_egos.str.len() > 5) | (alter_egos.index == alter_egos)]

print(huh)

| Peter Spidey

| Thor Thor

| dtype: object

The possibilities are endless. Some of the more common functions are summarized in Figure 13.1.2.

Figure \(\PageIndex{2}\): Common query methods for string data.

Last word

A couple things before we move on. You’ve noticed that in all the above examples, it was necessary to type the Series variable name several times:

Code \(\PageIndex{17}\) (Python):

understanding[(understanding < 12) | (understanding > 18)]

alter_egos[(alter_egos.str.isupper()) & (alter_egos.str.len() < 10)]

There’s really no way around that, sorry; you just have to get used to it. A very common beginner error is to try and write this:

Code \(\PageIndex{18}\) (Python):

understanding[understanding < 12 | > 18]

This seems to make perfect sense, especially since it mimics the natural English sentence: “give me all values where understanding is less than 12 or greater than 18.” Unfortunately, it doesn’t work like that in Python. The rule is: each side of an and or an or must be a complete sentence. The phrase “understanding is greater than 18” counts as a complete sentence, but “is greater than 18” does not.

Also, whenever I see a line of code that specifies a key to a Series (or array), I mentally pronounce the opening boxie (“[”) as the word “of”. So when I read:

Code \(\PageIndex{19}\) (Python):

print(x[5])

I say to myself “print x of five.”

However, whenever I see a query, I mentally pronounce the boxie as the word “where”. So when I read:

Code \(\PageIndex{20}\) (Python):

print(x[x > 12])

I say to myself “print x where x is greater than 12.” I’ve found this helpful in making sense of the meaning of queries, since they’re complicated enough as it is!