16.1: Reading a DataFrame from a .csv file

Last updated
Save as PDF

Page ID: 39306

\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \) \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)\(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\) \(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\)\(\newcommand{\AA}{\unicode[.8,0]{x212B}}\)

Unlike NumPy arrays and Pandas Serieses, which we learned several different ways to create, we’re only going to learn one way to create a DataFrame. That’s because DataFrames are normally big enough that it’s just too tedious to ever type them in manually. Instead, we’ll read them from an external source; a .csv file.

We’ll actually use the same read_csv() function that we used in section 11.1 (p. 107), although oddly, this time we won’t need to specify as many arguments. Let’s say we have a “davieses.csv” file with these contents:

Code \(\PageIndex{1}\) (Python):

person,age,gender,height,instrument

Dad,50,M,73,piano

Mom,49,F,66,flute

Lizzy,21,F,63,guitar

TJ,20,M,71,trombone

Johnny,17,M,72,euphonium

We can read it into a DataFrame with this code:

Code \(\PageIndex{2}\) (Python):

my_first_df = pd.read_csv("davieses.csv").set_index('person')

print(my_first_df)

| age gender height instrument

| person

| Dad 51 M 73 piano

| Mom 49 F 66 flute

| Lizzy 21 F 63 guitar

| TJ 20 M 71 trombone

| Johnny 17 M 72 euphonium

A couple things. First, you may have noticed that the davieses.csv file had a “header” row. This means that the first line of the file is not like the others: instead of containing information on a specific family member, it contains the kind of information for every family member. It looked like this:

person,age,gender,height,instrument

and you’ll notice that these words (except for the first one; more on that in a moment) became the column names when we imported the data. This sort of information, by the way, is called “metadata,” a geeky-sounding word that basically means “data about data.” If “Lizzy plays the guitar” is a piece of data, then “family members play instruments” is a piece of metadata.

Second, don’t miss the ending I tacked on to the read_csv() line, where I called the .set_index() method on the DataFrame. This tells Pandas that one of the columns in the DataFrame should be designated as the index (or the keys).

Back on p. 57 I asserted that unlike associative arrays, tables didn’t have keys. And that’s true of the general “table” concept. But Pandas designed their DataFrames to behave in the same way as their Serieses: one uniquely-valued column will be used to identify each row.

This choice is usually easy; if you glance back to Figure 7.3, we’d probably want to choose the screenname as the index (although a case could be made for the real name column instead). For the table in Figure 7.4, it would be the item column. In the DataFrame we just created above, obviously person is the correct choice – it’s the only one sure to be unique.¹

Anyway, designating a column as the index in this way sort of removes it from the other, “ordinary” columns. In the output, above, you may notice that the word “person” is printed somewhat lower than the other column names are. It turns out that if we want to talk about the index column specifically, we’ll need to use a slightly different technique than we do for the other columns. More on that next chapter.

Finally, note that calling .set_index() is optional. It’s perfectly fine to just call pd.read_csv() and leave it at that. In that case, Pandas will use integers (starting with 0, of course) as the index/keys.