Skip to main content
Engineering LibreTexts

16.1: Reading a DataFrame from a .csv file

  • Page ID
    39306
  • \( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \) \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)\(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\) \(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\)\(\newcommand{\AA}{\unicode[.8,0]{x212B}}\)

    Unlike NumPy arrays and Pandas Serieses, which we learned several different ways to create, we’re only going to learn one way to create a DataFrame. That’s because DataFrames are normally big enough that it’s just too tedious to ever type them in manually. Instead, we’ll read them from an external source; a .csv file.

    We’ll actually use the same read_csv() function that we used in section 11.1 (p. 107), although oddly, this time we won’t need to specify as many arguments. Let’s say we have a “davieses.csv” file with these contents:

    Code \(\PageIndex{1}\) (Python):

    person,age,gender,height,instrument

    Dad,50,M,73,piano

    Mom,49,F,66,flute

    Lizzy,21,F,63,guitar

    TJ,20,M,71,trombone

    Johnny,17,M,72,euphonium

    We can read it into a DataFrame with this code:

    Code \(\PageIndex{2}\) (Python):

    my_first_df = pd.read_csv("davieses.csv").set_index('person')

    print(my_first_df)

    | age gender height instrument

    | person

    | Dad 51 M 73 piano

    | Mom 49 F 66 flute

    | Lizzy 21 F 63 guitar

    | TJ 20 M 71 trombone

    | Johnny 17 M 72 euphonium

    A couple things. First, you may have noticed that the davieses.csv file had a “header” row. This means that the first line of the file is not like the others: instead of containing information on a specific family member, it contains the kind of information for every family member. It looked like this:

    person,age,gender,height,instrument

    and you’ll notice that these words (except for the first one; more on that in a moment) became the column names when we imported the data. This sort of information, by the way, is called “metadata,” a geeky-sounding word that basically means “data about data.” If “Lizzy plays the guitar” is a piece of data, then “family members play instruments” is a piece of metadata.

    Second, don’t miss the ending I tacked on to the read_csv() line, where I called the .set_index() method on the DataFrame. This tells Pandas that one of the columns in the DataFrame should be designated as the index (or the keys).

    Back on p. 57 I asserted that unlike associative arrays, tables didn’t have keys. And that’s true of the general “table” concept. But Pandas designed their DataFrames to behave in the same way as their Serieses: one uniquely-valued column will be used to identify each row.

    This choice is usually easy; if you glance back to Figure 7.3, we’d probably want to choose the screenname as the index (although a case could be made for the real name column instead). For the table in Figure 7.4, it would be the item column. In the DataFrame we just created above, obviously person is the correct choice – it’s the only one sure to be unique.1

    Anyway, designating a column as the index in this way sort of removes it from the other, “ordinary” columns. In the output, above, you may notice that the word “person” is printed somewhat lower than the other column names are. It turns out that if we want to talk about the index column specifically, we’ll need to use a slightly different technique than we do for the other columns. More on that next chapter.

    Finally, note that calling .set_index() is optional. It’s perfectly fine to just call pd.read_csv() and leave it at that. In that case, Pandas will use integers (starting with 0, of course) as the index/keys.


    This page titled 16.1: Reading a DataFrame from a .csv file is shared under a CC BY-SA 4.0 license and was authored, remixed, and/or curated by Stephen Davies (allthemath.org) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.