Skip to main content
Engineering LibreTexts

1.5: Data Science with Python

  • Page ID
    118166
  • \( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

    \( \newcommand{\dsum}{\displaystyle\sum\limits} \)

    \( \newcommand{\dint}{\displaystyle\int\limits} \)

    \( \newcommand{\dlim}{\displaystyle\lim\limits} \)

    \( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)

    ( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)

    \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

    \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)

    \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

    \( \newcommand{\Span}{\mathrm{span}}\)

    \( \newcommand{\id}{\mathrm{id}}\)

    \( \newcommand{\Span}{\mathrm{span}}\)

    \( \newcommand{\kernel}{\mathrm{null}\,}\)

    \( \newcommand{\range}{\mathrm{range}\,}\)

    \( \newcommand{\RealPart}{\mathrm{Re}}\)

    \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

    \( \newcommand{\Argument}{\mathrm{Arg}}\)

    \( \newcommand{\norm}[1]{\| #1 \|}\)

    \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

    \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)

    \( \newcommand{\vectorA}[1]{\vec{#1}}      % arrow\)

    \( \newcommand{\vectorAt}[1]{\vec{\text{#1}}}      % arrow\)

    \( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \( \newcommand{\vectorC}[1]{\textbf{#1}} \)

    \( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)

    \( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)

    \( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)

    \( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

    \(\newcommand{\avec}{\mathbf a}\) \(\newcommand{\bvec}{\mathbf b}\) \(\newcommand{\cvec}{\mathbf c}\) \(\newcommand{\dvec}{\mathbf d}\) \(\newcommand{\dtil}{\widetilde{\mathbf d}}\) \(\newcommand{\evec}{\mathbf e}\) \(\newcommand{\fvec}{\mathbf f}\) \(\newcommand{\nvec}{\mathbf n}\) \(\newcommand{\pvec}{\mathbf p}\) \(\newcommand{\qvec}{\mathbf q}\) \(\newcommand{\svec}{\mathbf s}\) \(\newcommand{\tvec}{\mathbf t}\) \(\newcommand{\uvec}{\mathbf u}\) \(\newcommand{\vvec}{\mathbf v}\) \(\newcommand{\wvec}{\mathbf w}\) \(\newcommand{\xvec}{\mathbf x}\) \(\newcommand{\yvec}{\mathbf y}\) \(\newcommand{\zvec}{\mathbf z}\) \(\newcommand{\rvec}{\mathbf r}\) \(\newcommand{\mvec}{\mathbf m}\) \(\newcommand{\zerovec}{\mathbf 0}\) \(\newcommand{\onevec}{\mathbf 1}\) \(\newcommand{\real}{\mathbb R}\) \(\newcommand{\twovec}[2]{\left[\begin{array}{r}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\ctwovec}[2]{\left[\begin{array}{c}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\threevec}[3]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\cthreevec}[3]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\fourvec}[4]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\cfourvec}[4]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\fivevec}[5]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\cfivevec}[5]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\mattwo}[4]{\left[\begin{array}{rr}#1 \amp #2 \\ #3 \amp #4 \\ \end{array}\right]}\) \(\newcommand{\laspan}[1]{\text{Span}\{#1\}}\) \(\newcommand{\bcal}{\cal B}\) \(\newcommand{\ccal}{\cal C}\) \(\newcommand{\scal}{\cal S}\) \(\newcommand{\wcal}{\cal W}\) \(\newcommand{\ecal}{\cal E}\) \(\newcommand{\coords}[2]{\left\{#1\right\}_{#2}}\) \(\newcommand{\gray}[1]{\color{gray}{#1}}\) \(\newcommand{\lgray}[1]{\color{lightgray}{#1}}\) \(\newcommand{\rank}{\operatorname{rank}}\) \(\newcommand{\row}{\text{Row}}\) \(\newcommand{\col}{\text{Col}}\) \(\renewcommand{\row}{\text{Row}}\) \(\newcommand{\nul}{\text{Nul}}\) \(\newcommand{\var}{\text{Var}}\) \(\newcommand{\corr}{\text{corr}}\) \(\newcommand{\len}[1]{\left|#1\right|}\) \(\newcommand{\bbar}{\overline{\bvec}}\) \(\newcommand{\bhat}{\widehat{\bvec}}\) \(\newcommand{\bperp}{\bvec^\perp}\) \(\newcommand{\xhat}{\widehat{\xvec}}\) \(\newcommand{\vhat}{\widehat{\vvec}}\) \(\newcommand{\uhat}{\widehat{\uvec}}\) \(\newcommand{\what}{\widehat{\wvec}}\) \(\newcommand{\Sighat}{\widehat{\Sigma}}\) \(\newcommand{\lt}{<}\) \(\newcommand{\gt}{>}\) \(\newcommand{\amp}{&}\) \(\definecolor{fillinmathshade}{gray}{0.9}\)
    Learning Objectives

    By the end of this section, you should be able to

    • 1.5.1 Load data to Python.
    • 1.5.2 Perform basic data analysis using Python.
    • 1.5.3 Use visualization principles to graphically plot data using Python.

    Multiple tools are available for writing and executing Python programs. Jupyter Notebook is one convenient and user-friendly tool. The next section explains how to set up the Jupyter Notebook environment using Google Colaboratory (Colab) and then provides the basics of two open-source Python libraries named Pandas and Matplotlib. These libraries are specialized for data analysis and data visualization, respectively.

    Exploring Further: Python Programming

    In the discussion below, we assume you are familiar with basic Python syntax and know how to write a simple program using Python. If you need a refresher on the basics, please refer to Das, U., Lawson, A., Mayfield, C., & Norouzi, N. (2024). Introduction to Python Programming. OpenStax. https://openstax.org/books/introduction-python-programming/pages/1-introduction.

    Jupyter Notebook on Google Colaboratory

    Jupyter Notebook is a web-based environment that allows you to run a Python program more interactively, using programming code, math equations, visualizations, and plain texts. There are multiple web applications or software you could use to edit a Jupyter Notebook, but in this textbook we will use Google’s free application named Google Colaboratory (Colab), often abbreviated as Colab. It is a cloud-based platform, which means that you can open, edit, run, and save a Jupyter Notebook on your Google Drive.

    Setting up Colab is simple. On your Google Drive, click New > More. If your Google Drive has already installed Colab before, you will see Colaboratory under More. If not, click “Connect more apps” and install Colab by searching “Colaboratory” on the app store (Figure 1.15). For further information, see the Google Colaboratory Ecosystem animation.

    A screenshot of the Google Drive icon with an add new button. Below is the Google Workspace Marketplace menu with a magnifying glass and Colaboratory highlighted.
    Figure 1.15 Install Google Colaboratory (Colab)

    Now click New > More > Google Laboratory. A new, empty Jupyter Notebook will show up as in Figure 1.16.

    A screenshot of an empty Jupyter Notebook within Google Colaboratory. The notebook is named Untitled.jpynb.
    Figure 1.16 Google Colaboratory Notebook

    The gray area with the play button is called a cell. A cell is a block where you can type either code or plain text. Notice that there are two buttons on top of the first cell—“+ Code” and “+ Text.” These two buttons add a code or text cell, respectively. A code cell is for the code you want to run; a text cell is to add any text description or note.

    Let’s run a Python program on Colab. Type the following code in a code cell.

    Python Code

          print ("hello world!")
    

    The resulting output will look like this:

    hello world!

    You can write a Python program across multiple cells and put text cells in between. Colab would treat all the code cells as part of a single program, running from the top to bottom of the current Jupyter Notebook. For example, the two code cells below run as if it is a single program.

    When running one cell at a time from the top, we see the following outputs under each cell.

    Python Code

          
          a = 1
          print ("The a value in the first cell:", a)
          

    The resulting output will look like this:

    The a value in the first cell: 1

    Python Code

          b = 3
          print ("a in the second cell:", a)
          print ("b in the second cell:", b)
          a + b

    The resulting output will look like this:

    a in the second cell: 1
    b in the second cell: 3
    4

    Conventional Python versus Jupyter Notebook Syntax

    While conventional Python syntax requires print() syntax to print something to the program console, Jupyter Notebook does not require print(). On Jupyter Notebook, the line a+b instead of print(a+b) also prints the value of a+b as an output. But keep in mind that if there are multiple lines of code that trigger printing some values, only the output from the last line will show.

    You can also run multiple cells in bulk. Click Runtime on the menu, and you will see there are multiple ways of running multiple cells at once (Figure 1.17). The two commonly used ones are “Run all” and “Run before.” “Run all” runs all the cells in order from the top; “Run before” runs all the cells before the currently selected one.

    A screenshot of the Runtime menu in Google Colab with options to Run all, Run before, Run the focused cell, Run selection, and Run after with the keyboard shortcuts for each.
    Figure 1.17 Multiple Ways of Running Cells on Colab

    One thing to keep in mind is that being able to split a long program into multiple blocks and run one block at a time raises chances of user error. Let’s look at a modified code from the previous example.

    Python Code

          a = 1
          print ("the value in the first cell:", a)
    

    The resulting output will look like this:

    the value in the first cell: 1

    Python Code

          b = 3
          print ("a in the second cell:", a)
          print ("b in the second cell:", b)
          a + b
    

    The resulting output will look like this:

    a in the second cell: 1
    b in the second cell: 3
    4

    Python Code

          a = 2
          a + b
    

    The resulting output will look like this:

    5

    The modified code has an additional cell at the end, updating a from 1 to 2. Notice that now a+b returns 5 as a has been changed to 2. Now suppose you need to run the second cell for some reason, so you run the second cell again.

    Python Code

          a = 1
          print ("the a value in the first cell:", a)
    

    The resulting output will look like this:

    the a value in the first cell: 1

    Python Code

          b = 3
          print ("a in the second cell:", a)
          print ("b in the second cell:", b)
          a + b
          

    The resulting output will look like this:

    a in the second cell: 2
    b in the second cell: 3
    5

    Python Code

          a = 2
          a + b
          

    The resulting output will look like this:

    5

    The value of a has changed to 2. This implies that the execution order of each cell matters! If you have run the third cell before the second cell, the value of a will have the value from the third one even though the third cell is located below the second cell. Therefore, it is recommended to use “Run all” or “Run before” after you make changes across multiple cells of code. This way your code is guaranteed to run sequentially from the top.

    Python Pandas

    One of the strengths of Python is that it includes a variety of free, open-source libraries. Libraries are a set of already-implemented methods that a programmer can refer to, allowing a programmer to avoid building common functions from scratch.

    Pandas is a Python library specialized for data manipulation and analysis, and it is very commonly used among data scientists. It offers a variety of methods, which allows data scientists to quickly use them for data analysis. You will learn how to analyze data using Pandas throughout this textbook.

    Colab already has Pandas installed, so you just need to import Pandas and you are set to use all the methods in Pandas. Note that it is convention to abbreviate pandas to pd so that when you call a method from Pandas, you can do so by using pd instead of having to type out Pandas every time. It offers a bit of convenience for a programmer!

    Python Code

          # import Pandas and assign an abbreviated identifier "pd"
          import pandas as pd
          

    Exploring Further: Installing Pandas on Your Computer

    If you wish to install Pandas on your own computer, refer to the installation page of the Pandas website.

    Load Data Using Python Pandas

    The first step for data analysis is to load the data of your interest to your Notebook. Let’s create a folder on Google Drive where you can keep a CSV file for the dataset and a Notebook for data analysis. Download a public dataset, ch1-movieprofit.csv, and store it in a Google Drive folder. Then open a new Notebook in that folder by entering that folder and clicking New > More > Google Colaboratory.

    Open the Notebook and allow it to access files in your Google Drive by following these steps:

    First, click the Files icon on the side tab (Figure 1.18).

    A screenshot of the side tab of Google Colab showing the following icons: hamburger menu, magnifying glass, x in brackets, key, and folder. The folder icon is highlighted and the word “Files” has popped up.
    Figure 1.18 Side Tab of Colab

    Then click the Mount Drive icon (Figure 1.19) and select “Connect to Google Drive” on the pop-up window.

    A screenshot of the Files popup menu on Google Colab. The menu includes four icons, and the Mount Drive icon is selected.
    Figure 1.19 Features under Files on Colab

    Notice that a new cell has been inserted on the Notebook as a result (Figure 1.20).

    Code snippet in Google Colab displaying a Python command to mount Google Drive. The command imports the 'drive' module and uses the 'mount' function with the path '/content/drive'.
    Figure 1.20 An Inserted Cell to Mount Your Google Drive

    Connect your Google Drive by running the cell, and now your Notebook file can access all the files under content/drive. Navigate folders under drive to find your Notebook and ch1-movieprofit.csv files. Then click “…” > Copy Path (Figure 1.21).

    A screenshot showing how to copy the path of a C S V file located in a Google Drive folder.
    Figure 1.21 Copying the Path of a CSV File Located in a Google Drive Folder

    Now replace [Path] with the copied path in the below code. Run the code and you will see the dataset has been loaded as a table and stored as a Python variable data.

    Python Code

            # import Pandas and assign an abbreviated identifier "pd"
            import pandas as pd
            
            data = pd.read_csv("[Path]")
            data
    

    The resulting output will look like this:

    A Python output table displaying movie data, including title, year, genre, rating, duration, US gross, worldwide gross, and votes. The table is sorted by worldwide gross in descending order.

    The read_csv() method in Pandas loads a CSV file and stores it as a DataFrame. A DataFrame is a data type that Pandas uses to store multi-column tabular data. Therefore, the variable data holds the table in ch1-movieprofit.csv in the form of a Pandas DataFrame.

    DataFrame versus Series

    Pandas defines two data types for tabular data—DataFrame and Series. While DataFrame is used for multi-column tabular data, Series is used for single-column data. Many methods in Pandas support both DataFrame and Series, but some are only for one or the other. It is always good to check if the method you are using works as you expect. For more information, refer to the Pandas documentation or Das, U., Lawson, A., Mayfield, C., & Norouzi, N. (2024). Introduction to Python Programming. OpenStax. https://openstax.org/books/introduction-python-programming/pages/1-introduction.

    Example 1.9

    Remember the Iris dataset we used in Data and Datasets? Load the dataset ch1-iris.csv to a Python program using Pandas.

    Answer

    The following code loads the ch1-iris.csv that is stored in a Google Drive. Make sure to replace the path with the actual path to ch1-iris.csv on your Google Drive.

    Python Code

            import pandas as pd 
    
            data = pd.read_csv("[Path to ch1-iris.csv]") # Replace the path 
            data
          

    The resulting output will look like this:

    A Python output table displaying a portion of the Iris dataset. Columns include sepal length, sepal width, petal length, petal width, and species. Rows show data for individual Iris flowers.

    Exploring Further

    Can I load a file that is uploaded to someone else’s Google Drive and shared with me?

    Yes! This is useful especially when your Google Drive runs out of space. Simply add the shortcut of the shared file to your own drive. Right-click > Organize > Add Shortcut will let you select where to store the shortcut. Once done, you can call pd.read_csv() using the path of the shortcut.

    Summarize Data Using Python Pandas

    You can compute basic statistics for data quite quickly by using the DataFrame.describe() method. Add and run the following code in a new cell. It calls the describe() method upon data, the DataFrame we defined earlier with ch1-movieprofit.csv.

    Python Code

          data = pd.read_csv("[Path to ch1-movieprofit.csv]")
          data.describe()
          

    like this:

    A Python output table displaying descriptive statistics for movie data, including count, mean, standard deviation, minimum, 25th percentile, median, 75th percentile, and maximum values for rating, duration, US gross, and worldwide gross in millions.

    describe() returns a table whose columns are a subset of the columns in the entire dataset and whose rows are different statistics. The statistics include the number of unique values in a column (count), mean (mean), standard deviation (std), minimum and maximum values (min/max), and different quartiles (25%/50%/75%), which you will learn about in Measures of Variation. Using this representation, you can compute such statistics of different columns easily.

    Example 1.10

    Summarize the IRIS dataset using describe() of ch1-iris.csv you loaded in the previous example.

    Answer

    The following code in a new cell returns the summary of the dataset.

    Python Code

          data = pd.read_csv("[Path to ch1-iriscsv]")
          data.describe()
          

    The resulting output will look like this:

    A Python output table displaying descriptive statistics for four Iris flower measurements: sepal length, sepal width, petal length, and petal width, including count, mean, standard deviation, minimum, quartiles, and maximum values.

    Select Data Using Python Pandas

    The Pandas DataFrame allows a programmer to use the column name itself when selecting a column. For example, the following code prints all the values in the “US_Gross_Million” column in the form of a Series (remember the data from a single column is stored in the Series type in Pandas).

    Python Code

          data = pd.read_csv("[Path to ch1-movieprofit.csv]")
          
          data["US_Gross_Million"]
    

    like this:

    0   760.51
    1   858.37
    2   659.33
    3   936.66
    4   678.82
        ... 
    961   77.22
    962  177.20
    963  102.31
    964  106.89
    965   75.47
    Name: US_Gross_Million, Length: 966, dtype: float64

    DataFrame.iloc[] enables a more powerful selection—it lets a programmer select by both column and row, using column and row indices. Let’s look at some code examples below.

    Python Code

          data.iloc[:, 2] # select all values in the second column
          

    The resulting output will look like this:

    0   2009
    1   2019
    2   1997
    3   2015
    4   2018
        ... 
    961  2010
    962  1982
    963  1993
    964  1999
    965  2017
    Name: Year, Length: 966, dtype: object

    Python Code

          data.iloc[2,:] # select all values in the third row
          

    The resulting output will look like this:

    Unnamed: 0             3
    Title            Titanic
    Year              1997
    Genre             Drama
    Rating              7.9
    Duration             194
    US_Gross_Million       659.33
    Worldwide_Gross_Million   2201.65
    Votes           1,162,142
    Name: 2, dtype: object

    To pinpoint a specific value within the “US_Gross_Million” column, you can use an index number.

    Python Code

          
          print (data["US_Gross_Million"][0]) # index 0 refers to the top row
          print (data["US_Gross_Million"][2]) # index 2 refers to the third row
          

    The resulting output will look like this:

    760.51
    659.33

    You can also use DataFrame.iloc[] to select a specific group of cells on the table. The example code below shows different ways of using iloc[]. There are multiple ways of using iloc[], but this chapter introduces a couple of common ones. You will learn more techniques for working with data throughout this textbook.

    Python Code

          
          data.iloc[:, 1] # select all values in the second column (index 1)
          

    The resulting output will look like this:

    0                     Avatar
    1                Avengers: Endgame
    2                     Titanic
    3   Star Wars: Episode VII - The Force Awakens
    4             Avengers: Infinity War
                 ...          
    961                  The A-Team
    962                    Tootsie
    963              In the Line of Fire
    964                 Analyze This
    965            The Hitman's Bodyguard
    Name: Title, Length: 966, dtype: object

    Python Code

          data.iloc[[1, 3], [2, 3]]  
          # select the rows at index 1 and 3, the columns at index 2 and 3

    The resulting output will look like this:

    A Python output table with two columns and two rows. The first column is labeled “Year,” and the second column is labeled “Genre.” The first row contains the values “2019” and “Action,” and the second row contains the values “2015” and “Action.” There are two icons to the right of the table, one that looks like a calendar and one that looks like a bar chart.

    Example 1.11

    Select a “sepal_width” column of the IRIS dataset using the column name.

    Answer

    The following code in a new cell returns the “sepal_width” column.

    Python Code

          data = pd.read_csv("[Path to ch1-iris.csv]")
          
          data["sepal_width"]
          

    The resulting output will look like this:

    0   3.5
    1   3.0
    2   3.2
    3   3.1
    4   3.6
       ... 
    145  3.0
    146  2.5
    147  3.0
    148  3.4
    149  3.0
    Name: sepal_width, Length: 150, dtype: float64

    Example 1.12

    Select a “petal_length” column of the IRIS dataset using iloc[].

    Answer

    The following code in a new cell returns the “petal_length” column.

    Python Code

          data.iloc[:, 2]
          

    The resulting output will look like this:

    0   1.4
    1   1.4
    2   1.3
    3   1.5
    4   1.4
       ... 
    145  5.2
    146  5.0
    147  5.2
    148  5.4
    149  5.1
    Name: petal_length, Length: 150, dtype: float64

    Search Data Using Python Pandas

    To search for some data entries that fulfill specific criteria (i.e., filter), you can use DataFrame.loc[] of Pandas. When you indicate the filtering criteria inside the brackets, [], the output returns the filtered rows within the DataFrame. For example, the code below filters out the rows whose genre is comedy. Notice that the output only has 307 out of the full 3,400 rows. You can check the output on your own, and you will see their Genre values are all “Comedy.”

    Python Code

          data = pd.read_csv("[Path to ch1-movieprofit.csv]")
          
          data.loc[data['Genre'] == 'Comedy']
          

    The resulting output will look like this:

    A Python output table displaying movie data, including title, year, genre, rating, duration, US gross, worldwide gross, and votes. The table is sorted by worldwide gross in descending order.

    Example 1.13

    Using DataFrame.loc[], search for all the items of Iris-virginica species in the IRIS dataset.

    Answer

    The following code returns a filtered DataFrame whose species are Iris-virginica. All such rows show up as an output.

    Python Code

          data = pd.read_csv("[Path to ch1-iris.csv]")
          
          data.loc[data['species'] == 'Iris-virginica']
          

    The resulting figure will look like this:

    A Python output table displaying a portion of the Iris dataset, specifically Iris-virginica species. Columns include sepal length, sepal width, petal length, petal width, and species. Rows show data for individual Iris flowers.
          (Rows 109 through 149 not shown.)
          

    Example 1.14

    This time, search for all the items whose species is Iris-virginica and whose sepal width is wider than 3.2.

    Answer

    You can use a Boolean expression—in other words, an expression that evaluates as either True or False—inside data.loc[].

    Python Code

          data.loc[(data['species'] == 'Iris-virginica') & (data['sepal_width'] > 3.2)]
          

    The resulting output will look like this:

    A Python output table displaying a portion of the Iris dataset, specifically Iris-virginica species. Columns include sepal length, sepal width, petal length, petal width, and species. Rows show data for individual Iris flowers.

    Visualize Data Using Python Matplotlib

    There are multiple ways to draw plots of data in Python. The most common and straightforward way is to import another library, Matplotlib, which is specialized for data visualization. Matplotlib is a huge library, and to draw the plots you only need to import a submodule named pyplot.

    Type the following import statement in a new cell. Note it is convention to denote matplotlib.pyplot with plt, similarly to denoting Pandas with pd.

    Python Code

          import matplotlib.pyplot as plt
    

    Matplotlib offers a method for each type of plot, and you will learn the Matplotlib methods for all of the commonly used types throughout this textbook. In this chapter, however, let’s briefly look at how to draw a plot using Matplotlib in general.

    Suppose you want to draw a scatterplot between “US_Gross_Million” and “Worldwide_Gross_Million” of the movie profit dataset (ch1-movieprofit.csv). You will investigate scatterplots in more detail in Correlation and Linear Regression Analysis. The example code below draws such a scatterplot using the method scatter(). scatter() takes the two columns of your interest—data["US_Gross_Million"] and data["Worldwide_Gross_Million"]—as the inputs and assigns them for the x- and y-axes, respectively.

    Python Code

          data = pd.read_csv("[Path to ch1-movieprofit.csv]")
          
          # draw a scatterplot using matplotlib’s scatter()
          plt.scatter(data["US_Gross_Million"], data["Worldwide_Gross_Million"])
    

    The resulting output will look like this:

    An unlabeled scatter plot. The X axis ranges from 0 to 1,000. The Y axis ranges from 0 to 3,000. Data points are clustered toward the lower left corner, with a general upward trend indicating that a higher value on the X axis tends to correlate with a higher value on the Y axis.

    Notice that it simply has a set of dots on a white plane. The plot itself does not show what each axis represents, what this plot is about, etc. Without them, it is difficult to capture what the plot shows. You can set these with the following code. The resulting plot below indicates that there is a positive correlation between domestic gross and worldwide gross.

    Python Code

          # draw a scatterplot
          plt.scatter(data["US_Gross_Million"], data["Worldwide_Gross_Million"])
          
          # set the title
          plt.title("Domestic vs. Worldwide Gross")
          
          # set the x-axis label
          plt.xlabel("Domestic")
          
          # set the y-axis label
          plt.ylabel("Worldwide")
          

    The resulting output will look like this:

    A scatter plot comparing the domestic gross versus worldwide gross of movies. The x-axis represents domestic gross and ranges from 0 to 1,000, and the y-axis represents worldwide gross and ranges from 0 to 3,000. Each data point is a blue dot representing a movie. The plot shows a general positive correlation between domestic and worldwide gross, indicating that movies with higher domestic gross tend to also have higher worldwide gross.  Data points are clustered toward the lower left corner, with a general upward trend.

    You can also change the range of numbers along the x- and y-axes with plt.xlim() and plt.ylim(). Add the following two lines of code to the cell in the previous Python code example, which plots the scatterplot.

    Python Code

          # draw a scatterplot
          plt.scatter(data["US_Gross_Million"], data["Worldwide_Gross_Million"])
          
          # set the title
          plt.title("Domestic vs. Worldwide Gross")
          
          # set the x-axis label
          plt.xlabel("Domestic")
          
          # set the y-axis label
          plt.ylabel("Worldwide")
          
          # set the range of values of the x- and y-axes
          plt.xlim(1*10**2, 3*10**2) # x axis: 100 to 300
          plt.ylim(1*10**2, 1*10**3) # y axis: 100 to 1,000
          

    The resulting output will look like this:

    A scatter plot comparing the domestic gross versus worldwide gross of movies. The x-axis represents domestic gross and ranges from 100 to 300, and the y-axis represents worldwide gross and ranges from 100 to 1,000. Each data point is a blue dot representing a movie. The plot shows a general positive correlation between domestic and worldwide gross, indicating that movies with higher domestic gross tend to also have higher worldwide gross.  Data points are clustered toward the lower left corner, with a general upward trend.

    The resulting plot with the additional lines of code has a narrower range of values along the x- and y-axes.

    Example 1.15

    Using the iris dataset, draw a scatterplot between petal length and height of Setosa Iris. Set the title, x-axis label, and y-axis label properly as well.

    Answer

    Python Code

          import matplotlib.pyplot as plt 
          
          data = pd.read_csv("[Path to ch1-iris.csv]")
          
          # select the rows whose species are Setosa Iris
          setosa = data.loc[(data['species'] == 'Iris-setosa')]
          
          # draw a scatterplot
          plt.scatter(setosa["petal_length"], setosa["petal_width"])
          
          # set the title
          plt.title("Petal Length vs. Petal Width of Setosa Iris")
          
          # set the x-axis label
          plt.xlabel("Petal Length")
          
          # set the y-axis label
          plt.ylabel("Petal Width")
          

    The resulting output will look like this:

    A scatter plot showing the relationship between petal length and petal width of setosa iris flowers. The x-axis represents petal length, and the y-axis represents petal width. Each data point is represented by a blue dot. Data points are clustered within specific ranges of petal length and width, with some overlap between clusters.

    Datasets

    Note: The primary datasets referenced in the chapter code may also be downloaded here.


    This page titled 1.5: Data Science with Python is shared under a CC BY 4.0 license and was authored, remixed, and/or curated by OpenStax via source content that was edited to the style and standards of the LibreTexts platform.