Skip to main content
Library homepage
 

Text Color

Text Size

 

Margin Size

 

Font Type

Enable Dyslexic Font
Engineering LibreTexts

15.3: Pandas

( \newcommand{\kernel}{\mathrm{null}\,}\)

Learning Objectives

By the end of this section you should be able to

  • Describe the Pandas library.
  • Create a DataFrame and a Series object.
  • Choose appropriate Pandas functions to gain insight from heterogeneous data.

Pandas library

Pandas is an open-source Python library used for data cleaning, processing, and analysis. Pandas provides data structures and data analysis tools to analyze structured data efficiently. The name "Pandas" is derived from the term "panel data," which refers to multidimensional structured datasets. Key features of Pandas include:

  • Data structure: Pandas implements two main data structures:
    • Series: A Series is a one-dimensional labeled array.
    • DataFrame: A DataFrame is a two-dimensional labeled data structure that consists of columns and rows. A DataFrame can be thought of as a spreadsheet-like data structure where each column represents a Series. DataFrame is a heterogeneous data structure where each column can have a different data type.
  • Data processing functionality: Pandas provides various functionalities for data processing, such as data selection, filtering, slicing, sorting, merging, joining, and reshaping.
  • Integration with other libraries: Pandas integrates well with other Python libraries, such as NumPy. The integration capability allows for data exchange between different data analysis and visualization tools.

The conventional alias for importing Pandas is pd. In other words, Pandas is imported as import pandas as pd. Examples of DataFrame and Series objects are shown below.

DataFrame example Series example
Name Age City
0 Emma 15 Dubai
1 Gireeja 28 London
2 Sophia 22 San Jose
0 Emma
1 Gireeja
2 Sophia
dtype: object
Table 15.1

Data input and output

A DataFrame can be created from a dictionary, list, NumPy array, or a CSV file. Column names and column data types can be specified at the time of DataFrame instantiation.

Description Example Output Explanation
DataFrame from a dictionary
import pandas as pd

# Create a dictionary of columns
data = {
  "Name": ["Emma", "Gireeja", "Sophia"],
  "Age": [15, 28, 22],
  "City": ["Dubai", "London", "San Jose"]
}

# Create a DataFrame from the dictionary
df = pd.DataFrame(data)

# Display the DataFrame
df
Name Age City
0 Emma 15 Dubai
1 Gireeja 28 London
2 Sophia 22 San Jose

The pd.DataFrame() function takes in a dictionary and converts it into a DataFrame. Dictionary keys will be column labels and values are stored in respective columns.

DataFrame from a list
import pandas as pd

# Create a list of rows
data = [
  ["Emma", 15, "Dubai"],
  ["Gireeja", 28, "London"],
  ["Sophia", 22, "San Jose"]
]

# Define column labels
columns = ["Name", "Age", "City"]

# Create a DataFrame from list using column labels
df = pd.DataFrame(data, columns=columns)

# Display the DataFrame
df
Name Age City
0 Emma 15 Dubai
1 Gireeja 28 London
2 Sophia 22 San Jose

The pd.DataFrame() function takes in a list containing the records in different rows of a DataFrame, along with a list of column labels, and creates a DataFrame with the given rows and column labels.

DataFrame from a NumPy array
import numpy as np
import pandas as pd

# Create a NumPy array
data = np.array([
  [1, 0, 0],
  [0, 1, 0],
  [2, 3, 4]
])

# Define column labels
columns = ["A", "B", "C"]

# Create a DataFrame from the NumPy array
df = pd.DataFrame(data, columns=columns)

# Display the DataFrame
df
A B C
0 1 0 0
1 0 1 0
2 2 3 4

A NumPy array, along with column labels, are passed to the pd.DataFrame() function to create a DataFrame object.

DataFrame from a CSV file
import pandas as pd

# Read the CSV file into a DataFrame
df = pd.read_csv("data.csv")

# Display the DataFrame
df
The content of the CSV file will be printed in a tabular format.

The pd.read_csv() function reads a CSV file into a DataFrame and organizes the content in a tabular format.

DataFrame from a Excel file
import pandas as pd

# Read the Excel file into a DataFrame
df = pd.read_excel("data.xlsx")

# Display the DataFrame
df
The content of the Excel file will be printed in a tabular format.

The pd.read_excel() function reads an Excel file into a DataFrame and organizes the content in a tabular format.

Table 15.2 DataFrame creation.
Concepts in Practice: Pandas basics
1.
Which of the following is a Pandas data structure?
  1. Series
  • dictionary
  • list
  • 2.
    A DataFrame object can be considered as a collection of Series objects.
    1. true
    2. false
    3.
    What are the benefits of Pandas over NumPy?
    1. Pandas provides integration with other libraries.
    2. Pandas supports heterogeneous data whereas NumPy supports homogenous numerical data.
    3. Pandas supports both one-dimensional and two-dimensional data structures.

    Pandas for data manipulation and analysis

    The Pandas library provides functions and techniques to explore, manipulate, and gain insights from the data. Key DataFrame functions that analyze this code are described in the following table.

        import pandas as pd
        import numpy as np
        
        # Create a sample DataFrame
        days = {
          'Season': ['Summer', 'Summer', 'Fall', 'Winter', 'Fall', 'Winter'],
          'Month': ['July', 'June', 'September', 'January', 'October', 'February'],
          'Month-day': [1, 12, 3, 7, 20, 28],
          'Year': [2000, 1990, 2020, 1998, 2001, 2022]
        }
        df = pd.DataFrame(days)
    Season Month Month-day Year
    0 Summer July 1 2000
    1 Summer June 12 1990
    2 Fall September 3 2020
    3 Winter January 7 1998
    4 Fall October 20 2001
    5 Winter February 28 2022
    Table 15.3
    Function name Explanation Example Output

    head(n)

    Returns the first n rows. If a value is not passed, the first 5 rows will be shown.

    df.head(4)
    Season Month Month-day Year
    0 Summer July 1 2000
    1 Summer June 12 1990
    2 Fall September 3 2020
    3 Winter January 7 1998

    tail(n)

    Returns the last n rows. If a value is not passed, the last 5 rows will be shown.

    df.tail(3)
    Season Month Month-day Year
    3 Winter January 7 1998
    4 Fall October 20 2001
    5 Winter February 28 2022

    info()

    Provides a summary of the DataFrame, including the column names, data types, and the number of non-null values. The function also returns the DataFrame's memory usage.
    df.info()
    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 6 entries, 0 to 5
    Data columns (total 4 columns):
     #    Column        Non-Null Count    Dtype
    ---   ------        --------------    -----
     0    Season        6 non-null        object
     1    Month         6 non-null        object
     2    Month-day     6 non-null        int64
     3    Year          6 non-null        int64
    dtypes: int64(2), object(2)
    memory usage: 320.0+ bytes
    

    describe()

    Generates the column count, mean, standard deviation, minimum, maximum, and quartiles.
    df.describe()
    Month-day Year
    count 6.000000 6.000000
    mean 11.833333 2005.166667
    std 10.457852 12.875040
    min 1.000000 1990.000000
    25% 4.000000 1998.500000
    50% 9.500000 2000.500000
    75% 18.000000 2015.250000
    max 28.000000 2022.000000

    value_counts()

    Counts the occurrences of unique values in a column when a column is passed as an argument and presents them in descending order.
    df.value_counts \('Season')
    Season
    Fall  2
    Summer  2
    Winter  2
    dtype: int64
    

    unique()

    Returns an array of unique values in a column when called on a column.
    df['Season'] \.unique()
    ​​['Summer' 'Fall' 'Winter']
    Table 15.4 DataFrame functions.
    Concepts in Practice: DataFrame operations
    4.
    Which of the following returns the top five rows of a DataFrame?
    1. df.head()
  • df.head(5)
  • both
  • 5.
    What does the unique() function do in a DataFrame when applied to a column?
    1. returns the number of unique columns
    2. returns the number of unique values in the given column
    3. returns the unique values in the given column
    6.
    Which function generates statistical information of columns with numerical data types?
    1. describe()
    2. info()
    3. unique()
    Exploring further

    Please refer to the Pandas user guide for more information about the Pandas library.

    Programming practice with Google

    Use the Google Colaboratory document below to practice Pandas functionalities to extract insights from a dataset.

    Google Colaboratory document


    This page titled 15.3: Pandas is shared under a CC BY 4.0 license and was authored, remixed, and/or curated by OpenStax via source content that was edited to the style and standards of the LibreTexts platform.

    Support Center

    How can we help?