Skip to main content
Engineering LibreTexts

12.4: Retrieving web pages with urllib

  • Page ID
    3211
  • \( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \) \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)\(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\) \(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\)\(\newcommand{\AA}{\unicode[.8,0]{x212B}}\)

    While we can manually send and receive data over HTTP using the socket library, there is a much simpler way to perform this common task in Python by using the urllib library.

    Using urllib, you can treat a web page much like a file. You simply indicate which web page you would like to retrieve and urllib handles all of the HTTP protocol and header details.

    The equivalent code to read the romeo.txt file from the web using urllib is as follows:

    Code 12.4.1 (Python)
    %%python3
    
    import urllib.request
    
    fhand = urllib.request.urlopen('http://data.pr4e.org/romeo.txt')
    for line in fhand:
        print(line.decode().strip())
    
    # Code: http://www.py4e.com/code3/urllib1.py
    
    

    Once the web page has been opened with urllib.urlopen, we can treat it like a file and read through it using a for loop.

    When the program runs, we only see the output of the contents of the file. The headers are still sent, but the urllib code consumes the headers and only returns the data to us.

    But soft what light through yonder window breaks
    It is the east and Juliet is the sun
    Arise fair sun and kill the envious moon
    Who is already sick and pale with grief

    As an example, we can write a program to retrieve the data for romeo.txt and compute the frequency of each word in the file as follows:

    Code 12.4.1 (Python)
    %%python3
    
    import urllib.request, urllib.parse, urllib.error
    
    fhand = urllib.request.urlopen('http://data.pr4e.org/romeo.txt')
    
    counts = dict()
    for line in fhand:
        words = line.decode().split()
        for word in words:
            counts[word] = counts.get(word, 0) + 1
    print(counts)
    
    # Code: http://www.py4e.com/code3/urlwords.py
    
    

    Again, once we have opened the web page, we can read it like a local file.


    This page titled 12.4: Retrieving web pages with urllib is shared under a CC BY-NC-SA 4.0 license and was authored, remixed, and/or curated by Chuck Severance via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.