Skip to main content
Engineering LibreTexts

12.7: Parsing HTML using BeautifulSoup

  • Page ID
    3214
  • \( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \) \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)\(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\) \(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\)\(\newcommand{\AA}{\unicode[.8,0]{x212B}}\)

    There are a number of Python libraries which can help you parse HTML and extract data from the pages. Each of the libraries has its strengths and weaknesses and you can pick one based on your needs.

    As an example, we will simply parse some HTML input and extract links using the BeautifulSoup library. You can download and install the BeautifulSoup code from:

    http://www.crummy.com/software/

    You can download and "install" BeautifulSoup or you can simply place the BeautifulSoup.py file in the same folder as your application.

    Even though HTML looks like XML1i and some pages are carefully constructed to be XML, most HTML is generally broken in ways that cause an XML parser to reject the entire page of HTML as improperly formed. BeautifulSoup tolerates highly flawed HTML and still lets you easily extract the data you need.

    We will use urllib to read the page and then use BeautifulSoup to extract the href attributes from the anchor (a) tags.

    # To run this, you can install BeautifulSoup
    # https://pypi.python.org/pypi/beautifulsoup4
    
    # Or download the file
    # http://www.py4e.com/code3/bs4.zip
    # and unzip it in the same directory as this file
    
    import urllib.request, urllib.parse, urllib.error
    from bs4 import BeautifulSoup
    import ssl
    
    # Ignore SSL certificate errors
    ctx = ssl.create_default_context()
    ctx.check_hostname = False
    ctx.verify_mode = ssl.CERT_NONE
    
    url = input('Enter - ')
    html = urllib.request.urlopen(url, context=ctx).read()
    soup = BeautifulSoup(html, 'html.parser')
    
    # Retrieve all of the anchor tags
    tags = soup('a')
    for tag in tags:
        print(tag.get('href', None))
    
    # Code: http://www.py4e.com/code3/urllinks.py

    The program prompts for a web address, then opens the web page, reads the data and passes the data to the BeautifulSoup parser, and then retrieves all of the anchor tags and prints out the href attribute for each tag.

    When the program runs it looks as follows:

    python urllinks.py
    Enter - http://www.dr-chuck.com/page1.htm
    http://www.dr-chuck.com/page2.htm
    python urllinks.py
    Enter - http://www.py4e.com/book.htm
    http://www.greenteapress.com/thinkpython/thinkpython.html
    http://allendowney.com/
    http://www.si502.com/
    http://www.lib.umich.edu/espresso-book-machine
    http://www.py4e.com/code
    http://www.py4e.com/

    You can use BeautifulSoup to pull out various parts of each tag as follows:

    Code 12.7.1 (Python)
    # To run this, you can install BeautifulSoup
    # https://pypi.python.org/pypi/beautifulsoup4
    
    # Or download the file
    # http://www.py4e.com/code3/bs4.zip
    # and unzip it in the same directory as this file
    
    
    from urllib.request import urlopen
    from bs4 import BeautifulSoup
    import ssl
    
    # Ignore SSL certificate errors
    ctx = ssl.create_default_context()
    ctx.check_hostname = False
    ctx.verify_mode = ssl.CERT_NONE
    
    url = input('Enter - ')
    html = urlopen(url, context=ctx).read()
    
    # html.parser is the HTML parser included in the standard Python 3 library.
    # information on other HTML parsers is here:
    # http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser
    soup = BeautifulSoup(html, 'html.parser')
    
    # Retrieve all of the anchor tags
    tags = soup('a')
    for tag in tags:
        # Look at the parts of a tag
        print('TAG:', tag)
        print('URL:', tag.get('href', None))
        print('Contents:', tag.contents[0])
        print('Attrs:', tag.attrs)
    
    # Code: http://www.py4e.com/code3/urllink2.py
    
    
    python urllink2.py
    Enter - http://www.dr-chuck.com/page1.htm
    TAG: <a href="http://www.dr-chuck.com/page2.htm">
    Second Page</a>
    URL: http://www.dr-chuck.com/page2.htm
    Content: ['\nSecond Page']
    Attrs: [('href', 'http://www.dr-chuck.com/page2.htm')]

    These examples only begin to show the power of BeautifulSoup when it comes to parsing HTML.


    This page titled 12.7: Parsing HTML using BeautifulSoup is shared under a CC BY-NC-SA 4.0 license and was authored, remixed, and/or curated by Chuck Severance via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.