Skip to main content
Engineering LibreTexts

6.3: Using jsoup

  • Page ID
    12759
  • \( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \) \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)\(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\) \(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\)\(\newcommand{\AA}{\unicode[.8,0]{x212B}}\)

    jsoup makes it easy to download and parse web pages, and to navigate the DOM tree. Here’s an example:

    String url = "http://en.Wikipedia.org/wiki/Java_(programming_language)";
    
    // download and parse the document
        Connection conn = Jsoup.connect(url);
        Document doc = conn.get();
        
        // select the content text and pull out the paragraphs.
        Element content = doc.getElementById("mw-content-text");
        Elements paragraphs = content.select("p");
    

    Jsoup.connect takes a URL as a String and makes a connection to the web server; the get method downloads the HTML, parses it, and returns a Document object, which represents the DOM.

    Document provides methods for navigating the tree and selecting nodes. In fact, it provides so many methods, it can be confusing. This example demonstrates two ways to select nodes:

    • getElementById takes a String and searches the tree for an element that has a matching “id” field. Here it selects the node <div lang="en" class="mw-content-ltr"> which appears on every Wikipedia page to identify the element that contains the main text of the page, as opposed to the navigation sidebar and other elements.
      • The return value from getElementById is an Element object that represents this and contains the elements in the as children, grandchildren, etc.
    • select takes a String, traverses the tree, and returns all the elements with tags that match the String. In this example, it returns all paragraph tags that appear in content. The return value is an Elements object.

    Before you go on, you should skim the documentation of these classes so you know what they can do. The most important classes are Element, Elements, and Node, which you can read about at thinkdast.com/jsoupelt, thinkdast.com/jsoupelts, and thinkdast.com/jsoupnode.

    Node represents a node in the DOM tree; there are several subclasses that extend Node, including Element, TextNode, DataNode, and Comment. Elements is a Collection of Element objects.

    Figure \(\PageIndex{1}\) is a UML diagram showing the relationships among these classes. In a UML class diagram, a line with a hollow arrow head indicates that one class extends another. For example, this diagram indicates that Elements extends ArrayList. We’ll get back to UML diagrams in Section 11.6.

    UML diagram for selected classes provided by jsoup.

    Figure \(\PageIndex{1}\): UML diagram for selected classes provided by jsoup. Edit: http://yuml.me/edit/4bc1c919


    This page titled 6.3: Using jsoup is shared under a CC BY-NC-SA 3.0 license and was authored, remixed, and/or curated by Allen B. Downey (Green Tea Press) .

    • Was this article helpful?