Skip to main content
Engineering LibreTexts

6.4: Iterating through the DOM

  • Page ID
    12760
  • \( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \) \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)\(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\) \(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\)\(\newcommand{\AA}{\unicode[.8,0]{x212B}}\)

    To make your life easier, I provide a class called WikiNodeIterable that lets you iterate through the nodes in a DOM tree. Here’s an example that shows how to use it:

    Elements paragraphs = content.select("p");
    Element firstPara = paragraphs.get(0);
    
    Iterable<Node> iter = new WikiNodeIterable(firstPara);
    for (Node node: iter) {
        if (node instanceof TextNode) {
            System.out.print(node);
        }
    }

    This example picks up where the previous one leaves off. It selects the first paragraph in paragraphs and then creates a WikiNodeIterable, which implements Iterable<Node>. WikiNodeIterable performs a “depth-first search”, which produces the nodes in the order they would appear on the page.

    In this example, we print a Node only if it is a TextNode and ignore other types of Node, specifically the Element objects that represent tags. The result is the plain text of the HTML paragraph without any markup. The output is:

    Java is a general-purpose computer programming language that is concurrent,class-based, object-oriented,[13] and specifically designed ...
    

    This page titled 6.4: Iterating through the DOM is shared under a CC BY-NC-SA 3.0 license and was authored, remixed, and/or curated by Allen B. Downey (Green Tea Press) .

    • Was this article helpful?