Skip to main content
Engineering LibreTexts

15.4: Graph traversal

  • Page ID
    12823
  • \( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \) \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)\(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\) \(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\)\(\newcommand{\AA}{\unicode[.8,0]{x212B}}\)

    If you did the “Getting to Philosophy” exercise in Chapter 7, you already have a program that reads a Wikipedia page, finds the first link, uses the link to load the next page, and repeats. This program is a specialized kind of crawler, but when people say “Web crawler” they usually mean a program that

    • Loads a starting page and indexes the contents,
    • Finds all the links on the page and adds the linked URLs to a collection, and
    • Works its way through the collection, loading pages, indexing them, and adding new URLs.
    • If it finds a URL that has already been indexed, it skips it.

    You can think of the Web as a graph where each page is a node and each link is a directed edge from one node to another. If you are not familiar with graphs, you can read about them at thinkdast.com/graph.

    Starting from a source node, a crawler traverses this graph, visiting each reachable node once.

    The collection we use to store the URLs determines what kind of traversal the crawler performs:

    • If it’s a first-in-first-out (FIFO) queue, the crawler performs a breadth- first traversal.
    • If it’s a last-in-first-out (LIFO) stack, the crawler performs a depth-first traversal.
    • More generally, the items in the collection might be prioritized. For example, we might want to give higher priority to pages that have not been indexed for a long time.

    You can read more about graph traversal at thinkdast.com/graphtrav.


    This page titled 15.4: Graph traversal is shared under a CC BY-NC-SA 3.0 license and was authored, remixed, and/or curated by Allen B. Downey (Green Tea Press) .

    • Was this article helpful?