16.1: Crawler solution

Last updated
Save as PDF

Page ID: 12827

\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \) \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)\(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\) \(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\)\(\newcommand{\AA}{\unicode[.8,0]{x212B}}\)

First, let’s go over our solution to the previous exercise. I provided an outline of WikiCrawler; your job was to fill in crawl. As a reminder, here are the fields in the WikiCrawler class:

public class WikiCrawler {
    // keeps track of where we started
    private final String source;

    // the index where the results go
    private JedisIndex index;

    // queue of URLs to be indexed
    private Queue<String> queue = new LinkedList<String>();

    // fetcher used to get pages from Wikipedia
    final static WikiFetcher wf = new WikiFetcher();
}

When we create a WikiCrawler, we provide source and index. Initially, queue contains only one element, source.

Notice that the implementation of queue is a LinkedList, so we can add elements at the end — and remove them from the beginning — in constant time. By assigning a LinkedList object to a Queue variable, we limit ourselves to using methods in the Queue interface; specifically, we’ll use offer to add elements and poll to remove them.

Here’s my implementation of WikiCrawler.crawl:

public String crawl(boolean testing) throws IOException {
    if (queue.isEmpty()) {
        return null;
    }
    String url = queue.poll();
    System.out.println("Crawling " + url);

    if (testing==false && index.isIndexed(url)) {
        System.out.println("Already indexed.");
        return null;
    }

    Elements paragraphs;
    if (testing) {
        paragraphs = wf.readWikipedia(url);
    } else {
        paragraphs = wf.fetchWikipedia(url);
    }
    index.indexPage(url, paragraphs);
    queueInternalLinks(paragraphs);
    return url;
}

Most of the complexity in this method is there to make it easier to test. Here’s the logic:

If the queue is empty, it returns null to indicate that it did not index a page.
Otherwise it removes and stores the next URL from the queue.
If the URL has already been indexed, crawl doesn’t index it again, unless it’s in testing mode.
Next it reads the contents of the page: if it’s in testing mode, it reads from a file; otherwise it reads from the Web.
It indexes the page.
It parses the page and adds internal links to the queue.
Finally, it returns the URL of the page it indexed.

I presented an implementation of Index.indexPage in Section 15.1. So the only new method is WikiCrawler.queueInternalLinks.

I wrote two versions of this method with different parameters: one takes an Elements object containing one DOM tree per paragraph; the other takes an Element object that contains a single paragraph.

The first version just loops through the paragraphs. The second version does the real work.

void queueInternalLinks(Elements paragraphs) {
    for (Element paragraph: paragraphs) {
        queueInternalLinks(paragraph);
    }
}

private void queueInternalLinks(Element paragraph) {
    Elements elts = paragraph.select("a[href]");
    for (Element elt: elts) {
        String relURL = elt.attr("href");

        if (relURL.startsWith("/wiki/")) {
            String absURL = elt.attr("abs:href");
            queue.offer(absURL);
        }
    }
}

To determine whether a link is “internal,” we check whether the URL starts with “/wiki/”. This might include some pages we don’t want to index, like meta-pages about Wikipedia. And it might exclude some pages we want, like links to pages in non-English languages. But this simple test is good enough to get started.

That’s all there is to it. This exercise doesn’t have a lot of new material; it is mostly a chance to bring the pieces together.