Skip to main content
Engineering LibreTexts

16.1: Crawler solution

  • Page ID
    12827
  • First, let’s go over our solution to the previous exercise. I provided an outline of WikiCrawler; your job was to fill in crawl. As a reminder, here are the fields in the WikiCrawler class:

    public class WikiCrawler {
        // keeps track of where we started
        private final String source;
    
        // the index where the results go
        private JedisIndex index;
    
        // queue of URLs to be indexed
        private Queue<String> queue = new LinkedList<String>();
    
        // fetcher used to get pages from Wikipedia
        final static WikiFetcher wf = new WikiFetcher();
    }
    

    When we create a WikiCrawler, we provide source and index. Initially, queue contains only one element, source.

    Notice that the implementation of queue is a LinkedList, so we can add elements at the end — and remove them from the beginning — in constant time. By assigning a LinkedList object to a Queue variable, we limit ourselves to using methods in the Queue interface; specifically, we’ll use offer to add elements and poll to remove them.

    Here’s my implementation of WikiCrawler.crawl:

    public String crawl(boolean testing) throws IOException {
        if (queue.isEmpty()) {
            return null;
        }
        String url = queue.poll();
        System.out.println("Crawling " + url);
    
        if (testing==false && index.isIndexed(url)) {
            System.out.println("Already indexed.");
            return null;
        }
    
        Elements paragraphs;
        if (testing) {
            paragraphs = wf.readWikipedia(url);
        } else {
            paragraphs = wf.fetchWikipedia(url);
        }
        index.indexPage(url, paragraphs);
        queueInternalLinks(paragraphs);
        return url;
    }
    

     Most of the complexity in this method is there to make it easier to test. Here’s the logic:

    • If the queue is empty, it returns null to indicate that it did not index a page.
    • Otherwise it removes and stores the next URL from the queue.
    • If the URL has already been indexed, crawl doesn’t index it again, unless it’s in testing mode.
    • Next it reads the contents of the page: if it’s in testing mode, it reads from a file; otherwise it reads from the Web.
    • It indexes the page.
    • It parses the page and adds internal links to the queue.
    • Finally, it returns the URL of the page it indexed.

    I presented an implementation of Index.indexPage in Section 15.1. So the only new method is WikiCrawler.queueInternalLinks.

    I wrote two versions of this method with different parameters: one takes an Elements object containing one DOM tree per paragraph; the other takes an Element object that contains a single paragraph.

    The first version just loops through the paragraphs. The second version does the real work.

    void queueInternalLinks(Elements paragraphs) {
        for (Element paragraph: paragraphs) {
            queueInternalLinks(paragraph);
        }
    }
    
    private void queueInternalLinks(Element paragraph) {
        Elements elts = paragraph.select("a[href]");
        for (Element elt: elts) {
            String relURL = elt.attr("href");
    
            if (relURL.startsWith("/wiki/")) {
                String absURL = elt.attr("abs:href");
                queue.offer(absURL);
            }
        }
    }
    

    To determine whether a link is “internal,” we check whether the URL starts with “/wiki/”. This might include some pages we don’t want to index, like meta-pages about Wikipedia. And it might exclude some pages we want, like links to pages in non-English languages. But this simple test is good enough to get started.

    That’s all there is to it. This exercise doesn’t have a lot of new material; it is mostly a chance to bring the pieces together.