Skip to main content
Engineering LibreTexts

6.1: Search engines

  • Page ID
    12757
  • A web search engine, like Google Search or Bing, takes a set of “search terms” and returns a list of web pages that are relevant to those terms (I’ll discuss what “relevant” means later). You can read more at thinkdast.com/searcheng, but I’ll explain what you need as we go along.

    The essential components of a search engine are:

    • Crawling: We’ll need a program that can download a web page, parse it, and extract the text and any links to other pages.
    • Indexing: We’ll need a data structure that makes it possible to look up a search term and find the pages that contain it.
    • Retrieval: And we’ll need a way to collect results from the Index and identify pages that are most relevant to the search terms.

    We’ll start with the crawler. The goal of a crawler is to discover and download a set of web pages. For search engines like Google and Bing, the goal is to find all web pages, but often crawlers are limited to a smaller domain. In our case, we will only read pages from Wikipedia.

    As a first step, we’ll build a crawler that reads a Wikipedia page, finds the first link, follows the link to another page, and repeats. We will use this crawler to test the “Getting to Philosophy” conjecture, which states:

    Clicking on the first lowercase link in the main text of a Wikipedia article, and then repeating the process for subsequent articles, usually eventually gets one to the Philosophy article.

    This conjecture is stated at thinkdast.com/getphil, and you can read its history there.

    Testing the conjecture will allow us to build the basic pieces of a crawler without having to crawl the entire web, or even all of Wikipedia. And I think the exercise is kind of fun!

    In a few chapters, we’ll work on the indexer, and then we’ll get to the retriever.