17.2: Tutorial Example - Generating a Site Map

Last updated
Save as PDF

Page ID: 39675

\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \) \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)\(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\) \(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\)\(\newcommand{\AA}{\unicode[.8,0]{x212B}}\)

Our job is to write a simple application that will generate a site map for a web site that we have stored locally on our hard drive. The site map will contain links to each of the HTML files in the web site, using the title of the document as the text of the link. Furthermore, links will be indented to reflect the directory structure of the web site.

Accessing the web directory

To do

If you do not have a web site on your machine, copy a few HTML files to a local directory to serve as a test bed.

We will develop two classes, WebDir and WebPage, to represent directories and web pages. The idea is to create an instance of WebDir which will point to the root directory containing our web site. When we send it the message makeToc, it will walk through the files and directories inside it to build up the site map. It will then create a new file, called toc.html, containing links to all the pages in the web site.

One thing we will have to watch out for: each WebDir and WebPage must remember the path to the root of the web site, so it can properly generate links relative to the root.

To do

Define the class WebDir with instance variables webDir and homePath, and define the appropriate initialization method.

Also define class-side methods to prompt the user for the location of the web site on your computer, as follows:

WebDir >> setDir: dir home: path
    webDir := dir.
    homePath := path

WebDir class >> onDir: dir
    ^ self new setDir: dir home: dir pathString

WebDir class >> selectHome
    ^ self onDir: UIManager default chooseDirectory

The last method opens a browser to select the directory to open. Now, if you inspect the result of WebDir selectHome, you will be prompted for the directory containing your web pages, and you will be able to verify that webDir and homePath are properly initialized to the directory holding your web site and the full path name of this directory.

Figure \(\PageIndex{1}\): A WebDir instance.

It would be nice to be able to programmatically instantiate a WebDir, so let’s add another creation method.

To do

Add the following methods and try it out by inspecting the result of WebDir onPath: 'path to your web site'.

WebDir class>>onPath: homePath
    ^ self onPath: homePath home: homePath

WebDir class>>onPath: path home: homePath
    ^ self new setDir: path asFileReference home: homePath

Pattern matching HTML files

So far so good. Now we would like to use regexes to find out which HTML files this web site contains.

If we browse the FileDirectory class, we find that the method fileNames will list all the files in a directory. We want to select just those with the file extension .html. The regex that we need is '.*\.html'. The first dot will match any character except a newline:

'x' matchesRegex: '.'
>>> true

' ' matchesRegex: '.'
>>> true

Character cr asString matchesRegex: '.'
>>> true

The * (known as the Kleene star, after Stephen Kleene, who invented it) is a regex operator that will match the preceding regex any number of times (including zero).

'' matchesRegex: 'x*'
>>> true

'x' matchesRegex: 'x*'
>>> true

'xx' matchesRegex: 'x*'
>>> true

'y' matchesRegex: 'x*'
>>> false

Since the dot is a special character in regexes, if we want to literally match a dot, then we must escape it.

'.' matchesRegex: '.'
>>> true

'x' matchesRegex: '.'
>>> true

'.' matchesRegex: '\.'
>>> true

'x' matchesRegex: '\.'
>>> false

Now let’s check our regex to find HTML files works as expected.

'index.html' matchesRegex: '.*\.html'
>>> true

'foo.html' matchesRegex: '.*\.html'
>>> true

'style.css' matchesRegex: '.*\.html'
>>> false

'index.htm' matchesRegex: '.*\.html'
>>> false

Looks good. Now let’s try it out in our application.

To do

Add the following method to WebDir and try it out on your test web site.

WebDir >> htmlFiles
    ^ webDir fileNames select: [ :each | each matchesRegex: '.*\.html' ]

If you send htmlFiles to a WebDir instance and print it, you should see something like this:

(WebDir onPath: '...') htmlFiles
>>> #('index.html' ...)

Caching the regex

Now, if you browse matchesRegex:, you will discover that it is an extension method of String that creates a fresh instance of RxParser every time it is sent. That is fine for ad hoc queries, but if we are applying the same regex to every file in a web site, it is smarter to create just one instance of RxParser and reuse it. Let’s do that.

To do

Add a new instance variable htmlRegex to WebDir and initialize it by sending asRegex to our regex string. Modify WebDir>>htmlFiles to use the same regex each time as follows:

WebDir >> initialize
    htmlRegex := '.*\.html' asRegex

WebDir >> htmlFiles
    ^ webDir fileNames select: [ :each | htmlRegex matches: each ]

Now listing the HTML files should work just as it did before, except that we reuse the same regex object many times.

Accessing web pages

Accessing the details of individual web pages should be the responsibility of a separate class, so let’s define it, and let the WebDir class create the instances.

To do

Define a class WebPage with instance variables path, to identify the HTML file, and homePath, to identify the root directory of the web site. (We will need this to correctly generate links from the root of the web site to the files it contains.) Define an initialization method on the instance side and a creation method on the class side.

WebPage >> initializePath: filePath homePath: dirPath
    path := filePath.
    homePath := dirPath

WebPage class >> on: filePath forHome: homePath
    ^ self new initializePath: filePath homePath: homePath

A WebDir instance should be able to return a list of all the web pages it contains.

To do

Add the following method to WebDir, and inspect the return value to verify that it works correctly.

WebDir >> webPages
    ^ self htmlFiles collect:
        [ :each | WebPage
            on: webDir pathString, '/', each
            forHome: homePath ]

You should see something like this:

(WebDir onPath: '...') webPages
>>> an Array(a WebPage a WebPage ...)

String substitutions

That’s not very informative, so let’s use a regex to get the actual file name for each web page. To do this, we want to strip away all the characters from the path name up to the last directory. On a Unix file system directories end with a slash (/), so we need to delete everything up to the last slash in the file path.

The String extension method copyWithRegex:matchesReplacedWith: does what we want:

'hello' copyWithRegex: '[elo]+' matchesReplacedWith: 'i'
>>> 'hi'

In this example the regex [elo] matches any of the characters e, l or o. The operator + is like the Kleene star, but it matches exactly one or more instances of the regex preceding it. Here it will match the entire substring 'ello' and replay it in a fresh string with the letter i.

To do

Add the following method and verify that it works as expected.

WebPage >> fileName
    ^ path copyWithRegex: '.*/' matchesReplacedWith: ''

Now you should see something like this on your test web site:

(WebDir onPath: '...') webPages collect: [:each | each fileName ]
>>> #('index.html' ...)

Extracting regex matches

Our next task is to extract the title of each HTML page. First we need a way to get at the contents of each page. This is straightforward.

To do

Add the following method and try it out.

WebPage >> contents
    ^ (FileStream oldFileOrNoneNamed: path) contents

Actually, you might have problems if your web pages contain non-ascii characters, in which case you might be better off with the following code:

WebPage >> contents
    ^ (FileStream oldFileOrNoneNamed: path)
        converter: Latin1TextConverter new;
        contents

You should now be able to see something like this:

(WebDir onPath: '...') webPages first contents
>>> '<head>
<title>Home Page</title>
...
'

Now let’s extract the title. In this case we are looking for the text that occurs between the HTML tags <title> and </title>.

What we need is a way to extract part of the match of a regular expression. Subexpressions of regexes are delimited by parentheses. Consider the regex ([CARETaeiou]+)([aeiou]+); it consists of two subexpressions, the first of which will match a sequence of one or more non-vowels, and the second of which will match one or more vowels: the operator CARET at the start of a bracketed set of characters negates the set. (NB: In Pharo the caret is also the return keyword, which we write as ^. To avoid confusion, we will write CARET when we are using the caret within regular expressions to negate sets of characters, but you should not forget, they are actually the same thing.) Now we will try to match a prefix of the string 'pharo' and extract the submatches:

| re |
re := '([CARETaeiou]+)([aeiou]+)' asRegex.
re matchesPrefix: 'pharo'
>>> true
re subexpression: 1
>>> 'pha'
re subexpression: 2
>>> 'ph'
re subexpression: 3
>>> 'a'

After successfully matching a regex against a string, you can always send it the message subexpression: 1 to extract the entire match. You can also send subexpression: n where n-1 is the number of subexpressions in the regex. The regex above has two subexpressions, numbered 2 and 3.

We will use the same trick to extract the title from an HTML file.

To do

Define the following method:

WebPage >> title
    | re |
    re := '[\w\W]*<title>(.*)</title>' asRegexIgnoringCase.
    ^ (re matchesPrefix: self contents)
        ifTrue: [ re subexpression: 2 ]
        ifFalse: [ '(', self fileName, ' -- untitled)' ]

There are a couple of subtle points to notice here. First, HTML does not care whether tags are upper or lower case, so we must make our regex case insensitive by instantiating it with asRegexIgnoringCase.

Second, since dot matches any character except a newline, the regex .*<title>(.*)</title> will not work as expected if multiple lines appear before the title. The regex \w matches any alphanumeric, and \W will match any non-alphanumeric, so [\w\W] will match any character including newlines. (If we expect titles to possible contain newlines, we should play the same trick with the subexpression.)

Now we can test our title extractor, and we should see something like this:

(WebDir onPath: '...') webPages first title
>>> 'Home page'

More string substitutions

In order to generate our site map, we need to generate links to the individual web pages. We can use the document title as the name of the link. We just need to generate the right path to the web page from the root of the web site. Luckily this is trivial — it is simple the full path to the web page minus the full path to the root directory of the web site.

We must only watch out for one thing. Since the homePath variable does not end in a /, we must append one, so that relative path does not include a leading /. Notice the difference between the following two results:

'/home/testweb/index.html' copyWithRegex: '/home/testweb'
    matchesReplacedWith: ''
>>> '/index.html'

'/home/testweb/index.html' copyWithRegex: '/home/testweb/'
    matchesReplacedWith: ''
>>> 'index.html'

The first result would give us an absolute path, which is probably not what we want.

To do

Define the following methods:

WebPage >> relativePath
    ^ path
        copyWithRegex: homePath, '/'
        matchesReplacedWith: ''

WebPage >> link
    ^ '<a href="', self relativePath, '">', self title, '</a>'

You should now be able to see something like this:

(WebDir onPath: '...') webPages first link
>>> '<a href="index.html">Home Page</a>'

Generating the site map

Actually, we are now done with the regular expressions we need to generate the site map. We just need a few more methods to complete the application.

To do

If you want to see the site map generation, just add the following methods.

If our web site has subdirectories, we need a way to access them:

WebDir >> webDirs
    ^ webDir directoryNames
        collect: [ :each | WebDir onPath: webDir pathString, '/', each
            home: homePath ]

We need to generate HTML bullet lists containing links for each web page of a web directory. Subdirectories should be indented in their own bullet list.

WebDir >> printTocOn: aStream
    self htmlFiles
        ifNotEmpty: [
            aStream nextPutAll: '<ul>'; cr.
            self webPages
                do: [:each | aStream nextPutAll: '<li>';
                    nextPutAll: each link;
                    nextPutAll: '</li>'; cr].
            self webDirs
                do: [:each | each printTocOn: aStream].
            aStream nextPutAll: '</ul>'; cr]

We create a file called toc.html in the root web directory and dump the site map there.

WebDir >> tocFileName
    ^ 'toc.html'

WebDir >> makeToc
    | tocStream |
    tocStream := (webDir / self tocFileName) writeStream.
    self printTocOn: tocStream.
    tocStream close.

Now we can generate a table of contents for an arbitrary web directory!

WebDir selectHome makeToc

Figure \(\PageIndex{2}\): A small site map.