17.2: Tutorial Example - Generating a Site Map
- Page ID
- 39675
\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)
\( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)
( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)
\( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)
\( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)
\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)
\( \newcommand{\Span}{\mathrm{span}}\)
\( \newcommand{\id}{\mathrm{id}}\)
\( \newcommand{\Span}{\mathrm{span}}\)
\( \newcommand{\kernel}{\mathrm{null}\,}\)
\( \newcommand{\range}{\mathrm{range}\,}\)
\( \newcommand{\RealPart}{\mathrm{Re}}\)
\( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)
\( \newcommand{\Argument}{\mathrm{Arg}}\)
\( \newcommand{\norm}[1]{\| #1 \|}\)
\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)
\( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)
\( \newcommand{\vectorA}[1]{\vec{#1}} % arrow\)
\( \newcommand{\vectorAt}[1]{\vec{\text{#1}}} % arrow\)
\( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\( \newcommand{\vectorC}[1]{\textbf{#1}} \)
\( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)
\( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)
\( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)
\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)
\(\newcommand{\avec}{\mathbf a}\) \(\newcommand{\bvec}{\mathbf b}\) \(\newcommand{\cvec}{\mathbf c}\) \(\newcommand{\dvec}{\mathbf d}\) \(\newcommand{\dtil}{\widetilde{\mathbf d}}\) \(\newcommand{\evec}{\mathbf e}\) \(\newcommand{\fvec}{\mathbf f}\) \(\newcommand{\nvec}{\mathbf n}\) \(\newcommand{\pvec}{\mathbf p}\) \(\newcommand{\qvec}{\mathbf q}\) \(\newcommand{\svec}{\mathbf s}\) \(\newcommand{\tvec}{\mathbf t}\) \(\newcommand{\uvec}{\mathbf u}\) \(\newcommand{\vvec}{\mathbf v}\) \(\newcommand{\wvec}{\mathbf w}\) \(\newcommand{\xvec}{\mathbf x}\) \(\newcommand{\yvec}{\mathbf y}\) \(\newcommand{\zvec}{\mathbf z}\) \(\newcommand{\rvec}{\mathbf r}\) \(\newcommand{\mvec}{\mathbf m}\) \(\newcommand{\zerovec}{\mathbf 0}\) \(\newcommand{\onevec}{\mathbf 1}\) \(\newcommand{\real}{\mathbb R}\) \(\newcommand{\twovec}[2]{\left[\begin{array}{r}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\ctwovec}[2]{\left[\begin{array}{c}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\threevec}[3]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\cthreevec}[3]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\fourvec}[4]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\cfourvec}[4]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\fivevec}[5]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\cfivevec}[5]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\mattwo}[4]{\left[\begin{array}{rr}#1 \amp #2 \\ #3 \amp #4 \\ \end{array}\right]}\) \(\newcommand{\laspan}[1]{\text{Span}\{#1\}}\) \(\newcommand{\bcal}{\cal B}\) \(\newcommand{\ccal}{\cal C}\) \(\newcommand{\scal}{\cal S}\) \(\newcommand{\wcal}{\cal W}\) \(\newcommand{\ecal}{\cal E}\) \(\newcommand{\coords}[2]{\left\{#1\right\}_{#2}}\) \(\newcommand{\gray}[1]{\color{gray}{#1}}\) \(\newcommand{\lgray}[1]{\color{lightgray}{#1}}\) \(\newcommand{\rank}{\operatorname{rank}}\) \(\newcommand{\row}{\text{Row}}\) \(\newcommand{\col}{\text{Col}}\) \(\renewcommand{\row}{\text{Row}}\) \(\newcommand{\nul}{\text{Nul}}\) \(\newcommand{\var}{\text{Var}}\) \(\newcommand{\corr}{\text{corr}}\) \(\newcommand{\len}[1]{\left|#1\right|}\) \(\newcommand{\bbar}{\overline{\bvec}}\) \(\newcommand{\bhat}{\widehat{\bvec}}\) \(\newcommand{\bperp}{\bvec^\perp}\) \(\newcommand{\xhat}{\widehat{\xvec}}\) \(\newcommand{\vhat}{\widehat{\vvec}}\) \(\newcommand{\uhat}{\widehat{\uvec}}\) \(\newcommand{\what}{\widehat{\wvec}}\) \(\newcommand{\Sighat}{\widehat{\Sigma}}\) \(\newcommand{\lt}{<}\) \(\newcommand{\gt}{>}\) \(\newcommand{\amp}{&}\) \(\definecolor{fillinmathshade}{gray}{0.9}\)Our job is to write a simple application that will generate a site map for a web site that we have stored locally on our hard drive. The site map will contain links to each of the HTML files in the web site, using the title of the document as the text of the link. Furthermore, links will be indented to reflect the directory structure of the web site.
Accessing the web directory
To do
If you do not have a web site on your machine, copy a few HTML files to a local directory to serve as a test bed.
We will develop two classes, WebDir
and WebPage
, to represent directories and web pages. The idea is to create an instance of WebDir
which will point to the root directory containing our web site. When we send it the message makeToc
, it will walk through the files and directories inside it to build up the site map. It will then create a new file, called toc.html
, containing links to all the pages in the web site.
One thing we will have to watch out for: each WebDir
and WebPage
must remember the path to the root of the web site, so it can properly generate links relative to the root.
To do
Define the class WebDir
with instance variables webDir
and homePath
, and define the appropriate initialization method.
Also define class-side methods to prompt the user for the location of the web site on your computer, as follows:
WebDir >> setDir: dir home: path webDir := dir. homePath := path WebDir class >> onDir: dir ^ self new setDir: dir home: dir pathString WebDir class >> selectHome ^ self onDir: UIManager default chooseDirectory
The last method opens a browser to select the directory to open. Now, if you inspect the result of WebDir selectHome
, you will be prompted for the directory containing your web pages, and you will be able to verify that webDir
and homePath
are properly initialized to the directory holding your web site and the full path name of this directory.

It would be nice to be able to programmatically instantiate a WebDir
, so let’s add another creation method.
To do
Add the following methods and try it out by inspecting the result of WebDir onPath: 'path to your web site'
.
WebDir class>>onPath: homePath ^ self onPath: homePath home: homePath WebDir class>>onPath: path home: homePath ^ self new setDir: path asFileReference home: homePath
Pattern matching HTML files
So far so good. Now we would like to use regexes to find out which HTML files this web site contains.
If we browse the FileDirectory
class, we find that the method fileNames
will list all the files in a directory. We want to select just those with the file extension .html
. The regex that we need is '.*\.html'
. The first dot will match any character except a newline:
'x' matchesRegex: '.' >>> true
' ' matchesRegex: '.' >>> true
Character cr asString matchesRegex: '.' >>> true
The * (known as the Kleene star, after Stephen Kleene, who invented it) is a regex operator that will match the preceding regex any number of times (including zero).
'' matchesRegex: 'x*' >>> true
'x' matchesRegex: 'x*' >>> true
'xx' matchesRegex: 'x*' >>> true
'y' matchesRegex: 'x*' >>> false
Since the dot is a special character in regexes, if we want to literally match a dot, then we must escape it.
'.' matchesRegex: '.' >>> true
'x' matchesRegex: '.' >>> true
'.' matchesRegex: '\.' >>> true
'x' matchesRegex: '\.' >>> false
Now let’s check our regex to find HTML files works as expected.
'index.html' matchesRegex: '.*\.html' >>> true
'foo.html' matchesRegex: '.*\.html' >>> true
'style.css' matchesRegex: '.*\.html' >>> false
'index.htm' matchesRegex: '.*\.html' >>> false
Looks good. Now let’s try it out in our application.
To do
Add the following method to WebDir
and try it out on your test web site.
WebDir >> htmlFiles ^ webDir fileNames select: [ :each | each matchesRegex: '.*\.html' ]
If you send htmlFiles
to a WebDir
instance and print it
, you should see something like this:
(WebDir onPath: '...') htmlFiles >>> #('index.html' ...)
Caching the regex
Now, if you browse matchesRegex:
, you will discover that it is an extension method of String
that creates a fresh instance of RxParser
every time it is sent. That is fine for ad hoc queries, but if we are applying the same regex to every file in a web site, it is smarter to create just one instance of RxParser
and reuse it. Let’s do that.
To do
Add a new instance variable htmlRegex
to WebDir
and initialize it by sending asRegex
to our regex string. Modify WebDir>>htmlFiles
to use the same regex each time as follows:
WebDir >> initialize htmlRegex := '.*\.html' asRegex WebDir >> htmlFiles ^ webDir fileNames select: [ :each | htmlRegex matches: each ]
Now listing the HTML files should work just as it did before, except that we reuse the same regex object many times.
Accessing web pages
Accessing the details of individual web pages should be the responsibility of a separate class, so let’s define it, and let the WebDir
class create the instances.
To do
Define a class WebPage
with instance variables path, to identify the HTML file, and homePath
, to identify the root directory of the web site. (We will need this to correctly generate links from the root of the web site to the files it contains.) Define an initialization method on the instance side and a creation method on the class side.
WebPage >> initializePath: filePath homePath: dirPath path := filePath. homePath := dirPath WebPage class >> on: filePath forHome: homePath ^ self new initializePath: filePath homePath: homePath
A WebDir
instance should be able to return a list of all the web pages it contains.
To do
Add the following method to WebDir
, and inspect the return value to verify that it works correctly.
WebDir >> webPages ^ self htmlFiles collect: [ :each | WebPage on: webDir pathString, '/', each forHome: homePath ]
You should see something like this:
(WebDir onPath: '...') webPages >>> an Array(a WebPage a WebPage ...)
String substitutions
That’s not very informative, so let’s use a regex to get the actual file name for each web page. To do this, we want to strip away all the characters from the path name up to the last directory. On a Unix file system directories end with a slash (/), so we need to delete everything up to the last slash in the file path.
The String
extension method copyWithRegex:matchesReplacedWith:
does what we want:
'hello' copyWithRegex: '[elo]+' matchesReplacedWith: 'i' >>> 'hi'
In this example the regex [elo]
matches any of the characters e
, l
or o
. The operator + is like the Kleene star, but it matches exactly one or more instances of the regex preceding it. Here it will match the entire substring 'ello'
and replay it in a fresh string with the letter i
.
To do
Add the following method and verify that it works as expected.
WebPage >> fileName ^ path copyWithRegex: '.*/' matchesReplacedWith: ''
Now you should see something like this on your test web site:
(WebDir onPath: '...') webPages collect: [:each | each fileName ] >>> #('index.html' ...)
Extracting regex matches
Our next task is to extract the title of each HTML page. First we need a way to get at the contents of each page. This is straightforward.
To do
Add the following method and try it out.
WebPage >> contents ^ (FileStream oldFileOrNoneNamed: path) contents
Actually, you might have problems if your web pages contain non-ascii characters, in which case you might be better off with the following code:
WebPage >> contents ^ (FileStream oldFileOrNoneNamed: path) converter: Latin1TextConverter new; contents
You should now be able to see something like this:
(WebDir onPath: '...') webPages first contents >>> '<head> <title>Home Page</title> ... '
Now let’s extract the title. In this case we are looking for the text that occurs between the HTML tags <title>
and </title>
.
What we need is a way to extract part of the match of a regular expression. Subexpressions of regexes are delimited by parentheses. Consider the regex ([CARETaeiou]+
)([aeiou]+
); it consists of two subexpressions, the first of which will match a sequence of one or more non-vowels, and the second of which will match one or more vowels: the operator CARET
at the start of a bracketed set of characters negates the set. (NB: In Pharo the caret is also the return keyword, which we write as ^
. To avoid confusion, we will write CARET
when we are using the caret within regular expressions to negate sets of characters, but you should not forget, they are actually the same thing.) Now we will try to match a prefix of the string 'pharo'
and extract the submatches:
| re | re := '([CARETaeiou]+)([aeiou]+)' asRegex. re matchesPrefix: 'pharo' >>> true re subexpression: 1 >>> 'pha' re subexpression: 2 >>> 'ph' re subexpression: 3 >>> 'a'
After successfully matching a regex against a string, you can always send it the message subexpression: 1
to extract the entire match. You can also send subexpression: n
where n-1 is the number of subexpressions in the regex. The regex above has two subexpressions, numbered 2 and 3.
We will use the same trick to extract the title from an HTML file.
To do
Define the following method:
WebPage >> title | re | re := '[\w\W]*<title>(.*)</title>' asRegexIgnoringCase. ^ (re matchesPrefix: self contents) ifTrue: [ re subexpression: 2 ] ifFalse: [ '(', self fileName, ' -- untitled)' ]
There are a couple of subtle points to notice here. First, HTML does not care whether tags are upper or lower case, so we must make our regex case insensitive by instantiating it with asRegexIgnoringCase
.
Second, since dot matches any character except a newline, the regex .*<title>(.*)</title>
will not work as expected if multiple lines appear before the title. The regex \w
matches any alphanumeric, and \W
will match any non-alphanumeric, so [\w\W
] will match any character including newlines. (If we expect titles to possible contain newlines, we should play the same trick with the subexpression.)
Now we can test our title extractor, and we should see something like this:
(WebDir onPath: '...') webPages first title >>> 'Home page'
More string substitutions
In order to generate our site map, we need to generate links to the individual web pages. We can use the document title as the name of the link. We just need to generate the right path to the web page from the root of the web site. Luckily this is trivial — it is simple the full path to the web page minus the full path to the root directory of the web site.
We must only watch out for one thing. Since the homePath
variable does not end in a /
, we must append one, so that relative path does not include a leading /
. Notice the difference between the following two results:
'/home/testweb/index.html' copyWithRegex: '/home/testweb' matchesReplacedWith: '' >>> '/index.html'
'/home/testweb/index.html' copyWithRegex: '/home/testweb/' matchesReplacedWith: '' >>> 'index.html'
The first result would give us an absolute path, which is probably not what we want.
To do
Define the following methods:
WebPage >> relativePath ^ path copyWithRegex: homePath, '/' matchesReplacedWith: '' WebPage >> link ^ '<a href="', self relativePath, '">', self title, '</a>'
You should now be able to see something like this:
(WebDir onPath: '...') webPages first link >>> '<a href="index.html">Home Page</a>'
Generating the site map
Actually, we are now done with the regular expressions we need to generate the site map. We just need a few more methods to complete the application.
To do
If you want to see the site map generation, just add the following methods.
If our web site has subdirectories, we need a way to access them:
WebDir >> webDirs ^ webDir directoryNames collect: [ :each | WebDir onPath: webDir pathString, '/', each home: homePath ]
We need to generate HTML bullet lists containing links for each web page of a web directory. Subdirectories should be indented in their own bullet list.
WebDir >> printTocOn: aStream self htmlFiles ifNotEmpty: [ aStream nextPutAll: '<ul>'; cr. self webPages do: [:each | aStream nextPutAll: '<li>'; nextPutAll: each link; nextPutAll: '</li>'; cr]. self webDirs do: [:each | each printTocOn: aStream]. aStream nextPutAll: '</ul>'; cr]
We create a file called toc.html in the root web directory and dump the site map there.
WebDir >> tocFileName ^ 'toc.html' WebDir >> makeToc | tocStream | tocStream := (webDir / self tocFileName) writeStream. self printTocOn: tocStream. tocStream close.
Now we can generate a table of contents for an arbitrary web directory!
WebDir selectHome makeToc
