17.4: Regex API

Last updated
Save as PDF

Page ID: 39677

\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \) \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)\(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\) \(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\)\(\newcommand{\AA}{\unicode[.8,0]{x212B}}\)

Up to now we have focused mainly on the syntax of regexes. Now we will have a closer look at the different messages understood by strings and regexes.

Matching prefixes and ignoring case

So far most of our examples have used the String extension method matchesRegex:.

Strings also understand the following messages: prefixMatchesRegex:, matchesRegexIgnoringCase: and prefixMatchesRegexIgnoringCase:.

The message prefixMatchesRegex: is just like matchesRegex, except that the whole receiver is not expected to match the regular expression passed as the argument; matching just a prefix of it is enough.

'abacus' matchesRegex: '(a|b)+'
>>> false

'abacus' prefixMatchesRegex: '(a|b)+'
>>> true

'ABBA' matchesRegexIgnoringCase: '(a|b)+'
>>> true

'Abacus' matchesRegexIgnoringCase: '(a|b)+'
>>> false

'Abacus' prefixMatchesRegexIgnoringCase: '(a|b)+'
>>> true

Enumeration interface

Some applications need to access all matches of a certain regular expression within a string. The matches are accessible using a protocol modeled after the familiar Collection-like enumeration protocol.

regex:matchesDo: evaluates a one-argument aBlock for every match of the regular expression within the receiver string.

| list |
list := OrderedCollection new.
'Jack meet Jill' regex: '\w+' matchesDo: [:word | list add: word].
list
>>> an OrderedCollection('Jack' 'meet' 'Jill')

regex:matchesCollect: evaluates a one-argument aBlock for every match of the regular expression within the receiver string. It then collects the results and answers them as a SequenceableCollection.

'Jack meet Jill' regex: '\w+' matchesCollect: [:word | word size]
>>> an OrderedCollection(4 4 4)

allRegexMatches: returns a collection of all matches (substrings of the receiver string) of the regular expression.

'Jack and Jill went up the hill' allRegexMatches: '\w+'
>>> an OrderedCollection('Jack' 'and' 'Jill' 'went' 'up' 'the' 'hill')

Replacement and translation

It is possible to replace all matches of a regular expression with a certain string using the message copyWithRegex:matchesReplacedWith:.

'Krazy hates Ignatz' copyWithRegex: '\<[[:lower:]]+\>'
    matchesReplacedWith: 'loves'
>>> 'Krazy loves Ignatz'

A more general substitution is match translation. This message evaluates a block passing it each match of the regular expression in the receiver string and answers a copy of the receiver with the block results spliced into it in place of the respective matches.

'Krazy loves Ignatz' copyWithRegex: '\b[a-z]+\b'
    matchesTranslatedUsing: [:each | each asUppercase]
>>> 'Krazy LOVES Ignatz'

All messages of enumeration and replacement protocols perform a case- sensitive match. Case-insensitive versions are not provided as part of a String protocol. Instead, they are accessible using the lower-level matching interface presented in the following section.

Lower-level interface

When you send the message matchesRegex: to a string, the following happens:

A fresh instance of RxParser is created, and the regular expression string is passed to it, yielding the expression’s syntax tree.
The syntax tree is passed as an initialization parameter to an instance of RxMatcher. The instance sets up some data structure that will work as a recognizer for the regular expression described by the tree.
The original string is passed to the matcher, and the matcher checks for a match.

The Matcher

If you repeatedly match a number of strings against the same regular expression using one of the messages defined in String, the regular expression string is parsed and a new matcher is created for every match. You can avoid this overhead by building a matcher for the regular expression, and then reusing the matcher over and over again. You can, for example, create a matcher at a class or instance initialization stage, and store it in a variable for future use. You can create a matcher using one of the following methods:

You can send asRegex or asRegexIgnoringCase to the string.
You can directly invoke the RxMatcher constructor methods forString: or forString:ignoreCase: (which is what the convenience methods above will do).

Here we send matchesIn: to collect all the matches found in a string:

| octal |
octal := '8r[0-9A-F]+' asRegex.
octal matchesIn: '8r52 = 16r2A'
>>> an OrderedCollection('8r52')

| hex |
hex := '16r[0-9A-F]+' asRegexIgnoringCase.
hex matchesIn: '8r52 = 16r2A'
>>> an OrderedCollection('16r2A')

| hex |
hex := RxMatcher forString: '16r[0-9A-Fa-f]+' ignoreCase: true.
hex matchesIn: '8r52 = 16r2A'
>>> an OrderedCollection('16r2A')

Matching

A matcher understands these messages (all of them return true to indicate successful match or search, and false otherwise):

matches: aString — true if the whole argument string (aString) matches.

'\w+' asRegex matches: 'Krazy'
>>> true

matchesPrefix: aString — true if some prefix of the argument string (not necessarily the whole string) matches.

'\w+' asRegex matchesPrefix: 'Ignatz hates Krazy'
>>> true

search: aString — Search the string for the first occurrence of a matching substring. Note that the first two methods only try matching from the very beginning of the string. Using the above example with a matcher for a+, this method would answer success given a string 'baaa', while the previous two would fail.

'\b[a-z]+\b' asRegex search: 'Ignatz hates Krazy'
>>> true "finds 'hates'"

The matcher also stores the outcome of the last match attempt and can report it: lastResult answers a Boolean: the outcome of the most recent match attempt. If no matches were attempted, the answer is unspecified.

| number |
number := '\d+' asRegex.
number search: 'Ignatz throws 5 bricks'.
number lastResult
>>> true

matchesStream:, matchesStreamPrefix: and searchStream: are analogous to the above three messages, but takes streams as their argument.

| ignatz names |
ignatz := ReadStream on: 'Ignatz throws bricks at Krazy'.
names := '\<[A-Z][a-z]+\>' asRegex.
names matchesStreamPrefix: ignatz
>>> true

Subexpression matches

After a successful match attempt, you can query which part of the original string has matched which part of the regex. A subexpression is a parenthesized part of a regular expression, or the whole expression. When a regular expression is compiled, its subexpressions are assigned indices starting from 1, depth-first, left-to-right.

For example, the regex ((\d+)\s*(\w+)) has four subexpressions, including itself.

1: ((\d+)\s*(\w+)) "the complete expression"
2: (\d+)\s*(\w+) "top parenthesized subexpression"
3: \d+ "first leaf subexpression"
4: \w+ "second leaf subexpression"

The highest valid index is equal to 1 plus the number of matching parentheses. (So, 1 is always a valid index, even if there are no parenthesized subexpres- sions.)

After a successful match, the matcher can report what part of the original string matched what subexpression. It understands these messages:

subexpressionCount answers the total number of subexpressions: the highest value that can be used as a subexpression index with this matcher. This value is available immediately after initialization and never changes.

subexpression: takes a valid index as its argument, and may be sent only after a successful match attempt. The method answers a substring of the original string the corresponding subexpression has matched to.

subBeginning: and subEnd: answer the positions within the argument string or stream where the given subexpression match has started and ended, respectively.

| items |
items := '((\d+)\s*(\w+))' asRegex.
items search: 'Ignatz throws 1 brick at Krazy'.
items subexpressionCount
>>> 4
items subexpression: 1
>>> '1 brick' "complete expression"
items subexpression: 2
>>> '1 brick' "top subexpression"
items subexpression: 3
>>> '1' "first leaf subexpression"
items subexpression: 4
>>> 'brick' "second leaf subexpression"
items subBeginning: 3
>>> 14
items subEnd: 3
>>> 15
items subBeginning: 4
>>> 16
items subEnd: 4
>>> 21

As a more elaborate example, the following piece of code uses a MMM DD, YYYY date format recognizer to convert a date to a three-element array with year, month, and day strings:

| date result |
date :=
    '(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\s+(\d\d?)\s*,\s*19(\d\d)'
    asRegex.
result := (date matches: 'Aug 6, 1996')
    ifTrue: [{ (date subexpression: 4) .
        (date subexpression: 2) .
        (date subexpression: 3) } ]
    ifFalse: ['no match'].
result
>>> #('96' 'Aug' '6')

Enumeration and Replacement

The String enumeration and replacement protocols that we saw earlier in this section are actually implemented by the matcher. RxMatcher implements the following methods for iterating over matches within strings: matchesIn:, matchesIn:do:, matchesIn:collect:, copy:replacingMatchesWith: and copy:translatingMatchesUsing:.

| seuss aWords |
seuss := 'The cat in the hat is back'.
aWords := '\<([^aeiou]|[a])+\>' asRegex. "match words with 'a' in them"
aWords matchesIn: seuss
>>> an OrderedCollection('cat' 'hat' 'back')
aWords matchesIn: seuss collect: [:each | each asUppercase ]
>>> an OrderedCollection('CAT' 'HAT' 'BACK')
aWords copy: seuss replacingMatchesWith: 'grinch'
>>> 'The grinch in the grinch is grinch'
aWords copy: seuss translatingMatchesUsing: [ :each | each asUppercase ]
>>> 'The CAT in the HAT is BACK'

There are also the following methods for iterating over matches within streams: matchesOnStream:, matchesOnStream:do:, matchesOnStream:collect:, copyStream:to:replacingMatchesWith: and copyStream:to:translatingMatchesUsing:.

Error Handling

Several exceptions may be raised by RxParser when building regexes. The exceptions have the common parent RegexError. You may use the usual Smalltalk exception handling mechanism to catch and handle them.

RegexSyntaxError is raised if a syntax error is detected while parsing a regex
RegexCompilationError is raised if an error is detected while building a matcher
RegexMatchingError is raised if an error occurs while matching (for example, if a bad selector was specified using ':<selector>:' syntax, or because of the matcher’s internal error).

['+' asRegex] on: RegexError do: [:ex | ^ ex printString ]
>>> 'RegexSyntaxError: nullable closure'