17.3: Regex Syntax

Last updated
Save as PDF

Page ID: 39676

$ \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } $ $ \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} $$\newcommand{\id}{\mathrm{id}}$ $ \newcommand{\Span}{\mathrm{span}}$ $ \newcommand{\kernel}{\mathrm{null}\,}$ $ \newcommand{\range}{\mathrm{range}\,}$ $ \newcommand{\RealPart}{\mathrm{Re}}$ $ \newcommand{\ImaginaryPart}{\mathrm{Im}}$ $ \newcommand{\Argument}{\mathrm{Arg}}$ $ \newcommand{\norm}[1]{\| #1 \|}$ $ \newcommand{\inner}[2]{\langle #1, #2 \rangle}$ $ \newcommand{\Span}{\mathrm{span}}$ $\newcommand{\id}{\mathrm{id}}$ $ \newcommand{\Span}{\mathrm{span}}$ $ \newcommand{\kernel}{\mathrm{null}\,}$ $ \newcommand{\range}{\mathrm{range}\,}$ $ \newcommand{\RealPart}{\mathrm{Re}}$ $ \newcommand{\ImaginaryPart}{\mathrm{Im}}$ $ \newcommand{\Argument}{\mathrm{Arg}}$ $ \newcommand{\norm}[1]{\| #1 \|}$ $ \newcommand{\inner}[2]{\langle #1, #2 \rangle}$ $ \newcommand{\Span}{\mathrm{span}}$$\newcommand{\AA}{\unicode[.8,0]{x212B}}$

We will now have a closer look at the syntax of regular expressions as supported by the Regex package.

The simplest regular expression is a single character. It matches exactly that character. A sequence of characters matches a string with exactly the same sequence of characters:

'a' matchesRegex: 'a'
>>> true

'foobar' matchesRegex: 'foobar'
>>> true

'blorple' matchesRegex: 'foobar'
>>> false

Operators are applied to regular expressions to produce more complex regular expressions. Sequencing (placing expressions one after another) as an operator is, in a certain sense, invisible — yet it is arguably the most common.

We have already seen the Kleene star (*) and the + operator. A regular expression followed by an asterisk matches any number (including 0) of matches of the original expression. For example:

'ab' matchesRegex: 'a*b'
>>> true

'aaaaab' matchesRegex: 'a*b'
>>> true

'b' matchesRegex: 'a*b'
>>> true

'aac' matchesRegex: 'a*b'
>>> false "b does not match"

The Kleene star has higher precedence than sequencing. A star applies to the shortest possible subexpression that precedes it. For example, ab* means a followed by zero or more occurrences of b, not zero or more occurrences of ab:

'abbb' matchesRegex: 'ab*'
>>> true

'abab' matchesRegex: 'ab*'
>>> false

To obtain a regex that matches zero or more occurrences of ab, we must enclose ab in parentheses:

'abab' matchesRegex: '(ab)*'
>>> true

'abcab' matchesRegex: '(ab)*'
>>> false "c spoils the fun"

Two other useful operators similar to * are + and ?. + matches one or more instances of the regex it modifies, and ? will match zero or one instance.

'ac' matchesRegex: 'ab*c'
>>> true

'ac' matchesRegex: 'ab+c'
>>> false "need at least one b"

'abbc' matchesRegex: 'ab+c'
>>> true

'abbc' matchesRegex: 'ab?c'
>>> false "too many b's"

As we have seen, the characters *, +, ?, (, and ) have special meaning within regular expressions. If we need to match any of them literally, it should be escaped by preceding it with a backslash \ . Thus, backslash is also special character, and needs to be escaped for a literal match. The same holds for all further special characters we will see.

'ab*' matchesRegex: 'ab*'
>>> false "star in the right string is special"

'ab*' matchesRegex: 'ab\*'
>>> true

'a\c' matchesRegex: 'a\\c'
>>> true

The last operator is |, which expresses choice between two subexpressions. It matches a string if either of the two subexpressions matches the string. It has the lowest precedence — even lower than sequencing. For example, ab*|ba* means a followed by any number of b’s, or b followed by any number of a’s:

'abb' matchesRegex: 'ab*|ba*'
>>> true

'baa' matchesRegex: 'ab*|ba*'
>>> true

'baab' matchesRegex: 'ab*|ba*'
>>> false

A bit more complex example is the expression c(a|d)+r, which matches the name of any of the Lisp-style car, cdr, caar, cadr, ... functions:

'car' matchesRegex: 'c(a|d)+r'
>>> true

'cdr' matchesRegex: 'c(a|d)+r'
>>> true

'cadr' matchesRegex: 'c(a|d)+r'
>>> true

It is possible to write an expression that matches an empty string, for example the expression a| matches an empty string. However, it is an error to apply *, +, or ? to such an expression: (a|)* is invalid.

So far, we have used only characters as the smallest components of regular expressions. There are other, more interesting, components. A character set is a string of characters enclosed in square brackets. It matches any single character if it appears between the brackets. For example, [01] matches either 0 or 1:

'0' matchesRegex: '[01]'
>>> true

'3' matchesRegex: '[01]'
>>> false

'11' matchesRegex: '[01]'
>>> false "a set matches only one character"

Using plus operator, we can build the following binary number recognizer:

'10010100' matchesRegex: '[01]+'
>>> true

'10001210' matchesRegex: '[01]+'
>>> false

If the first character after the opening bracket is CARET, the set is inverted: it matches any single character not appearing between the brackets:

'0' matchesRegex: '[CARET01]'
>>> false

'3' matchesRegex: '[CARET01]'
>>> true

For convenience, a set may include ranges: pairs of characters separated by a hyphen (-). This is equivalent to listing all characters in between: '[0-9]' is the same as '[0123456789]'. Special characters within a set are CARET, -, and ], which closes the set. Below are examples how to literally match them in a set:

'CARET' matchesRegex: '[01CARET]'
>>> true "put the caret anywhere except the start"

'-' matchesRegex: '[01-]'
>>> true "put the hyphen at the end"

']' matchesRegex: '[]01]'
>>> true "put the closing bracket at the start"

Thus, empty and universal sets cannot be specified.

Character classes

Regular expressions can also include the following backquote escapes to refer to popular classes of characters: \w to match alphanumeric characters, \d to match digits, and \s to match whitespace. Their upper-case variants, \W, \D and \S, match the complementary characters (non-alphanumerics, non-digits and non-whitespace). Here is a summary of the syntax seen so far:

Table $\PageIndex{1}$: Regex Syntax
Syntax	What it represents
a	literal match of character a
.	match any char (except newline)
(...)	group subexpression
\x	escape the following special character where ’x’ can be ’w’,’s’,’d’,’W’,’S’,’D’
*	Kleene star — match previous regex zero or more times
+	match previous regex one or more times
?	match previous regex zero times or once
\|	match choice of left and right regex
[abcd]	match choice of characters abcd
[^abcd]	match negated choice of characters
[0-9]	match range of characters 0 to 9
\w	match alphanumeric
\W	match non-alphanumeric
\d	match digit
\D	match non-digit
\s	match space
\S	match non-space

As mentioned in the introduction, regular expressions are especially useful for validating user input, and character classes turn out to be especially useful for defining such regexes. For example, non-negative numbers can be matched with the regex \d+:

'42' matchesRegex: '\d+'
>>> true

'-1' matchesRegex: '\d+'
>>> false

Better yet, we might want to specify that non-zero numbers should not start with the digit 0:

'0' matchesRegex: '0|([1-9]\d*)'
>>> true

'1' matchesRegex: '0|([1-9]\d*)'
>>> true

'42' matchesRegex: '0|([1-9]\d*)'
>>> true

'099' matchesRegex: '0|([1-9]\d*)'
>>> false "leading 0"

We can check for negative and positive numbers as well:

'0' matchesRegex: '(0|((\+|-)?[1-9]\d*))'
>>> true

'-1' matchesRegex: '(0|((\+|-)?[1-9]\d*))'
>>> true

'42' matchesRegex: '(0|((\+|-)?[1-9]\d*))'
>>> true

'+99' matchesRegex: '(0|((\+|-)?[1-9]\d*))'
>>> true

'-0' matchesRegex: '(0|((\+|-)?[1-9]\d*))'
>>> false "negative zero"

'01' matchesRegex: '(0|((\+|-)?[1-9]\d*))'
>>> false "leading zero"

Floating point numbers should require at least one digit after the dot:

'0' matchesRegex: '(0|((\+|-)?[1-9]\d*))(\.\d+)?'
>>> true

'0.9' matchesRegex: '(0|((\+|-)?[1-9]\d*))(\.\d+)?'
>>> true

'3.14' matchesRegex: '(0|((\+|-)?[1-9]\d*))(\.\d+)?'
>>> true

'-42' matchesRegex: '(0|((\+|-)?[1-9]\d*))(\.\d+)?'
>>> true

'2.' matchesRegex: '(0|((\+|-)?[1-9]\d*))(\.\d+)?'
>>> false "need digits after ."

For dessert, here is a recognizer for a general number format: anything like 999, or 999.999, or -999.999e+21.

'-999.999e+21' matchesRegex: '(\+|-)?\d+(\.\d*)?((e|E)(\+|-)?\d+)?'
>>> true

Character classes can also include the following grep(1)-compatible elements:

Table $\PageIndex{2}$: grep-compatible elements.
Syntax	What it represents
[:alnum:]	any alphanumeric
[:alpha:]	any alphabetic character
[:cntrl:]	any control character (ascii code below 32)
[:digit:]	any decimal digit
[:graph:]	any graphical character (ascii code above 32)
[:lower:]	any lowercase character
[:print:]	any printable character (here, the same as [:graph:])
[:punct:]	any punctuation character
[:space:]	any whitespace character
[:upper:]	any uppercase character
[:xdigit:]	any hexadecimal character

Note that these elements are components of the character classes, i.e., they have to be enclosed in an extra set of square brackets to form a valid regular expression. For example, a non-empty string of digits would be represented as [[:digit:]]+. The above primitive expressions and operators are common to many implementations of regular expressions.

'42' matchesRegex: '[[:digit:]]+'
>>> true

Special character classes

The next primitive expression is unique to this Smalltalk implementation. A sequence of characters between colons is treated as a unary selector which is supposed to be understood by characters. A character matches such an expression if it answers true to a message with that selector. This allows a more readable and efficient way of specifying character classes. For example, [0-9] is equivalent to :isDigit:, but the latter is more efficient. Analogously to character sets, character classes can be negated: :CARETisDigit: matches a character that answers false to isDigit, and is therefore equivalent to [CARET0-9].

So far we have seen the following equivalent ways to write a regular expression that matches a non-empty string of digits: [0-9]+, \d+, [\d]+, [[:digit:]]+, :isDigit:+.

'42' matchesRegex: '[0-9]+'
>>> true

'42' matchesRegex: '\d+'
>>> true

'42' matchesRegex: '[\d]+'
>>> true

'42' matchesRegex: '[[:digit:]]+'
>>> true

'42' matchesRegex: ':isDigit:+'
>>> true

Matching boundaries

The last group of special primitive expressions shown next is used to match boundaries of strings.

Table $\PageIndex{3}$: Boundary expressions.
Syntax	What it represents
CARET	match an empty string at the beginning of a line
\$	match an empty string at the end of a line
\b	match an empty string at a word boundary
\B	match an empty string not at a word boundary
\<	match an empty string at the beginning of a word
\>	match an empty string at the end of a word

'hello world' matchesRegex: '.*\bw.*'
>>> true "word boundary before w"

'hello world' matchesRegex: '.*\bo.*'
>>> false "no boundary before o"