# 3.1: Languages

In formal language theory, an alphabet is a finite, non-empty set. The elements of the set are called symbols. A finite sequence of symbols $$a_{1} a_{2} \ldots a_{n}$$ from an alphabet is called a string over that alphabet.

Example $$3.1 . \Sigma=\{0,1\}$$ is an alphabet, and $$011,1010,$$ and 1 are all strings over $$\Sigma .$$

Note that strings really are sequences of symbols, which implies that order matters. Thus 011, 101, and 110 are all different strings, though they are made up of the same symbols. The strings $$x=a_{1} a_{2} \ldots a_{n}$$ and $$y=b_{1} b_{2} \dots b_{m}$$ are equal only if m = n (i.e. the strings contain the same number of symbols) and $$a_{i}=b_{i}$$ for all $$1 \leq i \leq n$$.

Just as there are operations defined on numbers, truth values, sets, and other mathematical entities, there are operations defined on strings. Some important operations are:

1. length: the length of a string x is the number of symbols in it. The notation for the length of x is $$|x|$$. Note that this is consistent with other uses of $$| |,$$ all of which involve some notion of size: $$|$$ number $$|$$ measures how big a number is (in terms of its distance from 0); $$|\operatorname{set}|$$ measures the size of a set (in terms of the number of elements).

We will occasionally refer to a length-n string. This is a slightly awkward, but concise, shorthand for “a string whose length is n”.

2. concatenation: the concatenation of two strings $$x=a_{1} a_{2} \dots a_{m}$$ and $$y=b_{1} b_{2} \ldots b_{n}$$ is the sequence of symbols $$a_{1} \ldots a_{m} b_{1} \ldots b_{n}$$. Some- times · is used to denote concatenation, but it is far more usual to see the concatenation of x and y denoted by xy than by x · y. You can easily convince yourself that concatenation is associative (i.e. $$(x y) z=x(y z)$$ for all strings $$x, y$$ and $$z .$$ Concatenation is not com- mutative (i.e. it is not always true that $$x y=y x :$$ for example, if $$x=a$$ and $$y=b$$ then $$x y=a b$$ while $$y x=b a$$ and, as discussed above, these strings are not equal.)

3. reversal: the reverse of a string $$x=a_{1} a_{2} \ldots a_{n}$$ is the string $$x^{R}=$$ $$a_{n} a_{n-1} \ldots a_{2} a_{1}$$.

Example $$3.2 .$$ Let $$\Sigma=\{a, b\}, x=a, y=a b a a,$$ and $$z=b a b .$$ Then $$|x|=1,|y|=4,$$ and $$|z|=3 .$$ Also, $$x x=a a, x y=$$ abaa $$, x z=a b a b,$$ and $$z x=$$ baba. Finally, $$x^{R}=a, y^{R}=a a b a,$$ and $$z^{R}=b a b$$.

By the way, the previous example illustrates a naming convention stan- dard throughout language theory texts: if a letter is intended to represent a single symbol in an alphabet, the convention is to use a letter from the beginning of the English alphabet (a, b, c, d ); if a letter is intended to represent a string, the convention is to use a letter from the end of the English alphabet (u, v, etc).

In set theory, we have a special symbol to designate the set that contains no elements. Similarly, language theory has a special symbol ε which is used to represent the empty string, the string with no symbols in it. (Some texts use the symbol $$\lambda$$ instead. ) It is worth noting that $$|\varepsilon|=0,$$ that $$\varepsilon^{R}=\varepsilon$$, and that $$\varepsilon \cdot x=x \cdot \varepsilon=x$$ for all strings $$x$$. (This last fact may appear a bit confusing. Remember that ε is not a symbol in a string with length 1, but rather the name given to the string made up of 0 symbols. Pasting those 0 symbols onto the front or back of a string x still produces x.)

The set of all strings over an alphabet $$\Sigma$$ is denoted $$\Sigma^{*} .$$ (In language theory, the symbol $$*$$ is typically used to denote "zero or more", so $$\Sigma^{*}$$ is the set of strings made up of zero or more symbols from \Sigma.) Note that while an alphabet $$\Sigma$$ is by definition a finite set of symbols, and strings are by definition finite sequences of those symbols, the set $$\Sigma^{*}$$ is always infinite. Why is this? Suppose $$\Sigma$$ contains $$n$$ elements. Then there is one string over $$\Sigma$$ with 0 symbols, $$n$$ strings with 1 symbol, $$n^{2}$$ strings with 2 symbols ( since there are $$n$$ choices for the first symbol and $$n$$ choices for the second), $$n^{3}$$ strings with 3 symbols, etc.

Example $$3.3 .$$ If $$\Sigma=\{1\},$$ then $$\Sigma^{*}=\{\varepsilon, 1,11,111, \ldots\} .$$ If $$\Sigma=\{a, b\}$$ then $$\Sigma^{*}=\{\varepsilon, a, b, a a, a b, b a, b b, \text { aaa, aab, } \ldots\} .$$

Note that $$\Sigma^{*}$$ is countably infinite: if we list the strings as in the preceding example (length-0 strings, length-1 strings in “alphabetical” order, length-2 strings similarly ordered, etc) then any string over $$\sum$$ will eventually appear. (In fact, if $$|\Sigma|=n \geq 2$$ and $$x \in \Sigma^{*}$$ has length $$k,$$ then $$x$$ will appear on the list within the first $$\frac{n^{k+1}-1}{n-1}$$ entries.

We now come to the definition of a language in the formal language theoretical sense.

Definition 3.1.

A language over an alphabet $$\Sigma$$ is a subset of $$\Sigma^{*} .$$ Thus, a language over $$\Sigma$$ is an element of $$\mathcal{P}\left(\Sigma^{*}\right),$$ the power set of $$\Sigma^{*} .$$

In other words, any set of strings (over alphabet $$\Sigma )$$ constitutes a language (over alphabet $$\Sigma )$$

Example $$3.4 .$$ Let $$\Sigma=\{0,1\} .$$ Then the following are all languages over $$\Sigma :$$

$$L_{1}=\{011,1010,111\}$$
$$L_{2}=\{0,10,110,1110,11110, \ldots\}$$
$$L_{3}=\left\{x \in \Sigma^{*} | n_{0}(x)=n_{1}(x)\right\}$$,

where the notation n0(x) stands for the number of 0’s in the string x, and similarly for n1(x).

$$L_{4}=\{x | \quad x \text { represents a multiple of } 5 \text { in binary }\}$$

Note that languages can be either finite or infinite. Because $$\Sigma^{*}$$ is infinite, it clearly has an infinite number of subsets, and so there are an infinite number of languages over Σ. But are there countably or uncountably many such languages?

Theorem 3.1.

For any alphabet $$\Sigma,$$ the number of languages over $$\Sigma$$ is uncountable.

This fact is an immediate consequence of the result, proved in a previ- ous chapter, that the power set of a countably infinite set is uncountable. Since the elements of $$\mathcal{P}(\Sigma)$$ are exactly the languages over $$\Sigma$$, there are uncountably many such languages.

Languages are sets and therefore, as for any sets, it makes sense to talk about the union, intersection, and complement of languages. (When taking the complement of a language over an alphabet $$\Sigma,$$ we always consider the univeral set to be $$\Sigma^{*},$$ the set of all strings over $$\Sigma$$ .) Because languages are sets of strings, there are additional operations that can be defined on languages, operations that would be meaningless on more general sets. For example, the idea of concatenation can be extended from strings to languages.

For two sets of strings S and T, we define the concatenation of S and T (denotedS·T or just ST)to be the set $$S T=\{s t | s \in S \wedge t \in$$$$\mathrm{~ T \} . ~ F o r ~ e x a m p l e , ~ i f ~} S=\{a b, a a b\}$$ and $$T=\{\varepsilon, 110,1010\},$$ then $$S T=$$$$\{a b, a b 110, a b 1010, a a b, a a b 110, a a b 1010\} .$$ Note in particular that $$a b \in S T$$ because $$a b \in S, \varepsilon \in T,$$ and $$a b \cdot \varepsilon=a b .$$ Because concatenation of sets is defined in terms of the concatenation of the strings that the sets contain, concatenation of sets is associative and not commutative. (This can easily be verified.)

When a set S is concatenated with itself, the notation SS is usually scrapped in favour of $$S^{2} ;$$ if $$S^{2}$$ is concatenated with $$S,$$ we write $$S^{3}$$ for the resulting set, etc. So $$S^{2}$$ is the set of all strings formed by concatenating two (possibly different, possibly identical) strings from $$S, S^{3}$$ is the set of strings formed by concatenating three strings from $$S,$$ etc. Extending this notation, we take $$S^{1}$$ to be the set of strings formed from one string in $$S$$ $$\left(\text { i.e. } S^{1} \text { is } S \text { itself }\right),$$ and $$S^{0}$$ to be the set of strings formed from zero strings in $$S$$ (i.e. $$S^{0}=\{\varepsilon\} ) .$$ If we take the union $$S^{0} \cup S^{1} \cup S^{2} \cup \ldots,$$ then the resulting set is the set of all strings formed by concatenating zero or more strings from $$S,$$ and is denoted $$S^{*} .$$ The set $$S^{*}$$ is called the Kleene closure of $$S,$$ and the $$*$$ operator is called the Kleene star operator.

Example $$3.5 .$$ Let $$S=\{01, b a\} .$$ Then

$$S^{0}=\{\varepsilon\}$$
$$S^{1}=\{01, b a\}$$
$$S^{2}=\{0101,01 b a, b a 0, \text { baba }\}$$
$$S^{3}=\{010101,0101 b a, 01 b a 01,01 b a b a, b a 0101, b a 010 a, b a b a 01, b a b a b a\}$$
etc, so
$$S^{*}=\{\varepsilon, 01, b a, 0101,01 b a, b a 01, b a b a, 010101,0101 b a, \ldots\}$$

Note that this is the second time we have seen the notation something*.
We have previously seen that for an alphabet $$\Sigma, \Sigma^{*}$$ is defined to be the set of all strings over $$\Sigma .$$ If you think of $$\Sigma$$ as being a set of length-1 strings, and take its Kleene closure, the result is once again the set of all strings over $$\Sigma,$$ and so the two notions of * coincide.

Example 3.6. Let $$\Sigma=\{a, b\}$$ Then

$$\Sigma^{0}=\{\varepsilon\}$$
$$\Sigma^{0}=\{\varepsilon\}$$
$$\Sigma^{1}=\{a, b\}$$
$$\Sigma^{2}=\{a a, a b, b a, b b\}$$
$$\Sigma^{3}=\{a a a, a a b, a b a, a b b, b a a, b a b, b b a, b b b\}$$
$$\Sigma^{*}=\{\varepsilon, a, b, a a, a b, b a, b b, a a a, a a b, a b a, a b b, b a a, b a b, \ldots\}$$

Exercises

1. Let $$S=\{\varepsilon, a b, a b a b\}$$ and $$T=\{a a, a b a, a b b a, a b b b a, \ldots\}$$. Find the following:

a) $$S^{2} \quad$$ b) $$S^{3} \quad$$ c) $$S^{*} \quad$$ d) $$S T \quad$$ e) $$T S$$

2. The reverse of a language $$L$$ is defined to be $$L^{R}=\left\{x^{R} | x \in L\right\}$$. Find $$S^{R}$$ and $$T^{R}$$ for the $$S$$ and $$T$$ in the preceding problem.

3. Give an example of a language $$L$$ such that $$L=L^{*}$$