4.5 Noncontextfree Languages
 Page ID
 6733
We have seen that there are contextfree languages that are not regular. The natural question arises, are there languages that are not contextfree? It’s easy to answer this question in the abstract: For a given alphabet Σ, there are uncountably many languages over Σ, but there are only countably many contextfree languages over Σ. It follows that most languages are not contextfree. However, this answer is not very satisfying since it doesn’t give us any example of a specific language that is not contextfree.
As in the case of regular languages, one way to show that a given lan guage L is not contextfree is to find some property that is shared by all contextfree languages and then to show that L does not have that prop erty. For regular languages, the Pumping Lemma gave us such a property. It turns out that there is a similar Pumping Lemma for contextfree lan guages. The proof of this lemma uses parse trees. In the proof, we will need a way of representing abstract parse trees, without showing all the details of the tree. The picture
represents a parse tree which has the nonterminal symbol A at its root and the string x along the “bottom” of the tree. (That is, x is the string made up of all the symbols at the endpoints of the tree’s branches, read from left to right.) Note that this could be a partial parse tree—something that could be a part of a larger tree. That is, we do not require A to be the start symbol of the grammar and we allow x to contain both terminal and nonterminal symbols. The string x, which is along the bottom of the tree, is referred to as the yield of the parse tree. Sometimes, we need to show more explicit detail in the tree. For example, the picture
represents a parse tree in which the yield is the string xyz. The string y is the yield of a smaller tree, with root B, which is contained within the larger tree. Note that any of the strings x, y, or z could be the empty string.
We will also need the concept of the height of a parse tree. The height of a parse tree is the length of the longest path from the root of the tree to the tip of one of its branches.
Like the version for regular languages, the Pumping Lemma for context free languages shows that any sufficiently long string in a contextfree language contains a pattern that can be repeated to produce new strings that are also in the language. However, the pattern in this case is more complicated. For regular languages, the pattern arises because any sufficiently long path through a given DFA must contain a loop. For contextfree languages, the pattern arises because in a sufficiently large parse tree, along a path from the root of the tree to the tip of one of its branches, there must be some nonterminal symbol that occurs more than once.
Theorem 4.7
(Pumping Lemma for Contextfree Languages). Suppose that L is a contextfree language. Then there is an integer K such that any string \(w \in L(G)\) with \(w \geq K\) has the property that w can be written in the form w = uxyzv where
 x and z are not both equal to the empty string;
 xyz < K; and

For any \(n \in \mathbb{N},\) the string \(u x^{n} y z^{n} v\) is in \(L\)
Proof. Let \(G=(V, \Sigma, P, S)\) be a contextfree grammar for the languageL. Let N be the number of nonterminal symbols in G, plus 1. That is, \(N=V+1\). Consider all possible parse trees for the grammar G with height less than or equal to N. (Include parse trees with any nonterminal symbol as root, not just parse trees with root S.) There are only finitely many such parse trees, and therefore there are only finitely many different strings that are the yields of such parse trees. Let K be an integer which is greater than the length of any such string.
Now suppose that w is any string in L whose length is greater than or equal to K. Then any parse tree for w must have height greater than N. (This follows since \(w \geq K\) and the yield of any parse tree of height \(\leq N\) has length less than \(K\). Consider a parse tree for w of minimal size, that is one that contains the smallest possible number of nodes. Since the height of this parse tree is greater than N, there is at least one path from the root of the tree to tip of a branch of the tree that has length greater than N. Consider the longest such path. The symbol at the tip of this path is a terminal symbol, but all the other symbols on the path are non terminal symbols. There are at least N such nonterminal symbols on the path. Since the number of different nonterminal symbols is V  and since \(N=V+1\), some nonterminal symbol must occur twice on the path. In fact, some nonterminal symbol must occur twice among the bottommostN nonterminal symbols on the path. Call this symbol A. Then we see that the parse tree for w has the form shown here:
The structure of this tree breaks the string w into five substrings, as shown in the above diagram. We then have w = uxyzv. It only remains to show that x, y, and z satisfy the three requirements stated in the theorem.
Let T refer to the entire parse tree, let \(T_{1}\) refer to the parse tree whose root is the upper A in the diagram, and let \(T_{2}\) be the parse tree whose root is the lower A in the diagram. Note that the height of is \(T_{1}\) less than or equal to N. (This follows from two facts: The path shown in T1 from its root to its base has length less than or equal to N, because we chose the two occurrences of A to be among the N bottommost nonterminal symbols along the path in T from its root to its base. We know that there is no longer path from the root of T1 to its base, since we chose the path in Tto be the longest possible path from the root of T to its base.) Since any parse tree with height less than or equal to N has yield of length less than \(K,\) we see that \(x y z<K\).
If we remove \(T_{1}\) from T and replace it with a copy of \(T_{2}\), the result is a parse tree with yield uyv, so we see that the string uyv is in the language L. Now, suppose that both x and z are equal to the empty string. In that case, w = uyv, so the tree we have created would be another parse tree for w. But this tree is smaller than T, so this would contradict the fact that T is the smallest parse tree for w. We see that x and z cannot both be the empty string.
If we remove \(T_{2}\) from T and replace it with a copy of \(T_{1}\), the result is a parse tree with yield \(u x^{2} y z^{2} v,\) so we see that \(u x^{2} y z^{2} v \in L .\) The two parse trees that we have created look like this:
Furthermore, we can apply the process of replacing \(T_{2}\) with a copy of \(T_{1}\) to the tree on the right above to create a parse tree with yield \(u x^{3} y z^{3} v\). Continuing in this way, we see that \(u x^{n} y z^{n} v \in L\) for all \(n \in \mathbb{N}\). This completes the proof of the theorem.
Since this theorem guarantees that all contextfree languages have a certain property, it can be used to show that specific languages are not contextfree. The method is to show that the language in question does not have the property that is guaranteed by the theorem. We give two examples.
Corollary 4.8.
Let L be the language \(\left\{a^{n} b^{n} c^{n}  n \in \mathbb{N}\right\}\). Then L is not a contextfree language.
Proof. We give a proof by contradiction. Suppose that L is contextfree. Then, by the Pumping Lemma for Contextfree Languages, there is an integer \(K\) such that every string \(w \in L\) with \(w \geq K\) can be written in the form \(w=u x y z v\) where \(x\) and \(z\) are not both empty, \(x y z<K,\) and \(u x^{n} y z^{n} v \in L\) for every \(n \in \mathbb{N}\).
Consider the string \(w=a^{K} b^{K} c^{K},\) which is in \(L,\) and write \(w=u x y z v\), where u, x, y, z, and v satisfy the stated conditions. Since \(x y z<K\) we see that if xyz contains an a, then it cannot contain a c. And if it contains a c, then it cannot contain an a. It is also possible that xyz is made up entirely of \(b^{\prime} s .\) In any of these cases, the string \(u x^{2} y z^{2} v\) cannot be in L, since it does not contain equal numbers of a’s, b’s, and c’s. But this contradicts the fact that \(u x^{n} y z^{n} v \in L\) for all \(n \in \mathbb{N} .\) This contradiction shows that the assumption that L is contextfree is incorrect.
Corollary 4.9.
Let Σ be any alphabet that contains at least two symbols. Let L be the language over Σ defined by \(L=\left\{s s  s \in \Sigma^{*}\right\}\). Then L is not contextfree.
Proof. Suppose, for the sake of contradiction, that L is contextfree. Then, by the Pumping Lemma for Contextfree Languages, there is an integer \(K\) such that every string \(w \in L\) with \(w \geq K\) can be written in the form \(w=u x y z v\) where \(x\) and \(z\) are not both empty, \(x y z<K,\) and \(u x^{n} y z^{n} v \in L\) for every \(n \in \mathbb{N}\).
Let \(a\) and \(b\) represent distinct symbols in \(\Sigma .\) Let \(s=a^{K} b a^{K} b\) and let \(w=s s=a^{K} b a^{K} b a^{K} b a^{K} b,\) which is in \(L .\) Write \(w=u x y z v,\) where \(u, x, y\) z, and v satisfy the stated conditions.
Since \(x y z<K, x\) and \(z\) can, together, contain no more than one \(b\). If either \(x\) or \(y\) contains a \(b,\) then \(u x^{2} y z^{2} v\) contains exactly five \(b\) 's. But any string in L is of the form rr for some string r and so contains an even number of \(b^{\prime} s .\) The fact that \(u x^{2} y z^{2} z\) contains five \(b^{\prime}\) s contradicts the fact that \(u x^{2} y z^{2} v \in L .\) So, we get a contradiction in the case where \(x\) or \(y\) contains a b.
Now, consider the case where x and y consist entirely of a’s. Again since \(x y z<K,\) we must have either that \(x\) and \(y\) are both contained in the same group of \(a^{\prime} s\) in the string \(a^{K} b a^{K} b a^{K} b a^{K} b,\) or that \(x\) is contained in one group of a’s and y is contained in the next. In either case, it is easy to check that the string \(u x^{2} y z^{2} v\) is no longer of the form \(r r\) for any string \(r,\) which contradicts the fact that \(u x^{2} y z^{2} v \in L\).
Since we are led to a contradiction in every case, we see that the as sumption that L is contextfree must be incorrect.
Now that we have some examples of languages that are not context free, we can settle some other questions about contextfree languages. In particular, we can show that the intersection of two contextfree languages is not necessarily contextfree and that the complement of a contextfree language is not necessarily contextfree.
Theorem 4.10.
The intersection of two contextfree languages is not nec essarily a contextfree language.
Proof. To prove this, it is only necessary to produce an example of two contextfree languages \(L\) and \(M\) such that \(L \cap M\) is not a contextfree languages. Consider the following languages, defined over the alphabet \(\Sigma=\{a, b, c\}\):
\(L=\left\{a^{n} b^{n} c^{m}  n \in \mathbb{N} \text { and } m \in \mathbb{N}\right\}\)
\(M=\left\{a^{n} b^{m} c^{m}  n \in \mathbb{N} \text { and } m \in \mathbb{N}\right\}\)
Note that strings in L have equal numbers of a’s and b’s while strings in \(M\) have equal numbers of \(b^{\prime} s\) and \(c^{\prime} s .\) It follows that strings in \(L \cap M\) have equal numbers of a’s, b’s, and c’s. That is,
\(L \cap M=\left\{a^{n} b^{n} c^{n}  n \in \mathbb{N}\right\}\)
We know from the above theorem that \(L \cap M\) is not contextfree. However, both L and M are contextfree. The language L is generated by the context free grammar
\(S \longrightarrow T C\)
\(C \longrightarrow c C\)
\(C \longrightarrow \varepsilon\)
\(T \longrightarrow a T b\)
\(T \longrightarrow \varepsilon\)
and M is generated by a similar contextfree grammar.
Corollary 4.11.
The complement of a contextfree language is not necessarily contextfree.
Proof. Suppose for the sake of contradiction that the complement of every contextfree language is contextfree.
Let L and M be two contextfree languages over the alphabet Σ. By our assumption, the complements \(\overline{L}\) and \(\overline{M}\) are contextfree. By Theorem 4.3, it follows that \(\overline{L} \cup \vec{M}\) is contextfree. Applying our assumption once again, we have that \(\overline{\overline{L} \cup \overline{M}}\) is contextfree. But \(\overline{\overline{L}} \cup \overline{M}=L \cap M,\) so we have that \(L \cap M\) is contextfree.
We have shown, based on our assumption that the complement of any contextfree language is contextfree, that the intersection of any two context free languages is contextfree. But this contradicts the previous theorem, so we see that the assumption cannot be true. This proves the theorem.
Note that the preceding theorem and corollary say only that \(L \cap M\) is not contextfree for some contextfree languages \(L\) and \(M\) and that \(\overline{L}\) is not contextfree for some contextfree language L. There are, of course, many examples of contextfree languages \(L\) and \(M\) for which \(L \cap M\) and \(\overline{L}\) are in fact contextfree.
Even though the intersection of two contextfree languages is not necessarily contextfree, it happens that the intersection of a contextfree language with a regular language is always contextfree. This is not difficult to show, and a proof is outlined in Exercise 4.4.8. I state it here without proof:
Theorem 4.12.
Suppose that L is a contextfree language and that M is a regular language. Then \(L \cap M\) is a contextfree language.
For example, let L and M be the languages defined by \(L=\{w \in\)\(\{a, b\}^{*}  w=w^{R} \}\) and \(M=\left\{w \in\{a, b\}^{*}  \text { the length of } w \text { is a multiple }\right.\)of 5\(\} .\) since \(L\) is contextfree and \(M\) is regular, we know that \(L \cap M\) is contextfree. The language \(L \cap M\) consists of every palindrome over the alphabet \(\{a, b\}\) whose length is a multiple of five.
This theorem can also be used to show that certain languages are not contextfree. For example, consider the language \(L=\left\{w \in\{a, b, c\}^{*}  n_{a}(w)\right.\)\(=n_{b}(w)=n_{c}(w) \} .\) (Recall that \(n_{x}(w)\) is the number of times that the symbol x occurs in the string w.) We can use a proof by contradiction to show that L is not contextfree. Let M be the regular language defined by the regular expression \(a^{*} b^{*} c^{*} .\) It is clear that \(L \cap M=\left\{a^{n} b^{n} c^{n}  n \in \mathbb{N}\right\} .\) If \(L\) were contextfree, then, by the previous theorem, \(L \cap M\) would be contextfree. However, we know from Theorem 4.8 that \(L \cap M\) is not contextfree. So we can conclude that L is not contextfree.
Exercises

Show that the following languages are not contextfree:
a) \(\left\{a^{n} b^{m} c^{k}  n>m>k\right\}\)
b) \(\left\{w \in\{a, b, c\}^{*}  n_{a}(w)>n_{b}(w)>n_{c}(w)\right\}\)
c) \(\left\{w w w  w \in\{a, b\}^{*}\right\}\)
d) \(\left\{a^{n} b^{m} c^{k}  n, m \in \mathbb{N} \text { and } k=m * n\right\}\)
e) \(\left\{a^{n} b^{m}  m=n^{2}\right\}\)
2. Show that the languages \(\left\{a^{n}  \mathrm{n} \text { is a prime number }\right\}\) and \(\left\{a^{n^{2}}  n \in \mathbb{N}\right\}\) are not contextfree. (In fact, it can be shown that a language over the alphabet{a} is contextfree if and only if it is regular.)
3. Show that the language \(\left\{w \in\{a, b\}^{*}  n_{a}(w)=n_{b}(w) \text { and } w \text { contains the }\right.\) string baaab as a substring} is contextfree.
4. Suppose that M is any finite language and that L is any contextfree language. Show that the language L \ M is contextfree. (Hint: Any finite language is a regular language.)