4.1: Context-free Grammars
- Page ID
- 6729
\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)
\( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)
( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)
\( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)
\( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)
\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)
\( \newcommand{\Span}{\mathrm{span}}\)
\( \newcommand{\id}{\mathrm{id}}\)
\( \newcommand{\Span}{\mathrm{span}}\)
\( \newcommand{\kernel}{\mathrm{null}\,}\)
\( \newcommand{\range}{\mathrm{range}\,}\)
\( \newcommand{\RealPart}{\mathrm{Re}}\)
\( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)
\( \newcommand{\Argument}{\mathrm{Arg}}\)
\( \newcommand{\norm}[1]{\| #1 \|}\)
\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)
\( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)
\( \newcommand{\vectorA}[1]{\vec{#1}} % arrow\)
\( \newcommand{\vectorAt}[1]{\vec{\text{#1}}} % arrow\)
\( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\( \newcommand{\vectorC}[1]{\textbf{#1}} \)
\( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)
\( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)
\( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)
\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)
\(\newcommand{\avec}{\mathbf a}\) \(\newcommand{\bvec}{\mathbf b}\) \(\newcommand{\cvec}{\mathbf c}\) \(\newcommand{\dvec}{\mathbf d}\) \(\newcommand{\dtil}{\widetilde{\mathbf d}}\) \(\newcommand{\evec}{\mathbf e}\) \(\newcommand{\fvec}{\mathbf f}\) \(\newcommand{\nvec}{\mathbf n}\) \(\newcommand{\pvec}{\mathbf p}\) \(\newcommand{\qvec}{\mathbf q}\) \(\newcommand{\svec}{\mathbf s}\) \(\newcommand{\tvec}{\mathbf t}\) \(\newcommand{\uvec}{\mathbf u}\) \(\newcommand{\vvec}{\mathbf v}\) \(\newcommand{\wvec}{\mathbf w}\) \(\newcommand{\xvec}{\mathbf x}\) \(\newcommand{\yvec}{\mathbf y}\) \(\newcommand{\zvec}{\mathbf z}\) \(\newcommand{\rvec}{\mathbf r}\) \(\newcommand{\mvec}{\mathbf m}\) \(\newcommand{\zerovec}{\mathbf 0}\) \(\newcommand{\onevec}{\mathbf 1}\) \(\newcommand{\real}{\mathbb R}\) \(\newcommand{\twovec}[2]{\left[\begin{array}{r}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\ctwovec}[2]{\left[\begin{array}{c}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\threevec}[3]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\cthreevec}[3]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\fourvec}[4]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\cfourvec}[4]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\fivevec}[5]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\cfivevec}[5]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\mattwo}[4]{\left[\begin{array}{rr}#1 \amp #2 \\ #3 \amp #4 \\ \end{array}\right]}\) \(\newcommand{\laspan}[1]{\text{Span}\{#1\}}\) \(\newcommand{\bcal}{\cal B}\) \(\newcommand{\ccal}{\cal C}\) \(\newcommand{\scal}{\cal S}\) \(\newcommand{\wcal}{\cal W}\) \(\newcommand{\ecal}{\cal E}\) \(\newcommand{\coords}[2]{\left\{#1\right\}_{#2}}\) \(\newcommand{\gray}[1]{\color{gray}{#1}}\) \(\newcommand{\lgray}[1]{\color{lightgray}{#1}}\) \(\newcommand{\rank}{\operatorname{rank}}\) \(\newcommand{\row}{\text{Row}}\) \(\newcommand{\col}{\text{Col}}\) \(\renewcommand{\row}{\text{Row}}\) \(\newcommand{\nul}{\text{Nul}}\) \(\newcommand{\var}{\text{Var}}\) \(\newcommand{\corr}{\text{corr}}\) \(\newcommand{\len}[1]{\left|#1\right|}\) \(\newcommand{\bbar}{\overline{\bvec}}\) \(\newcommand{\bhat}{\widehat{\bvec}}\) \(\newcommand{\bperp}{\bvec^\perp}\) \(\newcommand{\xhat}{\widehat{\xvec}}\) \(\newcommand{\vhat}{\widehat{\vvec}}\) \(\newcommand{\uhat}{\widehat{\uvec}}\) \(\newcommand{\what}{\widehat{\wvec}}\) \(\newcommand{\Sighat}{\widehat{\Sigma}}\) \(\newcommand{\lt}{<}\) \(\newcommand{\gt}{>}\) \(\newcommand{\amp}{&}\) \(\definecolor{fillinmathshade}{gray}{0.9}\)In its most general form, a grammar is a set of rewriting rules. A rewriting rule specifies that a certain string of symbols can be substituted for all or part of another string. If \(w\) and \(u\) are strings, then \(w \longrightarrow u\) is a rewriting rule that specifies that the string w can be replaced by the string u. The symbol \(" \longrightarrow\) " is read "can be rewritten as." Rewriting rules are also called production rules or productions, and \(" \longrightarrow\) can also be read as “produces.” For example, if we consider strings over the alphabet {a, b, c}, then the production rule \(a b a \longrightarrow c c\) can be applied to the string abbabacto give the string abbccc. The substring aba in the string abbabac has been replaced with cc.
In a context-free grammar, every rewriting rule has the form \(A \longrightarrow\)w, where A is single symbol and w is a string of zero or more symbols. (The grammar is “context-free” in the sense that w can be substituted for A wher- ever A occurs in a string, regardless of the surrounding context in which Aoccurs.) The symbols that occur on the left-hand sides of production rules in a context-free grammar are called non-terminal symbols. By convention, the non-terminal symbols are usually uppercase letters. The strings on the right-hand sides of the production rules can include non-terminal symbols as well as other symbols, which are called terminal symbols. By convention, the terminal symbols are usually lowercase letters. Here are some typical production rules that might occur in context-free grammars:
\(A \longrightarrow a A b B\)
\(S \longrightarrow S S\)
\(C \longrightarrow A c c\)
\(B \longrightarrow b\)
\(A \longrightarrow \varepsilon\)
In the last rule in this list, ε represents the empty string, as usual. For example, this rule could be applied to the string aBaAcA to produce the string aBacA. The first occurrence of the symbol A in aBaAcA has been replaced by the empty string—which is just another way of saying that the symbol has been dropped from the string.
In every context-free grammar, one of the non-terminal symbols is designated as the start symbol of the grammar. The start symbol is often, though not always, denoted by S. When the grammar is used to generate strings in a language, the idea is to start with a string consisting of nothing but the start symbol. Then a sequence of production rules is applied. Each application of a production rule to the string transforms the string to a new string. If and when this process produces a string that consists purely of terminal symbols, the process ends. That string of terminal symbols is one of the strings in the language generated by the grammar. In fact, the language consists precisely of all strings of terminal symbols that can be produced in this way.
As a simple example, consider a grammar that has three production rules: \(S \longrightarrow a S, S \longrightarrow b S,\) and \(S \longrightarrow b\). Inthisexample,Sistheonly non-terminal symbol, and the terminal symbols are a and b. Starting from the string S, we can apply any of the three rules of the grammar to produce either aS, bS, or b. Since the string b contains no non-terminals, we see that b is one of the strings in the language generated by this grammar. The strings aS and bS are not in that language, since they contain the non- terminal symbol S, but we can continue to apply production rules to these strings. From aS, for example, we can obtain aaS, abS, or ab. From abS, we go on to obtain abaS, abbS, or abb. The strings ab and abb are in the language generated by the grammar. It’s not hard to see that any string of a’s and b’s that ends with a b can be generated by this grammar, and that these are the only strings that can be generated. That is, the language generated by this grammar is the regular language specified by the regular expression \((a | b)^{*} b\). It’s time to give some formal definitions of the concepts which we have been discussing.
Definition 4.1.
A context-free grammar is a 4-tuple \((V, \Sigma, P, S),\) where:
1. V is a finite set of symbols. The elements of V are the non-terminal symbols of the grammar.
2.Σ is a finite set of symbols such that \(V \cap \Sigma=\emptyset\). The elements of Σ are the terminal symbols of the grammar.
3. P is a set of production rules. Each rule is of the form \(A \longrightarrow w\) where A is one of the symbols in V and w is a string in the language \((V \cup \Sigma)^{*}\).
4. \(S \in V . S\) is the start symbol of the grammar.
Even though this is the formal definition, grammars are often specified informally simply by listing the set of production rules. When this is done it is assumed, unless otherwise specified, that the non-terminal symbols are just the symbols that occur on the left-hand sides of production rules of the grammar. The terminal symbols are all the other symbols that occur on the right-hand sides of production rules. The start symbol is the symbol that occurs on the left-hand side of the first production rule in the list.Thus, the list of production rules
\(T \longrightarrow T T\)
\(T \longrightarrow A\)
\(A \longrightarrow a A a\)
\(A \longrightarrow b B\)
\(B \longrightarrow b B\)
\(B \longrightarrow \varepsilon\)
specifies a grammar \(G=(V, \Sigma, P, T)\) where \(V\) is \(\{T, A, B\}, \Sigma\) is \(\{a, b\},\) and T is the start symbol. P, of course, is a set containing the six production rules in the list.
Let \(G=(V, \Sigma, P, S)\) be a context-free grammar. Suppose that \(x\) and \(y\) are strings in the language \((V \cup \Sigma)^{*} .\) The notation \(x \Longrightarrow_{G} y\) is used to express the fact that y can be obtained from x by applying one of the production rules in P . To be more exact, we say that \(x \Longrightarrow G\) y if and only if there is a production rule \(A \longrightarrow w\) in the grammar and two strings uand v in the language \((V \cup \Sigma)^{*}\) such that x = uAv and y = uwv. The fact that x = uAv is just a way of saying that A occurs somewhere in x. When the production rule \(A \ is applied to substitute w for A in uAv, the result is uwv, which is y. Note that either u or v or both can be the empty string.
If a string y can be obtained from a string x by applying a sequence of zero or more production rules, we write \(x \Longrightarrow_{G}^{*}\) y. In most cases, the “G” in the notations \(\Longrightarrow G\) and \(\Longrightarrow_{G}^{*}\) will be omitted, assuming that the grammar in question is understood. Note that \(\Longrightarrow\) is a relation on the set \((V \cup \Sigma)^{*}\). The relation \(\Longrightarrow^{*}\) is the reflexive, transitive closure of that relation. (This explains the use of “∗”, which is usually used to denote the transitive, but not necessarily reflexive, closure of a relation. In this case, \(\Longrightarrow^{*}\) is reflexive as well as transitive since x =⇒∗x is true for any string x.) For example, using the grammar that is defined by the above list of production rules, we have
\(\begin{aligned} a T B & \Longrightarrow a T T B \\ & \Longrightarrow a T A B \\ & \Longrightarrow a T A b B \\ & \Longrightarrow a T b B b B \\ & \Longrightarrow a T b b B \end{aligned}\)
From this, it follows that \(a T B \Longrightarrow^{*} a T b b B .\) The relation \(\Longrightarrow\) is read "yields" or "produces" while \(\Longrightarrow\) can be read "yields in zero or more steps” or “produces in zero or more steps.” The following theorem states some simple facts about the relations \(\Longrightarrow\) and \(\Longrightarrow^{*} :\)
Theorem 4.1.
Let \(G\) be the context-free grammar \((V, \Sigma, P, S) .\) Then:
1. If \(x\) and \(y\) are strings in \((V \cup \Sigma)^{*}\) such that \(x \Longrightarrow y,\) then \(x \Longrightarrow^{*} y\)
2. If \(x, y,\) and \(z\) are strings in \((V \cup \Sigma)^{*}\) such that \(x \Longrightarrow^{*} y\) and \(y \Longrightarrow^{*} z\) then \(x \Longrightarrow^{*} z\)
3. If \(x\) and \(y\) are strings in \((V \cup \Sigma)^{*}\) such that \(x \Longrightarrow y,\) and if \(s\) and \(t\) are any strings in \((V \cup \Sigma)^{*},\) then \(s x t \Longrightarrow s y t\)
4. If \(x\) and \(y\) are strings in \((V \cup \Sigma)^{*}\) such that \(x \Longrightarrow^{*} y,\) and if \(s\) and \(t\) are any strings in \((V \cup \Sigma)^{*},\) then sxt \(\Longrightarrow^{*} s y t\)
Proof. Parts 1 and 2 follow from the fact that \(\Longrightarrow^{*}\) is the transitive closure of \(\Longrightarrow\) Part 4 follows easily from Part \(3 .\) (I leave this as an exercise.) To prove Part 3, suppose that x, y, s, and t are strings such that \(x \Longrightarrow y\). By definition, this means that there exist strings u and v and a production rule \(A \longrightarrow w\) such that \(x=u A v\) and \(y=u w v .\) But then we also have \(s x t=\) suAvt and syt = suwvt. These two equations, along with the existence of the production rule \(A \longrightarrow w\) show, by definition, that \(s x t \Longrightarrow s y t\).
We can use \(\Longrightarrow^{*}\) to give a formal definition of the language generated by a context-free grammar:
Definition 4.2.
Suppose that \(G=(V, \Sigma, P, S)\) is a context-free grammar.
Then the language generated by G is the language L(G) over the alphabet Σ defined by
\(L(G)=\left\{w \in \Sigma^{*} | S \Longrightarrow_{G}^{*} w\right\}\)
That is, L(G) contains any string of terminal symbols that can be obtained by starting with the string consisting of the start symbol, S, and applying a sequence of production rules.
A language L is said to be a context-free language if there is a context-free grammar G such that L(G) is L. Note that there might be many different context-free grammars that generate the same context-free language. Two context-free grammars that generate the same language are said to be equivalent.
Suppose G is a context-free grammar with start symbol S and suppose \(w \in L(G)\). By definition, this means that there is a sequence of one or more applications of production rules which produces the string w from S. This sequence has the form \(S \Longrightarrow x_{1} \Longrightarrow x_{2} \Longrightarrow \cdots \Longrightarrow w\). Such a sequence is called a derivation of w (in the grammar G). Note that w might have more than one derivation. That is, it might be possible to produce w in several different ways.
Consider the language \(L=\left\{a^{n} b^{n} | n \in \mathbb{N}\right\} .\) We already know that \(L\) is not a regular language. However, it is a context-free language. That is, there is a context-free grammar such that L is the language generated byG. This gives us our first theorem about grammars:
Theorem 4.2.
Let \(L\) be the language \(L=\left\{a^{n} b^{n} | n \in \mathbb{N}\right\} .\) Let \(G\) be the context-free grammar \((V, \Sigma, P, S)\) where \(V=\{S\}, \Sigma=\{a, b\}\) and \(P\) consists of the productions
\(S \longrightarrow a S b\)
\(S \longrightarrow \varepsilon\)
Then L = L(G), so that L is a context-free language. In particular, there exist context-free languages which are not regular.
Proof. To show that \(L=L(G),\) we must show both that \(L \subseteq L(G)\) and that \(L(G) \subseteq L .\) To show that \(L \subseteq L(G),\) let \(w\) be an arbitrary element of \(L . \quad\) By definition of \(L, w=a^{n} b^{n}\) for some \(n \in \mathbb{N} .\) We show that In the case where n=0,we have w=ε. Now, ε ∈ L(G) since ε can be produced from the start symbol S by an application of the rule \(S \longrightarrow \varepsilon,\) so our claim is true for \(n=0\). Now, suppose that \(k \in \mathbb{N}\) and that we already know that \(a^{k} b^{k} \in L(G)\). We must show that \(a^{k+1} b^{k+1} \in L(G) .\) since \(S \Longrightarrow^{*} a^{k} b^{k}\), we also have, by Theorem 4.1, that \(a S b \Longrightarrow^{*} a a^{k} b^{k} b\). That is, \(a S b \Longrightarrow^{*} a^{k+1} b^{k+1}\). Combining this with the production rule \(S \longrightarrow a S b,\) we see that \(S \Longrightarrow^{*} a^{k+1} b^{k+1}\). This means that \(a^{k+1} b^{k+1} \in L(G)\), as we wanted to show. This completes the proof that \(L \subseteq L(G)\).
To show that \(L(G) \subseteq L,\) suppose that \(w \in L(G) .\) That is, \(S \Longrightarrow^{*} w .\) We must show that \(w=a^{n} b^{n}\) for some \(n .\) since \(S \Longrightarrow^{*} w,\) there is a derivation \(S \Longrightarrow x_{0} \Longrightarrow x_{1} \Longrightarrow \cdots \Longrightarrow x_{n},\) where \(w=x_{n}\). We first prove by induction on n that in any derivation \(S \Longrightarrow x_{0} \Longrightarrow x_{1} \Longrightarrow \cdots \Longrightarrow x_{n}\), we must have either \(x_{n}=a^{n} b^{n}\) or \(x_{n}=a^{n+1} S b^{n+1}\). Consider the case n = 0. Suppose \(S \Longrightarrow x_{0}\). Then, we must have that \(S \longrightarrow x_{0}\) is a rule in the grammar, so \(x_{0}\) must be either \(\varepsilon\) or \(a S b .\) since \(\varepsilon=a^{0} b^{0}\) and \(a S b=a^{0+1} S b^{0+1}, x_{0}\) is of the required form. Next, consider the inductive case. Suppose that k > 1 and we already know that in any derivation \(S \Longrightarrow x_{0} \Longrightarrow x_{1} \Longrightarrow \cdots \Longrightarrow x_{k}\), we must have \(x_{k}=a^{k} b^{k}\) or \(x=a^{k+1} S b^{k+1}\). Suppose that \(S \Longrightarrow x_{0} \Longrightarrow\) \(x_{1} \Longrightarrow \cdots \Longrightarrow x_{k} \Longrightarrow x_{k+1}\). We know by induction that \(x_{k}=a^{k} b^{k}\) or \(x=a^{k+1} S b^{k+1},\) but since \(x_{k} \Longrightarrow x_{k+1}\) and \(a^{k} b^{k}\) contains no non-terminal symbols, we must have \(x_{k}=a^{k+1} S b^{k+1}\). Since \(x_{k+1}\) is obtained by applying one of the production rules \(S \longrightarrow \varepsilon\) or \(S \longrightarrow a S b\) to \(x_{k}, x_{k+1}\) is either \(a^{k+1} \varepsilon b^{k+1}\) or \(a^{k+1} a S b b^{k+1} .\) That is, \(x_{k+1}\) is either \(a^{k+1} b^{k+2} S b^{k+2}\), as we wanted to show. This completes the induction. Turning back to w, we see that \(w\) must be of the form \(a^{n} b^{n}\) or of the form \(a^{n} S b^{n}\). But since \(w \in L(G)\), it can contain no non-terminal symbols, so w must be of the form \(a^{n} b^{n},\) as we wanted to show. This completes the proof that \(L(G) \subseteq L\).
I have given a very formal and detailed proof of this theorem, to show how it can be done and to show how induction plays a role in many proofs about grammars. However, a more informal proof of the theorem would probably be acceptable and might even be more convincing. To show that \(L \subseteq L(G),\) we could just note that the derivation \(S \Longrightarrow a S b \Longrightarrow a^{2} S b^{2} \Longrightarrow\) \(\cdots \Longrightarrow a^{n} S b^{n} \Longrightarrow a^{n} b^{n}\) demonstrates that \(a^{n} b^{n} \in L .\) On the other hand, it is clear that every derivation for this grammar must be of this form, so every string in \(L(G)\) is of the form \(a^{n} b^{n}\).
For another example, consider the language \(\left\{a^{n} b^{m} | n \geq m \geq 0\right\}\). Let’s try to design a grammar that generates this language. This is similar to the previous example, but now we want to include strings that contain more a’s than b’s. The production rule \(S \longrightarrow a S b\) always produces the same number of a’s and b’s. Can we modify this idea to produce more a’s than b’s? One approach would be to produce a string containing just as manya’s as b’s, and then to add some extra a’s. A rule that can generate any number of a’s is \(A \longrightarrow a A .\) After applying the rule \(S \longrightarrow a S b\) for a while, we want to move to a new state in which we apply the rule \(A \longrightarrow a A\). We can get to the new state by applying a rule \(S \longrightarrow A\) that changes the Sinto an A. We still need a way to finish the process, which means getting rid of all non-terminal symbols in the string. For this, we can use the rule \(A \longrightarrow \varepsilon .\) Putting these rules together, we get the grammar.
\(S \longrightarrow a S b\)
\(S \longrightarrow A\)
\(A \longrightarrow a A\)
\(A \longrightarrow \varepsilon\)
This grammar does indeed generate the language \(\left\{a^{n} b^{m} | n \geq m \geq 0\right\}\) With slight variations on this grammar, we can produce other related lan- guages. For example, if we replace the rule \(A \longrightarrow \varepsilon\) with \(A \longrightarrow a\), we get the language \(\left\{a^{n} b^{m} | n>m \geq 0\right\}\).
There are other ways to generate the language \(\left\{a^{n} b^{m} | n \geq m \geq 0\right\}\). For example, the extra non-terminal symbol, A, is not really necessary, if we allow S to sometimes produce a single a without a b. This leads to the grammar.
\(S \longrightarrow a S b\)
\(S \longrightarrow a S\)
\(S \longrightarrow \varepsilon\)
(But note that the rule \(S \longrightarrow S a\) would not work in place of \(S \longrightarrow a S\) since it would allow the production of strings in which an a can follow a b, and there are no such strings in the language \(\left\{a^{n} b^{m} | n \geq m \geq 0\right\} . )\) And here are two more grammars that generate this language:
\(S \longrightarrow A B \qquad S \longrightarrow A S b\)
\(A \longrightarrow a A \qquad A \longrightarrow a A\)
\(B \longrightarrow a B b \qquad S \longrightarrow \varepsilon\)
\(A \longrightarrow \varepsilon \qquad A \longrightarrow \varepsilon\)
\(B \longrightarrow \varepsilon\)
Consider another variation on the language \(\left\{a^{n} b^{n} | n \in \mathbb{N}\right\}\), in which thea’s and b’s can occur in any order, but the number of a’s is still equal to the number of b’s. This language can be defined as \(L=\left\{w \in\{a, b\}^{*} | n_{a}(w)=\right.\) \(n_{b}(w) \}\). This language includes strings such as abbaab, baab, and bbbaaa.
Let's start with the grammar containing the rules \(S \longrightarrow a S b\) and \(S \longrightarrow\) \(\varepsilon .\) We can try adding the rule \(S \longrightarrow b S a\). Every string that can be generated using these three rules is in the language L. However, not every string in L can be generated. A derivation that starts with \(S \Longrightarrow a S b\) can only produce strings that begin with a and end with b. A derivation that starts with \(S \Longrightarrow b S a\) can only generate strings that begin with b and end with a. There is no way to generate the strings baab or abbbabaaba, which are in the language L. But we shall see that any string in L that begins and ends with the same letter can be written in the form xy where x and yare shorter strings in L. To produce strings of this form, we need one more rule, \(S \longrightarrow S S\). The complete set of production rules for the language L is
\(S \longrightarrow a S b\)
\(S \longrightarrow b S a\)
\(S \longrightarrow S S\)
\(S \longrightarrow \varepsilon\)
It’s easy to see that every string that can be generated using these rules is in L, since each rule introduces the same number of a’s as b’s. But we also need to check that every string w in L can be generated by these rules.
This can be done by induction on the length of w, using the second form of the principle of mathematical induction. In the base case, \(|w|=0\) and \(w=\varepsilon .\) In this case, \(w \in L\) since \(S \Longrightarrow \varepsilon\) in one step. Suppose \(|w|=k\), where \(k>0,\) and suppose that we already know that for any \(x \in L\) with \(|x|<k, S \Longrightarrow^{*} x .\) To finish the induction we must show, based on this induction hypothesis, that \(S \Longrightarrow^{*} w\).
Suppose that the first and last characters of w are different. Then w is either of the form axb or of the form bxa, for some string x. Let’s assume that w is of the form axb. (The case where w is of the form bxa is handled in a similar way.) Since w has the same number of a’s and b’s and sincex has one fewer a than w and one fewer b than w, x must also have the same number of a’s as b’s.That is \(x \in L .\) But \(|x|=|w|-2<k,\) so by the induction hypothesis, \(x \in L(G) .\) So we have \(S \Longrightarrow^{*} x\). By Theorem 4.1, we get then \(a S b \Longrightarrow^{*} a x b .\) Combining this with the fact that \(S \Longrightarrow a S b\), we get that \(S \Longrightarrow^{*} a x b,\) that is \(, S \Longrightarrow^{*} w .\) This proves that \(w \in L(G)\).
Finally, suppose that the first and last characters of w are the same. Let’s say that w begins and ends with a. (The case where w begins and ends with b is handled in a similar way.) I claim that w can be written in the form xy where \(x \in L(G)\) and \(y \in L(G)\) and neither x nor y is the empty string. This will finish the induction, since we will then have by the induction hypothesis that \(S \Longrightarrow^{*} x\) and \(S \Longrightarrow^{*} y\), and we can derive xy fromS by first applying the rule \(S \longrightarrow S S\) and then using the first \(S\) on the right-hand side to derive x and the second to derive y.
It only remains to figure out how to divide w into two strings x and ywhich are both in L(G). The technique that is used is one that is more generally useful. Suppose that \(w=c_{1} c_{2} \cdots c_{k},\) where each \(c_{i}\) is either \(a\) or \(b .\) Consider the sequence of integers \(r_{1}, r_{2}, \ldots, r_{k}\) where for each \(i=1,2, \ldots, k, r_{i}\) is the number of \(a^{\prime} \sin c_{1} c_{2} \cdots c_{i}\) minus the number of \(b^{\prime} \mathrm{s}\) in \(c_{1} c_{2} \cdots c_{i} .\) since \(c_{1}=a, r_{1}=1 .\) since \(w \in L, r_{k}=0 .\) And since \(c_{k}=a\), we must have \(r_{k-1}=r_{k}-1=-1 .\) Furthermore the difference between \(r_{i+1}\) and \(r_{i}\) is either 1 or \(-1,\) for \(i=1,2, \ldots, k-1\).
Since \(r_{1}=1\) and \(r_{k-1}=-1\) and the value of \(r_{i}\) goes up or down by 1 when \(i\) increases by \(1, r_{i}\) must be zero for some \(i\) between 1 and \(k-1\). That is, \(r_{i}\) cannot get from 1 to \(-1\) unless it passes through zero. Let \(i\) be a number between 1 and \(k-1\) such that \(r_{i}=0 .\) Let \(x=c_{1} c_{2} \cdots c_{i}\) and let \(y=c_{i+1} c_{i+2} \cdots c_{k} .\) Note that \(x y=w .\) The fact that \(r_{i}=0\) means that the string \(c_{1} c_{2} \cdots c_{i}\) has the same number of \(a^{\prime} \mathrm{s}\) and \(b^{\prime} \mathrm{s},\) so \(x \in L(G) .\) It follows automatically that \(y \in L(G)\) also. Since \(i\) is strictly between 1 and k − 1, neither x nor y is the empty string. This is all that we needed to show to finish the proof that L = L(G).
The basic idea of this proof is that if w contains the same number of a’s as b’s, then an a at the beginning of w must have a “matching” b somewhere in w. This b matches the a in the sense that the corresponding ri is zero, and the b marks the end of a string x which contains the same number ofa’s as b’s. For example, in the string aababbabba, the a at the beginning of the string is matched by the third b, since aababb is the shortest prefix ofaababbabba that has an equal number of a’s and b’s.
Closely related to this idea of matching a’s and b’s is the idea of balanced parentheses. Consider a string made up of parentheses, such as(()(()))(()). The parentheses in this sample string are balanced because each left parenthesis has a matching right parenthesis, and the matching pairs are properly nested. A careful definition uses the sort of integer se- quence introduced in the above proof. Let w be a string of parentheses.Write \(w=c_{1} c_{2} \cdots c_{n},\) where each \(c_{i}\) is either \((\text { or }) .\) Define a sequence of integers \(r_{1}, r_{2}, \ldots, r_{n},\) where \(r_{i}\) is the number of left parentheses in \(c_{1} c_{2} \cdots c_{i}\) minus the number of right parentheses. We say that the parentheses in w are balanced if \(r_{n}=0\) and \(r_{i} \geq 0\) for all \(i=1,2, \ldots, n\). The fact that \(r_{n}=0\) says that \(w\) contains the same number of left parentheses as right parentheses. The fact the \(r_{i} \geq 0\) means that the nesting of pairs of parentheses is correct: You can’t have a right parenthesis unless it is balanced by a left parenthesis in the preceding part of the string. The language that consists of all balanced strings of parentheses is context-free. It is generated by the grammar
\(S \longrightarrow(S)\)
\(S \longrightarrow S S\)
\(S \longrightarrow \varepsilon\)
The proof is similar to the preceding proof about strings of a’s and b’s. (It might seem that I’ve made an awfully big fuss about matching and balancing. The reason is that this is one of the few things that we can do with context-free languages that we can’t do with regular languages.)
Before leaving this section, we should look at a few more general results. Since we know that most operations on regular languages produce languages that are also regular, we can ask whether a similar result holds for context-free languages. We will see later that the intersection of two context-free languages is not necessarily context-free. Also, the comple- ment of a context-free language is not necessarily context-free. However, some other operations on context-free languages do produce context-free languages.
Theorem 4.3.
Suppose that L and M are context-free languages. Then the languages \(L \cup M, L M,\) and \(L^{*}\) are also context-free.
Proof. I will prove only the first claim of the theorem, that \(L \cup M\) is context-free. In the exercises for this section, you are asked to construct grammars for LM and \(L^{*}\) (without giving formal proofs that your answers are correct).
Let \(G=(V, \Sigma, P, S)\) and \(H=(W, \Gamma, Q, T)\) be context-free grammars such that \(L=L(G)\) and \(M=L(H) .\) We can assume that \(W \cap V=\emptyset\), since otherwise we could simply rename the non-terminal symbols in W. The idea of the proof is that to generate a string in \(L \cup M\), we first decide whether we want a string in L or a string in M . Once that decision is made, to make a string in L, we use production rules from G, while to make a string in M, we use rules from H. We have to design a grammar, K, to represent this process.
Let \(R\) be a symbol that is not in any of the alphabets \(V, W, \Sigma,\) or \(\Gamma .\) R will be the start symbol of K. The production rules for K consist of all the production rules from G and H together with two new rules:
\(R \longrightarrow S\)
\(R \longrightarrow T\)
Formally, K is defined to be the grammar
\((V \cup W \cup\{R\}, P \cup Q \cup\{R \longrightarrow S, R \longrightarrow T\}, \Sigma \cup \Gamma, R)\)
Suppose that \(w \in L .\) That is \(w \in L(G),\) so there is a derivation \(S \Longrightarrow_{G}^{*} w\). since every rule from \(G\) is also a rule in \(K,\) if follows that \(S \Longrightarrow_{K}^{*} w\). Combining this with the fact that \(R \Longrightarrow_{K} S,\) we have that \(R \Longrightarrow_{K}^{*} w,\) and \(w \in L(K) .\) This shows that \(L \subseteq L(K) .\) In an exactly similar way, we can show that \(M \subseteq L(K) .\) Thus, \(L \cup M \subseteq L(K)\).
It remains to show that \(L(K) \subseteq L \cup M .\) Suppose \(w \in L(K) .\) Then there is a derivation \(R \Longrightarrow_{K}^{*} w .\) This derivation must begin with an application of one of the rules \(R \longrightarrow S\) or \(R \longrightarrow T,\) since these are the only rules in which \(R\) appears. If the first rule applied in the derivation is \(R \longrightarrow S\), then the remainder of the derivation shows that \(S \Longrightarrow_{K}^{*} w .\) Starting from S, the only rules that can be applied are rules from G, so in fact we have \(S \Longrightarrow_{G}^{*} w .\) This shows that \(w \in L .\) Similarly, if the first rule applied in the derivation \(R \Longrightarrow_{K}^{*} w\) is \(R \longrightarrow T,\) then \(w \in M .\) In any case, \(w \in L \cup M\). This proves that \(L(K) \subseteq L \cup M\).
Finally, we should clarify the relationship between context-free lan- guages and regular languages. We have already seen that there are context- free languages which are not regular. On the other hand, it turns out that every regular language is context-free. That is, given any regular language, there is a context-free grammar that generates that language. This means that any syntax that can be expressed by a regular expression, by a DFA, or by an NFA could also be expressed by a context-free grammar. In fact, we only need a certain restricted type of context-free grammar to duplicate the power of regular expressions.
Definition 4.3.
A right-regular grammar is a context-free grammar in which the right-hand side of every production rule has one of the following forms: the empty string; a string consisting of a single non-terminal symbol; or a string consisting of a single terminal symbol followed by a single non- terminal symbol.
Examples of the types of production rule that are allowed in a right- regular grammar are \(A \longrightarrow \varepsilon, B \longrightarrow C,\) and \(D \longrightarrow a E\). Theideaof the proof is that given a right-regular grammar, we can build a corresponding NFA and vice-versa. The states of the NFA correspond to the non-terminal symbols of the grammar. The start symbol of the grammar corresponds to the starting state of the NFA. A production rule of the form \(A \longrightarrow b C\) corresponds to a transition in the NFA from state \(A\) to state \(C\) while reading the symbol \(b .\) A production rule of the form \(A \longrightarrow B\) corresponds to an ε-transition from state A to state B in the NFA. And a production rule of the form \(A \longrightarrow \varepsilon\) exists in the grammar if and only if A is a final state in the NFA. With this correspondence, a derivation of a string w in the grammar corresponds to an execution path through the NFA as it accepts the string w. I won’t give a complete proof here. You are welcome to work through the details if you want. But the important fact is:
Theorem 4.4.
A language L is regular if and only if there is a right-regular grammar G such that L = L(G). In particular, every regular language is context-free.
Exercises
1. Show that Part 4 of Theorem 4.1 follows from Part 3.
2. Give a careful proof that the language \(\left\{a^{n} b^{m} | n \geq m \geq 0\right\}\) is generated by the context-free grammar
\(S \longrightarrow a S b\)
\(S \longrightarrow A\)
\(A \longrightarrow a A\)
\(A \longrightarrow \varepsilon\)
3. Identify the language generated by each of the following context-free grammars.
a) \(S \longrightarrow a a S b\)
\(S \longrightarrow \varepsilon\)
b) \(S \longrightarrow a S b\)
\(S \longrightarrow a a S b\)
\(S \longrightarrow \varepsilon\)
c) \(S \longrightarrow T S\)
\(S \longrightarrow \varepsilon\)
\(T \longrightarrow a T b\)
\(T \longrightarrow \varepsilon\)
d) \(S \longrightarrow A B A\)
\(A \longrightarrow a A\)
\(A \longrightarrow a\)
\(B \longrightarrow b B\)
\(B \longrightarrow c B\)
\(B \longrightarrow \varepsilon\)
4. For each of the following languages find a context-free grammar that generates the language:
\(\begin{array}{ll}{\text { a) }\left\{a^{n} b^{m} | n \geq m>0\right\}} & {\text { b) }\left\{a^{n} b^{m} | n, m \in \mathbb{N}\right\}} \\ {\text { c) }\left\{a^{n} b^{m} | n \geq 0 \wedge m=n+1\right\}} & {\text { d) }\left\{a^{n} b^{m} c^{n} | n, m \in \mathbb{N}\right\}} \\ {\text { e) }\left\{a^{n} b^{m} c^{k} | n=m+k\right\}} & {\text { f) }\left\{a^{n} b^{m} | n \neq m\right\}} \\ {\text { g) }\left\{a^{n} b^{m} c^{r} d^{t} | n+m=r+t\right\}} & {\text { h) }\left\{a^{n} b^{m} c^{k} | n \neq m+k\right\}}\end{array}\)
5.Find a context-free grammar that generates the language \(\left\{w \in\{a, b\}^{*} | n_{a}(w)>\right.\)\(n_{b}(w) \}\).
- Find a context-free grammar that generates the language \(\left\{w \in\{a, b, c\}^{*} | n_{a}(w)=\right.\)\(n_{b}(w) \}\).
- A palindrome is a string that reads the same backwards and forwards, such as “mom”, “radar”, or “aabccbccbaa”. That is, w is a palindrome if \(w=w^{R}\). Let \(L=\left\{w \in\{a, b, c\}^{*} | w \text { is a palindrome }\right\}\). Show that L is a context-free language by finding a context-free grammar that generates L.
- Let \(\Sigma=\{(,),[,]\} .\) That is, \(\Sigma\) is the alphabet consisting of the four symbols (, ), [, and ]. Let L be the language over Σ consisting of strings in which both parentheses and brackets are balanced. For example, the string([][()()])([]) is in L but [(]) is not. Find a context-free grammar that generates the language L.
- Suppose that G and H are context-free grammars. Let L = L(G) and let M =L(H). Explain how to construct a context-free grammar for the language LM. You do not need to give a formal proof that your grammar is correct.
- Suppose that G is a context-free grammar. Let L = L(G). Explain how to construct a context-free grammar for the language \(L^{*}\). You do not need to give a formal proof that your grammar is correct.
- Suppose that \(L\) is a context-free language. Prove that \(L^{R}\) is a context-free language. (Hint: Given a context-free grammar G for L, make a new grammar, \(G^{R},\) by reversing the right-hand side of each of the production rules in G. That is, \(A \longrightarrow w\) is a production rule in \(G\) if and only if \(A \longrightarrow w^{R}\) is a production rule in \(G^{R} . )\)
- Define a left-regular grammar to be a context-free grammar in which the right-hand side of every production rule is of one of the following forms: the empty string; a single non-terminal symbol; or a non-terminal symbol followed by a terminal symbol. Show that a language is regular if and only if it can be generated by a left-regular grammar. (Hint: Use the preceding exercise and Theorem 4.4.)