Context-Free Grammar (CFG)
In Part 3, we saw that finite automata recognize
exactly the regular languages. However, many important languages - including the syntax of most programming
languages - have recursive, nested structure that regular languages cannot capture.
Context-free grammars provide a more powerful formalism for describing such languages.
Definition: Context-Free Grammar (CFG)
A context-free grammar is a 4-tuple \((V, \Sigma, R, S)\) where
- \(V\) is a finite set called the variables,
- \(\Sigma\) is a finite set disjoint from \(V\), called the terminals,
- \(R\) is a finite set of production rules, each of the form \(A \to w\) where \(A \in V\) and \(w \in (V \cup \Sigma)^*\),
- \(S \in V\) is the start variable.
A language that can be generated by some context-free grammar is called a context-free language (CFL).
Example:
\[
\begin{align*}
&X \to 00X1 \, | \, Y01 \\\\
&Y \to $
\end{align*}
\]
In this grammar, \(V = \{X, Y\}, \, \Sigma = \{0, 1, $\}, \, S = X\), and \(R\) is the collection of above rules.
For example, this grammar generates the string \(0000$0111\). The derivation of the string in this grammar is
\[
X \Rightarrow 00X1 \Rightarrow 0000X11 \Rightarrow 0000Y0111 \Rightarrow 0000$0111.
\]
Note: \( X \to 00X1 \, | \, Y01\) means \( X \to 00X1\) OR \(X \to Y01\).
Context-free grammars are strictly more powerful than regular expressions:
every regular language is context-free, but not vice versa. The key advantage is that CFGs can describe recursive
and nested structures.
Consider the language
\[
L_1 = \{0^n 1^n \mid n \geq 0\}.
\]
This language is not regular - no finite automaton can recognize it, because a finite
automaton cannot "count" an unbounded number of 0s and then match them with the same number of 1s.
To prove this rigorously, we use the Pumping Lemma.
Pumping Lemma
The Pumping Lemma is the standard tool for proving that a language is not regular.
The idea is that any sufficiently long string in a regular language must contain a substring
that can be "pumped" (repeated any number of times) while remaining in the language - a
consequence of the pigeonhole principle applied to the finite number of states.
Theorem: Pumping Lemma for Regular Languages
If \(L\) is a regular language, then there exists \(p \in \mathbb{N}\) where if any string \(s \in L\) of
length at least \(p\), then \(s\) may be divided into three pieces, \(s = xyz\), satisfying the following conditions:
- \(\forall i \geq 0, \, xy^i z \in L\)
- \(|y| > 0\)
- \(|xy| \leq p\)
Note: \(p\) is called the pumping length.
For example, assume \(L_1\) is regular. Choose \(s\) to be the string \(0^p1^p\), where \(p\) is the pumping length. Then
\(s\) must be split into three pieces, \(s = xyz\), where for any \(i \geq 0\) the string \(xy^iz \in L_1\). If we pump \(y\),
the number of \(0\)s and \(1\)s become unequal, producing a string not in \(L_1\). (Check when \(y\) consists only of \(0\)s, only of \(1\)s, or
of both \(0\)s & \(1\)s). Thus, by contradiction, \(L_1\) is not regular.
However, \(L_1\) is context-free. It can be generated by the grammar
\[
S \to 0S1 \mid \epsilon.
\]
For example, \(S \Rightarrow 0S1 \Rightarrow 00S11 \Rightarrow 0011\).
The recursive rule \(S \to 0S1\) ensures that every 0 is matched by a corresponding 1 - exactly the kind
of nested structure that a CFG can capture but a finite automaton cannot.
Pushdown Automata
The Pumping Lemma showed that finite automata lack the memory to handle languages like \(L_1 = \{0^n 1^n\}\).
To recognize context-free languages, we need a computational model with auxiliary storage.
A pushdown automaton (PDA) extends a finite automaton with a stack -
an unbounded last-in, first-out memory - that provides exactly the right amount of additional power.
Definition: Pushdown Automaton (PDA)
A pushdown automaton is a 6-tuple \((Q, \Sigma, \Gamma, \delta, q_0, F)\) where
- \(Q\) is a finite set of states,
- \(\Sigma\) is the input alphabet,
- \(\Gamma\) is the stack alphabet,
- \(\delta: Q \times \Sigma_{\epsilon} \times \Gamma_{\epsilon} \to \mathcal{P}(Q \times \Gamma_{\epsilon})\) is the transition function,
- \(q_0 \in Q\) is the start state,
- \(F \subseteq Q\) is the set of accept states.
A PDA is similar to an NFA but can additionally push symbols onto and pop symbols from its stack at each step.
This extra memory allows PDAs to recognize some nonregular languages.
Again, consider the language:
\[
L_1 = \{0^n 1^n \, | \, n \geq 0\}.
\]
The PDA \(M_1\) reads symbols from the input string as follows:
- As each \(0\) is read, pushes it onto the stack.
- After reading the sequence of \(0\)s, pops a \(0\) off the stack for each \(1\) that is read.
- If reading the string is finished exactly when the stack becomes empty, accepts the string, otherwise rejects the input.
Let's build the state diagram for \(M_1\) that recognizes \(L_1\).
\[
\begin{align*}
&Q = \{q_1, q_2, q_3, q_4\} \\\\
&\Sigma = \{0, 1\} \\\\
&\Gamma = \{0, \$\} \\\\
&F = \{q_1, q_4\}
\end{align*}
\]
Note: A special symbol \(\$\) is used as a bottom-of-stack marker.
The edge labels have the form \(a,\, b \to c\), meaning: read \(a\) from the input, pop \(b\) from
the top of the stack, and push \(c\) onto the stack. For example, \(1,\, 0 \to \epsilon\) means
"read a 1, pop a 0, and push nothing" (effectively removing the 0).
Theorem: CFG-PDA Equivalence
A language is context-free if and only if some pushdown automaton recognizes it.
We omit the proof, which proceeds by constructing a PDA from a given CFG and vice versa.
The key insight is that the PDA's stack can simulate the derivation process of the grammar.
Moreover, every regular language is context-free, because every regular language
is recognized by a finite automaton, and every finite automaton is automatically a pushdown automaton
(one that simply ignores its stack).
Non-Context-Free Languages
Just as the Pumping Lemma for regular languages lets us prove that certain languages are not regular,
there is an analogous result for context-free languages. This time, a sufficiently long string can be
divided into five pieces, where the 2nd and 4th pieces may be "pumped" together
while the resulting string remains in the language.
Theorem: Pumping Lemma for Context-Free Languages
If \(L\) is a context-free language, then there exists a pumping length
\(p \in \mathbb{N}\) such that every string \(s \in L\) with \(|s| \geq p\) can be divided into
five pieces, \(s = uvxyz\), satisfying:
- \(\forall\, i \geq 0, \quad uv^i x y^i z \in L\),
- \(|vy| > 0\),
- \(|vxy| \leq p\).
Consider the language
\[
L_2 = \{a^n b^n c^n \mid n \geq 0\}.
\]
This is not a context-free language. Intuitively, recognizing \(L_2\) requires maintaining
two independent counts simultaneously (matching \(a\)'s with \(b\)'s and \(b\)'s with \(c\)'s),
which exceeds the capability of a single stack.
More precisely, for any decomposition \(s = uvxyz\) satisfying condition 3 (\(|vxy| \leq p\)),
the substring \(vxy\) can span at most two of the three symbol types. Pumping then disrupts the
balance among \(a\)'s, \(b\)'s, and \(c\)'s, violating condition 1 in both cases — whether \(v\) and \(y\)
each contain a single symbol type, or one of them contains mixed symbols.
Recognizing \(L_2\) requires more computational power than a PDA - it requires a
Turing machine.
Deterministic Pushdown Automata (DPDAs)
Recall that for finite automata, deterministic (DFA) and nondeterministic (NFA) models recognize
the same class of languages. For pushdown automata, however, the situation is different:
nondeterministic PDAs are strictly more powerful than deterministic ones. Some context-free
languages cannot be recognized by any DPDA.
Nevertheless, deterministic context-free languages (DCFLs) - those recognizable
by DPDAs - are of great practical importance. Most programming languages are designed to be
DCFLs precisely so that efficient parsers (syntax analyzers) can be constructed
for their compilers.
Definition: Deterministic Pushdown Automaton (DPDA)
A deterministic pushdown automaton is a PDA whose transition function
\[
\delta: Q \times \Sigma_{\epsilon} \times \Gamma_{\epsilon} \to (Q \times \Gamma_{\epsilon}) \cup \{\emptyset\}
\]
satisfies the following condition: for every \(q \in Q\), \(a \in \Sigma\), and \(x \in \Gamma\),
exactly one of
\[
\delta(q, a, x), \quad \delta(q, a, \epsilon), \quad \delta(q, \epsilon, x), \quad \delta(q, \epsilon, \epsilon)
\]
is not \(\emptyset\).
Note that DPDAs still permit \(\epsilon\)-transitions - the determinism condition requires that
at most one transition applies in each situation, not that every transition must read input and stack simultaneously:
- \(\delta(q, a, \epsilon)\): read input \(a\) without inspecting the stack,
- \(\delta(q, \epsilon, x)\): pop \(x\) from the stack without reading input,
- \(\delta(q, \epsilon, \epsilon)\): neither read input nor inspect the stack.
The language \(L_1 = \{0^n 1^n \mid n \geq 0\}\) from our earlier example is a DCFL - the PDA
we constructed for it is in fact deterministic, since at each step the current state, input symbol,
and top-of-stack symbol uniquely determine the next move.
We have now established the hierarchy: regular languages \(\subsetneq\) DCFLs \(\subsetneq\) CFLs.
In the next pages, we move beyond context-free languages to the most powerful standard computational model -
the Turing machine - which defines the boundary of
what is computable.