Chomsky–Schützenberger enumeration theorem

In formal language theory, the Chomsky–Schützenberger enumeration theorem is a theorem derived by Noam Chomsky and Marcel-Paul Schützenberger about the number of words of a given length generated by an unambiguous context-free grammar. The theorem provides an unexpected link between the theory of formal languages and abstract algebra.

Statement
In order to state the theorem, a few notions from algebra and formal language theory are needed.

Let $$\mathbb{N}$$ denote the set of nonnegative integers. A power series over $$\mathbb{N}$$ is an infinite series of the form
 * $$f = f(x) = \sum_{k=0}^\infty a_k x^k = a_0 + a_1 x^1 + a_2 x^2 + a_3 x^3 + \cdots$$

with coefficients $$a_k$$ in $$\mathbb{N}$$. The multiplication of two formal power series $$f$$ and $$g$$ is defined in the expected way as the convolution of the sequences $$a_n$$ and $$b_n$$:


 * $$f(x)\cdot g(x) = \sum_{k=0}^\infty \left(\sum_{i=0}^k a_i b_{k-i}\right) x^k.$$

In particular, we write $$f^2 = f(x)\cdot f(x)$$, $$f^3 = f(x)\cdot f(x)\cdot f(x)$$, and so on. In analogy to algebraic numbers, a power series $$f(x)$$ is called algebraic over $$\mathbb{Q}(x)$$, if there exists a finite set of polynomials $$p_0(x), p_1(x), p_2(x), \ldots, p_n(x)$$ each with rational coefficients such that


 * $$p_0(x) + p_1(x) \cdot f + p_2(x)\cdot f^2 + \cdots + p_n(x)\cdot f^n = 0.$$

A context-free grammar is said to be unambiguous if every string generated by the grammar admits a unique parse tree or, equivalently, only one leftmost derivation. Having established the necessary notions, the theorem is stated as follows.


 * Chomsky–Schützenberger theorem. If $$L$$ is a context-free language admitting an unambiguous context-free grammar, and $$a_k := | L \ \cap \Sigma^k |$$ is the number of words of length $$k$$ in $$L$$, then $$G(x)=\sum_{k = 0}^\infty a_k x^k$$ is a power series over $$\mathbb{N}$$ that is algebraic over $$\mathbb{Q}(x)$$.

Proofs of this theorem are given by, and by.

Asymptotic estimates
The theorem can be used in analytic combinatorics to estimate the number of words of length n generated by a given unambiguous context-free grammar, as n grows large. The following example is given by : the unambiguous context-free grammar G over the alphabet {0,1} has start symbol S and the following rules


 * S → M | U
 * M → 0M1M | ε
 * U → 0S | 0M1U.

To obtain an algebraic representation of the power series $G(x)$ associated with a given context-free grammar G, one transforms the grammar into a system of equations. This is achieved by replacing each occurrence of a terminal symbol by x, each occurrence of ε by the integer '1', each occurrence of '→' by '=', and each occurrence of '|' by '+', respectively. The operation of concatenation at the right-hand-side of each rule corresponds to the multiplication operation in the equations thus obtained. This yields the following system of equations:


 * S = M + U
 * M = M²x² + 1
 * U = Sx + MUx²

In this system of equations, S, M, and U are functions of x, so one could also write $S(x)$, $M(x)$, and $U(x)$. The equation system can be resolved after S, resulting in a single algebraic equation:



This quadratic equation has two solutions for S, one of which is the algebraic power series $x(2x-1)S^2 + (2x-1)S +1 = 0$. By applying methods from complex analysis to this equation, the number $$a_n$$ of words of length n generated by G can be estimated, as n grows large. In this case, one obtains $$a_n \in O(2+\epsilon)^n$$ but $$a_n \notin O(2-\epsilon)^n$$ for each $$\epsilon>0$$.

The following example is from :$$ \left\{\begin{array} { l } { S \rightarrow X Y } \\ { T \rightarrow a T | T b T | Y c Y } \\ { Y \rightarrow Y a Y | c Y | a b T a Y Y a | X } \\ { X \rightarrow a | b | c } \end{array} \Rightarrow \left\{\begin{array}{l} s(z)=x(z) y(z) \\ t(z)=z t(z)+z t(z)^2+z y(z)^2 \\ y(z)=z y(z)^2+z y(z)+z^4 t(z) y(z)^2+x(z) \\ x(z)=3 z \end{array}\right.\right. $$which simplifies to$$ s(z)^8-27\left(z^3-z^2\right) s(z)^5+\ldots+59049 z^{10}=0 $$

Inherent ambiguity
In classical formal language theory, the theorem can be used to prove that certain context-free languages are inherently ambiguous. For example, the Goldstine language $$L_G$$ over the alphabet $$\{a,b\}$$ consists of the words $$a^{n_1}ba^{n_2}b\cdots a^{n_p}b$$ with $$p\ge 1$$, $$n_i>0$$ for $$i \in \{1,2,\ldots,p\}$$, and $$n_j \neq j$$ for some $$j \in \{1,2,\ldots,p\}$$.

It is comparably easy to show that the language $$L_G$$ is context-free. The harder part is to show that there does not exist an unambiguous grammar that generates $$L_G$$. This can be proved as follows: If $$g_k$$ denotes the number of words of length $$k$$ in $$L_G$$, then for the associated power series holds $$G(x) = \sum_{k=0}^\infty g_k x^k = \frac{1-x}{1-2x}- \frac1x \sum_{k \ge 1} x^{k(k+1)/2-1} $$. Using methods from complex analysis, one can prove that this function is not algebraic over $$\mathbb{Q}(x)$$. By the Chomsky-Schützenberger theorem, one can conclude that $$L_G$$ does not admit an unambiguous context-free grammar.