User:7dare/sandbox

Parikh's theorem in theoretical computer science says that if one looks only at the number of occurrences of each terminal symbol in a context-free language, without regard to their order, then the language is indistinguishable from a regular language. It is useful for deciding that strings with a given number of some terminals are not accepted by a context-free grammar. It was first proved by Rohit Parikh in 1961 and republished in 1966.

Definitions and formal statement
Let $\Sigma=\{a_1,a_2,\ldots,a_k\}$ be an alphabet. The Parikh vector of a word is defined as the function $p:\Sigma^*\to\mathbb{N}^k$, given by $$p(w)=(|w|_{a_1}, |w|_{a_2}, \ldots, |w|_{a_k})$$ where $|w|_{a_i}$ denotes the number of occurrences of the letter $$a_i$$ in the word $$w$$.

A subset of $\mathbb{N}^k$ is said to be linear if it is of the form $$u_0 + \mathbb{N}u_1 + \dots + \mathbb{N}u_m = \{u_0+t_1u_1+\dots+t_mu_m \mid t_1,\ldots,t_m\in\mathbb{N}\}$$ for some vectors $u_0,\ldots,u_m$. A subset of $$\mathbb{N}^k$$ is said to be semi-linear if it is a union of finitely many linear subsets.

Statement 1: Let $$L$$ be a context-free language. Let $$P(L)$$ be the set of Parikh vectors of words in $$L$$, that is, $P(L) = \{p(w) \mid w \in L\}$. Then $$P(L)$$ is a semi-linear set.

Two languages are said to be commutatively equivalent if they have the same set of Parikh vectors.

Statement 2: If $$S$$ is any semi-linear set, the language of words whose Parikh vectors are in $$S$$ is commutatively equivalent to some regular language. Thus, every context-free language is commutatively equivalent to some regular language.

These two equivalent statements can be summed up by saying that the image under $$p$$ of context-free languages and of regular languages is the same, and it is equal to the set of semilinear sets.

Strengthening for bounded languages
A language $$L$$ is bounded if $$L\subset w_1^*\ldots w_k^*$$ for some fixed words $$w_1,\ldots, w_k$$. Ginsburg and Spanier gave a necessary and sufficient condition, similar to Parikh's theorem, for bounded languages.

Call a linear set stratified, if in its definition for each $$i\ge 1$$ the vector $$u_i$$ has the property that it has at most two non-zero coordinates, and for each $$i,j\ge 1$$ if each of the vectors $$u_i,u_j$$ has two non-zero coordinates, $$i_1<i_2$$ and $$j_1<j_2$$, respectively, then their order is not $$i_1<j_1<i_2<j_2$$. A semi-linear set is stratified if it is a union of finitely many stratified linear subsets.

The Ginsburg-Spanier theorem says that a bounded language $$L$$ is context-free if and only if $$\{(n_1,\ldots,n_k)\mid w_1^{n_1}\ldots w_k^{n_k}\in L\}$$ is a stratified semi-linear set.

Significance
The theorem has multiple interpretations. It shows that a context-free language over a singleton alphabet must be a regular language and that some context-free languages can only have ambiguous grammars. Such languages are called inherently ambiguous languages. From a formal grammar perspective, this means that some ambiguous context-free grammars cannot be converted to equivalent unambiguous context-free grammars.