Brzozowski derivative



In theoretical computer science, in particular in formal language theory, the Brzozowski derivative $$u^{-1}S$$ of a set $$S$$ of strings and a string $$u$$ is the set of all strings obtainable from a string in $$S$$ by cutting off the prefix $$u$$. Formally:
 * $$u^{-1}S = \{v \in \Sigma^* \mid uv \in S\}$$.

For example,
 * $$c^{-1}\{\text{cat}, \text{cow}, \text{dog}\} = \{\text{at}, \text{ow}\}.$$

The Brzozowski derivative was introduced under various different names since the late 1950s. Today it is named after the computer scientist Janusz Brzozowski who investigated its properties and gave an algorithm to compute the derivative of a generalized regular expression.

Definition
Even though originally studied for regular expressions, the definition applies to arbitrary formal languages. Given any formal language $$S$$ over an alphabet $$\Sigma$$ and any string $$u \in \Sigma^*$$, the derivative of $$S$$ with respect to $$u$$ is defined as:


 * $$u^{-1}S = \{v \in \Sigma^* \mid uv \in S\}$$

The Brzozowski derivative is a special case of left quotient by a singleton set containing only $$u$$: $$\ u^{-1}S = \{u\} \;\backslash\; S$$.

Equivalently, for all $$u,v \in \Sigma^*$$:


 * $$v \in u^{-1}S \;\Leftrightarrow\; uv \in S.$$

From the definition, for all $$u, v \in \Sigma^*$$:


 * $$(uv)^{-1}S = v^{-1}(u^{-1}S)$$

since for all $$w \in \Sigma^*$$, we have $w \in (uv)^{-1}S \Leftrightarrow uvw \in S \Leftrightarrow vw \in u^{-1}S \Leftrightarrow w \in v^{-1}(u^{-1}S)$.

The derivative with respect to an arbitrary string reduces to successive derivatives over the symbols of that string, since for all $$a \in \Sigma, u \in \Sigma^*$$: $$\begin{align} (ua)^{-1}S &= a^{-1}(u^{-1}S) \\ \varepsilon^{-1}S &= S \end{align}$$

A language $$S \subseteq \Sigma^*$$ is called nullable if and only if it contains the empty string $$\varepsilon$$. Each language $$S$$ is uniquely determined by nullability of its derivatives:


 * $$w \in S \ \Leftrightarrow\ \varepsilon \in w^{-1}S$$

A language can be viewed as a (potentially infinite) boolean-labelled tree (see also tree (set theory) and infinite-tree automaton). Each possible string $$w \in \Sigma^*$$ denotes a node in the tree, with label true when $$w \in S$$ and false otherwise. In this interpretation, the derivative with respect to a symbol $$a$$ corresponds to the subtree obtained by following the edge $$a$$ from the root. Decomposing a tree into the root and the subtrees $$a^{-1}S$$ corresponds to the following equality, which holds for every language $$S \subseteq \Sigma^*$$:


 * $$S = (\{\varepsilon\} \cap S) \cup \bigcup_{a \in \Sigma} a(a^{-1}S).$$

Derivatives of generalized regular expressions
When a language is given by a regular expression, the concept of derivatives leads to an algorithm for deciding whether a given word belongs to the regular expression.

Given a finite alphabet A of symbols, a generalized regular expression R denotes a possibly infinite set of finite-length strings over the alphabet A, called the language of R, denoted L(R).

A generalized regular expression can be one of the following (where a is a symbol of the alphabet A, and R and S are generalized regular expressions):
 * "∅" denotes the empty set: L(∅) = {},
 * "ε" denotes the singleton set containing the empty string: L(ε) = {ε},
 * "a" denotes the singleton set containing the single-symbol string a: L(a) = {a},
 * "R∨S" denotes the union of R and S: L(R∨S) = L(R) ∪ L(S),
 * "R∧S" denotes the intersection of R and S: L(R∧S) = L(R) ∩ L(S),
 * "¬R" denotes the complement of R (with respect to A*, the set of all strings over A): L(¬R) = A* \ L(R),
 * "RS" denotes the concatenation of R and S: L(RS) = L(R) · L(S),
 * "R*" denotes the Kleene closure of R: L(R*) = L(R)*.

In an ordinary regular expression, neither ∧ nor ¬ is allowed.

Computation
For any given generalized regular expression R and any string u, the derivative u−1R is again a generalized regular expression (denoting the language u−1L(R)). It may be computed recursively as follows.

Using the previous two rules, the derivative with respect to an arbitrary string is explained by the derivative with respect to a single-symbol string a. The latter can be computed as follows:

Here, $ν(R)$ is an auxiliary function yielding a generalized regular expression that evaluates to the empty string ε if R 's language contains ε, and otherwise evaluates to ∅. This function can be computed by the following rules:

Properties
A string u is a member of the string set denoted by a generalized regular expression R if and only if ε is a member of the string set denoted by the derivative u−1R.

Considering all the derivatives of a fixed generalized regular expression R results in only finitely many different languages. If their number is denoted by dR, all these languages can be obtained as derivatives of R with respect to strings of length less than dR. Furthermore, there is a complete deterministic finite automaton with dR states that recognises the regular language given by R, as stated by the Myhill–Nerode theorem.

Derivatives of context-free languages
Derivatives are also effectively computable for recursively defined equations with regular expression operators, which are equivalent to context-free grammars. This insight was used to derive parsing algorithms for context-free languages. Implementation of such algorithms have shown to have cubic time complexity, corresponding to the complexity of the Earley parser on general context-free grammars.