Tagged Deterministic Finite Automaton

In the automata theory, a tagged deterministic finite automaton (TDFA) is an extension of deterministic finite automaton (DFA). In addition to solving the recognition problem for regular languages, TDFA is also capable of submatch extraction and parsing. While canonical DFA can find out if a string belongs to the language defined by a regular expression, TDFA can also extract substrings that match specific subexpressions. More generally, TDFA can identify positions in the input string that match tagged positions in a regular expression (tags are meta-symbols similar to capturing parentheses, but without the pairing requirement).

History
TDFA were first described by Ville Laurikari in 2000. Prior to that it was unknown whether it is possible to perform submatch extraction in one pass on a deterministic finite-state automaton, so this paper was an important advancement. Laurikari described TDFA construction and gave a proof that the determinization process terminates, however the algorithm did not handle disambiguation correctly.

In 2007 Chris Kuklewicz implemented TDFA in a Haskell library Regex-TDFA with POSIX longest-match semantics. Kuklewicz gave an informal description of the algorithm and answered the principal question whether TDFA are capable of POSIX longest-match disambiguation, which was doubted by other researchers.

In 2017 Ulya Trafimovich described TDFA with one-symbol lookahead. The use of a lookahead symbol reduces the number of registers and register operations in a TDFA, which makes it faster and often smaller than Laurikari TDFA. Trafimovich called TDFA variants with and without lookahead TDFA(1) and TDFA(0) by analogy with LR parsers LR(1) and LR(0). The algorithm was implemented in the open-source lexer generator RE2C. Trafimovich formalized Kuklewicz disambiguation algorithm.

In 2018 Angelo Borsotti worked on an experimental Java implementation of TDFA; it was published later in 2021. In 2019 Borsotti and Trafimovich adapted POSIX disambiguation algorithm by Okui and Suzuki to TDFA. They gave a formal proof of correctness of the new algorithm and showed that it is faster than Kuklewicz algorithm in practice.

In 2020 Trafimovich published an article about TDFA implementation in RE2C.

In 2022 Borsotti and Trafimovich published a paper with a detailed description of TDFA construction. The paper incorporated their past research and presented multi-pass TDFA that are better suited to just-in-time determinization. They also compared TDFA against other algorithms and provided benchmarks.

Formal definition
TDFA have the same basic structure as ordinary DFA: a finite set of states linked by transitions. In addition to that, TDFA have a fixed set of registers that hold tag values, and register operations on transitions that set or copy register values. The values may be scalar offsets, or offset lists for tags that match repeatedly (the latter can be represented efficiently using a trie structure). There is no one-to-one mapping between tags in a regular expression and registers in a TDFA: a single tag may need many registers, and the same register may hold values of different tags.



The following definition is according to Trafimovich and Borsotti. The original definition by Laurikari is slightly different.

A tagged deterministic finite automaton $$F$$ is a tuple $$(\Sigma, T, S, S_f, s_0, R, R_f, \delta, \varphi)$$, where:


 * $$\Sigma$$ is a finite set of symbols (alphabet)
 * $$T$$ is a finite set of tags
 * $$S$$ is a finite set of states with initial state $$s_0$$ and a subset of final states $$S_f \subseteq S$$
 * $$R$$ is a finite set of registers with a subset of final registers $$R_f$$ (one per tag)
 * $$\delta: S \times \Sigma \rightarrow S \times O^*$$ is a transition function
 * $$\varphi: S_f \rightarrow O^*$$ is a final function, where $$O$$ is a set of register operations of the following types:
 * set register $$i$$ to nil or to the current position: $$i \leftarrow v$$, where $$v \in \{\mathbf{n},\mathbf{p}\}$$
 * copy register $$j$$ to register $$i$$: $$i \leftarrow j$$
 * copy register $$j$$ to register $$i$$ and append history: $$i \leftarrow j \cdot h$$, where $$h$$ is a string over $$\{\mathbf{n},\mathbf{p}\}$$

Example
Figure 0 shows an example TDFA for regular expression $$(1 a 2)^* 3 (a|4 b) 5 b^*$$ with alphabet $$\Sigma=\{a,b\}$$ and a set of tags $$T=\{1,2,3,4,5\}$$ that matches strings of the form $$a \dots a b \dots b$$ with at least one symbol. TDFA has four states $$S=\{0,1,2,3\}$$ three of which are final $$S_f=\{1,2,3\}$$. The set of registers is $$R=\{r_1,r_2,r_3,r_4,r_5\}$$ with a subset of final registers $$R_f=\{r_1,r_2,r_3,r_4,r_5\}$$ where register $$r_i$$ corresponds to $$i$$-th tag. Transitions have operations defined by the $$\delta$$ function, and final states have operations defined by the $$\varphi$$ function (marked with wide-tipped arrow). For example, to match string $$aab$$, one starts in state 0, matches the first $$a$$ and moves to state 1 (setting registers $$r_1, r_2$$ to undefined and $$r_3$$ to the current position 0), matches the second $$a$$ and loops to state 1 (register values are now $$r_1=0, r_2=r_3=1$$), matches $$b$$ and moves to state 2 (register values are now $$r_1=1, r_2=r_3=r_4=2$$), executes the final operations in state 2 (register values are now $$r_1=1, r_2=r_3=r_4=2, r_5=3$$) and finally exits TDFA.

Complexity
Canonical DFA solve the recognition problem in linear time. The same holds for TDFA, since the number of registers and register operations is fixed and depends only on the regular expression, but not on the length of input. The overhead on submatch extraction depends on tag density in a regular expression and nondeterminism degree of each tag (the maximum number of registers needed to track all possible values of the tag in a single TDFA state). On one extreme, if there are no tags, a TDFA is identical to a canonical DFA. On the other extreme, if every subexpression is tagged, a TDFA effectively performs full parsing and has many operations on every transition. In practice for real-world regular expressions with a few submatch groups the overhead is negligible compared to matching with canonical DFA.

TDFA construction
TDFA construction is performed in a few steps. First, a regular expression is converted to a tagged nondeterministic finite automaton (TNFA). Second, a TNFA is converted to a TDFA using a determinization procedure; this step also includes disambiguation that resolves conflicts between ambiguous TNFA paths. After that, a TDFA can optionally go through a number of optimizations that reduce the number of registers and operations, including minimization that reduces the number of states. Algorithms for all steps of TDFA construction with pseudocode are given in the paper by Borsotti and Trafimovich.

This section explains TDFA construction on the example of a regular expression $$a^* t b^* | ab$$, where $$t$$ is a tag and $$\{a,b\}$$ are alphabet symbols.

Tagged NFA
TNFA is a nondeterministic finite automaton with tagged ε-transitions. It was first described by Laurikari, although similar constructions were known much earlier as Mealy machines and nondeterministic finite-state transducers. TNFA construction is very similar to Thompson's construction: it mirrors the structure of a regular expression. Importantly, TNFA preserves ambiguity in a regular expression: if it is possible to match a string in two different ways, then TNFA for this regular expression has two different accepting paths for this string. TNFA definition by Borsotti and Trafimovich differs from the original one by Laurikari in that TNFA can have negative tags on transitions: they are needed to make the absence of match explicit in cases when there is a bypass for a tagged transition.



Figure 1 shows TNFA for the example regular expression $$a^* t b^* | ab$$. It has three kinds of transitions: transitions on alphabet symbols $$\{a, b\}$$ (dark blue), tagged ε-transitions (the one from state 4 to state 12 with tag $$t$$ and the one from state 10 to state 11 with negative tag $$-t$$, bright green), and untagged ε-transitions (light blue). TNFA has a single start state 0 and a single final state 11.

In order to understand TNFA determinization, it helps to understand TNFA simulation first. Recall the canonical ε-NFA simulation: it constructs a subset of active states as an ε-closure of the start state, and then in a loop repeatedly steps on the next input symbol and constructs ε-closure of the active state set. Eventually the loop terminates: either the active set becomes empty (which means a failure), or all input symbols get consumed (which means a success if the active set contains a final state, otherwise a failure). TNFA simulation is similar, but it additionally tracks tag values. Every time simulation encounters a tagged transition, it updates tag value to the current offset. Since the simulation tracks multiple nondeterministic paths simultaneously, tag values along these paths may differ and should be tracked separately. Another difficulty is the need for disambiguation: unlike canonical NFA simulation, TNFA simulation needs to distinguish ambiguous paths, as they may have different tag values.



Figure 2 shows an example of TNFA simulation on a string $$ab$$:

• Initially the active set contains start state 0 with tag value ? (unknown).

• ε-Closure of the active set gives the new active set containing states 2, 5, 8, 11 with tag values ?, 0, ?, 0 respectively.

• Stepping on symbol $a$ gives the new active set containing states 3, 9 with tag values ?, ? respectively.

• ε-Closure of the active set gives the new active set containing states 2, 5, 9, 11 with tag values ?, 1, ?, 1 respectively.

• Stepping on symbol $b$ gives the new active set containing states 6, 10 with tag values 1, ? respectively.

• ε-Closure of the active set gives the new active set containing states 5, 11 with tag values 1, 1 respectively. Reflecting ambiguity in the regular expression, there are two conflicting paths that reach state 11: the one from state 5 corresponds to the left alternative $a^* t b^*$ with tag value 1, and the one from state 9 corresponds to the right alternative $ab$ with tag value $\varnothing$ (which differs from the unknown value). The first path takes priority.

• Simulation terminates, as there are no more input symbols. The active set contains final state 11, so the match is successful and $t=1$.

Disambiguation
Ambiguity means the existence of multiple different parse trees for the same input. It is a property of a regular expression; ambiguity is preserved by TNFA construction and gets resolved during TNFA simulation or determinization. One way to resolve ambiguity is use a disambiguation policy, the most notable examples being the leftmost-greedy and the longest-match (POSIX) policies. The leftmost-greedy policy is defined in terms of regular expression structure; it admits a simple and efficient implementation. The POSIX policy, on the other hand, is defined in terms of the structure of parse results; it is more difficult to implement and computationally more complex than the leftmost-greedy policy. TDFA can work with both policies, and there is no runtime overhead on disambiguation, since it happens during determinization and gets built into TDFA structure. For TNFA simulation, on the other hand, the time spent on disambiguation is included in the runtime.

The example regular expression $$a^* t b^* | ab$$ is deliberately ambiguous, as it allows one to parse $$ab$$ in two different ways: either as the left alternative $$a^* t b^*$$, or the right one $$ab$$. Depending on which alternative is preferred, tag $$t$$ should either have value $$1$$ (the offset at the position between symbols $$a$$ and $$b$$), or $$\varnothing$$ (undefined). Both POSIX and leftmost greedy disambiguation policies agree that the first alternative is preferable in this case.

Determinization
TNFA determinization is based on the canonical powerset construction algorithm that converts an NFA to a DFA. The algorithm simulates NFA on all possible strings. At each step of the simulation, the active set of NFA states forms a new DFA state. If the new state is identical to an existing DFA state, it is discarded and replaced with the existing one, and the current branch of simulation terminates. Otherwise the new state is added to the growing set of DFA states and simulation from this state continues. Eventually determinization terminates: although the set of all possible strings is infinite, the set of different DFA states is finite, and at some point all new states become identical to existing ones.

In the case of TDFA naive powerset construction faces a problem: TDFA states contain tag information, which changes at every step of the simulation (as the offsets increase). This prevents TDFA states from mapping: a pair of states that contain identical TNFA states but different tag offsets are not identical and cannot be mapped. As a result, simulation continues indefinitely, the set of TDFA states grows, and determinization does not terminate. To solve this problem, Laurikari applied the idea of indirection: instead of storing immediate tag values in TDFA states, he suggested storing them indirectly in registers. Tag values in registers may be different, but it doesn't matter as long as the registers are identical. This solves the termination problem: even if TDFA states have different registers, they can be made identical (mapped to each other) by adding operations that copy the corresponding register values. Indirection is not free: it requires adding runtime register operations on transitions that update register values. To reduce the runtime overhead, Trafimovich used the lookahead optimization. The idea is to move register operations from the incoming transition into a TDFA state to the outgoing transitions from this state. This way the operations get split on the lookahead symbol, which reduces the overlap between register lifetimes and results in a faster TDFA. To use this optimization, it is necessary to track lookahead tags in each TDFA state under construction and take them into account when mapping TDFA states.





Figure 3 shows the determinization process for the running example.

• The first TDFA state $s_0$ is formed by the ε-closure of the initial TNFA state and contains states 2, 5, 8, 11. All states have the same register version $r_0$. ε-Paths from state 0 to states 5 and 11 pass through a tagged transition, therefore these states have a lookahead tag $t$, while states 2 and 8 have empty lookahead tags. Since $s_0$ contains state 11, it is final, and the lookahead tag $t$ for 11 results in a register operation $r_1 \leftarrow \mathbf{p}$.

• Stepping from $s_0$ on symbol $a$ and constructing ε-closure of states 3 and 9 results in $s_1$ with states 2, 5, 9, 11. Since the ε-closure originates from states 2 and 8 in $s_0$ that have empty lookahead tags, transition from $s_0$ to $s_1$ has no register operations, and the register version $r_0$ is unchanged. $s_1$ is also final with a register operation $r_1 \leftarrow \mathbf{p}$.

• Stepping from $s_1$ on symbol $a$ and constructing ε-closure of state 3 results in $s_2$ with states 2, 5, 11. Since the origin state 2 in $s_1$ has empty lookahead tags, transition from $s_1$ to $s_2$ has no operations, and $r_0$ is unchanged. $s_2$ is also final with a register operation $r_1 \leftarrow \mathbf{p}$.

• Stepping from $s_2$ on symbol $a$ and constructing ε-closure of state 3 results in a TDFA state identical to $s_2$, so no new state is added, but a looping transition from $s_2$ to $s_2$.

• Stepping from $s_0$ on symbol $b$ and constructing ε-closure of state 6 results in $s_3$ with states 5, 11. Since the origin state 5 in $s_0$ has a lookahead tag $t$, transition from $s_0$ to $s_3$ has a register operation $r_r \leftarrow \mathbf{p}$, and the register version in $s_3$ is changed to $r_1$. State $s_3$ is final, but it has no final register operations, as 11 has no lookahead tag.

• Stepping from $s_1$ on symbol $b$ and constructing ε-closure of states 6 and 10 requires disambiguation, as both paths from states 6 and 10 reach state 11. The path from state 6 is preferred (both by POSIX and by leftmost greedy policies). The new state is very similar to $s_3$, except that the new register is $r_2$ not $r_1$. In this case $r_1$ could be reused as well, making the new state completely identical to $s_3$. But in the general case it is unclear which registers to reuse. Also, it makes optimizations harder, as complex register lifetimes have more overlap with each other. Although the new state is not identical to $s_3$, it can be mapped to $s_3$ by adding a copy operation $r_1 \leftarrow r_2$. The two operations $r_2 \leftarrow \mathbf{p}$ and $r_1 \leftarrow r_2$ get fused into $r_1 \leftarrow \mathbf{p}$ on transition from $s_1$ to $s_3$.

• Stepping from $s_2$ on symbol $b$ is the same as the previous case, except that the new register version is $r_3$.

• Stepping from $s_3$ on symbol $b$ results in a state identical to $s_3$, therefore a looping transition from $s_3$ to $s_3$ is added to TDFA.

• There are no more states and symbols to explore, so the determinization process terminates.

Figure 4 shows the resulting TDFA.

Optimizations
The goal of optimizations is to reduce TDFA size and the number of registers and operations on transitions. This section describes a few optimizations that are used in a practical TDFA implementation. None of these optimizations is particularly complex or vital for TDFA operation, but when applied together and in the correct order they can make TDFA considerably faster and smaller.

Fixed-tags optimization is applied at the regular expression level (before TNFA construction). The idea is, if a pair of tags happens to be within fixed distance from each other, there is no need to track both of them: the value of one tag can be computed from the value of the other tag one by adding a fixed offset. For example, in the regular expression $$t_1 (a|b) \, t_2$$ tags $$t_1$$ and $$t_2$$ are within one symbol distance from each other, so $$t_1$$ can be computed as $$t_2 - 1$$. This optimization can reduce both TDFA construction time (as fixed tags are excluded from TNFA construction and determinization) and matching time (as there are fewer register operations).



Register optimizations are applied after TDFA construction. A TDFA induces a control flow graph (CFG) on registers: operations on transitions form basic blocks, and there is an arc between two blocks if one of them is reachable from the other one by TDFA transitions, without passing through other register operations. CFG represents a program on registers as variables, so the usual compiler optimizations can be applied to it (such as liveness analysis, dead code elimination and register allocation).

TDFA minimization is very similar to DFA minimization, except for one additional restriction: register actions on TDFA transitions must be taken into account. So, TDFA states that are identical, but have different register actions on incoming transitions on the same symbol, cannot be merged. All the usual algorithms for DFA minimization can be adapted to TDFA minimization, for example Moore's algorithm is used in the RE2C lexer generator.

Figure 5 shows an optimized TDFA for regular expression $$a^* t b^* | ab$$. Note that it is in fact the same as a TDFA for a semantically equivalent regular expression $$a^* t b^*$$, where ambiguity has been removed. Ambiguity is resolved during determinization, and the unoptimized TDFA on Figure 4 is unambiguous, but it has some built-in redundancy that can be removed by optimizations.

Multi-pass TDFA


TDFA with registers are well suited for ahead-of-time determinization, when the time spent on TDFA construction is not included in the runtime (e.g. in lexer generators). But for just-in-time determinization (e.g. in regular expression libraries) it is desirable to reduce TDFA construction time. Another concern is tag density in a regular expression: TDFA with registers are efficient if the number of tags is relatively small. But for heavily tagged regular expressions TDFA with registers are suboptimal: transitions get cluttered with operations, making TDFA execution slow. Register optimizations also become problematic due to the size of liveness and interference information.

Multi-pass TDFA address both issues: they reduce TDFA construction time, and they are better suited to dense submatch extraction. The main difference with canonical TDFA is that multi-pass TDFA have no register operations. Instead they have multiple passes: a forward pass that matches the input string and records a sequence of TDFA states, and one or more backward passes that iterate through the recorded states and collect submatch information. A single backward pass is sufficient, but an extra pass may be used e.g. to estimate the necessary amount of memory for submatch results. In order to trace back the matching TNFA path from a sequence of TDFA states, multi-pass TDFA use backlinks on transitions and in final states. A backlink is a pair, where the first component is an index in backlink arrays on preceding transitions, and the second component is a sequence of tags for a fragment of TNFA path between TDFA states.

Figure 6 shows an example of a multi-pass TDFA for regular expression $$(1 a 2)^* 3 (a|4 b)5b^*$$ matching on a string $$aab$$ (compare it to Figure 0 that shows TDFA with registers for the same regular expression):


 * 1) The forward pass collects a sequence of states 0, 1, 1, 2.
 * 2) The backward pass traces backlinks from state 2 to state 0 and concatenates tag sequences and input symbols. It starts with backlink $$(0,t_5)$$. Tagged string is $$t_5$$. Index 0 selects backlink $$(0,t_2 t_3 t_4)$$ on preceding transition from state 1 to state 2 (the only one on this transition). Tagged string becomes $$t_2 t_3 t_4 b \, t_5$$. Index 0 selects backlink $$(0,t_2 t_1)$$ on preceding looping transition from state 1 to state 1. Tagged string becomes $$t_2 t_1 a \, t_2 t_3 t_4 b \, t_5$$. Index 0 selects backlink $$(0,t_1)$$ on preceding transition from state 0 to state 1. The resulting tagged string is $$t_1 a \, t_2 t_1 a \, t_2 t_3 t_4 b \, t_5$$. It contains all the necessary information to reconstruct last offsets or offset lists for each tag.

Related automata
StaDFA described by Mohammad Imran Chowdhury are very similar to TDFA, except that they have register operations in states, not on transitions.

DSSTs (Deterministic Streaming String Transducers) described by Grathwohl are more distant relatives to TDFA, better suited to full parsing than submatch extraction. DSST states contain path trees constructed by the ε-closure, while TDFA states contain similar information in a decomposed form (register versions and lookahead tags).