Tree transducer

In theoretical computer science and formal language theory, a tree transducer (TT) is an abstract machine taking as input a tree, and generating output – generally other trees, but models producing words or other structures exist. Roughly speaking, tree transducers extend tree automata in the same way that word transducers extend word automata.

Manipulating tree structures instead of words enable TT to model syntax-directed transformations of formal or natural languages. However, TT are not as well-behaved as their word counterparts in terms of algorithmic complexity, closure properties, etcetera. In particular, most of the main classes are not closed under composition.

The main classes of tree transducers are:

Top-Down Tree Transducers (TOP)
A TOP T is a tuple $(Q, Σ, Γ, I, δ)$ such that:


 * $Q$ is a finite set, the set of states;
 * $Σ$ is a finite ranked alphabet, called the input alphabet;
 * $Γ$ is a finite ranked alphabet, called the output alphabet;
 * $I$ is a subset of Q, the set of initial states; and
 * $&delta;$ is a set of rules of the form $$q(f(x_1,\dots,x_n)) \to u$$, where f is a symbol of Σ, n is the arity of f, q is a state, and u is a tree on Γ and $$Q\times 1..n$$, such pairs being nullary.

Examples of rules and intuitions on semantics
For instance,
 * $$q(f(x_1,\dots,x_3)) \to g(a,q'(x_1),h(q''(x_3)))$$

is a rule – one customarily writes $$q(x_i)$$ instead of the pair $$(q,x_i)$$ – and its intuitive semantics is that, under the action of q, a tree with f at the root and three children is transformed into
 * $$g(a,q'(x_1),h(q''(x_3)))$$

where, recursively, $$q'(x_1)$$ and $$q''(x_3)$$ are replaced, respectively, with the application of $$q'$$ on the first child and with the application of $$q''$$ on the third.

Semantics as term rewriting
The semantics of each state of the transducer T, and of T itself, is a binary relation between input trees (on Σ) and output trees (on Γ).

A way of defining the semantics formally is to see $$\delta$$ as a term rewriting system, provided that in the right-hand sides the calls are written in the form $$q(x_i)$$, where states q are unary symbols. Then the semantics $$[\![q]\!]$$ of a state q is given by
 * $$[\![q]\!] = \{ u \mapsto v \mid u \text{ is a tree on } \Sigma,\ v\text{ is a tree on } \Gamma \text{, and } q(u) \to_{\delta}^* v \} . $$

The semantics of T is then defined as the union of the semantics of its initial states:
 * $$[\![T]\!] = \bigcup_{q\in I} [\![q]\!].$$

Determinism and domain
As with tree automata, a TOP is said to be deterministic (abbreviated DTOP) if no two rules of δ share the same left-hand side, and there is at most one initial state. In that case, the semantics of the DTOP is a partial function from input trees (on Σ) to output trees (on Γ), as are the semantics of each of the DTOP's states.

The domain of a transducer is the domain of its semantics. Likewise, the image of a transducer is the image of its semantics.

Properties of DTOP

 * DTOP are not closed under union: this is already the case for deterministic word transducers.
 * The domain of a DTOP is a regular tree language. Furthermore, the domain is recognisable by a deterministic top-down tree automaton (DTTA) of size at most exponential in that of the initial DTOP.
 * That the domain is DTTA-recognizable is not surprising, considering that the left-hand sides of DTOP rules are the same as for DTTA. As for the reason for the exponential explosion in the worst case (that does not exist in the word case), consider the rule $$q(f(x_1,x_2)) \to g(p_1(x_1),p_2(x_1),p_3(x_2))$$. In order for the computation to succeed, it must succeed for both children. That means that the right child must be in the domain of $$p_3$$. As for the left child, it must be in the domain of both $$p_1$$ and $$p_2$$. Generally, since subtrees can be copied, a single subtree can be evaluated by multiple states during a run, despite the determinism, and unlike DTTA. Thus the construction of the DTTA recognising the domain of a DTOP must account for sets of states and compute the intersections of their domains, hence the exponential. In the special case of linear DTOP, that is to say DTOP where each $$x_i$$ appears at most once in the right-hand side of each rule, the construction is linear in time and space.


 * The image of a DTOP is not a regular tree language.
 * Consider the transducer coding the transformation $$f(x)\to g(x,x)$$; that is, duplicate the child of the input. This is easily done by a rule $$q(f(x_1)) \to g(p(x_1),p(x_1))$$, where p encodes the identity. Then, absent any restrictions on the first child of the input, the image is a classical non-regular tree language.


 * However, the domain of a DTOP cannot be restricted to a regular tree language. That is to say, given a DTOP T and a language L, one cannot in general build a DTOP $$T'$$ such that the semantics of $$T'$$ is that of T, restricted to L.
 * This property is linked to the reason deterministic top-down tree automata are less expressive than bottom-up automata: once you go down a given path, information from other paths is inaccessible. Consider the transducer coding the transformation $$f(x,y)\to y$$; that is, output the right child of the input. This is easily done by a rule $$q(f(x_1,x_2)) \to p(x_2)$$, where p encodes the identity. Now let's say we want to restrict this transducer to the finite (and thus, in particular, regular) domain $$\{ f(c,a),\ f(c,b) \}$$. We must use the rules $$q(f(x_1,x_2)) \to p(x_2),\ p(a) \to a,\ p(b)\to b$$. But in the first rule, $$x_1$$ does not appear at all, since nothing is produced from the left child. Thus, it is not possible to test that the left child is c. In contrast, since we produce from the right child, we can test that it is a or b. In general, the criterion is that DTOP cannot test properties of subtrees from which they do not produce output.


 * DTOP are not closed under composition. However this problem can be solved by the addition of a lookahead: a tree automaton, coupled to the transducer, that can perform tests on the domain which the transducer is incapable of.
 * This follows from the point about domain restriction: composing the DTOP encoding identity on $$\{ f(c,a),\ f(c,b) \}$$ with the one encoding $$f(x,y)\to y$$ must yield a transducer with the semantics $$\{ f(c,a) \mapsto a,\ f(c,b) \mapsto b \}$$, which we know is not expressible by a DTOP.


 * The typechecking problem—testing whether the image of a regular tree language is included in another regular tree language—is decidable.
 * The equivalence problem—testing whether two DTOP define the same functions—is decidable.

Bottom-Up Tree Transducers (BOT)
As in the simpler case of tree automata, bottom-up tree transducers are defined similarly to their top-down counterparts, but proceed from the leaves of the tree to the root, instead of from the root to the leaves. Thus the main difference is in the form of the rules, which are of the form $$f(q_1(x_1),\dots,q_n(x_n)) \to q(u)$$.