User:VincentBMT/Ariane

History
From 1970 to 1988, the GÉTA research group(Grenoble, France) developed a linguistic methodology for building multilingual MT systems for revisors (MT-R), specialized to sublanguages, and relying on heuristic analysis. A comprehensive MT computer environment called Ariane has been programmed and used to develop a large number of MT-R mockups and prototypes, as well as two large-scale operational systems: Russian-French and French-English.

Between 1988 and 2002, the methodology and the computer tools have been revised and further developed as a framework of quality MT for monolingual authors, relying on a disambiguation dialogue with the author (DBMT), following an all-paths analysis. The software architecture has become distributed, with the author using a middle-range Macintosh, and Ariane-G5 running as a remote server and accessed through the Internet.

This new basic software is now being used in two projects with different aims and requirements. First, in the UNL project of personal multilingual high-quality communication over the Internet, our challenge is to go from the LIDIA mockup to a real system relying on a huge lexical database. Second, we have started a long term project in Speech Translation (ST), and concentrate first on the French part of C STAR II demonstrators, to show that useful ST systems can be obtained by integrating admittedly imperfect speech recognition and machine translation components in a communication environment permitting limited user control and multimedia feedback.

In both projects, we reuse the Ariane-G5/LIDIA software, with some new developments: (1) for UNL, we are building a multilingual database to generate Ariane and interactive disambiguation dictionaries, and filters between UNL graphs and Ariane-trees; (2) our first C STAR prototype uses http access to Ariane-G5, and MT starts from a phonetic word lattice output by the speech recognizer. In both projects, we also reuse the LIDIA multilevel pivot approach, where linguistic structures contain a lexical level of interlingual acceptions, but the analysis strategy is heuristic for C STAR and all-paths as in LIDIA for UNL, and interlingual acceptions are of a different nature.

Since 2002, ...

Basic software tools and linguistic methodology for multilingual MT-R => Software tools and linguistic methodology for multilingual machine translation

Ariane-G5, an MT shell for building multilingual machine translation systems

The specialized languages

Ariane-G5 is a

GÉTA’s linguistic methodology
Although Ariane-G5 does not propose or impose any methodology of linguistic programming, most of its users follow a certain number of principles, mainly due to Bernard Vauquois.

In the context of MT-R of a sublanguage, B. Vauquois recommends that analysis delivers a UMA-structure, for “unique, multilevel and abstract” structure [51]. The geometry of the tree reflects one choice for the organization into syntagmatic groups, but the structure is an “abstract tree”, from which the text may not necessarily be reconstructed in an immediate way. For example, it is convenient to regroup discontinuous constituents (e.g. “les garçons les ont tous vues” for “the kidsi have alli seen themj”), to “variabilize” negations, auxiliaries, articles, strongly governed preposi¬tions, certain modals, etc., thus obtaining trees considerably smaller than the “concrete” (“surface”) trees provided by direct application of extended context-free grammars (GPSG or others). In principle, each internal node dominates a leaf which is the governor of the group (from “gouverneur” in French, usually “head” in English), unless the governor is itself a compound. In order to get a dependency structure analogous to those of the Prague school, it is enough, in a first approximation, to recursively “send up” each governor to replace its mother node. In other words, Vauquois’ structures are “lexicalized” intermediates between constituent and dependency structures, at least geometrically.

Properties and relations are coded in the decorations attached to the nodes and constitute the “algebraic” part of the structure. For example, a node having “attribute of object” as value of the syntactic function (SF=ATROBJ) is the attribute of the group dominated by its (unique) sister node having SF=OBJ1. Hence, there are two syntactic levels, that of classes (morphosyntactic and syntagmatic), and that of functions. To translate into languages which are not extremely near to the source language without having to write large structural transfers, it is advisable to add two more levels, logical and semantic.

The logical level (RL variable, for “relation logique”) gives the positions of arguments of linguistic predicates. ARG0 denotes the logical subject (most often actor or agent) of the predicate which is the head (“governor”) of the same group, ARG1 denotes its logical object (in general the patient, but not in ergative constructs such as “the twig breaks”), and ARG2 denotes its third argument. The numbering is such that ARG1 corresponds to OBJ1 in standard active constructs, but that is purely a convention. For example, “the building of the house” and “to build the house” have identical structures at that level, the group “(of) the house” being ARG1.

TRL10 is used in place of ARG1 if the predicate is attributive (“to be”, “to seem”, “to appear”…), and TRL21 in place of ARG2 if the predicate attributes ARG2 to ARG1 (“to consider ARG1 as TRL21”). This is a way to indicate that the relation does not link the node with the governor (the predicate), but with another argument. Similarly, one often uses another variable (such as RLI, for “inverse logical relation”) to code the link between arguments in control constructs. For example, in “I ask him to come”, the group “to come” is ARG1 of “ask”, and bears RLI=00 if ARG0 (I) is coming and RLI=02 if ARG2 (he) is coming.

Finally, we use, mainly on circumstancials (modifiers), the semantic relation (RS), which grosso modo corresponds to the “deep case” (localization, origin, goal, accompaniment, manner, qualifi¬cation, measure, cause, concession, etc.). In practice, RL and RS are complementary, because it is extremely difficult (even manually) to assign RS to arguments in a reliable manner, and circum¬stancials can be correctly translated only if their RS are known.

In this respect, the famous problem of the translation of prepositions is often not well stated. If an argument is concerned, the whole construct (predicate+arguments, e.g. “to talk about sth. with sb.”, “to count for sb., sth.”) should be translated as a block. If a circumstantial is concerned, the RS, possibly particularized by the preposition (or its absence) should be translated. For instance, in “to come by Lyon” and “to come via Lyon”, the circumstantial should bear RS=LOC, SEM=SPACE and SLOC=QUA (localization in space, movement through sth.), thus allowing for exact translation of “by” (“par” and not “près de”, “à côté de”, “devant”, “de”, “d’après”, “suivant”, “à”…). Keeping the preposition also allows one to translate more exactly in a language like French, which also has two prepositions for this sense (“par”, “via”).

Several levels are similarly used for the actualization variables, such as number (morphological and logical), time vs. tense, etc. The order of the text is reflected as much as possible in the structure. As a matter of fact, order gives important information which is not well formalized, such as thematic articulation and emphasis. This avoids coding it explicitly in a tactical variable. If one considers the UMA-structure of a sentence only at the “deep” levels, it can be thought to represent a whole family of sentences of equivalent meanings. If it is considered at all levels, it should correspond to only one sentence, notwithstanding spelling variants (such as disc/disk, program/programme, or corpuses/corpora).

Besides these various levels of linguistic description, one also encodes in the UMA-structures produced by analysis unresolved ambiguities and doubts on parts of the construction, in order to avoid a combinatory explosion, and to be able to warn the revisor and at the same time to try to transfer those ambiguities which persist in translation (e.g. “the conquest of the Romans”). The aim of transfer is to perform lexical translation, and some adaptations of the structure aiming at delivering to the generator a structure coherent with the linguistic system of the target language. This structure is called GMA-structure, for “generating, multilevel, abstract structure”. In principle, the generator considers that the GMA-structure it receives is under-specified with respect to the surface levels, and recomputes them.

Hence, the first logical step of generation consists of selecting a paraphrase of the meaning expressed by the GMA-structure by computing the UMA-structure of the translation to be produced. The second step consists of producing a surface tree (“concrete” tree, or UMC-structure), by creating nodes for articles, auxiliaries, negation elements, punctuations, by dividing or merging sentences if necessary, etc. The third step is the morphological generation which, starting from the sequence of the leaves, constructs the occurrences of the final text.

The specialized languages
Ariane-G5 is a generator (G) of MT systems based on five (5) specialized languages for linguistic programming (SLLP). Each such language is compiled. The internal structures produced by its compiler are used as parameters by its “engine”. The complete documentation, in French, is available at GÉTA ([6] and later additions).

ATEF, a language for morphological analysis
ATEF was designed in 1971 by J. Chauché [23], who wrote the engine, while P. Guillaume and M. Quézel-Ambrunaz wrote the compilers of the different components. Since then, ATEF has undergone numerous extensions, but the underlying algorithmic model has not varied. As a matter of fact, it is a very satisfactory tool.

The system successively handles each occurrence of the text, examining a priori all possible analyses (non-deterministic total mode with backtracking). The current occurrence is named C. An analysis result is a decoration or a sequence of decorations (in the case of compound words). Each step of a particular analysis consists of choosing one of the open dictionaries, in finding there an item which key, the morph, is a prefix (or a suffix, in right-to-left mode) of what remains to be analyzed (noted A), which is reduced accordingly, and in applying one of the rules associated with the morphological format of the considered item.

The rules may contain conditions bearing on the current state (decoration C), on the strings C and A, on the partial results produced by the current analysis (PS1 to PS9) in case of a compound word, and also on the four preceding occurrences (from P1 to P4) and on the results of their analysis. A particular form of condition consists of giving a list of “sub-rules” and in asking that at least one of them applies (as a sub-rule may itself have sub-rules, that happens in non-deterministic unary mode with  backtracking). It is finally possible to store a condition on the analysis of the following occurrence.

There exist three types of action : assignment of values to mask C, transformation of what remains to be segmented (string A), and call of special functions. These functions allow to :
 * control the built-in backtracking by pruning the choice tree (functions FINAL, ARRET, ARD, ARF, STOP) or by opening and closing dictionaries (through assignment of the obligatory non-exclusive morphological variable DICT) ;
 * produce a partial result from the current state C (function SOL) ;
 * transform C or A into a UL, thereby reducing A to '' (functions TRANS and TRANSA) ;
 * decide that a sentence boundary has been reached (function INIT).

If an occurrence is not recognized (“unknown word”), that is, if no analysis succeeds in reducing A to '' while producing a current state C having a non-null UL, the system starts analysis again, after having attached to the occurrence the obligatory morphological format MODINC, which must in particular call the obligatory rule MOTINC (“rule of the unknown word”). As that rule may call sub-rules, and as that format may call other rules, it is possible to construct a true “grammar of the unknown word”, and to program sophisticated strategies for analyzing unknown words.

During processing, the automaton (ATEF “engine”) constructs a (“4-colour backward”) graph where the nodes are the masks (or lists of masks for the compound words) associated with the solutions found, and the edges indicate compatibility between analyses (at distance 1 to 4). The final graph is finally transformed into the desired form. The Q-graph output is now no longer available, and two other output forms, as 1-colour forward graph and as tree “with homosentences” (presenting all paths in the 4-colour graph separately) are  no longer used.

The standard output of ATEF is a tree “without homosentences”, which encodes ambiguities. Its root corresponds to the whole text and bears UL='ULTXT'. Its daughters correspond to the sentences (determined by the grammar) and bear UL='ULFRA'. Under each of them are nodes with UL='ULOCC', which correspond to the occurrences (words or fixed connex idioms). Under each 'ULOCC' are the different results of the morphological analysis of the corresponding occurrence. Each result is either a mask of variables (a node) or a sub-tree with root having UL='ULMCP' (compound word) dominating the masks corresponding to the different parts recognized in the word.

ROBRA, a language for transforming decorated trees
ROBRA [14] is a language for writing transformational systems working on decorated trees. It is the successor of the CETA language [23] Numerous extensions have been introduced, the semantics has been made more precise, and the engine has been totally respecified and rewritten.

A transformational system (ST) is defined by a control graph (GC), a set of transformational grammars (GT) and a set of rules (RP for “production rules”). A GT is an ordered set of rules. A GC is a graph where each node bears a GT or the exit symbol (&NUL) and the edges bear tree conditions. Note that each “grammar” component GRi of a ROBRA phase actually countains a whole transformational system, possibly consisting of a large GC with dozens of GTs.

To execute an ST on an object tree (AO), ROBRA uses the GC as non-deterministic (unary with backtracking) control structure : starting from an initial node, it looks for the first valid path leading to an exit. On this path, it executes the grammars countained in the nodes, and traverses an edge only if the current AO verifies its condition.

The execution of a GT consists of one elementary application in unitary (U) mode, or of several in iterated mode (E for “exhaustive”). In an elementary application, many rules of the GT are applied in parallel, which necessitates a mechanism for conflict resolution. An elementary application ends only after the recursive calls of sub-grammars (SGT) or sub-systems (SST) possibly triggered by the application of certain rules have been completed.

A system of interdictions (rules are marked, nodes are blocked) allows one to statically test the ST for decidability : the compiler may warn the user of the risks of undecidability (loops in the GC, “free” mode in an iterative GT, constraint on recursive calls not satisfied, etc.).

The schemas which appear in left-hand sides of rules have a very great expressive power. For each node, it is possible to indicate whether its daughters are to be looked for in leftmost or rightmost positions, in order or out of order (free permutations). It is possible to look for nodes at unspecified depths by using “generalized nodes”. Finally, the rules may be context-sensitive, the root of the schema (RS) being possibly different from the root of the effective transformation (RT). What is not dominated by the RT constitutes the context, or “hat”. The RT may itself be active or contextual. What it dominates belongs to the active part.

The notion of parallel rewriting in ROBRA is quite strong, as parallelism may be “normal” (RT located on distinct nodes of a cut of the AO), “vertical” (a RT may dominate another), and “horizontal” (several contextual RT and at most one active RT may be instantiated on the same node of the AO).

Finally, it is possible to write extremely complex conditional assignments of variables in the right-hand side of rules, which contributes to making ROBRA an extremely powerful tool. ROBRA is really a production system of the substitution type, even if, in the current implemen¬tation, elementary application of a transformational grammar is done by transduction of an input tree into an output tree (both represented linearly). For that reason, the decoration type is necessarily preserved by a transformation system.

EXPANS, a language for lexical expansion and transfer
EXPANS is based on a model of transduction of decorated trees [30]. The decoration types of the input and output trees may be different. Each node is transformed into a sub-tree in the output tree. This sub-tree is determined by consultation of the dictionaries, in their order of priority, through the UL born by the node. A default action is always foreseen. A dictionary item has a UL value as key, and a list of triplets  as content. The conditions concern the node of the input tree and possibly its immediate neighbours (mother, left and right sisters). The image describes the geometry of the sub-tree to be produced, and the assignments allow to compute the values of the variables on the nodes of the sub-tree from those of the accessible input nodes.

At the level of a dictionary, if the UL of the node is found, EXPANS looks for the first triplet whose condition is verified (the last condition must be empty, that is, identically true), and the corresponding sub-tree is produced. Otherwise, this dictionary fails. In deterministic mode, dictionaries are searched in their order of priority until a success is obtained. In non-deterministic mode, all dictionaries are searched, in order, and the image produced is a sub-tree constructed by rooting the sub-trees produced by the dictonaries under a new node. This mode allows for example not to “hide” the usual translation of a word which has also a different translation in a particular domain whose dictionary has been given higher priority.

SYGMOR, a language for morphological generation
SYGMOR is based on the model of a finite-state deterministic transducer, whose first version was designed by B. Thouin and programmed by D. Jaeger. [29] describes the extensions and amendments he contributed to it in recent years. SYGMOR takes as input a sequence of decorations and produces as output a string of characters. A context reduced to the current (C) and preceding (P) decorations is available, and two strings are used, the “working string” T (“chaîne de Travail”), and the output buffer S (“chaîne en Sortie”).

The grammar has quite a simple structure. Each rule is made of a condition, actions and “subsequent rules”. Actions consist essentially of writing to the right, to the left or at the middle (last point of concatenation) of T a literal string or the result of accessing one of the dictionaries through the value of one of the variables of C (dictionaries accessed by the UL give the bases, the others the affixes). It is also possible to modify C, and to “recall” S to T (by concatenating it to the left and emptying it).

For each decoration, SYGMOR looks for the first applicable rule and applies it. It then applies the subsequent rules in order, without taking their subsequent rules into account. A subsequent rule may be obligatory or optional. If an obligatory rule fails, SYGMOR goes back to the initial state and applies the rule MOTINC, if present, and otherwise the default error action (empty S, then do S:=T). Processing continues by taking into account the subsequent rules of the last rule applied, until an empty list of subsequent rules is reached.

TRACOMPL, a language for transforming decorations (“articulations")
TRACOMPL [30] is the sub-language used to write all DV components. It has been made into an autonomous language to allow for writing “articulations”. The goal is to transform decorations of a Set1 into a Set2. For that, we proceed in two steps :


 * First, one describes Set2 and what should be known of Set1 in order to perform the transformation. The names of the variables present in both sets are prefixed by ”$” and those of variables present in Set1 but not carried over to Set2 by “$$”. The others (not prefixed) are considered to be new.
 * One completes this by writing (CVAR part) a conditional action which can test the variables of the input decoration in Set1, in Set2 after the “reformatting” described by the preceding part, and in Set2 in their current state (during execution of the action). Because of this, it is possible to perform arbitrary transformations of variables.