NooJ

NooJ is a linguistic development environment software as well as a corpus processor constructed by Max Silberztein. NooJ allows linguists to construct the four classes of the Chomsky-Schützenberger hierarchy of generative grammars: Finite-State Grammars,  Context-Free Grammars,   Context-Sensitive Grammars as well as  Unrestricted Grammars, using either a text editor (e.g. to write down regular expressions), or a Graph editor.

NooJ allows linguists to develop orthographical and morphological grammars, dictionaries of simple words, of compound words as well as discontinuous expressions, local syntactic grammars (such as Named Entities Recognizers),  structural syntactic grammars (that produce syntactic trees) as well as Zellig Harris‘  transformational grammars.

All NooJ parsers process Atomic Linguistic Units (ALUs), as opposed to word forms (i.e. sequences of letters between two space characters). This allows NooJ’s syntactic parser to parse sequences of word forms such as “can not” exactly as contracted word forms such as “cannot” or “can’t”. This allows linguists to write relatively simple syntactic grammars, even for agglutinative languages. ALUs are represented by annotations that are stored in the Text Annotation Structure (or TAS): all NooJ parsers add, or remove annotations in the TAS. A typical NooJ analysis involves applying to a text a series of elementary grammars in cascade, in a bottom-up approach (from spelling to semantics).

History
NooJ originated in investigations by Silberztein and the INTEX community of linguist users into the Lexicon-Grammar approach of Maurice Gross’ LADL, which states that no grammar rule can be developed independently from a strict delimitation of its domain of application.

NooJ has been used as a corpus processor by researchers in Linguistics, History, in Psychology,  in Literature studies, in sentiment analysis projects, data mining,   and even for processing musical notes. For instance NooJ was used in the MARS 500 experiment but also by several computer software companies to build Information Extraction and Information Retrieval software.

Complexity and application
NooJ’s dictionaries are represented by finite-state transducers and can represent simple words (e.g. table), compound words (e.g. as a matter of fact) as well as discontinuous expressions such as phrasal verbs (e.g. to turn … off), idiomatic expressions (e.g. to take the bull by the horns) as well as support verb/predicative noun associations (e.g. to take a nap). NooJ allows linguists to create, edit, debug and maintain a large number of grammars that belong to the four classes of generative grammars in the Chomsky-Schützenberger hierarchy: finite-state grammars, context-free grammars, context-sensitive grammars and unrestricted grammars.

NooJ can often apply grammars to texts in linear time: for instance, most NooJ Context-Free Grammars can often be derecursived. NooJ Context-Sensitive Grammars are made of two parts: one part is a Context-Free (or even a Finite-State Grammar) that is applied to texts very efficiently, the second consists in a set of constraints applied to matching sequences, each one performed in constant time. NooJ unrestricted grammars are context-sensitive grammars that can contain variables and can modify the text input. They are typically used to perform transformational analysis & generation (see Zellig Harris), but several teams of linguists have shown that, when used in conjunction with multilingual lexicons, they can be used to perform Machine Translation