User:Steamerandy/editing/metacompiler2

metacompiler

META II
META II was first written and hand-compiled in META I, a hand-compiled version of META II. The history is unclear as to whether META I was a full implementation of META II or a minimalist subset required to compile the full META II compiler.

In its documentation, META II is described as resembling BNF (Backus normal form) that today is explained as a Chomsky production grammar. META II is an analytical grammar. In the TREE-META document these languages were described as reductive grammars.

For example, in BNF, an arithmetic expression may be defined as: BNF rules are today production rules describing how constituent parts may be assembled to form only valid language constructs. A parser does the opposite taking language constructs apart disassembling them into their constituent parts. META II is a stack-based functional parser programming language that includes output directive. In META II, the order of testing is specified as left to right and down within an equation. META II like other programming languages would overflow its call stack atempting left recursion. META II instead uses a $ (looping) zero-or-more sequance operator. Parsing equations written in META II are boolean conditional expressions that are evaluated left to right. Above the expr equation is defined by the expression to the right of the '='. Evaluating left to right from the '=' term is the first thing that must be tested. T If term returns failure expr fails. If successful a term was recognized we then enter the indefinite $ zero or more loop were we first test for a '+' if that fails the alternative '-' is attempted and finally if a '-' were not recognized the loop terminates with expr returning success having recognized a single term. If a '+' or '-' were successful then term would be called. And if successful the loop would repeat. The expr equation can also be expressed using nested grouping as: The code production elements were left out to simplify the example. Due to the limited character set of early computers the character / was used as the alternative, or, operator. The, loop operator, is used to match zero or more of something: The above can be expressed in English: An expr is a term followed by zero or more of (plus term or minus term). Schorre describes this as being an aid to efficiency, but unlike a naive recursive descent compiler it will also ensure that the associativity of arithmetic operations is correct:

With the ability to express a sequance with a loop or right("tail") recursion the order of evaluation can be controled.

Syntax rules appear declarative but are actually made impariative by their semantic specifications.

work area
stack-oriented programming language is one that relies on a stack (or stacks) for manipulating data and/or passing parameters. Several programming languages fit this description, notably Forth, RPL, PostScript,

Reductive Grammar
Rp. Sense our discussions Talk:Formal grammar/Reductive grammar I have been trying to figure out how to explain the grammars that were called reductive in the 1960s. After the long and unfruitful discussions we had on Formal Grammar. Not really a problem any more as the articles linking to formal grammar no longer do so. But still your opinion is important. I think it valid to distinguish them from production grammars. What they are is parser programming languaged ... program = $((declaration      // A program is a sequance of                               // declaration, terminated by            | .EOF .STOP)      // .EOF (End Of File). If not           \ ERRORX["Error"]   // Backtrack and report error                               // flaging furthest parse point             $(-';'(.ANY|.STOP)); Do error recovery find a ;                               // to continue. Stop on .ANY                               // failing indicating the end                               // of file has been reached.

declaration =	"#" dirictive | comment | global	 	DECLAR[*1]     // inter compile linkge | (id ( grammar  PARSER[*1]     // Parser function | sequencer GENERATOR[*1] // Generator function | optimizer ISO[*1]       // In Sequence Optimizer | pseudo_op PRODUCTION[*1] // Pseudo instruction | emitor_op MACHOP[*1]    // Machine instruction ) // All the above start with an identifier                \ ERRORX["Syntax error"] garbol); // Backtrack ERROR

grammar = (':' class  :CLASS   // character class define          |".." token  :TOKEN   // token rule          |"==" syntax :BCKTRAK // backtrack syntax rule          |'='  syntax :SYNTAX  // syntax rule.          	)';'!2	        // Combine name and rule tree $(-.NL -"/*" .any); Comment to line seperator?

comment = "//" $(-.NL .ANY)     // C style comments. | "/*" $(-"*/" .ANY) "*/";

syntax  = test_seq ('|' altn:ALTN!2 | '\' altb :ALTB!2 | :SEQ!1);

altn    = test_seq ('|' altn:ALTN!2 / .EMPTY); // alternative sequance.

altb    = test_seq ('\' altb:ALTB!2 / .EMPTY); // backtracking sequance

test_seq = +[test $test]+; The above is a bit of one defining itself. An updated version of SLIC I wrote in the early 1970s. There are five sub-languaged: grammar, sequencer, optimizer, pseudo_op, and emitor_op. The grammar language is what I have been trying to explain. SLIS is based (or a decedent of) META II. Some of the points I have found are:

1) Backtracking. Early languages until CWIC in 1969 do not allow backtracking. Intentionally designed as all languages of the time could be parsed with out backtracking. An alternative may not start with the same test as a previous alternatives in:
 * x = a b | a c;

are invalid as both begin with a. So a must be factored out:
 * x = a (b|c);

That allows x to be parsed without backtracking. The multiple production rules:
 * x = a b;
 * x = a c;

are not valid on two points. Each rule is compiled into functions who's name is the assignment name on the left of the equal. Function names like most programmig langaed must be unique. CWIC added programed backtracking.
 * x = a b \ a c;

Backtraking uses long fail. At the start of the leftmost backtrack alternative a backtrack stack frame is created on the call stack. On a long failure the backtrack stack frame is returned to. (long jump in C). The parse is reset and control transfered to next backtrack alternative. Backtracking alternatives may not be mixed with non-backtracking alternatives within a sub-exression seuquance.

Being a programming language the parsing equations have semantics associated with them.

The SYNTAX laguages of CWIC and SLIC compile into lexerless parsers. But that is a poor discription. Thay have on the fly lexing. Not all lexemes were explained as tokens. Token formula recongize variable strings in these languages. Constant strings are used in recognizing key-word. Tokens are variable character sequences that are kept. Character sequences recognized by constant string tests are not normally kept. The META II exp rule illistrates string recognition:
 * exp = term $('+' term .OUT('ADD')/'-'term .OUT('SUB));

The strings '+' and '-' are not kept. String have semantic specifications. leading white space characters are ignored (skiped). In META II tokens are recognized by buitin functins: .ID, .STRING, and .NUMBER. Tokens are pshed on the * (star) stack. They are poped and output using * in a .OUT production. Schorre called his languages, META I and META II, transformational metalanguages.

CWIC was far more advanced building a tree structure in the parsing language and generating banary code with a lisp 2 based generator language. A tree building exp formula:
 * exp = term $(('+':ADD|'-':SUB) term!2);

It build an abstract syntax tree. Though I now use the terminology abstract parse tree as it is created by the parsing languaged. There two stacks used. : pushed a node on the node stack. ! creates a tree having branches. It actually creates a +1 element list. The first element being the. !2 creates a list of 3 elements:
 * [ ,<*2>,<*1>]

[ADD,, ] [SUB, , ]
 * 1 top parse stack entry, *2 second or next to top parse stack entry. The exp formula created a tree of the form:

ADD                 SUB /  \                /   \ Token rules are defined using .. token formula:
 * id .. alpha $('_' alphanum|alphanum);

in id starts with an alpha character. It may contain any number of alphanum (alpha numeric) or underscore characters. An underscore '_' must be b followed by an alphanum preventing an id ending with an underscore. $ is the zero or more loop operator. Token formula skip leading white space characters.

These programming metalanguages very precisely define a language. They also, using programed transformation, create an intermediate abstract parse tree or directly output stack machine assembly code.

They define the language, it's parsing, and transformation to another form. All of which is programed. They define the recovery from grammar errors: \ ERRORX["Error"]  // Backtrack and report error // flaging furthest parse point. $-(';'|.EOF .STOP) ';');// Error recovery                                    // find a ; to continue. The SLIC compiler I wrote implements the identical parser, SYNTAX, language of CWIC except for the == formula. Which is just a short cut for readability. Equilivant to grouping the parsing expression and adding a backtrack \ .FAIL to a = formula.
 * a == x y z

in CWIC:
 * a = (x y z)\.FAIL;

These are simular to parsing expression grammars. However they predate PEGs by decades. AND are more powerful.

CWIC and SLIC are full optimizing compiler developmrnt systems. SLIC improved on CWIC by advancing code generation. A MACHOP function call has the appearance of an assembly instruction. They transform their parameters into a sequence of bit fields. SLIC generates code into a bit addressable memory space. PSEUDO functions are akin to assembly macro. Were CWIC generated binary code in the generator languagr, SLIC generates PSEUDO instructions. Both generate code into sections that are later output by flushing the section. The difference is CWIC simple writes a block of memory associated with a named section. In SLIC PSEUDO instructions (with parameters) are appended to a section list. Flushing a section executes the PSEUDO instruction list. The PSEUDO functions call MACHOP functions that output relocatable code to the object file. The MACHOP defining language provides for an optional symbolic assembly like listing interspersed with the source.

These are very fast parsers as they can avoid backtracking. Most programming languages can be parsed with out the need of backtracking except as illustrated in my example above were it is used in error recovery. They combine lexing and parsing eliminating time consuming processes. By default token formula enter recognized tokens into the dictionary, a structured symbol table. Built in functions may be called as the last operation of a token formula. MAKESTR for example intercedes dictionary entry creating a string object. MAKEINT creates an integer object.

Metacompiler
A metacompiler in the most general case is a compiler that takes as input programs written in a metalanguage. More specifically a transformational metalanguage. Thay have uses other then compiler development. Notable examples of which are metacompilers developed at SDC, funded by the United States Department of Defense, that where primarly used to write document analysis applications.

The early history of metacompilers is closely associated with the ACM SIGPLAN working group 1 on Syntax Directed Compilers. The group was primarily started by the efforts of Howard Metcalf. Howard designed two compilers in the fall of 1962. One based on work by Ledley and Wilson used a botom-up parsing method. The other using a top-down approach based on work by Glennie  generated  random sentences from context-free grammars. META I, in a talk given by Dewey Val Schorre at the 1963 ACM conference in Denver Colorado, presented the basic concepts of transformational metalanguage programming. The Schorre META II compiler writing language artical was published in the Jernal of the ACM in 1964. Schorre's transformational metalanguages are analytical grammars that also include semantic transform operations as part of the metalanguage specification. That same year System Development Corporation, (SDC), a government supported think tank, began major conserated efforts in metacompiler development. This effort lead to powerful metacompilers, BOOK1, BOOK2 and META 5, written in LISP. META 5, released to many users, was used in writing many string-minuplating, document analysis, applications. META 5 has multiple push-down stacks, incorporated backup of the input stream, symbol attributes and other facilities enabling it to parse context-sensitive grammars. SDC released CWIC in 1970. CWIC used a Schorre transformational metalanguage and included a dialect of LISP 2 used in generating and optimizing output object code. Like META II CWIC was written in it's own languages.

Early metacompilers were compilers taking as input a META "language". I.E. 'META I, META II, METAFOUR, META 5 etc, This naming convention started by Dewey Val Schorre were in the META II document he gave license to freely use his work: "META II is not presented as a standard language, but as a point of departure from which a user may develop his own META 'language'. - Dewey Val Schorre" Of course programs coded in a META 'langusge' were compiled by the META 'language' compiler. Eventually a portmanteau of meta language  compiler became the generic term metacompiler. A compiler whose input is a metalanguage. .

Between 1964 and 1970 many compiler compilers were developed. These were complete compiler development systems that included all phases of translation from parsing to code generation. A prime example, "A GENERAL PURPOSE TABLE-DRIVEN COMPILER", used five phases of translation each written in it's own specialized language. It was not described as being a metacompiler or specifically using metalanguages. However it's documentation clearly describes it as using specialized metalanguages for each phase of translation.

Phase 1 syntax analysis. Using a BNF like metalanguage transformed input source code into tree structures.

Phase 2 Generator. Performed optimizations on the tree and transformed it into macro like instructions.

Phase 3 ISO (In Sequence Optimization). Transformed the macro sequences resequencing and moving constant calculations outside of loops eliminating edundant calculations.

Phase 4 Macro expansuon. Macro instruction expansion into assembly code.

Phase 5 Assembly. Transforming assembly instructions into binary object code.

During the 1960s metacompiler and compiler-compiler referred to complete compiler writing systems. When yacc came out in the 1970s it corrupted the meaning of compiler-compiler. It's name being an acronym for y et a nother c ompiler c ompiler and only being a parser generator.

Metacompilers by supplying functionality and automating aspects that are the same regardless of the object language, reduce the time, effort and cost of producing and maintaining compilers, translators, or interpreters. Metacompilers can make it economically feasible to include the development of domain-specific languages which are appropriate to the specification of a particular problem in the solution of a problem.

Besides being useful for development of programming languages, they are themselves a programming language useful for a wide range of string or symbol processing applications.

The term transformational metalanguage used in describing Schorre's META I and META II metalanguages shows a conection to Zellig Harris's transformational metalanguage theory. Schorre's META "language"s are analytical phase structure grammars having very specific semantic specifications. Schorre chose to describe them as syntax equations. Syntax formula better describes them today. The fact that they are procedural programming languages is commonly missed. The syntax formula are used to write a top-down parser. Each syntax formula is compiled into a test(boolean) function.

that is an analytical string processing programming language

String processing object-oriented
transformational metalanguage programming.

The early metacompilers were origionally explained as using syntax equations resembling BNF. However as Schorre explained in his talk on META I they use a form of transformational metalanguage. Today syntax formula is a closer fit to the technology. Each syntax formula is translated into a recursive test function. A top driving rule defines a valid program module's structure. The analysis and transformational aspects may be combined into a single language or divided into multipal sub-languahes. In these languages the terms token and lexeme are used differently. Tokens rather then being equivalent to lexemes are a subset of lexemes. Lexeme applies to any sequances of characters recognized as a unit. Tokens are recognized by token formula and kept for later use. For example a quoted string is used in a syntax formula to recognize a constant lexeme that is not normally kept.

Early metalanguages META I and META II used builtin token recognizes:
 *  .ID      identifier token.
 *  .NUM     integer token.
 *  .STRING  quoted string token.

Saved token objects were then addressed in code production transformations using *. i.e. *1 *2

The transformational metalanguage can be divided into a parser programming sub-language and object manipulation sub-language(s).

A valid programs structure as defined in the parser programming language is made up of string processing functions coded in the form of syntax formula.

Advanced parser languages as found in CWIC have programed token rules. CWIC's token rules create objects that are pushed on the parse stack. Symbol objects that are cataloged in the dictionary (symbol tables) are the defult of a token recognition. Supplied generator functions may be used that intercede the default dictionary cataloging of symbols. These intercede function are used to create numeric and string token objects bypassing cataloging. The quoted string token formula below recognizes string tokens: string .. """" $(-"""" .ANY | """""","""") """" MAKESTR[]; ^     ^  ^    ^ ^     ^    ^ ^     ^      ^    ^ 1      2  3    4 5     6    7 8     9      10   11 The string formula above defines a string constant as used in the CWIC parser language:
 * 1 Function name or token identifier may be used in syntax formula to recognize the token. It may also be used to identify a dictionary entries type.
 * 2 .. signifies a token formuls. The symbol to it's left names a formula the right side is the lexing formula
 * 3 Conforming to the string rule """"is a string constant lexeme test for a single " character.
 * 4 $ zero or more of the lexing term following it. In this case a grouped, prenthisized, sequence of tests.
 * 5 -"""" a negative look ahead. A single " character is tested for and if matched the negative look ahead fails. If not matched the test succeeds. The parse is not advanced in either case.
 * 6 .ANY matches any character. In token formula the matched character is appended to the token in process.
 * 7 | is the alternative operator. In this case when -"""" in the left alternative fails the right alternative is atempted. If however .ANY the second test of the left alternative fails a backtrack long failure would result.
 * 8 """""" is a test for two " characters and if successful ,"""" appends a single " character to the token.
 * 9 ,"""" The comma operator in token formula appends the following string to the token.
 * 10 """" matches a single " character. The string constant right bounding " character.
 * 11. MAKESTR[] is a supplied interceding function that creates a string object of the matched token characters.

Token formula skip initial or leading skip_class characters. $ is a sequances operator. are grouping operators. $( ... ) matches a sequances of the inclosed grouped expression.

The analytical syntax formula are string matching rules that colect characters into objects. The objects may simply be output using transform operators is in META II's:
 * .OUT

The consisting of quoted strings and star-stack referances. i.e. *1 *2 etc. *1 being the most recent recognized token. *2 the next most recent.

parser programming language
The parser programming language is programed using grammar equations:
 * rule = id rule_type parsing_expression ";";
 * rule_type = "=" | ".." | ":";
 * "=' syntax rule
 * ".." token making rule
 * ":" character class rule

Unlike BNF grammar rules, parser programming languages use equations or perhaps more specific formula:

; =  ;

In CWIC a parsing formula would be defined as: parse = id ("=" parse_expr           |".." token_expe           |":"  class_expr); As a programming language parse is compiled into a function that first calls id. If successful an equation type is deturmined by testing for the string "=". If successful parse_expr is called. If unsuccessful recognizing "=" the alternate ".." is atempted again if unsuccessful the ":" alternative is atempted. Grouping ("=" parse_expr|".." token_expe|":" class_expr) alternatives eliminates backtracking as all equation forms start with an id. Backtracking is programed using backtrack alternatives indicated by using \ to separate alternatives. A backtrack alternative saves the parse state befor attempting it's left alternative. On failure the parse state is reset to the saved state and program control transfer to the right hand alternative. Lexeme failure is a backup failure only reseting the input state and returning failure.

value type object copying

Object-oriented as originally used in lisp 2 talks referred to a programing paradigm were every datum is an object and variables are typeless containers (pointers). The datum types are fixed defined by the language. A declared variable x for example may at one point contain an integer and on another instance a list, string or floating point number.

In the first generation of metacompilers only character sequances recognized as tokens by the parser languages become token objects. The more generalized objects were publicly described in the CWIC ( C ompiler for  W ring and  I mplementing  C ompilers). The generator language of CWIC was a dialect of LISP2.

The early metacompilers were simply string processing languages taking as input a source code character stream and transforming it into an output object code character stream.

work ups
I had been told that CWIC had been classified. But another Wikipedian who had worked on the CWIC project told me that it had only been declared proprietary technology. Which didn't mean shot in the early 70s. Programming languages couldn't by copyrighted or patented at that time.

I based my SLIC (System of Langauges for Implementing Compilers) primarly on CWIC. CWIC has a SYNTAX and GENERATOR languages. It described the generating of machine instructions as planting. It planted byte, word, or long word values into sections(Named memory blocks). Written on an IBM 360 the blocks were 8 bit bytes associated with a section name. Multiple sections could be used, separating code, constants, and data for example.

SLIC was also inspired by:A GENERAL-PURPOSE TABLE-DRIVEN COMPILER -- Stephen  Warshall and Robert  M.  Shapiro that describe five translation phases:

1. Syntactic Analysis Converts input strings into tree-representations.

2. Generator Transforms the tree into a sequence  of macro-instructions.

3 ISO in-sequence optimizer which recognizes  macro patterns.

4. Macro expansion Transforms macros into machine  code,

5. Assembly Outputing the machine code to an object code file.

SLIC also has five phases Each phase written in it's own domain specific language:

1. SYNTAX sub-language made up of syntax formula, token rules, and character class defines.

2. GENERATOR sub-language made up made up generator functions.

3. sub-ISO sub-language is sets of transformational rules.

4. PSEUDO sub-language functions simular to generator action procedures that call MACHOPs to output binary object code.

5. MACHOP sub-language is used to define the target machine code.

SLIC compiled each language into machine executable code. Assembly instructions are defined in the MACHOP language. Each MACHOP is compiled into an executable subroutine whose appearance when called appears like assembly. Pseudo instructions are subroutines whose appearance and function is that of macro an assrmbly. The ISO language.has not been implemented. The SYNTAX and GENERATOR language work as phase 1 and phase 2.

junk
A full featured development package would include a linker and a run-time support library. Usually a machine oriented language is required for writing the support library. C or C++ could be used as a machine oriented language. A library consisting of support functions required for the compilation process usually rounds out the full metacompiler package.

edited
The metacompiler's metalanguage is a specialized programming languages designed for the purpose of parsing amd transforming character sequences. Metacompilers reduce the task of writing compilers by automating the aspects that are the same regardless of the object language. This makes possible the design of domain-specific languages which are appropriate to the specification of a particular problem. A metacompiler reduces the cost of producing translators for such domain-specific object languages to a point where it becomes economically feasible to include in the solution of a problem a domain-specific language design.

Metacompiler metalanguages are powerful string and symbol processing languages that are also useful for generating a wide range of other software engineering and analysis tools.

Besides being useful for domain-specific language development, a metacompiler is itself a prime example of a domain-specific language, designed for the domain of compiler writing.

A metacompiler is a metaprogram usually written in its own metalanguage or an existing computer programming language. The process of a metacompiler, written in its own metalanguage, compiling itself is equivalent to self-hosting compiler. Most common compilers written today are Self-hosting compilers. Self-hosting is a powerful tool, of many metacompilers, allowing the easy extension of their own metaprogramming metalanguage. The feature that separates a metacompiler apart from other compiler compilers is that it takes as input a specialized metaprogramming language that describes all aspects of the compilers operation. A metaprogram produced by a metacompiler is as complete a program as a program written in C++, BASIC or any other general programming language. The metaprogramming metalanguage is a powerful attribute allowing the ease of development of computer programming languages and other computer tools. Command line processors, text string transforming and analysis are easily coded using metaprogramming metalanguages of metacompilers.

A full featured development package would include a linker and a run-time support library. Usually a machine oriented language is required for writing the support library. C or C++ could be used as a machine oriented language. A library consisting of support functions required for the compilation process usually rounds out the full metacompiler package

metaprogramming metalanguage

e1
They employ specialized string and object orianted programming languages

Lead paragraph
A metacompiler is a software development tool used chiefly in the construction of compilers, translators, and interpreters for other programming languages. They are a subset of a specialized class of compiler writing tools called compiler-compilers that employ metaprogramming languages. Metaprogramming is the writing of computer programs with the ability to treat programs as their data. The input to a metacompiler is a metaprogram written in a specialized metalanguage designed chiefly for the purpose of constructing compilers. The language of the compiler produced is called the object language. The minimal input producing a compiler is a metaprogram specifying the object language grammar and semantic transformations into an object program.

Metacompilers reduce the task of writing compilers by automating the aspects that are the same regardless of the object language. This makes possible the design of domain-specific languages which are appropriate to the specification of a particular problem. A metacompiler reduces the cost of producing translators for such domain-specific object languages to a point where it becomes economically feasible to include in the solution of a problem a domain-specific language design.

Metacompiler metalanguages are powerful string and symbol processing languages that are also useful for generating a wide range of other software engineering and analysis tools.

Besides being useful for domain-specific language development, a metacompiler is itself a prime example of a domain-specific language, designed for the domain of compiler writing.

A metacompiler is a metaprogram usually written in its own metalanguage or an existing computer programming language. The process of a metacompiler, written in its own metalanguage, compiling itself is equivalent to self-hosting compiler. Most common compilers written today are Self-hosting compilers. Self-hosting is a powerful tool, of many metacompilers, allowing the easy extension of their own metaprogramming metalanguage. The feature that separates a metacompiler apart from other compiler compilers is that it takes as input a specialized metaprogramming language that describes all aspects of the compilers operation. A metaprogram produced by a metacompiler is as complete a program as a program written in C++, BASIC or any other general programming language. The metaprogramming metalanguage is a powerful attribute allowing the ease of development of computer programming languages and other computer tools. Command line processors, text string transforming and analysis are easily coded using metaprogramming metalanguages of metacompilers.

A full featured development package would include a linker and a run-time support library. Usually a machine oriented language is required for writing the support library. C or C++ could be used as a machine oriented language. A library consisting of support functions required for the compilation process usually rounds out the full metacompiler package.

and transforms it into object code instructions


 * b)is used chiefly in the construction of compilers, translators, and interpreters for other programming languages..

When the input is a complete programming language specification the transformation is a compiler for that language.

program transformation

software development tool

They are a subset of a specialized class of compiler writing tools called compiler-compilers program that reads a metalanguage program as input and translates that program into a set of instructions. If the input program is a complete specification of a programming language the translation is a compiler for that language.

A metacompiler is a computer program that takes as input a metalanguage program in which a programming language specification is coded. i.e. the specification of source code syntax and semantically equivalent transformation to machine code,

compiler's, source code, programming language specification and semantic equivalent transformation to object code are programed. The language specification defines the grammar semantic transformation that operate on parsed language constructs. When the input is a complete programming language specification the output is a compiler for that language.

Metacompiler's work by performing semantic equivalent program transformations on parsed language constructs.

A metacompiler may be used to construct compilers, translators, or interpreters for a variety of programming languages. With their powerful string analyzing and transforming features they are useful for many string processing applications.

stack automaton

History of compiler construction reductive syntax and Semantics Saul Rosen

edit Pt2
Basically, the metalanguage consists of rules. A rule can either succeed or fail. A rule consists of alternatives that are sequences of other rule invocations. A rule succeeds if any of its alternatives succeeds; these are tried in sequence. An alternative succeeds if all of its rule invocations succeed. The language provides operators to create evaluation loops without recursion.

The metacompiler class of compiler-compiler is characterized by its reductive grammar specification language incorporating semantic transformations. A language is specified top-down by a set of rules.


 * Token (parser)

Aspects of the Theory of Syntax

Proceedings of the 1964 19th ACM National Conference

Reductive Grammar
A reductive grammar is a set of rules, analytical in nature, defining a language's structure. They were used by philosophers and linguists in the early 20th century. A reductive grammar rule defines a named language formation. A verb phrase, noun phrase etc. Defining language in this fashion allowed language to be discussed.

A book for example has a structure that may contain sections, chapters, table of contents, an index, paragraphs sentences phrases and words. A reductive grammar would define a book top down starting with a book rule.


 * books = intro TOC sections index;

As a programming language the above rule specfies the top level parts and their order. Order being a sequential formation.

Each rule defines a structure's formation. Textual constructs of sentences that are made up of phrases. That are in turn made up of words. Words made up of characters.

A reductive grammar rule names a formation of symbols. The name relates to the formations function. i.e. A set of reductive grammar rules expressing the form of an algebraic expression using related rule names.

Consider the following algebraic expression grammar expressed in CWIC.

1 – Exponent (power), 2 – factor (coefficient), 3 – term, 4 – sum_operator, 5 – factor (constant), $$x, y - $$factor (variables)


 * expression = term $(('+'|'-') term);
 * term = factor $factor;
 * factor = power | coefficient
 * variable | constant;
 * power = ''variableexponent;
 * coefficient = number | constant;
 * variable = 's'|'t'|'u'|'v'|'w'|'x'|'y'|'z';
 * constant = 'c' | 'c0' | 'c1' | 'c2' | 'c3' |  'c4':
 * number .. dgt $dgt;
 * dgt: '0'|'1'|'2'|'3'|'4'|'5'|'6'|'7'|'8'|'9';
 * exponent = number | constant;

edit point
and transforms it into object code. If the input source code is a complete specification (syntax and semantics) of a programming language the output object code is a compiler. Metacompilers are used chiefly in the construction of compilers for other programming language.

In turn a metalanguage is described using a metasyntax.

translator

Metaprogramming is the writing of computer programs with the ability to treat programs as their data. The input to a metacompiler is a metaprogram written in a specialized metalanguage designed chiefly for the purpose of constructing compilers. The language of the compiler produced is called the object language. The minimal input producing a compiler is a metaprogram specifying the object language grammar and semantic transformations into an object program.

is a compiler that takes as input source code written in a metalanguage. If the input is a complete specification of a programming language the output translation is a compiler for that programming language. Metacompilers are used chiefly in the construction of compilers, translators, and interpreters for other programming language. .

Metacompilers employ powerful string and symbol, analyzing and processing features that are useful for a wide range of software engineering and analysis applications. They are metaprogramming tools that reduce the effort of writing compilers and other language analysis tools by automating the common aspects that are the same regardless of the object language. This makes possible the economic design and development of domain-specific languages which are appropriate to the specification of a particular problem. Besides being useful for domain-specific language development, a metacompiler is itself a prime example of the use of domain-specific languages.

Metacompilers belong to a class of software development tools known as compiler-compilers. Metacompiler and compiler-compiler were at on time synonymous. The term compiler-compiler by common use has became more general. Using a metacompiler one programs the parser, defineing top down the allowable structure and composition of the source language and semantic transformation to the target language or object code.

Metalanguages
The first use of a specialized metalanguage in specifying a programming language was BNF in the ALGOL 60 reports. BNF, Backus–Naur Form, was specifically developed by John Backus to talk about ALGOL. What was used before BNF? A natural language such as English, Danish, Russian etc. And that remained for talking about the semantics of ALGOL. But with BNF they could be bound together as never before. BNF was immediately and unanimously accepted by the IAL group.

Early programming languages were primarily mathematical in nature. FORTRAN (FORmula TRANslation) was crated by John Backus. John was a math major having just aquired his BS degree when recruited by IBM to head up the IBM language development group. It is important to know this to understand the reasoning behind BNF. BNF is a metalanguage of formula that specify a language construct, formation or pattern. The ALGOL 60 report describes them as metalinguistic formulae. They are further explained by Peter Naur to be a set of named linguistic classes. A class name is bounded by the < > characters.

In algebra we learn that a mathematical expression is made up of terms that are added or subtracted. i.e.


 * a+b-c

An expression specifications in BNF can be written as follows.





Likewise, term can be defined:



The linguists would describe BNF as a production language ignoring the semantic descriptions. A programmer might agree for they describe how to write valid language constructs. But a compiler writer would say they are reductive or analytical for they described the ALGOL programming language top down. Making for an easy translation to a top down recursive decent parser.

But both are individually wrong. For BNF is a metalanguage describing the syntax of a valid program and (semantics) having a natural language description of each class precisely describing what it does. It is used to communicate to a programmer writing a program the precise meaning of every language construct. And to the compiler writer the same thing. BNF, as origionally defined, is a documentary specification metalanguage to be used along with natural language semantic descriptions of a BNF, language construct, or class. The semantic specification for would be written in a natural language. It would specify the acceptable data types of the s. Upward conversion to maintain accurate results. And any other significant details of implementation.

The programming metalanguages of metacompilers are simular to BNF. Analytic grammar rules are boolean functions with imbeded semantic transforms and the ability to code many programming languages specifications in their programming metalanguage as well as their own metasyntax specifications.

Each rule defines a language element or construct. In the process of change to a programming language the < > brackets were dropped and words of the target language were then coded as quoted strings. BNF required writing a separate rule for alternative within a rule. The above for example. And a series of like constructs were coded as immediate left recursion. i.e.



In the above BNF class, expr is defined as a single term or a term followed by a sequence of term. The immediate left recursion, referencing itself, implies a loop. BNF simply was not a simple language to write a compiler for and could not define itself. The < > are cumbersome to write around every name. Computers were very memory limited. ln 1963 - 1964 Dewey Val Schorre made a more conventional programming language following essentially the same concept. Quoted strings were used as was already the norm in the 1401 autocoder. META I and META II were written on a IBM 1401. Schorre removed the < > no longer needed using quoted strings. Added grouping and a $ loop operator. Parsing is a complex conditional. In essance one huge conditional if program then ...

Can one even imagine the enormity of the boolean expression to decide if program meets a language specifications. That is a compilers job. Using a programming metalanguage that job is broken down into deciding language construct correctness. As in BNF each language construct has a named rule. The rule is a boolean expression returning success(true) or failure(false). The above expr rule written in META II:


 * EXPR = TERM $(('+'/'-') TERM);
 * TERM = FACTOR $(('*'/'/') FACTOR);
 * FACTOR = .ID / .NUMBER / '(' EXPR ')';

Built in token rules for .ID, .NUMBER, .STRING were used to recognize tokens in the language. Early Schorre metalanguages recognizing that tokenis require different algorithms than syntax rules provided builtin functions that followed ALGOL rules. Program input was on punch cards. The | not being available a / character was used.

Features of metacompilers

 * Specialized metalanguages that are easy to read, write, and maintain specificly designed for writing compilers having complete programmed control of all aspects of the grammar analysis and transformation to object code . Metacompilers define the transformation of source code to object code completely within the metalanguage(s) of the metacompiler.
 * A defining feature of a metacompiler is it compiles a program written in a metasyntax defined metalanguage that is able to express many programming languages including its own metasyntax. Not all metacompilers compile their selves but utilizing the self expressing feature allows them to do so making them a specialized class of self-hosting  compilers enabling them to quickly and easly port or extend their selves.
 * Programable parsers provide the ability to avoiding backtracking, by factoring, ambiguous language constructs.
 * Context sensitive domain specific sub-languages like COBOL divisions can be handled by programed analitical parsers.
 * Grammar analysis may include context sensitive lexical rules.
 * Advanced metacompilers, having extensive backtracking features and lexical rules can analyze most context sensitive languages.
 * A full featured development package includes a linker and a run-time support library.
 * A machine oriented language is required for writing the support library. MOL 360 was used with the CWIC metacompiler. Today C or C++ might be used as a machine oriented language.

Examples
These examples are primarrly based on advanced metacompilers developed by SDC (Systems Development Corporation) and their predecessors in the Schorre line of metacompilers. Other metacompilers exist, usually of limited capability, or lacking documentation.

These examples attempt to illustrate how a compiled metalanguage operates. They are quite different from parser generators in common use.

All metacompiles in the line of these Schorre based metalanguage use the same basic programed syntax rule method of language analysis and are able to program the generation of object code. The code generation method has advanced in the later generations. CWIC, the basis for these examples, is the last publicly documented generation in the Scorre metacompiler line. Small differences in the illustrated code reflect modern character sets and generalizations illustrating concepts.

Syntax Rules
Syntax rules are test functions specifying a language construct, pattern of tests, matched against the input character stream. They analyze it, breaking it into elements, and, on success, using embedded transform operations output it in a different equivalent form. A rule is itself a test that returns success or failure. Transformations do not occure on failure.
 * Assignment statement example:

assign = id '=' expr :STORE!2 genit[*1]; expr = term $(('+':ADD|'-':SUB) term!2); term = factor $(('*':MPY|'/':DIV) factor!2); factor = '(' expr ')' | id | number; Above is a simple arithmetic assignment defined in a metalanguage similar to the CWIC SYNTAX language. It is a functional programming language made up of rules that analyze a program starting with a top driving rule defining the parsing of a source file. Syntax rules are functions that analyze the input character stream, returning success or failure. On success the recognized language constructs are transformed and placed on the parse stack.

Rules are best thought of as logic equations. The '|'(or operator) is used to allow alternate language constructs. ('+' | '-') is read as + or -. The character constants '+' and '-' are tests on the input character stream. The input is advanced over characters that match.

The programming model consists of two stacks and a symbol table. The parse stack contains parsed objects and constructed lists The SYNTAX rules use node and tree building operators to combine parse stack entries into list or tree objects creating an abstract syntax tree. The : operator pushes the on the node stack. The tree builder ! operator builds a tree object poping the top node stack (node name) object and the top of parse stack entries forming +1 entry list and pushing it on the parse stack. Generally the parse stack contains abstract tree elements that are combined and eventually passed to an unparse transform rule. Token rules create atomic objects on the parse stack. Atomic objects are numbers, strings, and symbols.

Syntax rules are made up of logical operators, grouping operators, looping operators, tree building operators, and tests consisting of literal strings or rules that are matched against the input stream returning success or failure. A rule is itself a test. Starting with the assign rule defining an assignment to be an id followed by an = followed by an expr. The assign rule is specifying a series of tests to be made against the input stream of characters. First an id must be matched. When a test succeeds the input stream has been advanced over the matched characters. The '=' is matched against the input stream. And finally expr is called to match an arithmetic expression. If all three tests are successful a STORE tree is built and passed to the (codegen[*1]) codegen unparse transform rule. The [*1] indicates codegen to be a generator transform rule. *1 is the top parse stack object that is popped and passed as the single argument to codegen.

The backtracking alternate allows ambiguous language constructs to be recognized. A more common use of backtracking is handling errors. A backtrack alternate resets the parser state to the beginning state of its left alternant before proceeding with the right alternant. Error recovery usually tries skiping to a point where normal operation can continue. For example a ';' may be scanned for to find the end of a statement. ($-';' ';') skips characters not a ; and then matches a ;. A normal alternant will not be attempted if any part of left alternant succeeds. Operators of the syntax language Operator   Function |    (A or B)   Alternant separator. If A partially succedes B will not be attempted. ||    (A or B)   Backtracking alternant If A partially succeeds the state will be restored to when A began and B is attempted. Groups constructs. $    Loop, zero or more of. '(' id $(',' id) ')' matches a argument list to a function. %    Zero or more with backtracking. "    String delimiter  "ABC" is the character string ABC.  '     Character delimiter. A string of one character.        Character class rules are restricted to single character strings.  :     Used to define a character class rule.        In syntax rules creates a node and pushes it on the node stack.  !     Forms a tree from the top node stack entry and parse stack         entries. The ! is followed by the number of parse stack entries         to be removed and placed in the list. < >    Creates list of parse stack entries created within their bounds.  -     Negates the following test. Or grouped test.           (-record_contains) true if record_contains fails. The input           stream is not advanced in any case.  ~     Not the following character. Matches any character except the        following character constant. ~';' matches any characte except ;           string .. '"' $~'"' '"' MAKESTRING[]: is the token rule for a string. ,    Pushes a string on the parse stack. ,"xyz" pushes the string "xyz" onto the parse stack. +   proceeding a string constant keeps the string. In a syntax rule it       is pushed on the parse stack. In a token rule it becomes part of        the token.

FORTH metacompiler
Programming in FORTH is adding new words to the language and defining their function. Adding words to the FORTH language is metaprogramming, extending the language, creating a new FORTH dialect. FORTH is a specialized metacompiler for FORTH language dialect.

Starting from a root vocabulary, that is defined using machine operations, a standard FORTH implementation can be defined. FORTH may be used to compile itself.

Many Forth advocates call the process of creating a new implementation of FORTH a meta-compilation. That it compiles itself is what makes it a metacompiler. But that is ignoring the fact that the FORTH metacompiler is actually compiling new vocabulary creating a full FORTH language implementation from a minimal base vocabulary. It's the programed creation of new dictionary entries that make FORTH a metacompiler.

Historical
Metacompilers have played a significant role in both computer science and the build-up of the computer industry. Initial metacompilers including Meta-II and its descendant TreeMeta. are the better known metacompilers. There were many more metacompiler designs tried during the early history of compiler development.

The ideas about grammars, self-generation and extensions are widely used by Program transformation systems.

Early History of Metacompilers
The early history of metacompilers is closely tied with the history of SIG/PLAN Working group 1 on Syntax Driven Compilers. The group was started primarily through the effort of Howard Metcalfe in the Los Angeles area. In the fall of 1962 Howard Metcalfe designed two compiler-writing interpreters. One used a bottom-to-top analysis technique based on a method described by Ledley and Wilson. The other used a top-to-bottom approach based on a work by glennie to generate random English sentences from a context-free grammar.

At the same time, Val Schorre described two "meta machines". One generative and one analytic. The generative machine was implemented and produced random algebraic expressions. Meta I the first metacompiler was implemented by Schorre on an IBM 1401 at UCLA in January 1963. His original interpreters and metamachines were written directly in a pseudo-machine language. Meta II, however, was written in a higher-level metalanguage able to describe its own compilation into the pseudo-machine language.

Lee Schmidt at Bolt, Beranek, and Newman wrote a metacompiler in March 1963 that utilized a CRT display on the time-sharing PDP-l. This compiler produced actual machine code rather than interpretive code and was partially bootstrapped from Meta I.

Schorre bootstrapped Mteta II from Meta I during the Spring of 1963. The paper on the refined metacompiler system presented at the 1964 Philadelphia ACM conference is the first paper on a metacompiler available as a general reference. The syntax and implementation technique of Schorre's system laid the foundation for most of the systems that followed. The system was implemented on a small 1401, and was used to implement a small ALGOL-like language.

Many similar systems immediately followed.

Roger Rutman of A. C. Sparkplug developed and implemented LOGIK, a language for logical design simulation, on the IBM 7090 in January 1964. This compiler used an algorithm that produced efficient code for Boolean expressions.

Another paper in the 1964 ACM proceedings describes Meta III, developed by Schneider and Johnson at UCLA for the IBM 7090. Meta III represents an attempt to produce efficient machine code, for a large class of languages. Meta III was implemented completely in assembly language. Two compilers were written in Meta III, CODOL, a compiler-writing demonstration compiler, and PUREGOL, a dialect of ALGOL 60. (It was pure gall to call it ALGOL).

Late in 1964, Lee Schmidt bootstrapped the metacompiler EQGEN, from the PDP-l to the Beckman 420. EQGEN was a logic equation generating language.

In 1964, System Development Corporation began a major effort in the development of metacompilers. This effort includes powerful metacompilers, Bookl, and Book2 written in LISP which have extensive tree-searching and backup capability. An outgrowth of one of the Q-32 systems at SDC is Meta 5. This system was successfully released to a wide number of users and had many string-manipulation applications other than compiling. The Meta 5 system incorporates backup of the input stream and enough other facilities to parse any context-sensitive language. It has many elaborate push-down stacks, attribute setting and testing facilities, and output mechanisms. The fact that Meta 5 successfully translates JOVIAL programs to PL/l programs clearly demonstrates its power and flexibility.

The LOT system was developed during 1966 at Stanford Research Institute and was modeled very closely after Meta II. It had new special-purpose constructs allowing it to generate a compiler which would in turn be able to compile a subset of PL/l. This system had extensive statistic-gathering facilities and was used to study the characteristics of top-down analysis.

SIMPLE is a specialized translator system designed to aid the writing of pre-processors for PL/I, SIMPLE, written in PL/I, is composed of three components: An executive, a syntax analyzer and a semantic constructor.

The TREE META compiler was developed at Stanford Research Institute in Menlo Park, California. April 1968.

The early metacompiler history is well documented in the TREE META manual. TREE META paralleled some of the SDC developments. Unlike earlier metacompilers it separated the semantics processing from the syntax processing. The syntax rules contained tree building operations that combined recognized language elements with tree nodes. The tree structure representation of the input was then processed by a simple form of unparse rules. The unparse rules used node recognition and attribute testing that when matched resulted in the associated action being performed. In addition like tree element could also be tested in an unparse rule. Unparse rules were also a recursive language being able to call unparse rules passing elements of the tree before the action of the unparse rule was performed. The concept of the metamachine originally put forth by Glennie is so simple that three hardware versions have been designed and one actually implemented. The latter at Washington University in St. Louis. This machine was built from macro-modular components and has for instructions the codes described by Schorre.

CWIC (Compiler for Writing and Implementing Compilers) is the last known Schorre metacompiler. It was developed at System Development Corporation. With the full power of (lisp 2) a list processing language optimizing algorithms could operate on syntax generated lists and trees before code generation. CWIC also had a symbol table built into the language.

Information about later descendants of these metacompilers is not generally available. With the resurgence of domain-specific languages and the need for parser generators which are easy to use, easy to understand, and easy to maintain, metacompilers are becoming a valuable tool for advanced software engineering projects.

CWIC
In 1969-1970, Erwin Book, Dewey Val Schorre, and Steven J. Sherman developed CWIC. The SDC CWIC(Compiler for Writing and Implementing Compilers) documents are arvhived at Charles Babbage Institute Center for the History of Information Technology (Box 12, folder 21),

CWIC is last documented metacompiler in the Schorre line. It is a marriage of Schorre's meta CWIC is a meta-compiler system composed of three special-purpose languages, each intended to permit the description of certain aspects of translation in a straightforward, natural manner. The syntax language is used to describe the recognition of source text and the construction from it to an intermediate tree structure. The generator language is used to describe the transformation of the tree into appropriate object language. MOL/360 ,Machine Oriented Language-360, is a mid-level systems programming language for the IBM System/360 family of computers that is used to provide an interface with the machine and it's operatoring system. MOL/360 was independently developed in 1968 and used to develop the ADEPT time sharing system.

The syntax language is similar to that of Dewey Val Schorre's line of metacompilers. It most resembles TREEMETA having tree building operations in the syntax language. The unparse rules of TREEMETA are extended to work with the object oriented generator language based on LISP 2.

An eaisly accessable source for information on CWIC is available at: "CWIC/36O system, a compiler for writing and implementing compilers ACM SIGPLAN Notices June 1970 volume=5 issue=6 pages=11–29" Archived System Development Corporation doumentents can be requested from Charles Babbage Institute Center for the History of Information Technolog. The CWIC Technical Manual is archived at Charles Babbage Institute Center for the History of Information Technology (Box 12, folder 21)

CWIC includes three languages:
 * Syntax: Transforms the source program input, into list structures using specialized grammars. A parsed expression structure is passed to a generator by placement of a generator call in a rule. A tree is a list whose first element is a node. The language has operators, < and >, specifically for making lists. The colon : operator is used to create node objects. :ADD creates an ADD node. The exclamation ! operator combines a number of parsed entries with the most previous created node to make a tree . The syntax rules are processed by generator functions, returning success or failure. The syntax language is very close to TREEMATA. Both use a colon to create a node. CWIC's tree building exclamation ! functions the same as TREEMETA's [ ].
 * Generator: a named series of transforming rules, each consisting of an unparse, pattern matching, rule. and an output production written in a LISP 2 like language. the translation was to IBM 360 binary machine code. Other facilities of the generator language generalized output.
 * MOL-360: an independent mid level implementation language language developed in 1968 and used for writing the underlying support library.

Generators Language
The Generator Language has semantics similar to Lisp. The tree is thought of as a recursive list. The general form of a Generators Language function is: function-name(first-unparse_rule) =&gt; first_generator_action; (second-unparse_rule) =&gt; second_generator_action;; (third-unparse_rule) =&gt; third_generator_action; •                         •                          • The code to process a given tree included the features of a general purpose programming language, plus a form: &lt;stuff&gt;, which would emit (stuff) onto the output file. A generator call may be used in the unparse_rule. The generator is passed the element of unparse_rule pattern in which it is placed and returns values are listed in. For example: expr_gen(ADD[expr_gen(x),expr_gen(y)]) =&gt; &lt;AR + (x*16)+y;&gt; releasereg(y); return x;         (SUB[expr_gen(x),expr_gen(y)])=&gt; &lt;SR + (x*16)+y;&gt; releasereg(y); return x;         (MUL[expr_gen(x),expr_gen(y)])=&gt; .                             .                              .          (x)=&gt;     r1 = getreg; load(r1, x); return r1; ... That is, if the parse tree looks like (ADD[, ]), expr_gen(x) would be called with and return x. A variable in the unparse rule is a local variable that can be used in the generator action. expr_gen(y) is called with and returns y. Of note here is a generator call in an unparse rule is passed the element in the position it occupies. Hopefully in the above x and y will be registers on return. The last transforms is intended to load an atomic into a register and return the register. The first production would be used to generate the 360 "AR" (Add Register) instruction with the appropriate values in general registers. The above example is only a part of a generator. Every generator expression evaluates to a value that con then be further processed. The last transform could just as well have beenr writen as: (x)=&gt;    return load(getreg, x); In this case load returns it's first parameter, the register returned by getreg. the functions load and getreg are other CWIC generators.

CWIC addressed domain-specific languages before the term domain-specific language existed
From the authors of CWIC: "A metacompiler assists the task of compiler-building by automating its noncreative aspects, those aspects that are the same regardless of the language which the produced compiler is to translate. This makes possible the design of languages which are appropriate to the specification of a particular problem. It reduces the cost of producing processors for such languages to a point where it becomes economically feasible to begin the solution of a problem with language design."