Re2c

re2c is a free and open-source lexer generator for C, C++, Go, and Rust. It compiles declarative regular expression specifications to deterministic finite automata. Originally written by Peter Bumbulis and described in his paper, re2c was put in public domain and has been since maintained by volunteers. It is the lexer generator adopted by projects such as PHP, SpamAssassin, Ninja build system and others. Together with the Lemon parser generator, re2c is used in BRL-CAD. This combination is also used with STEPcode, an implementation of ISO 10303 standard.

Philosophy
The main goal of re2c is generating fast lexers: at least as fast as reasonably optimized C lexers coded by hand. Instead of using traditional table-driven approach, re2c encodes the generated finite state machine directly in the form of conditional jumps and comparisons. The resulting program is faster than its table-driven counterpart and much easier to debug and understand. Moreover, this approach often results in smaller lexers, as re2c applies a number of optimizations such as DFA minimization and the construction of tunnel automaton. Another distinctive feature of re2c is its flexible interface: instead of assuming a fixed program template, re2c lets the programmer write most of the interface code and adapt the generated lexer to any particular environment. The main idea is that re2c should be a zero-cost abstraction for the programmer: using it should never result in a slower program than the corresponding hand-coded implementation.

Features

 * Submatch extraction: re2c supports both POSIX-compliant capturing groups and standalone tags (with leftmost greedy disambiguation and optional handling of repeated submatch). The implementation is based on the lookahead-TDFA algorithm.


 * Encoding support: re2c supports ASCII, UTF-8, UTF-16, UTF-32, UCS-2 and EBCDIC.
 * Flexible user interface: the generated code uses a few primitive operations in order to interface with the environment (read input characters, advance to the next input position, etc.); users can redefine these primitives to whatever they need.
 * Storable state: re2c supports both pull-model lexers (when lexer runs without interrupts and pulls more input as necessary) and push-model lexers (when lexer is periodically stopped and resumed to parse new chunks of input).
 * Start conditions: re2c can generate multiple interrelated lexers, where each lexer is triggered by a certain condition in program.
 * Self-validation: re2c has a special mode in which it ignores all used-defined interface code and generates a self-contained skeleton program. Additionally, re2c generates two files: one with the input strings derived  from  the  regular grammar, and one with compressed match results that are used to verify lexer behavior on all inputs. Input strings are generated so that they extensively cover DFA transitions and paths. Data  generation  happens right  after  DFA  construction  and  prior  to  any  optimizations,  but  the  lexer  itself  is  fully  optimized, so skeleton programs are capable of revealing any errors in optimizations and code generation.
 * Warnings: re2c performs static analysis of the program and warns its users about possible deficiencies or bugs, such as undefined control flow, unreachable code, ill-formed escape symbols and potential misuse of the interface primitives.
 * Debugging. Besides generating human-readable lexers, re2c has a number of options that dump various intermediate representations of the generated lexer, such as NFA, multiple stages of DFA and the resulting program graph in DOT format.

Syntax
re2c program can contain any number of  blocks. Each block consists of a sequence of rules, definitions and configurations (they can be intermixed, but it is generally better to put configurations first, then definitions and then rules). Rules have the form  or   where   is a regular expression and   is a block of C code. When  matches the input string, control flow is transferred to the associated. There is one special rule: the default rule with  instead of  ; it is triggered if no other rules matches. re2c has greedy matching semantics: if multiple rules match, the rule that matches longer prefix is preferred; if the conflicting rules match the same prefix, the earlier rule has priority. Definitions have the form  (and also   in Flex compatibility mode). Configurations have the form  where   is the name of the particular configuration and   is a number or a string. For more advanced usage see the official re2c manual.

Regular expressions
re2c uses the following syntax for regular expressions:

Character classes and string literals may contain the following escape sequences:,  ,  ,  ,  ,  ,  ,  , octal escapes   and hexadecimal escapes  ,   and.
 * case-sensitive string literal
 * case-insensitive string literal
 * ,  character class (possibly negated)
 * any character except newline
 * difference of character classes
 * zero or more occurrences of
 * one or more occurrences of
 * zero or one occurrence of
 * repetition of  exactly   times
 * repetition of  at least   times
 * repetition of  from   to   times
 * just ; parentheses are used to override precedence or for POSIX-style submatch
 * concatenation:  followed by
 * alternative:  or
 * lookahead:  followed by , but   is not consumed
 * the regular expression defined as  (except in Flex compatibility mode)
 * an s-tag: saves the last input position at which  matches in a variable named
 * an m-tag: saves all input positions at which  matches in a variable named

Example
Here is a very simple program in re2c (example.re). It checks that all input arguments are hexadecimal numbers. The code for re2c is enclosed in comments, all the rest is plain C code. See the official re2c website for more complex examples.

Given that,  generates the code below (example.c). The contents of the comment  are substituted with a deterministic finite automaton encoded in the form of conditional jumps and comparisons; the rest of the program is copied verbatim into the output file. There are several code generation options; normally re2c uses  statements, but it can use nested  statements (as in this example with   option), or generate bitmaps and jump tables. Which option is better depends on the C compiler; re2c users are encouraged to experiment.