User:Vegard/Wikitext parsing

Reasons for using an LL/LR (context-free) parser:
 * Parsing efficiency
 * Unambiguity
 * Uniform parser/grammar can also help greatly for accessing documents with bots/programs (think DOM)

Don't try to capture all of today's wikitext constructs using a formal grammar. This would be counter-productive. There are bound to be differences anyway. Even if strange things like nesting links within links is possible with the ad-hoc parsing, how many articles actually use it? New parser should be simple and extensible (compare with today's regex hell).

Recursive descent parser
/* XXX */ document { section* | paragraph* }

/* Container for elements that rely on separated lines for structure (such as * lists). */ line { text "\n" }

/* Text may contain mark-up like links and font styles, but only a single * contigous line of text (therefore no lists or other elements that span * multiple lines). */ text { text-plain | text-italic | text-bold } text-italic { "" text "" } text-bold { "" text "" }

/* Plain-text may not contain additional markup. Plain-text may contain * markup that is not to be displayed as markup. Umm. */ text-plain { /* XXX: Define this. Make sure to include all UTF-8 characters. */ }

section { heading paragraph* }

/* Headings */ heading { heading-1 | heading-2 | heading-3 | heading-4 | heading-5 | heading-6 } heading-1 { "=" text "=" } heading-2 { "==" text "==" } heading-3 { "===" text "===" } heading-4 { "====" text "====" } heading-5 { "=====" text "=====" } heading-6 { "======" text "======" }

/* A single paragraph of text. May contain some multi-line constructs like * lists, but not headings. */ paragraph { (text | list)+ }

/* Signatures */ signature { signature-name | signature-name-date | signature-date } signature-name { "" } signature-name-date { "~" } signature-date { "" }

/* XXX: Match beginning/end of line */ ruler { "" }

list { list-element* } list-element { ("*" | "#" | ":" | ";")+ line }

comment { "" }

tag { "<" /* XXX: What to put here? */ ">" }

Practical implementation

 * Don't make many exceptions and special cases (for example: A closing tag is not required. If it is missing then the rest of the supplied text is treated as nowiki. ). Depreciate these obscure features and produce warnings, so that pages in violation can be detected and corrected.
 * Allowing HTML was probably always a bad idea. Provide Wikitext replacements.
 * Use a bot to validate existing pages with the new parser. Maintain list of pages that are not valid with the new parser.