User:Ryuki4716/SynTex/SynTex Generation Stage by Stage

Stage I: Text Prep
An appropriate Training Text (TT) is located. The longer the better. Public domain. Ideally a *.txt formatted file, but in practice, probably a *.pdf file that will need translation to *.txt (introducing plenty of trash, especially run-togethers and Periodtrash). Correct language (eg English).

Ideally with Left/Right Quotes That's DublinerQuotes, since they are probably easiest to process, and we developed Generation for them first. There are many other kinds of Quotes, each requiring specific and extensive processing that's a time-consuming deviation from our core SynTex Generation goal.

Topically focussed (Dubliners, Das Kapital, not Wikipedia). Topical focus is critical. For example, an English dump of Wikipedia may contain 1000s of articles, with millions of words. Yet resulting SynTex will read bizarrely Phazed since Extension Wafers are equally likely to be chosen from ANY of the 1000s of different topics.

Similar to mixing 100 colors of paint together. The result? Colorless grey. Or a bizarrely Phazed SynTex that might begin discussing Flamenco, then switch to Earth History after just 4-5 words, then to Auto Mechanics, and so forth. The resulting SynTex is usually close to formally correct but semantically chaotic.

It is extremely difficult to find long Training Text (TT) focussed on a single narrow topic. Nobody writes a 500-page manual about 'watermelon seeds' for example, then makes it free and public on the internet. The prose alternative may not be topically focussed, but that's an advantage for SynTex Generation. When the artistic coherency is in style, not content, then SynTexGen can approximate better. The resulting SynTex appears 'artistic', just as the unfocussed author also wrote Phazed prose that sounded good but didnt make much rational sense.

Trash Filtering
Raw text contains trash: typos, unwanted figures (foot-notes, page-numbers, titles, graphs, figures) that needs removing, but without disturbing textual flow. If empty holes are left after removing trash, then later SynTexGen will misinterpret the gaps as legitimate text, and naively mimic errors thinking they are stylisms. It's likely a *.pdf text will be used since they are much more numerous than *.txt files. The translation into *.txt format will introduce several varieties of new trash, notably Run-On words and PERIODtrash. We filter them out at a later stage.

Trash Filtering is performed in 2 stages. Freshly input raw text resides in a long Python string of characters, but not yet tokenized text (a List (datatype) of numerous short strings) So those character operations that work well on character string data are performed now. Then later tokenizing the text data means slicing the character string into many short character string tokens, sequentially in a Python List datatype. Now List operations can be performed.

Tokenizing
After preliminary trash is removed, the text is tokenized, arranging character sequences properly into a single very long list of tokens. The tokens comprise words and punctuation (and peripherally: numbers). For Python a token is a short character string. It's critical to include punctuation, an often overlooked defining feature of text. Common punctuation symbols such as Comma ',' and Period '.' are extremely frequent, possibly accounting for 10%/15% of a text. They also figure in the top 10 most frequent token list.

Tokenizers do not in practice achieve 100% accuracy, so after tokenizing there will still be pre-existing left-over trash, and now new trash the tokenizer injected. For English, Period-trash and Run-on trash are notorious. For many languages such as English, it's notably easy to score 70%-80% accuracy with a home-made tokenizer, but near-100% isnt possible. Of course even humans cant score 100% either. A naive English tokenizer simply considers a token to be whatever is surrounded by whitespace. But then some punctuation usages introduce minority problems that are hard to overcome. Commas, periods, quotation marks, etc. are all positioned immediately to the right of a previous English word, with no intervening whitespace. And there are numerous exceptions and ambiguities. Then other languages (Japanese, Mandarin) dont use whitepace between words, so tokenizers for such languages are much harder to perfect.