User:Ryuki4716/SynTex/Synthetic Text

Motivation
Nobody precisely knows how humans convert thoughts into language. A Synthetic Text (ST) generator tries to simulate that conversion, using a long Resource Text as our machine's equivalent of a human mind. We want the (ST) to mimic Human Text(HT) as closely as possible. Of course the results won't be as perfect as (HT). But the exercise may help to show where and why the (ST) and (HT) differ. Synthetic Text (ST) is often fazed in characteristic ways. Some typical examples:

*....if you like, when the first notes of the first flight, in the sleeping house. *If he was dead, I think he’s a man of the evening editions. *The man said they were to be called in the cycling-suit...                                          *...the man by the grey pallor of the man was grey... *“one night, man, I met her by me.” she turned as if to say : “that takes the solitary...crossed.”

None of these examples sounds convincingly human, yet they all approximate English in bizarre ways. They're examples of fazed text: (ST) mimics (HT) but often suffers fazings. Some common symptoms include:

Lack of Context
*....if you like, when the first notes of the first flight, in the sleeping house.

We could invent a plausible context for some fragments, for example: Charles:  "When should I go?" "There are the first notes and then the rest of the notes, remember?" "And Dominiq told me they were all left in the sleeping house." Maria:    "When you should go?" "You're asking me?" Well....if you like, when the first notes of the first flight, in the sleeping house.

Zags
*...the man by the grey pallor of the man was grey...

Zags can seem bizarrely inappropriate, yet still plausible. Of course humans frequently correct their own speech in midstream, while written corrections usually erase mistakes unless the writer's intent is to illustrate the actual flawed speech. But consider: *...the man by the grey side of the building was hungry... Sentence structure (well: fragment structure) doesnt change yet when lexical substitutions are made, the fragment suddenly makes better sense. * pallor --> side     (singular noun) * man   --> building  (singular noun) * grey  --> hungry    (adjective)

For now, we leave this issue unsolved while we try to optimize performance of our machine's language component.

Synthetic Text (ST) is generated by extending a train-of-thought (TOT), re-purposing text fragments from similar Left Contexts from a very long resource text. Humans also write text by choosing contextually appropriate extensions. ST contexts include:
 * the Train-of-Thought (TOT) context: the previous few (2-5) tokens before a new extension.
 * the Resource: a language knowledge database

Example of an (ST) Extension
We start with the given TOT: We scan the Resource for possible Extension found in the Left Contexts: The scan returns 3 Candidates: which could extend the Train-of-Thought (TOT) as follows:
 * ....'secessionist drive is the most serious'
 * ['is','the','most','serious']
 * ['the','most','serious']
 * ['most','serious']
 * ['serious']
 * 'existential challenge to a'
 * 'last name found in'
 * 'free gift to a'
 * ....secessionist drive is the most serious existential challenge to a
 * ....secessionist drive is the most serious last name found in
 * ....secessionist drive is the most serioius free gift to a

Tokens, Words, Punctuation in Synthetic Text(ST)
Text contains more than words. Punctuation is abundant, important and influences meaning. TOKENs are the words plus punctuation in a text. We manipulate TOKENs when generating Synthetic Text (ST). Throughout this discussion of Synthetic Text (ST), we operate on TOKENs, not just words.

In this example: '$1.50 If they can, you are in for a treat. Does' there are 11 WORDS:      ['$1.50', 'If','they','can','you','are','in','for','a','treat','Does'] 2 PUNCTUATION: [",","."] totalling 13 TOKENS. (Of course, different tokenizers may interpret '$1.50' differently). The definition of a token is pragmatic: a token is what a tokenizer considers to be a token.

Context Length
Shorter contexts lead to gibberish since they only match a few tokens, generating an overly broad range of extensions. Longer contexts can fit perfectly, but that is plagiarism. The longer the context, the less frequent in the resource, so the extension range is narrow. Longer resource permits longer context matching, but such long texts are not easily available. Short resource doesnt contain enough matches to be viable. That's why people never pursued Synthetic Text(ST) before computers and huge online databases became available.

DeadEnds
Extensions can reach DeadEnds, with no way to continue forward. Some causes include: TextDict Sparsity: TextDict is the 8gram dictionary version of Resource. It uses an ExcludeFilter to remove any 8grams containing NOISE of various types 'all.134', 'lb.'
 * Mis-tokenized Period Token Errors (Puntos) that were incorrectly tokenized, usually involving a PERIOD:
 * Oddities: 'c-m-c', 'xxxviii.', ,'7','8','9'

Repeats
Extension repeats happen when an extension and its neighbors are the same or similar. They need suppressing. Repeats can be contiguous, or remote occurring over several extensions. Repeats happen in Synthetic Text(ST) because the Extender selects the best, or one of the best, Extension. Since there is just 1 best Extension, it tends to appear often.

Example of a contiguous repeat
21 Fire is hot because 22 the conversion of the 23 the conversion of the

Example of a remote repeat
21 Fire is hot because 22 the conversion of the 23 weak double bond in 24 molecular oxygen, O2, to 25 the conversion of the 26 weak double bond in

Resource Text
When Resource Text is too short searches cannot find enough matches. When Resource Text isn't topic-specific enough the SynText lacks coherency since it shuffles vocabulary together from numerous disparate topics. Digitized texts are useful Resource Texts, but need to be long, clean and focussed. In practice as of 2017 most free downloadable texts are in *.PDF or *.htm format. We need *.txt or similar files for Python3 to read. There are *.pdf --> *.txt and *.htm --> *.txt converters, but they are prone to error, introducing enough Noise to sabotage a Synthetic Text(ST) project. Some *.txt files are available free, but most arent long enough.

Topical focus is a critical problem since it's rare to find a long text narrowly focussed on just 1 topic.

Long prose *.txt files (eg: Ulysses) work well since they tend to lack a clear topic as compared to a medical article discussing Tamoxifen, for example. English Wikipedia dumps are long, but they contain millions of independent articles, each discussing an independent topic.

Hinges: Synthetic/Organic Extensions
Extensions grow from a hinge, the rightmost area of a given THOT that can grow an extension. The span of tokens across the hinge is SYNTHETIC since the machine choses them, possibly introducing incoherency. A span of tokens not crossing a hinge is ORGANIC, since it was repurposed from RESOURCE. An ORGANIC span doesnt increase incoherency since the original human meaning and form remain intact. Mismatches (X) are needed to avoid plagiarism. If a mismatch is required every 8th token or even more frequently, then no spans of tokens in the Train-of-Thought(TOT) can be verbatum quotes from the text longer than 6 tokens. Consider: every octogram extension requires 1 mismatch and at least 1 match(O), allowing maximum 6 new tokens for the XO case. For the XOOOO case, there's 1 required mismatch and 4 matches(O), allowing just 3 new tokens.

Coherency/Incoherency
Let's use an example extending a Train-of-Thought(TOT) by hexagram THOTs, with just a 1-token match (with minimal coherency). Here's the given Train-of-Thought(TOT) we want to extend: given TOT:   'the book estimates the original figure'

We scan the Resource text and find this Candidate: Candidate:   'figure was calculated with accuracy in'

It's a plausible extension since the token 'figure' matches both at the tail of the TOT and at the head of the THOT. We can extend to:

Extended TOT: 'the book estimates the original figure was calculated with accuracy in' and luckily this extension sounds quite plausible. Now let's break the TOT down into a 4gram fold, all 8 of them. We find there are 6 organic and 2 synthetic 4grams, so this TOT has a 2/8 risk for incoherency. It's synthetic THOTs that can introduce incoherency while organic THOTs dont. 6 organic THOTs: 'the book estimates the' 'book estimates the original' 'estimates the original figure' 'figure was calculated with' 'was calculated with accuracy' 'calculated with accuracy in' 2 synthetic THOTs: 'the original figure was' 'original figure was calculated'