User:Ryuki4716/October29 Theory

What is Synthetic Text
Here's a sample: I should also like to remind you that the objective of the funds has been channelled towards the development of partnerships with the countries of origin of the citizens of the twelve countries the euro zone on the grounds of the workers and the general public welcomes the fact that a future ban on the presence of women in the decision-making process will be beneficial to lift the embargo on serbia Only people write text. Not machines or animals. Synthetic Text (SynTex) is artificially generated text plausible enough for people to think it was written by humans, not generated by software. SynTex is not boilerplate. It's Extended Hunk by Hunk to create unique Text after learning from a Training Text

SynTex can be surprizingly well-formed, nearly syntactically correct, usually makes bizarre (Zapped) sense, and often even makes plausible sense. Current SynTex quality is far from human, largely because there is no Thought component (yet) so Topical Continuity is haphazard.

SynTex works for any written language
The SynTexter can produce SynTex in any written language using the same Generator, with no language-specific grammatical, syntactic, semantic or lexical changes needed.

Since SynTexter uses no traditional academic grammar at all, it is language-independent, in contrast to academic grammars which are language-dependent and different for each language.

There are a few minor but important technical, non-linguistic caveats specific to each individual language though: the tokenizer, training text, text-prepper and hingewords.

SynTexter is language-independent because it learns from a Training Text (in any language) to generate similar SynTex.

How SynTex is Generated
The SynTexter learns from a Training Text what it needs to Extend SynTex. SynTex Generation is similar in principle to Spell Completion, Google Smart Compose, Machine Translation, Google Search Correction and Japanese Kanji Input Completion Dictionaries.

SynTexter traverses a long Training Text, the longer the better, isolating Chunks: stretches of 4-8 Tokens (words and punctuation). An Ante is the first of an Ante/Sequitur pair of Chunks in Text. A moving window Slices Text into all its Ante/Seq pairs, booking the AnteSeqD. Next the Ante/Para(phrase) APD dictionary is booked pairing Antes with Paras (paraphrases) instead of the original Sequiturs. APD booking involves exhaustive Text search for every Para that could substitute for the original Sequitur. Each Para Magnitude is also tabulated: the length of the Solapa: the number of Tokens shared between the Ante, Sequitur and Para. The longer the match, the higher the Magnitude and the more likely the Extension will be Plausible. Finally the Cyclado Dictionary (PATD) is booked. Every Para in one PATD Entry matches an Ante in a different PATD Entry to be Cyclical. Solapa Magnitude is the criterion for Extension: the more Tokens an Ante and a Para share, the higher the Magnitude and the more plausible the Extension on average. Higher Magnitude Extensions are generally preferred, with several caveats (Plage avoidance).

SynTexter can write but cannot think
People (hopefully) think before writing, thus producing Topically Cohesive Text paralleling and expressing a progression of thought. SynTex cannot think, so sometimes the Coherency is plausible while other times it is Zapped. SynTexter Extends SynTex by textual criteria without thinking at all. Academic Linguistics also has little to say about Topical Coherency. Academicians start with  as a given. They focus on language proper. Genesis: the way thought morphs into language in human minds, is not in their domain. So how are we to model and in software? Those questions remain unanswered.

Example: Text, Antes and Sequiturs
Here's a piece of TTL, the Tokenized training text:

['on', 'behalf', 'of', 'all', 'the', 'victims', 'concerned', ',', 'particularly', 'those', 'of',  'the', 'terrible', 'storms', ',', 'in', 'the', 'various', 'countries', 'of', 'the', 'european', 'union', '.', 'please', 'rise', ',', 'then', ',', 'for', 'this', 'minute', 's', 'silence', '.', '(', 'the', 'house', 'rose']

and here are some entries from AnteSeqD, a dictionary of Ante/Sequitur pairs learned from TTL: #AnteSeqD[('the', 'victims', 'concerned', ',', 'particularly', 'those', 'of')] # {('of', 'the', 'terrible', 'storms', ',', 'in'), # ('of', 'the', 'terrible', 'storms', ',', 'in', 'the')}

# AnteSeqD[('of', 'the', 'terrible', 'storms', ',', 'in')] # {('in', 'the', 'various', 'countries', 'of'), # ('in', 'the', 'various', 'countries', 'of', 'the')}

#AnteSeqD[('in', 'the', 'various', 'countries', 'of')] # {('of', 'europe', 'must', 'keep', 'their', 'monopoly', 'of'), # ('of', 'the', 'european', 'union', ',', 'but', 'from', 'the')}

#AnteSeqD[('the', 'various', 'countries', 'of')] # {('of', 'europe', 'must', 'keep', 'their', 'monopoly', 'of'), # ('of', 'the', 'european', 'union', ',', 'but', 'from', 'the'), # ('of', 'the', 'european', 'union', 'to'), # ('of', 'the', 'european', 'union', 'to', 'be'), # ('of', 'the', 'union', ',', 'as')} # AnteSeqD[('for', 'this', 'minute', 's', 'silence', '.', '(', 'the')]        # set         The Antes are:         ('the', 'victims', 'concerned', ',', 'particularly', 'those', 'of')         ('of', 'the', 'terrible', 'storms', ',', 'in')         ('in', 'the', 'various', 'countries', 'of')         ('the', 'various', 'countries', 'of')         ('for', 'this', 'minute', 's', 'silence', '.', '(', 'the') The Sequiturs are: {('of', 'the', 'terrible', 'storms', ',', 'in'), ('of', 'the', 'terrible', 'storms', ',', 'in', 'the')} {('in', 'the', 'various', 'countries', 'of'), ('in', 'the', 'various', 'countries', 'of', 'the')}

{('of', 'europe', 'must', 'keep', 'their', 'monopoly', 'of'), ('of', 'the', 'european', 'union', ',', 'but', 'from', 'the')}

{('of', 'europe', 'must', 'keep', 'their', 'monopoly', 'of'), ('of', 'the', 'european', 'union', ',', 'but', 'from', 'the'), ('of', 'the', 'european', 'union', 'to'), ('of', 'the', 'european', 'union', 'to', 'be'), ('of', 'the', 'union', ',', 'as')} Then for 1 example Sequitur: ('of', 'the', 'terrible', 'storms', ',', 'in') Its Cyclical Postulates are: [(4, 8, ('in', 'the', 'various', 'countries', ',', 'particularly', 'including', 'the')), (4, 7, ('in', 'the', 'various', 'countries', ',', 'especially', 'in')), (3, 6, ('in', 'the', 'various', 'states', 'of', 'the')), (3, 7, ('in', 'the', 'various', 'member', 'states', '.', 'to'))] There are 4 Cyclical Postulates in this example, listed in order by Magnitude, so higher Mag 4s come first.

SynTexter thinks any of these 4 Cyclicals would make a plausible Extension for the Ante, carefully excluding the Sequitur itself as an Extension to avoid Plage:

Original TTL text Ante: 'of', 'the', 'terrible', 'storms', ',', 'in'

Original TTL text Sequitur: 'the', 'various', 'countries', 'of', 'the', 'european', 'union']

Original Ante/Sequitur: ['of', 'the', 'terrible', 'storms', ',', 'in', 'the', 'various', 'countries', 'of', 'the', 'european', 'union']

Plausible SynTex Extensions: ['of', 'the', 'terrible', 'storms', ',', 'in', 'the', 'various', 'countries', ',', 'particularly', 'including', 'the'] ['of', 'the', 'terrible', 'storms',     'in', 'the', 'various', 'countries', ',', 'especially', 'in'] ['of', 'the', 'terrible', 'storms',     'in', 'the', 'various', 'states', 'of', 'the'] ['of', 'the', 'terrible', 'storms',     'in', 'the', 'various', 'member', 'states', '.', 'to']

Plage Avoidance
Quoting TTL Text verbatum too often and for too long is plagiarism. SynTexter tries to avoid or minimize Plages.

Plages are verbatum quotes that are too long. By convention Plages are at least 9 Tokens long, the longer the more blatent the Plage.

8grams (groups of 8 Tokens) are standard, 9grams are normal, 10grams are frequent but starting to resemble Plages .....   20grams are blatent Plages to be avoided

Why Plages are unavoidable, even for humans
Here's a sample of English text:

While she imposed the maximum sentence he faced, she said she may lower the term if prosecutors say he provided “substantial assistance” to their investigation. Krull attorney Oscar Rodriguez said Krull has been cooperating with prosecutors. They’re looking into Venezuelan President Nicolas Maduro, his three stepsons, and Raul Gorrin, owner of the Globovision television network in Venezuela, a person familiar with the matter said. As a whole, this text is probably unique and not a plagiarism. However when split into increasingly shorter subsegements, eventually ALL the subs have been recycled (plagiarized?) from previous use by huge numbers of English speakers. And clearly taken to the single vocabulary level, 100% of the words have been recycled (plagiarized?) from previous use. That is the nature of language: People share it. Otherwise it wouldnt be intelligeable.

How many of these subsegments have already been used before? Probably ALL of them: many times.. While she imposed she imposed the the maximum sentence she said she she may lower lower the term So at what length does a recyclage morph into a Plage? It's a continuum with longer Chunks more blatently plagiarisms.