User:Ryuki4716/Oct18Freeze

Quotes: Direct/Indirect Speech
It is best to avoid texts laden with direct speech. That unfortunately includes most texts, especially most prose. Machine translation databases of language pairs (eg EuroParl) are aware of such complications so limit themselves to Indirect Speech and no Quotation Marks, greatly simplifying Machine Translation.

Direct:       He said "I'm coming" Indirect:     He said (that) he was coming.

Direct speech introduces many complications into Training Text and is best avoided. Firstly quotation mark usage is not standardized and varies by text, author, time. Then direct/indirect usage is handled differently in different languages. Even for the same author (eg: James Joyce) in the same time period, direct speech can vary stylistically.

Further, tho Quotation Marks are supposed to occur in pairs (left/right, front/back), it's normal for some marks to be missing in long texts. Then it's very difficult to supply the missing Quotation Mark in the right place. After all, the human author and printer couldnt do that either!

There is a complicated logic to positioning Direct/Indirect Speech into a surrounding context that varies greatly by language, perhaps enough to invalidate the claim that SynTexter is 100% language independent.

P28FreezeJapan
This is the CodeDoc for the SynTexter at C:\Users\Robert\Jupyter\P28FreezeJapan.

Python Glossary*
The SynTexter is written in Python3.6.1 Script using the Python datatypes described below.

Strings*
"a string" an ordered sequence of characters, usually Unicode. Tokens are implemented as Strings. A Python String is a specific DataType, expressed between quote symbols (either single or double): 'two', "strings"

Python Dictionaries
In the Python programming language a Dictionary is a specific datatype. Each Entry in a Python Dictionary has a Lemma, what is used to look up the Entry, and a Gloss, what the Entry contains (besides the Lemma). For SynTexter the Lemma are usually Chunks, Slivers or Integers. The Gloss display greater variety, but are usually either Lists, Tuples or Sets of Chunks.

Lists*
[a,b,c,a,b] an ordered mutable group of elements, usually Token strings, not necessarily unique. There can be repeated elements and append/deletes are permitted. SynTexter Lists usually contain Token Strings.

Tuples*
("this", 'is', ',',"of", "course", "a", "tuple", "of", "strings", "of", "Tokens") are an alternative to Lists. Tuples are inmutable, while Lists are mutable, an important distinction since Python Dictionary Lemma (Keys) have to be immutable. So SynTexter Dictionary Lemma are implemented as Tuples (or Integers), not Lists, since a Python Tuple is a specific DataType and so processes different from Lists. SynTexGen Tuples usually contain Token Strings.

Sets*
{'a', 'set'} are mutable like Lists but contain only Unique elements with no Repeats. SynTexter uses Sets to exclude Repeats.

Dictionary*
is a specific Python datatype: a HashTable with very fast lookup, much faster than other iterables such as Lists, Tuples or Sets. Dictionary Entries are unordered (except for Ordered Dictionaries). Dictionaries have a Lemma and a Gloss. The Glossa are of specific type, often List or Set for SynTexter. Dictionary Entries are accessed by their Lemma, usually by Chunks or Hunks: Tuples of String Tokens.

Ordered Dictionary*
are Python Dictionaries where the Entries are Ordered, an extremely useful property for discovering and grouping Entries. Entries with similar Lemma can be grouped as neighbors in Ordered Dictionaries, but not in standard Python Dictionaries.

Entries*
Python Dictionaries are unordered collections of Entries, each composed of a Lemma and a Glossa

Cuts$
is a generic term including Slivers, Chunks and Hunks. Cuts are subsegments of Text, usually in Lists or Tuples of Token Strings. Example: ('this', 'is', 'a', 'cut') The example could be a Sliver, Chunk or Hunk depending on how it is used.

Chunks$
are ordered Cuts of length 4-8 Tokens. Subtypes include Antes and Rears. Chunks are delimited by HingeWords on both edges.

Chunk length is a pragmatic convention, used because it works. Yet shorter Chunks would not convey a 'meaning' as often, while longer Chunks would be Plages. In English around 6-10 Tokens is a convenient Chunk length.

Hunks$
("this", "is", "a", "text", "hunk") Hunks are Cuts, of any length and not necessarily representing Antes or Rears (tho they might.

Slivers$
The Entries in the Text16D Sliver Dictionary are Text Slivers. A  is a Cut that is also a Python Slice of Text.

Lemma$
SynTexter Dictionary Entries are referenced by a Lemma, usually a Chunk, Hunk or Integer.

Gloss$
Glossa are usually iterables: Lists, Tuples or Sets. They are Antes or Rears and Slivers thereof.

Thot$
an abstract, undefined minimal unit of 'thought' or 'idea', represented by 4-8 Tokens in English text. A Thot manifests as a Chunk or Hunk, since a Thot itself has no written representation. Thots have to be verbalized to be represented in text.

Token$
"token" are isolated in Text by a Tokenizer. Tokens represent each individual word or punctuation, usually delimited by white space (in English) or punctuation marks. Complications and ambiguities for Quote marks and Full Stop are numerous, difficult, arbitrary and idiosyncratic, even varying stylistically.

Tokenizer$
is a function that splits Python String Text into individual Tokens: words and punctuation. Tokenizers are language-specific. Many languages dont delimit Tokens by white space (Japanese, Mandarin) so their Tokenizers can be very complicated (and dictionary-lookup based). Many complexities are associated with Periods/FullStops since they arent standardized. It's often easier to enumerate individual cases in a Tokenizer dictionary than to attempt generalizing principles. Tokenizers are challenged by Named Entities also. For example, is  1 or 2 tokens? What about ?

Rears$
A Rear is a generic term for a Sequitur, Para or CycloPara. Rears come after Antes in Text.

Ante$
is a Chunk that precedes a Rear. Most Antes have Extensions, but others dont: -at the very end of a text -when the following text doesnt qualify as a Chunk by lacking HingeWords or failing length requirements

Sequitur$
is a Chunk and a Rear, that comes next after an Ante. Antes and Sequiturs exist in pairs. Every Sequitur has at least 1 Ante.

Throughout Text, an Ante A might have several Sequitur so these sequences might appear:

AS1 He said it was because he was gonna get lost. AS2 He said it was because the plane already left. AS3 He said it was because oil and water dont mix. In all 3 cases the Ante is the same, but there are 3 different Sequiturs.

Para$
-is short for Paraphrase, another kind of Chunk and Rear. -Paras are what could come after an Ante, instead of the Sequitur that was actually used. Long enough Training Text usually contains multiple Paras that could be used instead of the Sequitur since they all have the same Solapa.

CycloParas$
are Chunks and Rears that come after an Ante, like Paras but more exclusive since CycloParas exclude Sequiturs or repeats of its Entry's Ante. CycloParas are Cyclical (while Paras and Sequiturs might not be) CycloParas form the Glossa for Cyclado Entries. So a CycloPara in a Glossa always refers to a Lemma/Ante in a different Entry, thus perpetuating SynTex Extensibility.

Plages$
are plagiarisms, SynTex segments that quote Text verbatum for too long. By convention never under 9 Tokens long. SynTex should not quote other peoples' words for very long, but necessarily does so for shorter sequences. Longer segments are more undeniably Plages, while shorters (9-12 Tokens long) are commonplace and unavoidable. SynTexter tries to minimize Plages, especially longer ones.

Actually every speaker/writer in every language ALWAYS uses Plages of pre-existing text, when considered in short segments. "of the" for example is just 2 Tokens long, and certainly every English speaker has re-used that Plage many times. Should it be considered a plagiarism? Probably not, since it's so short, though strictly speaking it is a plagiarism. However "Eventually Sequitur Flush should only return the single shortest Sequitur," is 11 Tokens long, hence undeniably a Plage to avoid if it's a verbatum Text quote.

Solapa$
is an overlap or margin shared between an Ante and its Rears. Its the sequence of HingeWords both share in between. The longer the Solapa the tighter the Extension, and the more infrequently it is likely to appear in Text. Conversely short Solapa are much more abundant, but much laxer.

For 2 million Text Tokens, 2gram Solapa are abundant, 3grams an order of magnitude scarcer, 4grams rare and longer Solapa are only tiny trace elements.

Chunks sharing a Solapa convey meaning in Text (otherwise the author wouldnt have written them!). In SynTex even short, lax 2gram Solapa will seem formally correct (more or less), but often sound Zapped, with bizarre questionable semi-random Topical Continuity. 2gram Solapa are too abundant and so laxly allow too great a variety of Extensions.

Zapped$
Zapped SynTex is too common and should be minimized. SynTex is full of segments that are structurally correct enough, could mean something, do mean something, but are highly improbable juxtapositions of otherwise unrelated Chunks that probably wouldnt appear in Human Text, and only appear as SynTex Extensions since they happen to share a (usually short) Solapa that is too lax.

A main goal for SynTexter is to minimize Zapped SynTex to more closely approximate Human Text. Using more longer HiMag high Magnitude Extensions with longer Solapa is an obvious strategy. Longer Training Text is required to tabulate a larger arsenal of longer Solapa. Then of course Humans (hopefully) use Thought prior to Genesis of Text while SynTex has no Thought component at all.

HingeWords$
delimit Antes and Rears. HingeWords are some of the most common words in text, following Zipf's Law. SynTexter recognizes just 19 HingeWords by convention, and tweaking that set might greatly improve SynTex quality.

HingeWords exclude Megas such as '.' and ',', because they are too frequent, thus too lax, and as HingeWords would generate thouroughly Zapped Text lacking Topical Continuity.

As a common example: FullStop is the end of sentence A and the start of sentence B, so allowing FullStops as HingeWords would allow ANY sentence to Extend from ANY sentence, clearly too lax, generating Zapped SynTex.

Topical Continuity$
is a way text is held together by Thought. it is sorely lacking in SynTex. SynTexter underscores the need to deal with the Topical Continuity issue. In normal human text a Thot flow is verbalized at the instant of Genesis. SynTex has no Thot flow processor at all, and so too often appears Zapped due to questionable Topical Continuity due to lack of Thot flow processor.

Europarl Parallel Corpus$
supplies the Training Text used here. http://www.statmt.org/europarl/

Training Text(TT)$
TT should be as long as possible since quantity of longer High Magnitude CycloParas is proportional to Text length. The SynTexter currently uses just 2 million Tokens of the 54 million Tokens available in from English EuroParl. Implementation issues are the limiting factor (processes too slow, data too big).

Topically homogeneous TT generates more coherent SynTex. EuroParl transcriptions make good TT since they are formulaic, rhetorical preferring style over content. They are European Parliament speeches closer to poetry than prose. James Joyce makes good TT for similar reasons (tho his varied use of Direct/Indirect Speech is problematic).

Wikipedia dumps make poor TT because they are so Topically diverse. A 6-million article Wiki dump results in truly well-Zapped SynTex. Imagine throwing all 6-million articles into a paper shredder, then splicing some randomly together!

Mechanical practicalities such as Punctuation usage Direct/Indirect Speech and Named Entity issues can greatly complicate SynTex processing in ways only tangential to the central SynTex Generation thrust.

Then modern web-based text is full of non-text content (images, diagrams, links, references) that need removal before a Training Text is useful. Web-based text normally includes such heterogeneous non-text content and is incomplete, even incoherent without it.

Language Independence
SynTex uses no academic linguistic constructs at all (grammars). SynTexter instead follows general principles shared with Spelling Completers, Japanese Kanji Dictionaries, Spell Check, Google Smart Compose, Google search correction, etc.

SynTexter tabulates all the Ante/Sequitur pairs from a Training Text, then books Cyclical Ante/Para Entries that Extend SynTex with minimal Plages.

The SAME SynTexter used here for English will also work for ANY other written language, given a large enough Training Text, with only a few language specifics: the Tokenizer, the choice of HingeWords, the Text Prepping, Quote complications, the ExcludeList. SynTexter has been tested and works for Spanish and English.

There is one important exception: Direct/Indirect Speech. The SynTexter is language-independent if trained from Text free of Direct Speech (Quotation Marks). But Direct Speech treatment is language-dependent, so therefore the SynTexter isnt truly language-independent when trained from Text including Direct Speech. Further, at this point the SynTexter cannot handle Training Text using Direct Speech. Further still, Direct Speech usage is not only language-dependent, it's also non-standardized, varies stylistically, by author, by text, by time period and even by printer. Finally, Academic Linguistics cant handle Direct/Indirect Speech (Quotations) cleanly either. It's an issue they have chosen to ignore. Tranformational linguistics in particular has nothing whatsoever to say about this situation and omits any reference to punctuation.

SynTex quality is proportional to Training Text length
so eventually quality may improve up to 54/2 = 27 times better if/when all 54 million Tokens of Training Text can be processed. The improvement comes from booking more Higher Magnitude CycloParas to Extend from. If for example in 2 million Tokens of Text only 1 Mag4, 3 Mag3 10 Mag2 CycloParas are detected, then from 54 million Text Tokens many more (up to 27 times more?) Mag4s might be detected, so Appropriateness will be tighter because Solapa will be wider.

Appropriateness
is the subjective degree to which an Extension seems fitting. While ungrammatical SynTex is objectively incorrect, more often SyTex seems Zapped: it's grammaticaly correct enough, and makes bizarre sense, but Topicality develops in Zapped unnatural pseudo-random ways, since the only coherence is Solapa Magnitude. Only the current Solapa (end) of a SynTex is considered when Extending. SynTexter doesnt use any of the earlier SynTex as context to decide how to Extend, and there's no Thot component to initiate Text Genesis.

Even so, Solapa Magnitude alone produces plausible if Zapped SynTex without using any academic linguistics at all, and its techniques may prove vary useful for other Computational Linguistic activities like Machine Translation, Named Entity Detection, Sentiment Analysis, etc.

Unicode
The raw TextFiles are mostly in Unicode, a large characterset that (attempts) to include all the Glyphs for all written languages, in contrast to ASCII, which only includes 256 common Glyphs, and ignores most of the world's writing systems.

EngTTL
is the full raw EuroParl raw text in Tokenized form.

TTL
is the first 2 million Tokens of EngTTL.

AnteSeqD
dictionary Entries are a tabulation of all the Antes in the Text paired with their Sequiturs, as Lemma/Glossa. There's usually just 1 Sequitur per Ante, though multiple are possible. Downstream the Sequiturs will be replaced by Paras, then CycloParas. The CycloParas are what could be said instead of the original Sequitur in the original Ante context. That's how SynTex Extends.

Here's a sample AnteSeqD Entry:

('of', 'that', 'kind', 'to') {('to', 'what', 'is', 'merely', 'a'), ('to', 'what', 'is', 'merely', 'a', 'report', ',', 'not')} Lemma: Ante Tuple Gloss: Set of Sequiturs The Lemma/Ante is: ('of', 'that', 'kind', 'to') what was stated first in Text. The Glossa/Sequiturs are: {('to', 'what', 'is', 'merely', 'a'), ('to', 'what', 'is', 'merely', 'a', 'report', ',', 'not')} and were stated next in Text. In this sample the 2 Sequitur form a SequiturFlush.

Data Attrition
Here are some lengths for the important Texts and Dictionaries: EngTTL      769,104KB 54,000,000 Tokens, currently too long for SynTexGen to process TTL       26,236KB  2,000,000 string Tokens in a List Text16D  169,937KB  1,874,403 16gram Tuple Entries in a Dictionary based on TTL AnteSeqD  46,202KB    315,949 Dictionary Entries of Ante/SequiturList Tuples/List based on Text16D NumASD    64,507KB    550,842 Dictionary Entries of Num/[Ante,ShortAnte,Sequitur,ShortSequitur] an isomer of AnteSeqD to expedite processing APD    2,072,377KB    290,666 Dictionary Entries based on NumASD ATD       36,398KB    182,289 Dictionary based on APD, but limited to High Magnitude Paras PATD      20,694KB    165,252 Dictionary based on ATD, but Cyclical PD918     199,189KB  1,902,193 Dictionary of 18gram Slivers from TTL for Plage detection

Direct size comparison is elusive since Token Lists are not Dictionary Entries, and Dictionary Entries vary greatly in length also. But clearly usable data shrinks by refinement from raw string Text form to the final PATD Ante/[CycloPara] Extensible form. Probably the KB size on disk gives the best idea of relative size. APD is huge even though its dictionary entry count is modest because its Glossa can be very long, including every qualifying Para from the entire TTL.

TTL
is the Training Text where SynTexter finds Chunks to learn. TTL is implemented as a Python List 2 million string Tokens long, the first 2 million Tokens of the 54 million EngTTL. At the moment only 2 million Tokens can fit in a Python session.

TTL is prepped by EuroPrep (P28FJ_Basics 1). The raw Europerl file is in Unicode, 'utf-8', and various mechanical but critical filters refine Europerl raw into TTL prepped Text. For example: problematic characters: ('\\n', '\\r', "\'",'\\','—'), removing Wikipedia footnote references, tokenizing the raw string data to be a Token List using word_tokenize, lowercasing all uppercase.

EngTTL
is the Tokenized version of the Europerl Parallel Corpus\\English UN speech transcripts, huge: 54,214,969 tokens long, a Python List of Token strings.

Europerl Parallel Corpus\\English europarl-v7.es-en

is a compilation of vintage UN speeches in English and many other European languages, scrupulously correlated to serve as Language Pairs for (an obsolete) Machine Translation method.

Exclude Filter
excludes disqualified Text Slices from Text16D. Most issues involve Punctuation: FullStop, Comma and Quotes in particular. Then non-standard antiquated usage makes trouble (Das Kapital). There is 1 Exclude Filter for Text using Quotes (Das Kapital, Ulysses, Dubliners) and a different Exclude Filter for Texts without Quotes (EuroParl).

Here's part of the Exclude Filter: ['—','“','”','(', ')','‘','"',"'",'@',"``", '...',\           'pp','&', 'c.','b','/',\            'lb','lbs','i.','e.','***',\            '12','m-c-m','2d.','3d.',\            '£', 'v','°','100°','90°','12','v.',\            'c','10','20:100', '1:1','7½','5:3'\             'l.c.',  'p.', '456', '51','3s.',\             '180,000,000','26½','les', 'comptes', 'à',\             'dijon', 'pour', '18', 'l.', 'ce', 'que', 'l', '220',\

Text16D
is of Python Dictionary datatype. It's TTL Sliced into 16gram Slivers, then booked as dictionary entries, the 16gram as Lemma, the (unused) count of 16 as gloss. Python Dictionaries are orders of magnitude faster to use than Python Lists or Tuples or Sets. the Dictionaries are optimized Hash Tables. Here's an example of 4gram Slices of the Tuple  and List :

Blue=(1,2,3,4,5,6,7,8) 4gram Slices= (1,2,3,4) (2,3,4,5)                    (3,4,5,6)                      etc.

Or for Text Tokens:

Green=['the', 'meantime', ',', 'i', 'should', 'like', 'to', 'observe', 'a', 'minute'] 4gram Slices= ['the', 'meantime', ',', 'i'] ['meantime', ',', 'i', 'should'] [',', 'i', 'should', 'like'] ['i', 'should', 'like', 'to'] etc.

AnteSeqD
entries have Antes for Lemma and Sets of Sequiturs as Glossa. TTL is traversed, isolating Ante/Sequitur pairs to book AnteSeqD. An Ante is an initial Text segment, and its Sequitur is a following Text segment. Sometimes a single Ante reappears multiple times with different Sequiturs. Both Ante and Sequitur are implemented as Python TUPLEs of Token strings. By convention their length is limited to between 4 and 8 Tokens. Since the Glossa are Python Sets all their elements are Unique so there are no repeated Sequiturs per Gloss, (though they can repeat across all the Glossa).

Here is a sample AnteSeqD entry:

('of', 'that', 'kind', 'to') {('to', 'what', 'is', 'merely', 'a'), ('to', 'what', 'is', 'merely', 'a', 'report', ',', 'not')}

SequiturFlush
Evidently there was just 1 place, not 2, in the Text where the Ante ('of', 'that', 'kind', 'to') appeared, and it was followed by the Sequitur:

('to', 'what', 'is', 'merely', 'a', 'report', ',', 'not')

Notice however there are 2 Sequiturs listed and one is a subsegment of the other. They form a SequiturFlush. Sequiturs are Chunks of length 4-8 delimited by HingeWords. Sometimes there are SequiturFlush where 2+ subSequiturs are present telescoped within the same longest Sequitur. All such SequiturFlush members are legitimate Sequiturs since they catalyze valid CycloParas and reduce Determinism.

ParaFlush
are analogous to a SequiturFlush, but for Paras of course. For example: ANTE     ('on', 'thursday', 'prior', 'to', 'the') PARAFLUSH1a  ('the', 'start', 'of', 'the', 'application', 'of', 'the') PARAFLUSH1b  ('the', 'start', 'of', 'the', 'application', 'of') PARAFLUSH2a  ('the', 'start', 'of', 'the', 'debate', ',', 'at', 'the') PARAFLUSH2b  ('the', 'start', 'of', 'the', 'debate', ',', 'at') PARAFLUSH3a  ('the', 'start', 'of', 'a') PARAFLUSH3b  ('the', 'start', 'of', 'a', 'sitting', 'in', 'the') PARAFLUSH3c  ('the', 'start', 'of', 'a', 'sitting', 'in')

NumASD*
is an isomer Dictionary for AnteSeqD. NumASD contains the same information as AnteSeqD, but in a different format to expedite downstream processing. Here is a sample NumASD entry: LEMMA: INTEGER    100 GLOSS: LIST          [ ANTE:          ('on', 'this', 'subject', 'in'), SHORT ANTE:    ('on', 'this'), SEQUITUR:      ('in', 'december', '1995', '.', 'the', 'result', 'of'), SHORT SEQUITUR: ('in', 'december') ]

The Lemma is a Unique integer (not an Ante). The Gloss is a List of: the Ante, the ShortAnte, the Sequitur and the ShortSequitur. While AnteSeqD included multiple Sequiturs for a single unique Ante, NumASD has separate Entries for each Ante/Sequitur pair. Further, NumASD contains ShortSlivers for both Ante and Sequitur: the first two Tokens respectively, since the shortest Solapa is of length 2 by convention. These Shorts greatly expedite downstream Dictionary booking since now isolating the 2 Token Solapa becomes a fast Dictionary look-up instead of a much slower Python Slice or (len) operation performed sequentially (slowly) across a List or Tuple.

HingeWords
('the', 'of', 'to', 'a', 'was', 'his', 'in', 'had', 'said', 'for', 'at', 'him', 'on', 'as',                         'not', 'were', 'be', 'would', 'up') delimit both edges of Chunks. A set of just 19 HingeWords is used here by convention. These Hinges include some of the most frequently occuring Tokens in our English Training Text. Yet the few most frequent Megas are NOT included as HingeWords. In Text a HingeWord Solapa binds each Ante with its Sequitur.

Why these 19? They are very frequent so serve as convenient Chunk delimiters. Generally most stretches of Text will contain several HingeWords due to their frequency. If instead infrequent Tokens were used as Hinges, they would occur infrequently, only occurring every 300-500 Tokens (for example), making it impractical to Slice Training Text into short length 4-8 Chunks. Chunk length would have to be much much longer: untenable since that would clearly amount to Plages.

Some MegaTokens are rejected as HingeWords, by far the most frequent English Tokens are the punctuation symbols: '.' and ',' (and others ';' to a lesser extent). If for example '.' was used as a Hinge then any end-of-sentence would be an Ante with any beginning-of-sentence as Sequitur. The result would be incoherent indiscriminant Zapped SynTex since it would laxly allow ANY other sentence to Extend from an Ante, in total disregard for Topicality or Semantics. Careful choice of HingeWords probably impacts SynTex quality greatly and needs to be invetigated.

Dictionary Sizes
are hard to compare. Python Dictionary length just the number of entries. The Glossa can be very short or long. The Dictionary FileSize in bytes, is perhaps more appropriate since filesize matters when loading, dumping and traversing, independent of number of entries. In a traversal, it's typically every Para in a Gloss that's reviewed, and there might be 100s of Paras per Entry.

Here are some sizes for the present implementation: dictionary entries  KB file size EngTTL      54,214,969           769,104 TTL          2,000,000            26,236 Text16D      1,874,403           169,937 PD918:         1902193           199,189 AnteSeqD       315,949            46,202 NumASD         550,842            64,507 APD:            290666         2,072,377 ATD:            182289            36,398 PATD:           165252            20,694

APD
Here's an example APD entry: ('on', 'thursday', 'prior', 'to', 'the', 'start', 'of', 'the') [('the', 'presentation', 'and', 'preparation', 'of') ('the', 'presentation', 'by', 'the', 'commission', 'of', 'the') ('the', 'presentation', 'and', 'examination', 'of') ('the', 'presentation', 'on', 'behalf', 'of') ('the', 'presentation', 'of', 'a') ('the', 'presentation', 'on', 'behalf', 'of', 'the') ('the', 'presentation', 'of', 'a', 'common', 'position', 'by', 'the') ('the', 'presentation', 'and', 'definition', 'of', 'agricultural', 'policy', 'in') ('the', 'presentation', 'of', 'a', 'resolution', '.', 'the') ('the', 'presentation', 'of', 'amendments', 'is', '10.00', 'a.m.', 'on') ('the', 'presentation', 'of', 'a', 'report', 'on') ('the', 'presentation', 'of', 'a', 'report', 'on', 'the') ('the', 'presentation', 'of', 'various', 'amendments', 'to') ('the', 'presentation', 'of', ',', 'and', 'the') ('the', 'presentation', 'of', 'their', 'works', 'to', 'the') ('the', 'presentation', 'by', 'the', 'council', 'of', 'a') ('the', 'presentation', 'of', 'various', 'amendments', 'to', 'the')]

with Lemma/Ante: ('on', 'thursday', 'prior', 'to', 'the', 'start', 'of', 'the')

and the Gloss List of Paras all sharing a 2Solapa: [('the', 'presentation', 'and', 'preparation', 'of') ('the', 'presentation', 'by', 'the', 'commission', 'of', 'the') ('the', 'presentation', 'and', 'examination', 'of') ('the', 'presentation', 'on', 'behalf', 'of') ('the', 'presentation', 'of', 'a') ('the', 'presentation', 'on', 'behalf', 'of', 'the') ('the', 'presentation', 'of', 'a', 'common', 'position', 'by', 'the') ('the', 'presentation', 'and', 'definition', 'of', 'agricultural', 'policy', 'in') ('the', 'presentation', 'of', 'a', 'resolution', '.', 'the') ('the', 'presentation', 'of', 'amendments', 'is', '10.00', 'a.m.', 'on') ('the', 'presentation', 'of', 'a', 'report', 'on') ('the', 'presentation', 'of', 'a', 'report', 'on', 'the') ('the', 'presentation', 'of', 'various', 'amendments', 'to') ('the', 'presentation', 'of', ',', 'and', 'the') ('the', 'presentation', 'of', 'their', 'works', 'to', 'the') ('the', 'presentation', 'by', 'the', 'council', 'of', 'a') ('the', 'presentation', 'of', 'various', 'amendments', 'to', 'the')] APD is deep since its Glossa include ALL Paras (Paraphrases of Sequiturs). APD Lemma/Ante are the Ante in an Ante/Sequitur pair, implemented as a Python TUPLE of Token strings. At this stage the Para Magnitude isnt yet calculated, and final number of Paras isnt yet known, so all Paras are Listed, whether of Hi or Low Magnitude. APD booking is a slow bottleneck since for each Sequitur of each Entry, the whole Training Text needs to be traversed searching for Paras. It's worse than Quadratic time. APD is booked in several stages using 3 intermediate dictionaries: OD and Clumpum to expedite booking time.

Dictionary Booking Order and Purpose
First AnteSeqD is transformed into auxiliary dictionary NumASD (discussed in the AnteSeqD section).

Here are some sample OD entries: 217559  ('a', 'bad', 'name', '.', 'we', 'should', 'not', 'be') ('a', 'bad') ('be', 'constantly', 're-regulating', 'incinerators', '.', 'we', 'should', 'be') ('be', 'constantly')

477552  ('a', 'bad', 'name', 'in') ('a', 'bad') ('in', 'the', 'past', 'when', 'most', 'of') ('in', 'the')

477553  ('a', 'bad', 'name', 'in') ('a', 'bad') ('in', 'the', 'past', 'when', 'most', 'of', 'the') ('in', 'the')

477555  ('a', 'bad', 'name', 'in', 'the') ('a', 'bad') ('the', 'past', 'when', 'most', 'of') ('the', 'past')

Next OD is booked from NumASD. OD is an OrderedDict: a Python datatype where dictionary entries are in order, in this case ordered by the ShortAnte ('a', 'bad'). So then all Entries whose Antes share that same ShortAnte will book clumped together as neighbors in order. Obviating any need to perform exhaustive slow traversals of entire dictionaries or texts to locate all such Antes.

Here's an example ClumpumD entry:

('the', 'position') {('the', 'position', 'to', 'take', 'in') ('the', 'position', 'of', 'the', 'committee', 'on') ('the', 'position', 'taken', 'by', 'the', 'commission', 'in') ('the', 'position', 'taken', 'by', 'the') ('the', 'position', 'the', 'members', 'of', 'the') ('the', 'position', 'of', 'the', 'european', 'union', 'in') ('the', 'position', 'regarding', 'the') ('the', 'position', 'the', 'members', 'of') ('the', 'position', 'of', 'these', 'member', 'states', 'in') ('the', 'position', 'regarding', 'amendments', '.', 'the') ('the', 'position', 'as', 'regards', 'the', 'stability', 'of', 'the') ('the', 'position', 'of', 'the', 'victims', 'of')}

The Lemma is a ShortSequitur while the Gloss is a Set of Unique Sequiturs sharing the same ShortSequitur, again obviating any need to traverse entire Texts, Lists or Dictionaries seaching exhaustively for all such Sequiturs. They're already listed clumped together as neighbors.

Then ClumpumD dictionary is booked from OD. ClumpumD has Python SETs as datatype for the Glossa. Thus each Gloss contains only Unique Paras with no repeats.

Here's a sample APDpre entry: 100 [('in', 'december', 'that', 'eurodac', 'would') ('in', 'december', 'as', 'part', 'of') ('in', 'december', '1999', 'we', 'approved', 'a') ('in', 'december', '1998', ',', 'this', 'report', 'on') ('in', 'december', '1998', ',', 'of', 'the', 'need', 'for') ('in', 'december', '1999', ',', 'but', 'the') ('in', 'december', 'a', 'compromise', 'was') ('in', 'december', ',', 'i', ',', 'together', 'with', 'a') ('in', 'december', '1999', 'only', 'refers', 'to', 'the') ('in', 'december', '.', 'these', 'storms', 'were') ('in', 'december', '1979', 'and', 'a', 'low', 'of') ('in', 'december', 'of', 'last', 'year', '.', 'the') ('in', 'december', '1999', 'there', 'were') ('in', 'december', '1999', 'to', 'set', 'plans', 'in') ('in', 'december', '1999', 'at')]

APDpre is the next intermediate dictionary to be booked. Each Gloss is a List of all Sequiturs sharing a same ShortSequitur back in ClumpumD. The List members are Unique since they derive from a Set in ClumpumD, and are transformed to List for unclear reasons by book_APDpre_LIST_1(limit) by using the Lemma/Num and Gloss/ShortS data for each NumASD entry to book APDpre provided the ShortS is in ClumpumD as a Lemma. APDpre Gloss are importantly Python LISTs, counterintuitive. But downstream an unresolved implementation problem precludes using SET datatype for APDpre Gloss. APDpre booking is slow, but only 2 minutes or so for the current implementation.

=
=>Oct26 review to here Finally APD proper is booked by Refine_APDpreLIST(testlimit) by using the Lemma/Num for each NumASD entry, the if Num is present as a Lemma in APDpre the booking can proceed.

Any Antes or Sequiturs in the APDpre Glossa are filtered out to book the final APD. Lists of Paras are generated from Ante/Seq(uitur) pairs and sometimes an Ante can legitimately appear in a Para/Gloss List but such Antes must be removed for implementation reasons. A similar situation for Sequiturs in Para/Glossa.

Refine_APDpreLIST removes both. Its a major bottleneck taking about 10 minutes for the current implementation. The expectation is for a SET-based APD booker eventually, executing much faster than LIST-based.

PATD
PATD is the final dictionary Extend uses to generate SynTex. All its Entries are Cyclical, so every CycloPara in every Gloss refers to the Lemma of a different Entry in PATD, thus insuring continuity so SynTex can Extend indefinitely with no discontinuities. Python Dictionary Entries have a Lemma and a Gloss. An Entry is accessed (looked-up) by the Lemma, while the content of the Entry is the Gloss. PATD Lemma are Antes: the current CycloPara to Extend from. The Lemma is a Python TUPLE of 4-8 Tokens, themselves Python STRINGs. PATD Glossa are Python ORDERED LISTs of a limited number of the highest Magnitude CycloParas that can Extend from the current Ante. Any CycloPara in the list is a valid Extension, but various constraints improve SynTex quality. For example: Higher Magnitude CycloPara tend to fit a context better than lower Magnitude CycloPara. Then there is machinery in place to minimize Plages: SynTex should not quote verbatum from the Training Text for more than N tokens (around 9-12). RunOns should also be minimized: SynTex should not have repeating text segments (very often), even tho they are not Plages. Extensions from PATD require a certain amount of Randomness in selection. Without any Random, SynTex would ALWAYS Extend in the SAME way (the 'optimal' way). So no matter where the SynTex started from, it would soon converge to the SAME 'optimal' SynTex. Random Extension choice is a major way to avoid Deterministic SynTex.

Here is a sample PATD entry:

('the', 'competent', 'services', 'have', 'not', 'included', 'them', 'in') [(3, 8, ('in', 'the', 'agenda', 'for', 'this', 'part-session', 'to', 'the')), (3, 4, ('in', 'the', 'agenda', 'of')), (3, 7, ('in', 'the', 'agenda', 'for', 'this', 'part-session', 'to')), (3, 4, ('in', 'the', 'agenda', 'for')), (3, 5, ('in', 'the', 'agenda', 'of', 'the'))]

('the', 'competent', 'services', 'have', 'not', 'included', 'them', 'in') is the Lemma/Ante: a Python TUPLE containing 8 Tokens, themselves Python STRINGS.

[(3, 8, ('in', 'the', 'agenda', 'for', 'this', 'part-session', 'to', 'the')), (3, 4, ('in', 'the', 'agenda', 'of')), (3, 7, ('in', 'the', 'agenda', 'for', 'this', 'part-session', 'to')), (3, 4, ('in', 'the', 'agenda', 'for')), (3, 5, ('in', 'the', 'agenda', 'of', 'the'))] is the Gloss: a Python LIST of 5 (in this case) CycloPara, themselves Python TUPLEs containing Python STRING Tokens. CycloParas are Chunks, so their length is 4-8 Tokens by convention. Each CycloPara is a Trinomial:  In this case all the CycloPara are of Magnitude 3, their lengths varying from 4 to 8. Generally higher Magnitude promotes better Contextuality and less Zap. Conversely, higher Magnitude CycloPara are exceedingly rare in Training Text. 3-Mags are 2-3 orders of magnitude rarer than 2-Mags. 4-Mags are yet again 2-3 orders of magnitude rarer than 3-Mags and 4-6 orders of magnitude rarer than 2-Mags. Conversely, the exceedingly rare HiMags fit SynTex context extremely well, whil vulgar LowMags fit only generically. How to get more HiMags? They are proportional to Training Text length. If for example 500 5-Mag CycloPara are found in 2 Million Tokens of Training Text, then 1000 5-Mags can be expected to be found in 4 Million Training Text Tokens. Clean, long, valid Training Texts in proper Format are very rare, a precious commodity. For longer Texts, computer performance issues arise when for example comparisons of 4 million Tokens against 4 million other Tokens are needed. Machine Learning Neural Network methods address these performance issues by Matricizing the Training Text then processing it in specially-designed GPUs. As of 2018 this is an emerging new technology.

When a SynTex ends in ('the', 'competent', 'services', 'have', 'not', 'included', 'them', 'in') then that SynTex is Extended by choosing a CycloPara from the 5 Candidates in the Gloss.

PD918
entries look like this: ('in', 'the', 'meantime', ',', 'i', 'should', 'like', 'to', 'observe') ('a', 'minute', 's', 'silence', ',', 'as', 'a', 'number', 'of') They amount to 18gram Slivers of the Training Text and server to detect Plages. A PD918 Lemma is a Python Tuple of the first 9 Tokens, while the Gloss is another Tuple of the next 9 Tokens. EcoAutopsy in (P28FJ_Extend2) uses PD918 to detect Plages of any length between 18 and 9. For efficient traversal its more effective to use 18gram Text Slivers in just 1 dictionary, PD918, than to use 9 separate dictionaries of 8grams, 9grams,.....18grams. Python memory is limited so the single PD918 dictionary is only 1/9th the size of the alternative approach with 9 separate dictionaries.

ATD
ATD dictionary preceeds PATD. ATD Lemma are Ante, the Glossa are Magnitude-ordered Lists of Trinomial Tuples. Glossa are limited to 8 CycloPara: the highest-Magnitude 8 available. ATD Entries are not guaranteed Cyclical and in fact many are not. Cyclado reads ATD to produce the Cyclical PATD at a later stage. Here is an ATD entry: ('of', 'the', 'new', 'year', ',', 'the') [(3, 8, ('the', 'number', 'of', 'accidents', '.', 'he', 'objects', 'to')), (3, 6, ('the', 'number', 'of', 'members', 'of', 'the')), (3, 5, ('the', 'number', 'of', 'women', 'in')), (3, 5, ('the', 'number', 'of', 'commissioners', 'in')), (3, 8, ('the', 'number', 'of', 'countries', 'and', 'companies', 'prepared', 'for')), (3, 8, ('the', 'number', 'of', 'consultants', 'for', 'judges', 'at', 'the')), (3, 8, ('the', 'number', 'of', 'judges', 'at', 'the', 'court', 'of')), (3, 5, ('the', 'number', 'of', 'reservoirs', 'of'))]

Its Lemma is an Ante: ('of', 'the', 'new', 'year', ',', 'the') while the Gloss is a List of 8 Paras (Paraphrases of the original Sequitur). Each Para is a Trinomial of: 

Paras are CycloParas before Cyclicity is guaranteed. In the example case all 8 Paras are 3Mags, with lengths varying from 5 to 8. A Magnitude 3 Para have its first 3 Tokens in common: ('the', 'number', 'of') and originally there was a Sequitur in Training Text that also shares the same 3 Tokens. This stretch of 3 is called the Solapa (Spanish: flap, margin, overlap).

ATD Glossa are ORDERED SETs: all their constituents are Unique, there are no repeated constituents. The ordering is by Magnitude, so Highest Magnitude Paras come first in the Gloss List. Later at Extension time, the SynTexter may have constraints to only consider the first N CycloParas for PATD Entries.

Only using the first 2 CycloParas will promote higher Magnitude context matches, better quality SynTex, but also more Deterministic SynTex and possibly more RunOns. Also HiMags increase risk for Plages since their Solapa are wider so its more likely to quote verbatum longer stretches of Trainin Text.

While conversely using the first 5 CycloParas may allow Lower Magnitude Extensions hence less specific context matching, more Zap, but also (perhaps) fewer RunOns and less Deterministic SynTex. There's less risk of Plages, at the cost of vaguer appropriateness and more Zap.

ATD bookATD (in P28FJ_4_ATD_Cyclado_PATD_2)
bookATD books the ATD dictionary from (APD, AnteSeqD). The actual booking is done in Paraphraser(Lemma). ATD entries are Trinomials of , for example: (2, 6, ('of', 'the', 'structural', 'funds', ',', 'the')). bookATD top loop traverses all the Sequiturs for all the entries in AnteSeqD[lemma], comparing them to all the Paras from APD[lemma]. The actual comparison is done by glenSolapa(Seq, APD_Para). glenSolapa returns the Para Magnitude, then the Trinomial future ATD entry is composed, the Trinomial is tabulated by the Frequent_CycloPara(Postulate) Metric. APD_Para_Magnitude_BellD also tabulates for Metrics. The user specifies the Exclusivity Threshold: For example only Paras of Magnitude 3+ will be considered (omitting the numerous and more general Magnitude 2 Paras). If the current Para Exclusivity qualifies, that Para is added to the ParaSet. The ParaSet is a SET so contains only Unique constituents. Control is passed to ATD_Exclusive_Booking for possible booking to ATD.

Paraphraser
compares each Seq (Sequitur) in each AnteSeqD Gloss with each APD_Para. Paraphraser builds a Trinomial (Magnitude, Length, Paratext) and if the Magnitude meets the Exclusivity Threshold that Trinomial is added to the ParaSet: a SET of unique Paras. Finally ATD_Exclusive_Booking decides whether to book the ParaSet to ATD. glenSolapa(Seq,APD_Para) performs the actual comparison, returning the Magnitude. Various Metrics are updated: Frequent_Postulates, Total_APD_Paras_traversed, APD_Para_Magnitude_BellD, APD_Para_Width_BellD[Para_Width], APD_Para_Width_Bell_total_tab

glenSolapa
is the core of bookATD. glenSolapa performs the actual Solapa width comparison between a Sequitur and a Para. The width of the Solapa becomes the Magnitude for the Para. For example, if a Sequitur and a Para have the same first 3 tokens, then Solapa Width is 3, and the Para Magnitude is 3. glenSolapa only checks for Widths between 3 and 8 since every (Sequitur, Para) Solapa Width is at least 2. Yet Solapa Widths greater than 8 are exceedingly rare, so treated as Magnitude 8.

ATD_Exclusive_Booking
decides whether to book a ParaSet to ATD. First the ParaSet is ordered by Magnitude so highest Magnitude Paras are frontmost. Then if the ParaSet length exceeds 1, only the first 8 Paras are booked to ATD[Lemma]. Those 8 have the highest Magnitude. Various Metrics are updated: Trimmed_Para_Sigma, Sum_ATD_CycloPara(Postulates)_booked, Null_Exclusive_APD_Paras_not_ATD_booked, Mono_Exclusive_APD_Paras_not_ATD_booked

possible improvements for ATD_Exclusive_Booking
Unclear why NewGloss=list(ordered) is used. It's probably just wasting processing time.

Gloss_Length Metric
displays ordered tabulated Gloss length Metrics for (APD, ATD, PATD). APD for example has many entries with extremely long Gloss Lengths of 10s of 1000s. Clearly so many are unecessary. They are probably Repeats, not Uniques, and limiting to an arbitrary 1000 would probably speed up bookATD

APD_Para_Magnitude_Bell
is a Metric executed during ATD booking. It shows that nearly all (96.76%) APD Paras have just Magnitude 2, with a tiny 2.87% Magnitude 3 and miniscule fractions of greater Magnitude. For APD, not ATD

CycloPara Postulate_Magnitude_Bell
should work for (ATD, PATD, etc.). It tabulates CycloPara Magnitudes, showing ATD (with Exclusivity 3) has 47.8% Mag3, 38.8% Mag4, 11% Mag5, 2% Mag6.

APD improvements
Gloss_Length shows APD with numerous extremely long Glossa in 10s of 1000s. Clearly so many are unecessary. They are probably Repeats, not Uniques, and limiting to an arbitrary 1000 would probably speed up bookATD. Are APD Glossa SETs with Unique Entries?

Reinitialize_book_ATD
performs the numerous reinitializations for each bookATD session (that's one reinitialization).

Frequent CycloPara
are tabulated in Paraphraser. Why do a few CycloPara reoccur much more frequently than most?

bookATD performance
is too slow, a major bottleneck. 1150s or 19.5 minutes to process 182289 ATD entries from 290666 APD entries, or 748353448. Originating as 2M tokens of Training Text.

PATD Cyclado (Cloop, Cmain, Cyclado in P28FJ_4_ATD_Cyclado_PATD_2)
Cyclado books the PATD dictionary of Cyclical Entries from ATD. Every Entry has an ANTE/Lemma and Para/Glossa. An Entry is Cyclical if each of its Para/Glossa has a referent: points to other PATD Entry ANTE/Lemma. SynTex Extension relies on Cyclicity. If a PATD Entry was not Cyclical, its Para/Glossa would point to somewhere undefined outside PATD, so no Extension. So the SynTex Generation would fail. All PATD Entries are therefore Cyclical (as far as is known), but with too many (18%) Monos. Monos are Entries with just 1 possible Continuation (Para/Glossa) that spawn RunOns, hence undesirable. Cyclado works well enough in practice, but for unclear reasons.

How Cyclado Filters out non-Cyclicals
Cyclado currently iterates through a Previous dictionary searching for non-Cyclical entries. (Initially starting with the ATD Dictionary, but afterwards using intermediate temporary dictionaries). Not all such non-Cyclicals are identified, only SOME of them. The ones that lack referents in the current Previous Dictionary. Why? Lets say for Time 1, Dictionary 1, the 1st iteration, an Entry A Glossa B is non-Cyclical since there is no such other Entry with Lemma B. Proceeding thus all non-Cyclicals for Time 1 are easily identifiable, but now a cleaner, smaller Dictionary 2 is booked that omits all the non-Cyclicals detected for Dictionary 1. Unfortunately (many) OTHER Entries in Dictionary 1 might have ALSO been referencing Entry B, that was OMITTED for Dictionary 2. So now any such OTHER Entries in Dictionary 2 have non-Cyclical Glossa referencing non-existent Entry B. And all such OTHERs need updating. And SOME of them will also be legitimate non-Cyclicals for Dictionary 2, Time 2. Requiring ANOTHER iteration with Dictionary 3, Time 3. In practice this approach in fact produces an end dictionary of pure Cyclicals. But with 18%, also undesirable for other reasons. ==>Perhaps the Cyclado Omission test should ALSO include Monicity as well as Cyclicity?

Cyclado doesnt, but should, search instead for Cyclical entries. But how? Then Cyclado books a Next dictionary, omitting all the identified non-Cyclicals. So the Next dictionary contains fewer non-Cyclicals than the Previous dictionary. The process is iterated until there are no more non-Cyclicals to be found: when the Previous and Next dictionaries have the same length.

non-Cyclicals
Why do non-Cyclicals exist in the first place? Before PATD, dictionary Entries dont require each Para in the Gloss be Cyclical. That's Cyclado's job. A Cyclical Entry A has every Para in the Gloss pointing to the Lemma B of another Cyclical Entries. But at the instant Entry A's Cyclicity is being verified, the Cyclicity of Entry B is not yet established, so then also A's Cyclicity cannot be established, since were Entry B NOT Cyclical after all, then A also would be NOT Cyclical. It's an Infinite Regress problem. an example: Let's say at Time 1 the ATD dictionary has 100 entries and we are trying to book Entry #101. For 101 to be Cyclical, all 101 Glossa must already be present as Entry Lemma somewhere in the other previous 100 Entries. That's easy enough to check. But our ATD dictionary will eventually contain a HUGE number of entries. Let's say just 2000 total entries. Yet at time #101, half those 2000 are not yet booked. And ultimately entry #101 might turn out to be Cyclical after all 2000 Entries are booked, but not knowable at Time 1. If we just book all 2000 entries, then we can test for Cyclicity. Now suppose some (100/2000) Entries are in fact non-Cyclical since their Glossa are not other ATD entries. Then we would need to delete the non-Cyclical polluters. But now OTHER Entries are probably pointing to the deleted polluters, so all those OTHERs would need to be updated to remove reference to deleted Entries. And so forth in reverse. At the very least a performance issue for the implementation.

Cloop
has 3 purposes: 1)                avoids infinite looping by                          a hardwired kludge 30 iteration limit      2)                 exits upon completion when there are no more new current_nonCyclical_Paras 3)                initializes another iteration                         until final completion by                         passing a COPY of InDict to Cmain                         why(???) with great confusion on InDict/PATD                         status: a global? on function-parameter stack?

Cmain
performs actual segregation of Cyclicals from non-Cyclicals. When a Para/Gloss from the Previous Dictionary is also present as an Ante/Lemma in a different Previous Dictionary Entry, it is Cyclical. Otherwise its non-Cyclical. Then the Cyclical Para/Gloss are booked to the Next Dictionary, provided they have a Gloss. No NULL Gloss Entries are permitted.

Mono Control
Perhaps a tweak of Next Dictionary booking requiring Gloss length 2+ would eliminate Monos? This possibility hasnt been explored as of Oct22 2018.

Cyclado Performance
is fast enough, not a problem.

Extend (in P28FJ_Extend2)
uses the PATD Cyclical Dictionary to Extend SynTex by following Ante/CycloParas. Some amount of random choice is necessary else every SynTex generation would be the same identical 'optimal' generation, always choosing the first, highest Magnitude CycloParas for PATD entries, never bothering with any other CycloPara.

Import/Load cell
Various modules need importing, especially P28FJ_Utilities1. Also P28FJ_LoDumDefs1 defines the Load and Dump functions, also controls the Loads per each Script opening. Each Script needs Dictionaries and Texts of data to operate on, often Dumping results. Data is currently stored at J:\P28FreezeJapan\2M. This cell is stable. Extend needs: PATD and PD918 to work. PATD is the main dictionary. Lemma:Ante then Gloss:List of CycloParas ordered by Magnitude. PD918 is used for Plage detection by EcoAutopsy.

LoDumDefs Usage
LoDumDefs was an attempt to centralize all load/dump related code in 1 place. Once set up for a new Script Repository (P29FreezeJapan, P27, P26) LoDumDefs needs respecification each time different datafiles are to be loaded, such as when changing to a different script.

Respecifications: Inside each Script in top Import Cell: from P28FJ_Utilities1 import * from P28FJ_LoDumDefs1 import * need those 2 valid *.py executables resident in correct directory For each Script, certain datafiles are required (eg ATD_Cyclado_PATD nees APD and AnteSeqD), so in LoDumDefs.ipynb at the bottom:

from P28FJ_LoDumDefs1 import * L_AnteSeqD(directory) L_NumASD(directory) L_APD(directory) need specification for autoload when the ATD_Cyclado_PATD script is opened and its Import Cell is executed. Further, after each LoDumDefs change, the *.py executable version needs recompilation so: -new changes are saved to LoDumDefs.ipynb -File/DownloadAs/Python(.py) selected -*.py carefully saved to correct directory (P29FreezeJapan in this example) -use File Explorer to locate,copy,paste correct P29FreezeJapan as residence where -edit new *.py name (wrong by default with a spurious "(1)"
 * .py will be saved (default save directory is usually wrong)

-overwrite any existing *.py already there -have Script file closed before LoDumDefs.py changes, then open Script and test that new LoDumDefs.py works

The data directory(DD) changes whenever new testdata starts use. The current P28FreezeJapan address is: "J:\\P28FreezeJapan\\2M\\Tester_" so any loads/dumps will use that address as DD.

Detoke cell
Detoke pretty-prints a SynText list, reforming the Token List into a String for easy reading. Capitalization isnt attempted due to complexity. Paragraphing is arbitrary. Probably many legacy features from earlier Texts (Dubliners) having quotes and complex punctuation. Detoke is peripheral, mechanical and stable.

EcoAutopsy cell
EcoAutopsy locates Plages in SynTex. A Plage is any stretch of SynTex longer than N (8-12) that quotes Training Text. Longer Plages are worse. 8 is arbitrarily OK, tho lengths 9-12 commonly occur since an 8gram CycloPara Solapa can easily be 3 or 4 tokens, meaning a Plage of 8+(3 or 4). Some Plaging is inevitable, but less is better. Eventually perhaps shorter CycloParas might be preferred since their Solapa would total shorter Plages. A 6gram CycloPara with a 4gram Solapa amounts to a 10gram Plage. While an 8gram CycloPara with the same 4gram Solapa is a 12gram. EcoAutopsy uses PD918, a frugal Dictionary as short as possible. Plages up to 18gram are detected. EcoAutopsy searches for longest plages first, down to shortest, 18gram thru 9gram. Repeated false positives are avoided swapping in a 金 so later traversals dont repeatedly detect spurious PlageFragments. Without the 金 an 18gram Plage would spawn a Flush of spurious shorter Plages.

PD918
Has a 9gram Lemma and a continuation 9gram Gloss permitting 18-9gram Plage detection dynamically after the SynTex is generated. PD918 is a TextSliver dictionary built by bookPD918 in P28FJ_2_NumASD_PD_FD_1. Its contains TextSliver of TTL passed thru the Exclude Filter.

Extend1
performs the actual Extentions from PATD Ante/[CycloPara] Dictionary. Extend1 randomly chooses a CycloPara that follows the current Ante.

Random?
Tho Extend1 chooses an Extension randomly, the choices have been carefully constrained, so candidates have the highest Magnitudes. Choice-window width is a parameter. So it's a random choice from amongst the N most likely Extensions learned in an X-million token Training Text. "Most likely" means having the highest Magnitude. That means having the widest Solapa (Margin) shared with the previous Extension, which is currently the Tail of the generated SynTex. Presumably Extensions with wider Solapa are more likely to fit the Tail context better than narrower Solapa. Why? Well suppose the Solapa (overlap=margin) was just 1 token. Then there might be 100s maybe 1000s of candidates, hence less likely to fit the current context tightly. Now for a Magnitude 4 Extension, there will be very few such Candidates, each sharing 4 Tokens with the Tail, hence fitting the Context closer, therefore better choices.

Extend continued
Without some randomness Extensions would CONVERGE to the SAME SynTex each time: an OPTIMAL SynTex, but undesirable. Yet eventually RANDOM choice needs improvement to a rational regimen simulating Genesis. A hardwired selector at line 82, a=random.choice(Alts[:2]), constrains candidate-pool size, whether to randomly choose from a few or many candidates. This influences the Magnitude of the choice, since deeper-pools will have more lower Mags, hence greater Zap. But shallower pools risk more Repeats since there are fewer choices. And Training Text length in turn influences Magnitude range since longer TTexts afford more higher Magnitude CycloPara. Extend1 executes various Metrics: SessionFavorites are listed, Loop length limit, Magnitude Calibre, Monos, Polys. Extend1 starts by printing the Header randomly-chosen from PATD. The Header becomes the Ante for a PATD lookup to get the CycloPara List of viable Extensions. Each Extension Print includes Metrics: the len(Alts): some are minimally short just 2, others are very long the Magnitude, the Magnitudes of the unchosens the actual Extension Text Extension Length Extension Magnitude

AltctrLimit=30 is currently hardwired to iterate 30 printable Extensions displayed as SynTex. But more iterations take place (100??) to generate Metrics, including SessionFavorites, Monos/Polys, MagCalibre.

SessionFavorites
Fewer, less frequent Favorites is better at the moment where CycloParas dont derive from pre-Genesis Thought. Session Favorites are CycloParas chosen more frequently than normal for various reasons. Sometimes they are Mono RunOns: the ONLY possible continuation. Other times preselection of highest Magnitude Candidates favors certain Favorites. Even in Human Text Favorites are normal and contribute to a sense of Topic Coherency Flow. Suppose no CycloParas occured in a Human Text. No ideas would be invoked more than once. No continuity would be contributed. In fact Topical Continuity is totally lacking from this SynTex. Unfortunately the overrepresented SessionFavorites in SynTex dont contribute to Topic Coherency since they are not derived from pre-Genesis Thought.

Monos and Polys
Monos are PATD entries with just 1 CycloPara in the Gloss, while Polys have more than 1. Ideally every entry in the PATD CycloPara Candidate dictionary should have more than one Extension. In effect Monos force RunOns: long deterministic stretches of SynTex. If by chance SynTex extends Mono CycloPara A, then forcibly the only further possible Extension is from A's Gloss (lets call it B). So in effect the A:B RunOn is guaranteed, and 2x as long as usual. The SynTex will contain too many Mono RunOns. In contrast, when every Gloss contains Polys, not Monos, then there's a random choice at each Extension, producing less Determinism, more Variety, fewer RunOns, resulting in more Plausible SynTex. Currently Cyclado generates about 18% Monos in the PATD dictionary. It's not clear how to avoid the Monos. In fact Cyclado itself works for unclear reasons.

Edges
Text16D gets traversed keep tabs for al 3grams regardless, into the TresD. Lemma is the 3gram, Gloss is the tally. Then all 3grams ranking over a Floor Threshold become Edges, supplanting HingeWords. Next Text16D is retraversed from all the TresD Lemma, but for 4grams. CuatroD booked. And so forth up to Ngrams (7?). Problematic Entries deleted (Punctuation...) Then Text16D gets Cut into max8grams when displaying Edges as Solapa with proper length. Hopefully more Content-based Cuts will entail since Edges will include Content Tokens, not just StopWords like the current HingeWords.

Genesis
Genesis is the instant when thought becomes word. Presumably thought preceeds text. It is often nonverbal. It provides the Topical Continuum SynTex lacks. A place to derive a Topic from. Thought-modelling will be necessary, tho at this point it is elusive. SynTex might approximate Human Text closer using a Synthetic Topical Progression template. From a sequence of Training Text a Topical Skeleton is derived with a roster of Topical Tokens: those words that characterize the sequence minus Frequent Words and also minus Rare Words. Frequents dont contribute Topicality. Rare Tokens are too hard to substitute. For a Rare Tokens (sodim hydroxylate) there will be too few Alternates even in a long Training Text so Rares are unmanageable.

SynTex Driver
The driver should run the complete Generation process from earliest Text Manipulation through to Extension. All in 1 Script in a simple to run sequence. Currently 4-5 Scripts require as many Dictionaries/Texts to complete a Generation. Data can be too big to fit into memory all at 1 time so perhaps completed Dictionaries can be Dumped and cease to exist in the Python process.

Discourage Reps
Extended SynTex shouldnt contain many Reps: repeated CycloPara combinations. The Extension Function may need to keep track of Pre/Pos ordering for all CycloParas Extended in one SynTex. Then Reps can be discouraged unless there's no other choice.

PATD Uniques
Cyclado generates a PATD that is full of Unique entries with only 1 CycloPara per Entry. For example consider these 4 Entries: A:B B:C C:D D:E All 4 are Unique with only 1 possible continuation so if A is Extended then only B can continue, and so forth producing the sequence A-B-C-D-E that will Repeat anytime A is Extended. A Deterministic sequence. Ideally all PATD entries should always have at least 2 CycloPara Continuations, with even more even better. Current PATDs contain around 18% Uniques. Simply disallowing them would probably upset other Cycles. If Unique PATD Entry A only has the Unique Continuation B, then disallowing B would require backtracking somehow to detect ALL other Glossa referring to B, then deleting them too if they are also Unique.

Much Longer Training Text Processing in APD
SynTex Plausibility is proportional to Training Text length so trainig from 2 million Tokens generates a more Plausible Extension because there will be more Higher Magnitude Continuations to choose from. However APD dictionary generation is a slow bottleneck. APD should work by Chapters where Chapter1, 2,3 APD Dictionary entries are generated independently in 3 separate sessions to avoid Python memory issues. Then the Lemma from Chapter1 should next scan the other two Chapters for additional Alternatives. And likewise for APD Dict2 scanning Chapter 1 and 3 Text. And so forth.

Refine APD SET vs LIST
RefineAPD is a slow bottleneck in its current LIST-based form. A SET-based equivalent seems to process very much faster, however encountering fatal MEMORY problems partway thru a Booking. Probably some data structure (the SET? the Dictionary Gloss?) is copied onto a STACK so many times that it completely consumes all available memory, producing Python churning and even interference in external processes (Chrome browsers). And eventually Python throws a MEMORY ERROR. Specifically, APDpre Glossa are already SETs.

If APDG=APDpre[Num] then later Removes from APDG permeate back to become Removes from the APDpre[Num] Glossa itself since APDG is just a Synonim for APDpre[Num]. A MEMORY ERROR eventually happens. If instead APDG=APDpre[Num].copy then later Removes from APDG do not permeate back to APDpre. However it seems the copy is expensive of Memory Space, probably overflowing a Stack somehow. And anyhow the eventual MEMORY ERROR happens anyway. We need a way to make Removals from a LOCAL APDG data item that dont permeate, dont overflow an overwhelmed stack yet still runs much faster than the LIST version. SET-version seems desirable since its members are Unique with no Repeats. The LIST version of RefineAPD has to turn every SET in every APDpre Glossa into a LIST, a seemingly very expensive operation to avoid.

REGEX version
The current version uses LISTs and SETs: sequences of Tokens. Maybe a REGEX-based Generator is faster or better? Intuitively String-based comparisons seem faster than LIST- or SET- based. A clean way to isolate StringTokens is needed. A SPACE between words is generally useful, but there are numerous complicated subcases. Maybe instead its faster/better to generate the LISTs first, then change them into STRINGs using an alien TOKEN DELIMITER wherever Python saw a border between 2 Tokens.

Smart Compose Interactivity
Google Email Smart Compose suggests continuations whenever they are strongly likely, but lets the Human choose which and whether to use the continuations or not. Current SynTex Extension allows no Interactivity at all. For each Extension the Human could be consulted for desired Extension, greatly increasing Plausibility even resembling Topicality, while still qualifying as Text Generated Synthetically.

Japanese Tokenizer
would allow experimentation with non-delimited Japanese Text. Written Nihongo doesnt use SPACE to delimit Tokens so a Japanse Tokenizer is very much more complicated than an English one, which relies on the simple SPACE between Tokens for most Tokenization, with PERIOD manipulation being the main minority case. A Japanese Tokenizer is probably Dictionary-based, slow and imperfect. Some such tokenizers seem publically available, but need Linux for downloading and installation. As of October 2018, Ubuntu Linux is only recently installed. So perhaps a Japanese Tokenizer can be incorporated.

Long Japanese Training Text
The current 54-million Token English Training Text is from EuroPerl, which also includes lengthy Training Texts for many other European languages, but not Japanese.

Cyclado Mono Tag
Monos are identified when booking PATD by Cyclado, and could be tagged. Then the Extender could avoid them when possible.

Named Entities
should be isolated at an early stage. NE identification is standard software.

ClumpumD Ordered Dictionary
if ClumpumD were an Ordered Dictionary, then book_APDpre_LIST_1 would be different, probably MUCH faster than now, altho currently its only 105s to book APDpre for 2 million.
 * 1) ClumpumD Tuple/Set of Tuples but not an Ordered Dictionary

APDpre Glossa as SETs
if APDpre Glossa were booked as SETs by book_APDpre_LIST_1, then maybe Refine_APDpreSET might succeed in booking APD without any Memory Errors....and hopefully faster
 * 1) APDpre Integer/List of Tuples, an isoDict of ClumpumD

Topical Skeleton Match
would be (A) the preceeding paragraph of text where a CycloPara is resident and (B) the current SynTex skeleton. Then preference to CycloPara with closest Skeleton match. Gotta keep track of CycloPara's origen in text.

Load/Dump Specification local to Script
It's now in LoDumDefs so needs respecification everytime a different Script is edited. Should be in each Script's Import Cell and automatic.

APD Compaction
All Para in APD Glossa are Unique since they derive from Sets even tho the Glossa are Lists. However some Para appear in Text multiple times, so should get higher Tweight, similar to Magnitude. A Para appearing 1 time weighs less than a Para appearing 7 times so the heavier Para should also appear in SynTex 7x more often than the lighter.

Further, many Titan APD Glossa number 10s of 1000s of Para long, while very few have only 2 Para. Numerous Glossa (Ngloss) might sometimes be compactable by finding local non-HingeWord Solapa. Often they derive from a ParaFlush, analogous to a SequiturFlush. For example: ('on', 'thursday', 'prior', 'to', 'the') ('the', 'start', 'of', 'the', 'application', 'of', 'the') ('the', 'start', 'of', 'the', 'application', 'of') ('the', 'start', 'of', 'the', 'debate', ',', 'at', 'the') ('the', 'start', 'of', 'the', 'debate', ',', 'at') ('the', 'start', 'of', 'a') ('the', 'start', 'of', 'a', 'sitting', 'in', 'the') ('the', 'start', 'of', 'a', 'sitting', 'in') ('of', 'the', 'commission', 'being', 'able', 'to') ('of', 'the', 'commission', 'and', 'parliament', 'being', 'able', 'to') ('of', 'the', 'commission', 's', 'lack', 'of') ('of', 'the', 'commission', 'as') ('of', 'the', 'commission', '.', 'on', 'behalf', 'of', 'the') ('of', 'the', 'commission', 'and', 'the', 'need', 'for', 'the') ('of', 'the', 'commission', '.', 'in') ('of', 'the', 'commission', '.', 'it', 'was') Further still, even though 'commission' isnt a HingWord, still it's clearly part of a MicroSolapa delimiting a group of Paras and relating them to a Sequitur. A review of the  concept, considering there are 690 instances of <'of', 'the', 'commission'> in just 1 APD entry, length:13,233(!!): ('on', 'thursday', 'prior', 'to', 'the', 'start', 'of'). 26 for <'of', 'the', 'same'>, 1735 for <('of', 'the', 'euro>

Sequitur Tracking
may be useful to relate MicroSolapa to the original conText. Currently Sequitur arent used past AnteSeqD.

MicroSolapa added 2nd pass to APD?
Maybe all long APD Glossa should be scoured to find MicroSolapa to add to APD and propagate to PATD?