User:Ryuki4716/SynTex/CodeReview

Synthetic Text
---“ the difference between their two pontificates.” the french competitors and, in answer to sit by and a good dinner to speak to business men and he would only turn to him or come to himself : ---“ at the chapel. At her stillness and strained his ear to follow the voice, without looking at their church, too,” said mr henchy. ---“ I expect to his home. When he went in tone, attacked with great spirit the horse!’” the peal of breath and cried SynTex is notoriously Phazed: It formally mimics proper English with good-enough syntax. But it sounds schizophrenic, at times incoherent. Topical Cohesion, RealWorld Knowledge and Language Knowledge are haphazard or absent. And that is precisely THE POINT about SynTex: we have got past the syntax analysis stage and can now focus on dePhazing. Humans base their text on human experience, but how to represent human experience in a way a machine can understand it?

Phazing has multiple causes. A few of them might be remedied by improvements to the Context-Matching used to Extend SynTex. But clearly our SynTex Generator lacks a World Knowledge Representation(KR). As of 2018, nobody has solved this problem, or even made meaningful progress. At least this SynTex Generatorexposes the issue clearly while previous traditional (transformational/generative) grammars side-stepped it. Incremental improvements to the Generator may improve SynTex coherency, but in this article we only outline the SynTex Generation process. It's progress just to expose the root KR issue. Most adult humans cannot produce a coherent page of Text either, tho their shortcomings are human and qualitatively different from Phazed SynTex. SynTex Generation uses Slices extensively. Here's a short tokenized text and a deck of some 4gram slices:

'it',"'",'s','difficult','to','get','here',',','so','i','am'

'it', "'",'s','difficult' "'", 's','difficult','to' 's','difficult','to','get' 'difficult','to','get','here' 'to','get','here',',' 'get','here',',','so' 'here',',','so','i'                                              ',','so','i','am'

Kontext
In this discussion Kontext means: Left/Right context, length and QuoteStatus. Eventually Kontext will expand to include many other criteria. Here's an example Candidate Wafer:

('for','him','at','this','time',':','it','was') Left context is: 'for'. Right context is: 'was'. Length is 8 tokens. QuoteStatus is: Neither, since no quote symbols occur, so status is neither Left or Right.

('and','she','said',",",'“','no') Kontext is 'and'/'no', length 6, QuoteStatus: Left.

('and','she','said',",",'“','no','”') Kontext is 'and'/'”', length is 7, QuoteStatus: Right.

Text slices
Text slices are lists of tokens from a TrainingTextList(TTL). If the Training Text(TT) is [a,b,c,d,e,f,g,h], then some 4gram slices from it include: [a,b,c,d],[b,c,d,e],[c,d,e,f]. If we have 1000 tokens of TT, then our SliceDeck will hold nearly 1000 such slices. (Minus some we discard for various reasons like typos,etc.) We will find many of these 1000s resemble each other having the same Kontext, and are thus interchangeable to some extent. Given this SynTex:

[a,b,c,d,e]. We might find in our SliceDeck:

[e,a,b,h] [e,b,d,h] [e,f,c,h]

that we previously found and booked by slicing our TTL into 4gram Slices. Just by coincidence far away in unrelated places in text, other similiar-by-Kontext 4gram Slices occur. The longer the text, the more similar Slices may be found. Ultimately everything anybody ever wrote in any language is composed of re-purposed short Ngrams segments that other people previous used. Language is SHARED and we hold it in COMMON. While your particular LONG text may be unique, its shorter parts are certainly NOT unique. Just consider vocabulary items for any language. If they werent SHARED by many speakers, they would be useless. So now we might Extend SynTex in any of those 3 different ways:

[a,b,c,d,e]+[e,a,b,h] --> [a,b,c,d,e,a,b,h] [a,b,c,d,e]+[e,b,d,h] --> [a,b,c,d,e,b,d,h] [a,b,c,d,e]+[e,f,c,h] --> [a,b,c,d,e,f,c,h]

and we could keep extending indefinitely like this.

[a,b,c,d,e]+[e,f,c,h] --> [a,b,c,d,e,f,c,h]+[h,d,c,p]--> [a,b,c,d,e,f,c,h,d,c,p]

In this simple example there's only a simple Lcontext to match. In reality there are many other constraints and context dependencies.

Let's add another additional constraint: No Plagiarisms.

If the TT already has a sequence [a,b,c,d,e,f,c,h,d,c,p], then if we extend to produce that very same sequence, it counts as a Plage. Obviously in practice everything anybody can ever say is a re-purposing of words somebody else previously used. So by convention we take 8 tokens as a convenient length for an (English) SynTex Extension. Shorter is fine, but longer Extensions are closer to a Plage. Shorter extensions lead to incoherency sooner. That's because at least several (8+) tokens are usually seen to be needed to describe many/most thoughts. It depends what you define as a 'thought'. Fewer tokens per thought are rarer. We need to choose some baseline number, so it's 8 since reviewing Training Text frequently displays 'thoughts' of that approximate length.

Text Slices vs. Python Slices
My SynTex Generator is implemented in Python3 Script that uses Python Slices. Let's not confuse Python Slices with SynTex Slices. In this article 'slices' are SynText Slices.

Tokens
Tokens are words and punctuation in a training text(TT). It's important to include punctuation tokens in SynTex since they are numerous, pervasive and greatly contribute to form and meaning. It'd be impractical to exclude them since they are so frequent: perhaps 15% of typical text is composed of punctuation. Disallowing punctuation-laden TT slices might leave gaps where matches couldn't find Candidates.

Tokenizer
A tokenizer is a standard software tool that splits normal text into tokens. It is difficult to achieve high accuracy due to numerous (English) ambiguities involving the PERIOD '.' It's hard to get quality tokenizers for such languages as Japanese, where there's no white space between words. And probably no tokenizers at all exist for most written languages. It's not hard to get maybe 80% accuracy with a simple home-made tokenizer. But high accuracy is elusive. In practice any tokenizer will generate some noise: mistaken output that are best filtered out later since that's much easier than developing a precision tokenizer.

Python Lists vs. Dictionaries
We use Python Dictionaries whenever possible since they have faster look-up speed. Sometimes remarkably faster than a Python List. Also we use Ngram Slices extensively as they nicely book into Python Dictionaries.

CODE REVIEW
The discussion below details usage for the components of my SynTex Generator implemented in Python3. Notice that this Generator handles Dubliner Quotes (DQs) which are surprizingly complicated to process. Most of the script deals with DQs, while QuoteLess Training Text(TT) alone is much easier to handle. But in reality much Text in fact contains Quotes (and other Paired devices such as Parentheses and Square Brackets). A common kind of Quote is a re-statement of something previously stated. But Generation of SynTex doesnt WANT to re-state since that would be considered a Plage. Yet to mimic human Text the SynTex Generator needs to produce pseudo-Quotes. Most of this code will have evolved beyond recognition by the time you read it. It's Python3.6.1 in a Jupyter environment on a Windows10 machine.

Extend
def Extend(n): global SynTexList, QuoteToggle QuoteToggle="left"  #wants a LeftQuote by default initially HitsList=[]         #holds raw THOTs before splcing onto SynTexList SynTexList=[]        #THOTs properly spliced onto TOT cctr=0              #iteration counter for printing WaferHeads(n)      #is a wrapper for bookWafers which compiles #WaferDicts per Quote specs for q in range(0,len(SKELETON)-1): cctr +=1 Lcontext=Heads(SKELETON[q]);Rcontext=Heads(SKELETON[q+1]) #abbreviations if QuoteToggle=="left": extension=LNinset(Lcontext,Rcontext) print("Lcontext",cctr, extension) else: extension=RNinset(Lcontext,Rcontext) #Jan25 print("Rcontext",cctr,extension) LastQtype(extension)    #Quote management only HitsList.append(extension) for q in HitsList: SynTexList=SynTexList+q[:-1] print(QuoteDetoke(SynTexList))

Extend extends SynTexList n times by using the Insetters to append an extension in Kontext.Extend calls WaferHeads to build a SKELETON, and to book the Wafer dictionaries with Extension Candidates in Kontext. So Extend appends n Kontexted SynTex Extensions. Next Extend calls QDetoke to pretty-print the resulting SynTexList and finally Autopsy1 reports the Plage status. By default most Extensions are 8 tokens long, a few 9, even fewer 10, with longer, or doubled (16,18,20) considered Plages since they quote TrainingText(TT) verbatum for too long.

Some Extensions to the SynTextList
30 ['at', 'the', 'congregation', 'they', 'have.', '”', '“', 'the'] 31 ['the', 'continent.', '”', '“', 'o', ',', 'on', 'the'] 32 ['the', 'dear', 'knows.', '”', 'mrs', 'kearney', 'had', 'to'] 33 ['to', 'strike', 'him', '.', 'he', 'turned', 'suddenly', 'to'] 34 ['to', 'the', 'table', 'to', 'see', 'what', 'she', 'would'] 35 ['would', 'only', 'turn', 'to', 'him', 'or', 'come', 'to'] 36 ['to', 'himself', ':', '“', 'at', 'the', 'chapel', '.', 'at'] 37 ['at', 'last', ',', 'when', 'she', 'judged', 'it', 'to'] 38 ['to', 'you.', '”', '“', 'i', 'have', 'been', 'at']

An Extended SynTexList
This Extended SynTexList is just a list of tokens, both words and punctuation ('.'). It's not easy for humans to read, but Python easily manipulates such data structures: Lists of Strings.

['to', 'and', 'fro', '.', '“', 'i', 'bar', 'the', 'piano', 'had', 'twice', 'begun', 'the', 'prelude', 'to', 'his', 'confused', 'murmur', 'of', 'compliment', ',', 'the', 'church', 'for', 'twenty', 'years', '.', 'he', 'was', 'grace', 'and', 'mystery', 'in', 'her', 'attitude', 'as', 'he', 'had', 'failed', 'with', 'the', 'girl', 'in', 'the', 'end', 'he', 'had', 'got', 'mixed', 'up', 'such', 'an', 'affair', 'for', 'a', 'sum', 'of', 'forced', 'bravery', 'in', 'it', 'and', 'i', 'was', 'not', 'altogether', 'pleasant', 'for', 'him']

QuoteDetoke
QuoteDetoke paginates SynTexList for easy reading.

......To and fro.

“ I bar the piano had twice begun the prelude to his confused murmur of compliment, the church for twenty years. He was grace and mystery in her attitude as he had failed with the girl inthe end he had got mixed up such an affair for a sum of forced bravery in it      and I was not altogether pleasant for him,

def QuoteDetoke(inlist):                  #Detokenize for Quotes out0=[] for x in inlist:                       #flatten any interior lists if type(x)==list: for q in range(len(x)): out0.append(x[q]) else: out0.append(x) out1=[out0[0].capitalize]            #Oct27 prime the accumulator for x in range(len(out0)-1):           #Capitalize sentences if out0[x] in ['.',':',';']: out1.append(out0[x+1].capitalize) else: out1.append(out0[x+1]) pctr=0                               #new paragraph every 4 DOTs S2='\n\n......'                      #initial indentation S3='' tstring='' tstring=' '.join(out1) for z in range(len(tstring)):       #Indentation for Quotes if tstring[z]=='“': S2=S2 + '\n\n---'+tstring[z] else: S2=S2 + tstring[z] #string text cosmetics S3= S2.replace(' ,',',').\ replace(' .','.').replace(" 's","'s").\ replace(' ;',';').replace('( ','(').replace(' )',')').\ replace(' i ',' I ').replace(' !','!').replace(" d ’ ", " d’").\ replace(' ?','?').replace(" 'd","'d").replace(" ’ s","’s").\ replace(' ;',' ').replace(".; ",'. ').replace('_',' ').\ replace('!.','!').replace("i'm","I'm").replace("' '","").\ replace(" ’ ","’ ").replace(' “ ',' “').replace(' ” ','” ').\ replace(" ‘ "," ‘") print(S3,'\n',len(S3))

WaferHeads
def WaferHeads(n): global SKELETON SKELETON=S=generateSkeleton(n) for x in range(len(S)-1): Q0=isinstance(S[x],str);Q1=isinstance(S[x+1],str)     #already Atomic Heads? snot1=S[x]; head1=snot1[0]                            #if not, get the Atomic Head snot2=S[x+1]; head2=snot2[0] if Q0 and Q1: bookWafers(S[x],S[x+1])             #only pass Heads to        elif Q0 and not Q1: bookWafers(S[x],head2)         #bookWafers(L,R) elif not Q0 and Q1: bookWafers(head1, S[x+1]) elif not Q0 and not Q1: bookWafers(head1, head2) else: print("FAILED bSkeleton1 somehow fell thru the maze")
 * 1) Jan27 generateSkeleton gets atomicHeads only to pass to bookWafers(L,R)

WaferHeads(n) first calls generateSkeleton(n) to produce a SKELETON of length n. Then WaferHeadsbooks Conforming Candidates into the WaferDicts for each L/R context pair in the SKELETON list. WaferHeads(n) uses 1-token Hinges for L/R context, so for Molecular Hinge cases uses just the Head token.

generateSkeleton
def generateSkeleton(Limit): global RtextList, Hinges19 sktn1=[]; pairctr=0 RT=RTextList; H=Hinges19   #Abbreviations Jan22 #x=3500 #for testing Jan22 #x=random.randint(0,4000) #get skeleton from different place each run x=2105 print("generateSkeleton: x=", x)   ctr4=0;ctr3=0;ctr2=0;ctr1=0 while pairctr<Limit: x0=(RT[x]) in H; x1=(RT[x+1] in H)       x2=(RT[x+2] in H); x3=(RT[x+3] in H) #Abbreviations Jan22 if x0 and x1 and x2 and x3: print("cuatro: ", RT[x:x+4]) sktn1.append(RT[x:x+4]);x+=4;ctr4+=1 elif x0 and x1 and x2: print("tres: ", RT[x:x+3]) sktn1.append(RT[x:x+3]) x+=3;ctr3+=1 elif x0 and x1: sktn1.append(RT[x:x+2]);x+=2;ctr2+=1 elif x0:sktn1.append(RT[x]);x+=1;ctr1+=1 else:x+=1;pairctr -=1 #offset usual pairctr increment pairctr +=1 print("ctr4: ", ctr4, "ctr3: ", ctr3, "ctr2: ", ctr2, "ctr1: ", ctr1) print("generateSkeleton: ", sktn1) return sktn1

generateSkeleton(Limit) builds the SKELETON of length Limit. It starts at a given place in the RTextList, checking each token to see if it's a HingeWord, then appends only the Hingewords to sktn1. Moleculars up to length 4 are recognized. For example, if RTextList is

['there','was','no','hope','for','him','this','time',':','it','was','the', 'third','stroke','.','night','after','night','i','had','passed','the','house']

then sktn1 is

['was','for','this',':','it','was','the','.','after','i','the']

gWafers
gWafers gets a span of Text and populates the Wafer Dictionaries with Candidate Wafers by Kontext. Then downstream the Insetters Extend a SynTex in Kontext from the Candidate Wafers in the Wafer Dictionaries.

Wafers builds a SKELETON, then calls gLRpairlist to generate W2 (LRpairset internally),then clears the Wafer Dicts then finally books Wafers into the Dicts. gLRpairlist calls generateSkeleton then returns a List of unique LRpairs for bookWafers tobook. A SKELETON might have multiple identical pairs ('repeats') so gLRpairlist excludes the repeats, returning only unique L/Rpairs. (Otherwise downstream bookWafers would book the same L/Rpairs multiple times).

def gLRpairlist(n): SKELETON=S=generateSkeleton(n) #generate a SKELETON LRpairlist=[] for x in range(len(S)-1): Q0=isinstance(S[x],str);Q1=isinstance(S[x+1],str) #already Atomic Heads? snot1=S[x]; head1=snot1[0]                       #if not, get the Atomic Head snot2=S[x+1]; head2=snot2[0] if Q0 and Q1:LRpairlist.append((S[x],S[x+1])) elif Q0 and (not Q1):LRpairlist.append((S[x],head2)) elif (not Q0) and Q1:LRpairlist.append((head1, S[x+1])) elif (not Q0) and (not Q1):LRpairlist.append((head1, head2)) else: print("gLRpairlist FAILED") LRpairset=set(LRpairlist) return list(LRpairset) def gWafers(n):       #Feb4* global SKELETON, W2            W2=gLRpairlist(n)       #generate a SKELETON, then a unique set of L/R pairs LWaferD.clear   #clear Wafer Dicts NWaferD.clear LNWaferD.clear RWaferD.clear for x in range(len(W2)):  #book Wafers into Dicts Front,Back=W2[x] bookWafers(Front, Back)

bookWafers

 * 1) bookWafers segregates the TextSlices by Kontext.
 * 2) the Wafer dictionaries hold slices grouped by Kontext.
 * 3) NWaferD contains only slices without any Quotes at all (QuoteStatus: NoQuote)
 * 4) LNWaferD contains no slices with RightQuotes. That means it holds slices
 * 5)           with both LeftQuotes and NoQuotessince that's the mix called for initially
 * 6)           and whenever not requesting a RightQuote (QuoteStatus: Left/NoQuote)
 * 7) RWaferD contains only slices with RightQuotes---used to terminate started quotes
 * 8)           (QuoteStatus: Right)
 * 9) LWaferD contains only slices with LeftQuotes (QuoteStatus: Left)

def bookWafers(Left,Right): global LNWaferD Lctr=0;Nctr=0;Rctr=0 for W in SliceDict: for q in range(6,9): if (W[0]==Left)and (W[q]==Right): if ('”' not in W[:q+1]) and ('“' not in W[:q+1]): Nctr +=1             #book the None and LeftNone dicts NWaferD[(Left, Right,q+1)].append(list(W[:q+1])) LNWaferD[(Left, Right,q+1)].append(list(W[:q+1])) break for x in W:                   #book the Right dict if x=='”': Rctr +=1 RWaferD[(Left, Right,q+1)].append(list(W[:q+1])) break if x=='“':                #book Lefts only Lctr +=1              #also book Left/Nones together LWaferD[(Left, Right,q+1)].append(list(W[:q+1])) LNWaferD[(Left, Right,q+1)].append(list(W[:q+1])) break

bookWafers traverses SliceDict for Wafer lengths 7,8,9 and books them in 3 different ways into the Wafer dictionaries. Extend discriminates by QuoteStatus: either extend (1) a Left or (2) a Right Quote Extension. But usually the Extension is (3) neither Left nor Right, since most Slicesare NoQuote, and most quotes get extended a few times before finally terminating.

How often are quotes only 8 tokens long or shorter in normal human text? Not very often, tho possible. So we mimic that distribution: usually we Extend an initiated quote a few times before closing it, but occasionally we terminate it immediately.

Notice that Dubliners was chosen as our Training Text because it is full of quoted dialog: reported speech carefully written with either Left or Right quote symbols. That usage is obsolete nowdays, but it greatly facilitates Quote Management for SynTex Generation. This seemingly trivial point accounts for the greater part of the SynText Generator script. Quotes present complications for several reasons:
 * 1) Quotes are often used loosely
 * 2) Quotes are often mismatched
 * 3) Quotes have multiple meanings: Scare quotes vs. Reported Speech quotes
 * 4) Left/Right quote usage is rare in modern English
 * 5) Quote representation is language-dependent (eg Japanese quotes are different from English quotes)
 * 6) SynText lacking Quotes would be unrepresentative of an important aspect of text

Wafer Dictionaries
They hold Candidate Extensions in Kontext.

NWaferD
only contains NoQuoteCandidates. That's most text usually, since when split into Ngram slices, they mostly neither Left or Right quotes. This is the case even internally to quoted reported speech since it's typically over 8 tokens long so internal slices wont contain quotes

LNWaferD
contains Left-Quote and NoQuote Candidates and is important since NoQuote greatly outnumber Left-Quote so are much more likely to be required to match an L/Rcontext for Extension. NoQuote are in demand since most of the time we are NOT starting Left quotes.

LWaferD
contains Left-Quote Candidates only, no NoQuote. Sometimes we specifically want a Left-Quote only, so we look for one in this dictionary.

RWaferD
contains Right-Quote Candidates only, not NoQuote. In some contexts we prefer a Right-Quote Candidate only, so we find them in this dictionary.

QuoteStatus Toggle
Every quote symbol is either a Left or a Right, even tho modern English usage is to confuse them into just 1 ambiguous symbol (tho not so in other languages). SynText Generationcarefully distinguishes Left and Right quote symbols, and we use an older Training Text that differentiates L/R quotes, such as Dubliners. It turns out Quote management is complex and consumes most of the generation effort. Without quotes (or parentheses or other paired devices) SynTex generation is much simpler, but loses some resemblance to human text. By default a new SynTex expects a Left Quote, or no quote, but not a Right quote. After a Left Quote is eventually attached, then the SynTex expects a Right Quote, or no quote.

SKELETONs
A SKELETON is a list of Hinges, the L/RContext bones we flesh out to generate SynTex.

['to', 'the', 'to', 'the', 'was', 'as', ['in', 'his'], ['up', 'to', 'the'],\ ['of', 'the'], ['was', 'not'], ['as', 'for', 'the'], 'his', 'a', 'his',\ 'was', 'a', ['was', 'a'], ['in', 'the'], ['in', 'the'], ['in', 'his'],\ 'in', ['in', 'the'], ['to', 'the'], 'a', 'of', ['on', 'the'], 'to', 'a',\ 'of', 'at', 'the', 'the', 'to', 'to', 'would', ['to', 'be'], 'at',\ ['to', 'the'], ['at', 'the'], 'said', ['to', 'a'], 'in', 'the', 'of',\ 'a', 'said', 'the', ['of', 'him'], ['had', 'a'], 'be']

Atomic Hinges: 'to','the','to','the','was' Molecular Hinges: ['in','his'],['up','to','the']

The example SKELETON might have been extracted from Text such as this: ....talked to the teacher before going to the reunion that was planned as an event in his school, although Charlie stated nothing was up to the principal....

The items in a SKELETON are all Hinges found in a Text passage bygenerateSkeleton(n) which retains only the Hinges, discarding the non-Hinge content, then later new Candidate Extensions are fitted between a HingePair to Extend SynTex. L/Rcontext SKELETONs seem to Extend better quality SynTex than simpler Extension from a Left Context only. Whatever the likelihood an Extension correlates to the original (discarded) Text, that correlation is probably greater when there are 2 Contexts, Left and Right, to correlate to. A SKELETON is a Python list of strings and sublists of strings. It includes Atomic or Molecular Hinges, where an Atomic is 1 token (a string), while a Molecular is a sublist of multiple strings (tokens). It seems (for English) that the Head (1st token in a Molecular) performs best as a Context, not the other members of the Molecular.

Extend traverses a SKELETON taking HingePairs as L/RContexts. Then Insets append a new Extension to SynTex according to the L/RContext (and other constraints). In our example, here are some HingePairs:

('to','the'),('was','as'),('as','in'),('in','up')

Given this SynTex to Extend::

['and','the','sign','said','welcome','to'] we generate a SKELETON that starts with 'to', the LContext we will Extend from: ['to', 'the', 'to', 'the'] so the first HingePair is ('to','the'), the LContext Hinge is 'to', RContext Hinge is 'the'. The Insetters provide Extensions by choosing an appropriate one from the WaferDicts, conforming to various criteria. In this example some Candidates might include: ['to','her','every','night','of','the'] ['to','die','and','go','to','the'] ['to', 'every','good','girl','in','the'] so we could Extend 3 different ways:

['and','the','sign','said','welcome','to','her','every','night','of','the'] ['and','the','sign','said','welcome','to','die','and','go','to','the'] ['and','the','sign','said','welcome','to','every','good','girl','in','the']

Optimal Extending is a challenging linguistic task with numerous constraints, many not usually discussed by traditional linguistics since such issues never come up because traditional linguistics doesnt attempt generating SynTex by machine. But basically human Text seems to be (often) grounded in human knowledge and experience. Our TrainingText slices dont model human knowledge and experience and so can produce (nearly) formally correct SynTex with bizarre Phazed semantics.

Insetters: Linsetter and Rinsetter
try to return either a LeftQuote or RightQuote Candidate. They choose Extension Candidates by Kontext from the Wafer dicts. But often when no such LeftQuote Candidate exists, for example, then Linset(L,R) returns a NoQuote Candidate, since Insetters cannot return nothing, and a NoQuote can later Extend to a LeftQuote.

TextPrep
Prep is very important yet surprizing complicated. It's not easy to find quality, free, complete texts online, especially in languages other than English. They usually contain plenty of obstacles for SynTex generation: footnote markers, figures, tables, symbols, images, various pair mechanisms (quotes, reported speech, parentheses). Even invisible symbols are important: linebreak, carriage return, tabs.

As of 2018 many (most?) such texts are available in PDF format, which needs translation into *.txt format before a tokenizer can use it. Unfortunately PDF->TXT translators are imperfect (since the imperfections dont matter to a human reader's eye). Often PDF translators render into *.txt with words run together: in the evening --> inthe evening or period symbols suffixed onto words: "in the evening." --> ['in','the','evening.'] but should be                                             ['in','the','evening','.']. Wikipedia articles are full of footnote marks [23][24]. And so forth.

Such insignificant issues for humans are critical for machine success.

Unicode
Many (most?) recent online texts use Unicode, the standard encoding that attempts to express most of the world's writing systems. It contrasts with non-Unicode ASCII and many other encodings (several for Japanese alone). 'utf-8' is a very common Unicode encoding.

Flowing Text vs. Text Lines
We use Python3.6.1 Script but similar issues present for every Text Reader. Some texts are organized as Flowing Text, while others organize line-by-line, yet others in various different ways. So a new text to read is a trial-and-error challenge at this low level before any further SynTex processing can happen.

Linefeeds, Carriage Returns, Tabs, WhiteSpace
sometimes need elimination or alteration, other times not, sometimes are linguistically significant. It's trial-and-error, and 100% perfection is elusive.

Tokenizer
We use an NLTK Tokenizer for Python that gives imperfect results, notoriously for minority combinations involving '.' Period symbol. Oftentimes spurious '.' get incorrectly suffixed to previous words, items like "Mr.", "Mrs." dont tokenize correctly. Further downstream the SynTex Generator mistakenly interprets the "." PERIOD in "Mr." as an end-of-sentence, for example.

UpperCase LowerCase
Case processing is important yet surprizingly complicated and unlikely to approach perfection. We arbitrarily ignore it rendering most SynTex as simple lowercase (except sometimes obvious start-of-sentence after full-stop). And of course UC/LC is language-dependent, some languages (Japanese) dont natively have it at all.

Quote Mechanism
Proper Quote-handling perversely takes the majority of the Generation effort. The SynTex Generator without Quotes is only perhaps 1/5 the total effort. There are many Quote Mechanisms that vary by language, even within a single language. We describe 3 English Quote Mechanims:

"Normal Quoting" 'Normal Quoting'
is most common using pairs of single or double quotemarks. But a Left and Right single quotemark are identical, even identical in their Unicode representation, and likewise for double quotes. Meaning there's no way to tell by Unicode symbol whether a quotemark is a Left or a Right quote. And authors/printers inevitably mismatch quotes repeatedly so strict machine enumeration wont work (easily) either. Then writers are lax about quote usage so oftentimes it's simply impossible to codify strictly.

“Dubliner Quotes”
as we call them, are uncommon and obsolete, yet make SynTex Generation much easier since there are distinct Left and Right quotes at Unicode level, and so the Generator knows which it's processing.

Ulysses Quotes
look like this: --How about a lager, Mr. Cunningham?

--I'd love to James. where a double-line means Left Quote and a LineFeed (empty white line) means Right Quote. There are dozens more Quote mechanisms, often language-specific, so correctly processing all of them changes the SynTex Generator into a Quote Manipulator, not our intent, but mandatory to handle realistic variety of text.

non-Text
obstructs SynTex Generation and can be dealt with by avoiding longer chunks of it such as Prefaces Tables-of-Contents Indices, References, Tables, Figures, etc. A manual human intervention is effective: just spend 2 hours skimming through a raw text manually deleting such passages.

ExcludeList
We compile an ExcludeList by trial-and-error for each text since repetitive imperfections are common. The example below is heavy on PERIOD-related trash specific to Das Kapital, since the Tokenizer fails to process it correctely. So as we book all the 10gram text slices into SliceDict, we exclude any containing any of the trash on the ExcludeList. In practice this might eliminate some 20% of total slices, and probably skews SynTex results in sophisticated ways : ExcludeList=['“','”','(', ')','‘','"',"'",'@',"``", '...',\           'pp','&', 'c.','b','/',\            'lb','lbs','i.','e.','***',\            '12','m-c-m','2d.','3d.',\            '£', 'v','°','100°','90°','12','v.',\            'c','10','20:100', '1:1','7½','5:3'\             'l.c.',  'p.', '456', '51','3s.',\             '180,000,000','26½','les', 'comptes', 'à',\             'dijon', 'pour', '18', 'l.', 'ce', 'que', 'l', '220',\            'it.3','100', '.1','2,000.44','1/4','7¾','2½',\             '£10','14', 'non-use-values','£7','quit-rent-corn','2s',\            '143','23','11½','£100','66,632','1/9', '£10,000',\            'c-m-c', "''", 'xxxviii.',\            'p.d.µa', 'l.l.d', 's.s.', '.p', 'l.c.', '.s..',  't..', 'p..', 'l.c',\          '..pe', 's.µµet..a.', 'vol.ii', 'c.f', 'lieut.-col.', 's..', '..d..', 't..t..',\          '..a..e.t.s', 'pa.ta', 'pa.a', 'a.ta.es..', 'j.b.', 't.ii', 'a.taµee.ßes.a',\ '.d..', '.a.', 'lond..', 'pe.te', 'c..', '.a.a', '.a', '..t', 'child.empl.com.',\ 'iv.-vi', 'vol.i', 'p.m', 'a.t.', 'h.o.c.', 'l.c..', 'e..', 'm.p.', 'xxi..',\ 't.iii', '.s.t..', '..s..', 'a.t', '..the', 'e.a', 'note.—', '...', 'µat..',\ 'apa.t..', 'ad..at..','p3','x/100','ch2','i9']

The example below is the current text Prepper:

def Prep(filename): t1=[] #with open(filename, encoding='utf-8') as raw1: with open(filename) as raw1: for line in raw1: t1.append(line)                           #t1 is raw lines raw1.close #++++++++++++++++++++++++++++   a=''.join(t1)                                     #join lines into text b= a.replace('\\n', ).replace('\\r', ).replace("\'",'')\ .replace('\\','') #return b       #+++++++++++++++++++++++++                 #remove Wiki footnotes [22] ts1=re.sub(r'\[.*\]', '', b)   #+++++++++++++++++++++++++                 #tokenize string-->list tL1 = word_tokenize(ts1) #+++++++++++++++++++++++++                #lowercase all tL2 = [x.lower for x in tL1] return tL2[1:]                           #dont return the initial UTF-8 marker

FullTextList=Prep("J:\\PDF Resources\\Capital-Volume-I - Remedied.txt")
 * 1) -Capital-Volume-I - Remedied.txt-

Prep opens a *.txt file without 'utf-8' Unicode encoding. Then it reads in the text line-by-line, as opposed to flowing-text, a detail discovered by trial-and-error. The lines are all appended onto 't1' then the file is closed. At this point the text is a long string of characters internally to Python, so we standardize (to some extent) WhiteSpace by using 'join', then we replace (linefeed, carriage return, etc.) with WhiteSpace. Next we remove Wiki-reference brackets [1][2]. Now we let the NLTK tokenizer transform the long string into a long List of Tokens, each being a Python String: "in the beginning she had no idea, absolutely no idea" ['in','the','beginning','she','had','no','idea',',','absolutely','no','idea'] Punctuation ALSO turns into tokens just like words. Then all UpperCase is rendered LowerCase. This step is important since Python is case-sensitive, so 'water' matches 'water' but doesnt match 'Water' or 'waTer'. We dont try to return completed SynTex to UpperCase since the problem is intractable, requiring an exhaustive dictionary of 100s of 1000s of (Proper Names, Named Entities, Place Names) per each human language. And besides writers are notoriously sloppy with Case management which many times is stylistic or ideosyncratic.

Autopsy
runs after Extend has finished a SynTex. It searches for Plages: text segments copied verbatum from the Training Text(TT) that are too long. And how long is too long? We arbitrarily say 9 tokens or longer risks being a Plage. So 8 or shorter is great, somewhat longer is somewhat worse, while 14+ or so is clearly a Plage (since it likely came from 2 Wafers juxtaposed in the SynTex that were also neighbors in the TT). It is possible, but unlikely, for occasional longer segments to occur when an 8gram Wafer, for example, happens to match a 9,10,11+ token segment in the TT. Eventually such Plages will be prohibited at Extend-time, but for now they are rare enough so we dont bother.

AutopsyB uses the battery of PlageDicts: 1 PlageDict for every Ngram length 8-16. PlageDict13, for example, is the SliceDeck of every 13gram slice in the Training Text(TT).So AutopsyB can simply check for matches between the SynTex and PlageDict13. Any such match would probably be considered a Plage, or at least poor form, unless perhaps it occurs spontaneously, for example, when an 8gram Extension accidentally occured in TT with another 5 more coincidental tokens to constitute a 13gram Plage.

AutopsyB starts by searching for longest possible Plages first: 16gram, which, if found, are usually 2 8grams unfortunately juxtaposed. If found, the Plage is swapped out of the Autopsy's local copy of the SynText, replaced by an inert symbol '金' that can never occur in TT. Without this replacement, every shorter search (15,14,13...) would report spurious rediscoveries of subsegments of the same 16gram Plage, proliferating trash.

Heads
is an auxiliary function inside Extend that passes only Atomic Heads, never Molecular Sublists, to Lcontext/Rcontext. A bookeeping function. Given this SKELETON: [['for', 'to'], ['for', 'him'], 'was', 'up'] Heads will return:                         ['for','for','was','up']

def Heads(n): if isinstance(n,str):return n   else:return n[0]

LastQtype
is another auxiliary function inside Extend. It sets the QuoteToggle so Extend knows to want either a Left or Right Quote next iteration.

def LastQtype(j): global QuoteToggle k=j[::-1]             #reverse order for x in k:           if x=='“':QuoteToggle="right";return  #Jan27 if x=='”':QuoteToggle="left";return