User:Ryuki4716/P14

Preface: Why Synthetic Text?
Please skip this Preface unless you want to know why I wrote this paper. The actual description of the Synthetic Text Generator starts below from Text Issues: How Synthetic Text is Generated.

My original goal was to develop my own Machine Translator at least for English/Spanish and ultimately for English/Japanese. But I gave up after realizing long time-consuming parallel texts are needed. Yet many issues and techniques are similar for Synthetic Text generation. Decades ago in the early 1980s I was a graduate student at UCBerkeley interested in computational linguistics. Since then the field has grown immensely, unrecognizable and infinitely more advanced nowdays. The whole world uses Google searches and Google translate routinely. Statistical probabilistic approaches similar to SynTexGeneration techniques (but far more advanced) have become everyday realities.

The SynTexGenerator provides access to the more fundamental underlying issue of Thought-to-Text: how the human mind organizes thoughts to structure text. Phazed SynTex illustrates what happens when Topic Coherence is absent, while yet the generated SynTex is (reasonanly) structurally sound. We can use SynTex to avoid traditional linguistic issues (syntax, Chomsky tree derivations) and attempt to address Topic Coherence directly. Topic Coherence is only addressed in passing in this article. I only develop SynTexGeneration and note a few points about what is lacking for improved Topic Coherence. As far as I know, nobody to date has a working model of a human mind that can imitate human thought convincingly enough to generate SynTex without Zagged Topic Coherence. But now at least we have a better idea of what is missing.

I also wanted to explore computational linguistics while avoiding as much traditional linguistics as possible. To see what can be accomplished without recourse to traditional (syntax, semantics, Chomsky-trees). As it turns out those artefacts are more obstacles than resources for SynTexGeneration. Finally, it is fascinating to see that the same techniques, even almost exactly the same code, works equally well for any written language. In other words, the approach outlined below is not dependent on any specific language (with of course a few surface peculiarities). While traditional linguistics is very much specific language based. Traditional Spanish grammar is not analyzed in English grammar terms because it doesnt work (very well). While SynTexGeneration works amazingly well for both languages, and indeed all written languages, in ways traditional linguistics never addresses. Of course since the 1980s computers have come into use for countless applications, and it's time they started generating Synthetic Text.

the CycloD
The Generator isolates the Tail, the last Clause on the SynTex list. Then it looks up that Tail as a Lemma in the CycloD circular dictionary. Next a suitable Extension is chosen from the CycloD Glossa, and attached to the previous SynTex Tail, becoming the new Tail. The Generator works because all CycloD entries are Circular. Start by randomly picking any one Lemma from CycloD to be the first seminal SynTex: [1]. Any Glossa for (2) can be a suitable Extension, tho some are better than others. So now the SynTex Extends to [1,2], and the new Lemma(2) has a suitable Gloss(3) so now the SynTex Extends to [1,2,3], and so on. Clauses are the representations of Thots, extracted from a TrainingText.

DeadEnds
DeadEnds are non-Circular discontinuities that would interrupt the Extension cycle. They shouldnt exist, and dont (except by error). A Gloss without its corresponding CycloD Lemma is a DeadEnd. A Lemma that lacks Circular Glossa is a DeadEnd. Much of the P15 effort has gone to minimizing DeadEnds.

Viable
A Clause of length 4 to 8, with a Hinge on both edges, is Viable. The precise Viable length is pragmatically chosen to express a simple THOT. Short Clauses tend to result in Zagged SynTex, since very short sequences dont strongly correlate to a specific THOT. Longer Clauses amount to Plages. They are meaningful and correlate tightly to specific THOTs, but are impermissible verbatum lifts from the Training Text(TT).

Viable Clauses segment Text into smaller units representing THOTs. Viables require Hinges at both edges since Hinges are extremely high-frequency tokens in Text, and therefore Hinge-delimited Clauses are frequent and numerous too. In turn that abundance affords higher Recur Count and broader Extension choice.

If instead low-frequency non-Hinges were allowed as Clause edges, the Recur Count would be much lower, so SynTexGeneration would be impracticle, requiring a much larger Training Text. Extremely long Training Texts are unavailable (except maybe internally to Google Research).

Recur Count(RC)
RC is the length of the Glossa List for a CycloD entry. Suppose a Clause is found in Training Text: ....of six dozen eggs. In the..... Remedied and tokenized: ['of','six','dozen', 'eggs', '.','in','the'] Then since both the left edge Hinge 'of' and the right edge Hinge 'the' are very frequent tokens, there may be a Recur Count of 174 (for example) other Equivalent Clauses in the TT: ...of the story in the....       ...of the best ideas for the....        ...of it, but then maybe it's the....        ...of weird, don't you think it's the.... Low Recur Counts afford less Alt variety. A Single Alt has Recur Count:1, so there's no choice but to always Extend in the same way. Very many CycloD entries have RC: 1. They are extremely infrequent, just 1 time in a TT, yet they are very abundant, and correlate closely to a specific THOT. Conversely very few entries have high RCs, yet they are extremely aboundant and dont correlate closely to a specific THOT.

Clauses
Clauses include all Viables: Antes, Sequiturs, Alts, Lemma, Glossa. A Clause is the SynTex avatar for a THOT.

Viable Clauses have Hinges on both edges. That property makes them interchangeable (Equivalent) to some extent. In a long Training Text there might be 200 instances of Equivalent Clauses with as left Hinge and  as right Hinge. Then any of these 200 Equivalents could Extend a SynTex by matching as a left Hinge.

Equivalent Clauses (EC)
EC are interchangeable to some extent in a SynTex. Greater Equivalency fosters better Topic Coherency, while lesser Equivalency Phazes SynTex. The left edge Hinge for a Clause must match to Extend SynTex and is prerequisite for Equivalency.

Alts are EC that are as equivalent as possible to the Sequitur they are modelled after, yet without ever being Identical to a Sequitur, since that amounts to a Plage.

Plages
Plages are plagiarisms arbitrarily defined as verbatum text copying of 9+ tokens. All language is a re-purposing and re-ordering of previously used segments. But obviously rote copying of long TT passages would generate a shoddy SynTex. The 8-token max convention could be more or less. It's a convention. Longer Clauses would dramatically improve SynTex coherency.

Antes
Antes are all the Viable Clauses in a Text, whether they have Viable Sequiturs or not.

Sequiturs
Sequiturs are Clauses that follow an Ante in Text. SynTex Extension avoids using Sequiturs as Extensions since the result would often be a Plage. If the combined length of an Ante and a Sequitur exceeds 8, then it's a Plage. P15 tries to minimize Plages.

Alts
Alts (Alternatives) are Clauses that are Equivalent to a Sequitur, yet importantly never Identical. Circular Alts become the SynTex Extensions, driving the SynTex generation process. Alts must be Circular, else they'd be DeadEnds. So CycloD contains only Circular entries.

Circular Clauses
Circular Clauses can extend a SynTex. DeadEnds cannot Extend. The CycloD contains only Circulars. Circular Clauses include Lemma, Glossa, Antes, Alts but exclude Sequiturs. Sequiturs would become SynTex Plages so they are excluded.Any Circulars must also of course be Viable Clauses.

THOTs
What are they? Well, just what you think they are. The elusive disembodied thoughts that (hopefully) underly speech acts and Text. Humans develop THOTs into Text by writing, but it's a mystery precisely how the gap between THOT and Text is bridged. Anyway SynTexGeneration necessarily only manipulates digital representations of THOTs. Non-verbal THOTs are not in the Generator's domain. There's no attempt to deal with (feelings, smells, images, tactiles, mathematics, etc..). Only written text is processed by the Generator.

Topic Continuity(TC)
Topical Continuity is the human progression of THOTs in Text. SynTex features ridiculous TC. Improving TC is perhaps the single most important task for Generator design. But SynTex is an important first step since it avoids many of the traditional linguistic entanglements that have been obstacles to research on TC. It is extremely difficult for SynTex to Extend without Zagging Topic Continuity since the Generator only manipulates [Tokens, Clauses, CycloD,...]' but cannot address the (lack of) underlying Topic Continuity' at all. There is minimal referential coherence between SynTex Clauses. Clause10 has no idea about Clause8 or Clause12. Clause10 only uses shallowest simple mechanisms for reference to neighboring Clause9 and Clause11. Only the keywords in the left and right Margins cohere Clause 10 with Clause9 and Clause11. This situation parallels similar issues for Machine Translation and Neural Network representations of Text. Anaphora, cataphora, pointer resolution, distant reference. To date, nobody has made much progress specifying TC: the miraculous and spontaneous morphing of THOTs into Text that humans perform automatically. Traditional Linguistics side-steps this fundamental issue. We do too. SynTex Generation is merely the manipulation of language data with an illusion of Topically Continuous Text. Any calculus of THOTs by computer would need to represent them as language data overlying the Python Script reality (in our case) that in turn overlies a hexadecimal level, then ultimately a binary level.

Hinges
There are currently only 19 Hinges in use, tho this number is arbitrary. Hinges19=('the','of', 'to', 'a', 'was', 'his',\         'in', 'had', 'said', 'for', 'at', 'him',\          'on', 'as', 'not', 'were', 'be', 'would','up') All of them are high frequency Tokens. Some other high-frequency Tokens were excluded from Hinges for pragmantic reasons.

When more frequent Hinges are used, the RecurCount(RC) for Glossa rises, and less frequent Hinges correlate to lower RC. In the extreme case, using a Unique token as a Hinge would allow booking a Unique Clause with no Equivalencies: a DeadEnd. Conversely restricting the Hinges set to very few, very high frequency Tokens will proliferate CycloD Glossa with very high RecurCounts(RC), that lead to Zagged SynTex since the range of Glossa is too broad.

Zags/Zagged
*...the man by the grey pallor of the man was grey...

Zags can seem bizarrely inappropriate, yet still plausible. Of course humans frequently correct their own speech in midstream, while written corrections usually erase mistakes unless the writer's intent is to illustrate the actual flawed speech. But consider: *...the man by the grey side of the building was hungry... Sentence structure (well: fragment structure) doesnt change yet when lexical substitutions are made, the fragment suddenly makes better sense. * pallor --> side     (singular noun) * man   --> building  (singular noun) * grey  --> hungry    (adjective) Reducing Zags is a major challenge.

What is Language-Specific?
The same SynTexGenerator works for Texts written in any language, but with a few superficial caveats.

The choice of Hinges is manual and of course language-specific. Tho the Hinges are invariably are highest-frequency Tokens, some work better than others and are best chosen manually by trial-and-error.

Then Tokenization and DeTokenization(DeToke) are language-specific. Punctuation rules are language-specific, conspicuosly Pairs(Quotes) and Upper/LowerCase. (eg. Japanese lacks an Upper/Lower distinction. Spanish Quotes work different from English Quotes and Spanish has additional Paired figures that English lacks <¿?> <¡!>).

There are language-specific high-frequency homonyms that complicate Generation. In English, She wanted that red coat not the green one. He told me that it costs $7. You can imagine every language has similar confusions, yet they are language-specific.

Finally Japanese (and Mandarin?) needs a different type of Tokenizer (currently dictionary-based) since there's no white space between words(Tokens).

Yet these limitations are only superficial. The underlying SynTextGeneration techniques works the same for any written language.

Why Python Dictionaries (PDict)?
PDicts have very much faster access time than Lists or Tuples, the alternative data types. PDicts are user-friendly Hash Tables. A 1,000,000 item List needs 500,000 probes on average to find a match. But the same match in a 1,000,000 item PDict Hash Table is orders of magnitude faster.

Textslices
Textslices are 16gram slices from prepped raw text, Das Kapital and Dubliners in our case. ExcludeList and QuoteExcludeList specify disqualifications. The NLTK Tokenizer generates additional trash, especially PERIODtrash where PERIODs are incorrectly appended to Tokens. Then Pairs(Quotes, Parentheses) are excluded since they would require extensive special handling.

Autopsy
Autopsy searches for Plages in SynTex. A Plage means 9+ tokens matching Text verbatum. The Generator avoids plagiarism by preventing verbatum sequences of 9+ Tokens. The Generator should only produce its own original recombinations of repurposed Clauses trained from the Text. In practice, optimal Clause length is 8, but a few 9,10, even 11s are inevitable and excusable since many frequent sequences of frequent Tokens <'and','one',of','the'> can hardly be considered plagiarisms.

Training Text (TT)
TT is a mix of equal parts Das Kapital and Dubliners. Those source text are used because they are freely available online, relatively free of text-prep complications. Then Dubliners conveniently uses Left/Right Quote signs that distinguish a starting quote from an ending quote, an unusual and obsolete characteristic that makes Quote processing much easier. Das Kapital is thick and ponderous, with hardly any quotes (but abundant footnotes). Dubliners doesnt closely follow a logical topical progression but instead is more poetic and lyrical, more about expression than content. So Dubliners-derived SynTex appears less Zagged. Das Kapital famously focusses on a narrow topic range for 1000s of pages, and uses thick, dense antiquated English translated from even more ponderous obsolete German. Again, the effect masks the inherent Topical Phazing since few readers can understand the meanings and vocabulary. It's not clear Das Kapital even has a clear logical flow. The result is SynTex that seems less Topically Zagged than the dysfunctional reality, due to the unique prose combination (something similar to the various translations of Christian Bibles and al-Qur'an in my opinion).

Text Prep
Text Prep performs many text-level mundane yet critical functions including the handling of linefeeds, carriage returns, other invisible editing commands, removal of square-bracket-style reference notes that are legion in Das Kapital and Wikipedia, lowercasing to avoid intractable uppercase/lowercase processing issues. Of course the text is tokenized using the nltk word_tokenize function that introduces various new noise and artefacts, especially Dot-Monsters.

Manual Text Prep
Das Kapital volume I is very long and full of artefacts best manually deleted. The process could be automated, but each raw text has unique artefacts, so it's faster and more effective just to spend 2-3 hours manually deleting: tables of contents, figures, numeric tables, reference notes, references, footnotes and many other artefacts that would otherwise complicate SynTexGeneration. In particular square brackets are numerous and very hard to simulate. Dubliners Text Prep includes deleting the Project Gutenberg Prolog to minimize admixture to the Dubliners text flavor.

Upper/LowerCase
We standardize to all lowercase since Python Script discriminates case, case usage in normal text is lax and informal, and Upper/Lower discrimination is very difficult to achieve with near 100% accuracy. Case considerations are peripheral and mechanical, and complicated. Many languages (Japanese) dont have "upper/lower-case" at all.

Code Review
Current cleanest code as of April 6, 2018 resides in C:\Users\Robert\Jupyter\P15. P15 4 AnteSeqD.ipynb preps Das Kapital remedied combines Dubliners/DasKapital books Text16D remedied Text Slices books AnteSeqD Prereqs:         Text16D, Hinges19, ExcludeList for bookAnteSeqD to book AnteSeqD

P15 4General.ipynb    books BaseD books CycloD (and prior dictionaries) in BookCycloD Extender resides here Prereqs:        CycloD, P15PlageDicts to run Extender BaseD for BookCycloD to book CycloD AnteSeqD for BaseD to book BaseD

MyUtilities          everything else used at some point worth saving

the SynTex Generator
The Generator is told to Extend N Clauses of SynTex. The first one is a randomly chosen Lemma from CycloD that becomes the seminal SynTex, and also becomes the first SynTexTail. Now the main loop takes over, finding the CycloD Lemma matching the SynTexTail, then choosing a suitable Gloss to attach to SynTex to become the new SynTextTail. The loop Extends SynTex N times. After the SynTex Extension is complete, Detoke pretty-prints it in human-readable format then also Autopsydisplays Plage metrics.

SynTexList=[] # def Extend(seed): global Exctr, SynTexList if seed in CycloD:                  #April4  D3 previously q=random.choice(CycloD[seed])    #D3[seed] print(Exctr, q) #return an Alt for the seed if seed's         SynTexList=SynTexList+list(q[1:]) return q #in D3     #April4 in D6 not D3    else: print("=", seed,"= not in CycloD") return False #      def Generator(limit): global Exctr Exctr=1 for x in range(0,20): a=Extend(random.choice(list(CycloD))) if a==False: print("Generator Head") return else:break for x in range(0,limit): Exctr+=1 b=Extend(a) if b==False: print("Generator Loop") return a=b # # Generator(100) Detoke(SynTexList) Autopsy
 * 1) print(SynTexList)

BookCycloD
BookCycloD books CycloD by successively removing more and more DeadEnds from intermediate dictionaries. After several iterations (5+), there is convergence to the CycloD: the 100% DeadEnd-free Circular dictionary. DeadEnds found in a prior dictionary are supressed when booking a newer dictionary. It's impossible to verify presence of a Lemma in a new dictionary still under construction. So the verification is from the prior dictionary. Even so, this process converges.

BookCycloD starts from the BaseD dictionary generated by BookBaseD from AnteSeqD. BaseD contains DeadEnds that need purging, so BookCycloD cyclically generates intermediate dictionaries, each with fewer DeadEnds than the previous, until finally the DeadEnd-free CycloD is distilled. Legitimacy as a Circular non-DeadEnd dictionary entry requires every Gloss also be a Lemma for a different Circular entry. Yet this test isnt feasible in one iteration at this point in time since the dictionary isnt yet complete, still under construction. Fortunately the distillation converges after only a few (5+) passes.

def statprintGrl(old,new,dud): print('DeadEnd Entries removed: ',len(old)-len(new)) print('remaining DeadEnds:      ', len(dud))

def BookCycloD(limit,oldD,newD,newDeadEnds): ctr=0 for Lemma, Gloss in oldD.items: ctr+=1      #Short Test Exit if ctr>=limit:statprintGrl(oldD,newD,newDeadEnds);return Gloss2=[] for Alt in Gloss: if Alt in oldD: Gloss2.append(Alt) #excluding DeadEnd Alts from Gloss2 else:newDeadEnds[Alt]+=1 if len(Gloss2)>0:  #March25  >1 but what about when just 1 Gloss? newD[Lemma]=Gloss2 statprintGrl(oldD,newD,newDeadEnds)

D2=defaultdict(list) DeadEnd2=defaultdict(int) D3=defaultdict(list) DeadEnd3=defaultdict(int) D4=defaultdict(list) DeadEnd4=defaultdict(int) D5=defaultdict(list) DeadEnd5=defaultdict(int) CycloD=defaultdict(list) DeadEnd6=defaultdict(int)

print("BaseD to D2") print("BaseD: ",len(BaseD)) BookCycloD(1000000,BaseD,D2,DeadEnd2) print("D2: ",len(D2)) print("\nD2 to D3") BookCycloD(1000000,D2,D3,DeadEnd3) print("D3: ", len(D3)) print("\nD3 to D4") BookCycloD(1000000,D3,D4,DeadEnd4) print("D4: ", len(D4)) print("\nD4 to D5") BookCycloD(1000000,D4,D5,DeadEnd5) print("D5: ", len(D5)) print("\nD5 to CycloD") BookCycloD(1000000,D5,CycloD,DeadEnd6) print("CycloD: ", len(CycloD))
 * 1) March 29

BookBaseD
BookBaseD books the BaseD dictionary. BookBaseD traverses the AnteSeqD, using each Gloss/Sequitur as a model to accumulate Alts. Alts must never be identical to any Sequitur though they must be Equivalent. Because N leftmost tokens match, an Alt can be spliced onto the end of a SynTex, so SynTex can Extend. BookBaseD verifies an AltCandidate is also a Lemma in AnteSeqD with a non-Null Glossa to ensure Circularity. Then BookBaseD removes any Sequiturs present as AltCandidates. Finally if the GlossList of AltCandidates has any members, a new Entry is booked into BaseD, its Lemma being the same AnteSeqD Lemma, and its Glossa is the GlossList of AltCandidates. All Glossa are Uniquified in a later pass: any repeated Glossa are removed. Single Glosses are allowed though they force Generator to attach the same double-Clause sequence for lack of any other choice. And Single Glossa are very common (7998/27592 currently 29% of AnteSeqD entries). Even so, they dont contain any Sequiturs in their single Gloss.

BaseD=defaultdict(list) def bookBaseD(limit): global Seq1ctr, NullGlossctr, Remove, SingleGloss,ASctr, Mismatch,Iterations,Duds Seq1ctr=0;NullGlossctr=0;Remove=0;Iterations=0 SingleGloss=0;ASctr=0; Mismatch=0;Duds=0 #for SeqLemma,SDglosses in TestAnteSeqDict.items:   #SeqLemma becomes AltLemma for SeqLemma,SDglosses in AnteSeqDicta.items:   #SeqLemma becomes AltLemma March30 Iterations+=1 if Iterations%2000==0: print(Iterations, "Iterations %s secs " % (time.time - start_time)) if Seq1ctr>=limit:statprint1('\n++');return  #Test Sampling SqtrLoopctr=0 for Sequitur in SDglosses: #Alts are non-identical approximations to Sequitur SqtrLoopctr+=1 Marg2=Sequitur[:2] GlossList=[] Seq1ctr+=1 for AltCand,ASgloss in AnteSeqDicta.items: if len(ASgloss)>0:         #March25b >1 if AltCand[:2]==Marg2: if AltCand in AnteSeqDicta: GlossList.append(AltCand) else: Duds+=1       #AltCand missing as AnteSeqDictaLemma else: Mismatch +=1      #1.2 billion else: ASctr+=1              #0.3 billion if GlossList !=[]: SqtrLoopctr=0 for Sequitur in SDglosses: #Alts are non-identical approximations to Sequitur SqtrLoopctr+=1 if Sequitur in GlossList:      #remove all Sequiturs GlossList.remove(Sequitur) Remove+=1 if len(GlossList)>0:      #Single Gloss OK now BaseD[SeqLemma]=GlossList if len(GlossList)==1:SingleGloss+=1 if len(GlossList)==0:NullGlossctr+=1 statprint1('==') start_time = time.time bookBaseD(1000000) #36562 Source slices 27592 for len 45678 print("--- %s seconds ---" % (time.time - start_time))
 * 1) March25b 4 minutes to complete

def statprint1(sign): print(sign, 'Iterations: ', Iterations) print('Sequiturs: ', Seq1ctr) print('len(AnteSeqDicta): ', len(AnteSeqDicta)) print('Mismatches: ', Mismatch) print('NullGloss[]: ', NullGlossctr) print('SingleGloss: ',SingleGloss) print('len(BaseD: )',len(BaseD)) print('ASgloss too short: ', ASctr) print('Removes: ', Remove) print('Duds: ',Duds, 'should never happen')

Prep
def Prep(filename): t1=[] #with open(filename, encoding='utf-8') as raw1: with open(filename) as raw1: for line in raw1: t1.append(line)                           #t1 is raw lines raw1.close #++++++++++++++++++++++++++++   a=''.join(t1)                                     #join lines into text b= a.replace('\\n', ).replace('\\r', ).replace("\'",'')\ .replace('\\','') #return b       #+++++++++++++++++++++++++                 #remove Wiki footnotes [22] ts1=re.sub(r'\[.*\]', '', b)   #+++++++++++++++++++++++++                 #tokenize string-->list tL1 = word_tokenize(ts1) #+++++++++++++++++++++++++                #lowercase all tL2 = [x.lower for x in tL1] return tL2[1:]                           #dont return the initial UTF-8 marker DasKapitalTTL=Prep("J:\\PDF Resources\\Capital-Volume-I - Remedied.txt") len(DasKapitalTTL) #391099 March30 2018
 * 1) #-Capital-Volume-I - Remedied.txt-

Prep opens the utf-8 encoded manually-remedied DasKapital TrainingText(TT), reading in line-by-line. Then invisible characters ('\n' Linefeed, '\r' CarriageReturn,...) are processed. The numerous square-bracket pairs representing text references are removed. The NLTK tokenizer converts raw text into a Token List, introducing noise (eg: Period Monsters). The Token List is lowercased to avoid mismatching otherwise identical upper- and lower-case tokens ('while' should match 'While' and 'While' is only uppercase in certain contexts such as sentence-initial).

Load Dubliners
input = open("J:\\P11\\DublinersRTextListDec10.pickle", 'rb') DublinersComplete=pickle.load(input)  #Jan30 input.close print(len(DublinersComplete)) DublinersTTL=DublinersComplete[145:-3304] #Dec20 delete Gutenberg Prolog print(len(DublinersTTL))     #Old 84507  81203 March30 84652 81203 The manually-remedied and prepped Dubliners TT is not utf-8-encoded. The Gutenberg Project preface is removed since it is textually very dissimilar to the main body of Dubliners.

DKDTTL combines DasKapital and Dubliners
DKDTTL=DublinersTTL+DasKapitalTTL[:81203]
 * 1) Only part of DasKapitalTTL used so both texts equal lengths

DKDTTL Token list is half DasKapital and half Dubliners.

ExcludeList
BookNoQuotes uses the ExcludeFilter that usesExcludeList to exclude unacceptable Slices when booking SliceDictionaries such as Text16D and the PlageDicts. Most ExcludeList members originate in DasKapital, which abounds in obsoletes and Period Monsters. Some are native to the raw DasKapital text, while others are introduced by the NLTK Tokenizer.

Various Pairs are excluded, importantly Dubliners Quotes, Single and Double Quote and Parentheses. Processing Pairs is much more complicated than Pair-free text, with little return for the effort so Pairs are simply excluded for now. Remember for computers it's often easier to deal with special cases by ennumeration not general processing. Anything unusual and infrequent may warrent ExcludeList membership.

ExcludeList=['“','”','(', ')','‘','"',"'",'@',"``", '...',\           'pp','&', 'c.','b','/',\            'lb','lbs','i.','e.','***',\            '12','m-c-m','2d.','3d.',\            '£', 'v','°','100°','90°','12','v.',\            'c','10','20:100', '1:1','7½','5:3'\             'l.c.',  'p.', '456', '51','3s.',\             '180,000,000','26½','les', 'comptes', 'à',\             'dijon', 'pour', '18', 'l.', 'ce', 'que', 'l', '220',\            'it.3','100', '.1','2,000.44','1/4','7¾','2½',\             '£10','14', 'non-use-values','£7','quit-rent-corn','2s',\            '143','23','11½','£100','66,632','1/9', '£10,000',\            'c-m-c', "''", 'xxxviii.',\            'p.d.µa', 'l.l.d', 's.s.', '.p', 'l.c.', '.s..',  't..', 'p..', 'l.c',\          '..pe', 's.µµet..a.', 'vol.ii', 'c.f', 'lieut.-col.', 's..', '..d..', 't..t..',\          '..a..e.t.s', 'pa.ta', 'pa.a', 'a.ta.es..', 'j.b.', 't.ii', 'a.taµee.ßes.a',\ '.d..', '.a.', 'lond..', 'pe.te', 'c..', '.a.a', '.a', '..t', 'child.empl.com.',\ 'iv.-vi', 'vol.i', 'p.m', 'a.t.', 'h.o.c.', 'l.c..', 'e..', 'm.p.', 'xxi..',\ 't.iii', '.s.t..', '..s..', 'a.t', '..the', 'e.a', 'note.—', '...', 'µat..',\ 'apa.t..', 'ad..at..','p3','x/100','ch2','i9']

BookNoQuotes
BookNoQuotes books various SliceDictionaries such as Text16D and the PlageDicts, excluding any inappropriate slices. The name BookNoQuotes contrasts with BookQuotes (unused in P15) that performs the analogous bookings when Pairs(Quotes) are to be included. def BookNoQuotes(d,n):   #Quotes are not booked for x in range(len(DKDTTL)-n): if ExcludeFilter(DKDTTL[x:x+n]): d[tuple(DKDTTL[x:x+n])]=n print(n, len(DKDTTL), len(d)) Text16D=defaultdict(list) print("Text16D: ",end=" ");BookNoQuotes(Text16D,16)

Dumps
Many (pickles, lists, dictionaries,...) are dumped, then loaded, often to avoid slow re-booking. (It can take 8 minutes to book D1a currently). Here is an example:

# # output = open("J:\\P15\\CS16SourceDicta.pickle", 'wb+') pickle.dump(Text16D, output) output.close print(len(Text16D))  #April4 119757  10862KB
 * 1) April4 Dump Text16D

Loads
Here's an example how (pickles, lists, dictionaries,...) are loaded. input = open("J:\\P15\\CS16SourceDicta.pickle", 'rb') Text16D= pickle.load(input) input.close print(len(Text16D)) #April4 119757
 * 1) April4  Load    Text16D

BookAnteSeqD(BASD)
BASD books the AnteSeqD(ASD) from the Text16D TextSlice Dictionary. BASD makes 5 passes searching for Antes of increasing length 4,5,6,7,8. When an Ante is found in Text, the following 8 Tokens are given to SequiturFlush(SF). SF returns a list of the SequiturFlush or [] if none. Any SequiturFlushes are accumulated from all 5 Ante lengths. Then a new entry is booked into ASD with the Ante as Lemma and the SequiturFlush accumulate list as Glossa. After completing the ASD booking, an additional UniquifyGlossa pass Uniquifies all Glossa removing any repeated Glosses. Finally metrics are displayed.

def BookAnteSeqD: ANTEctr=0;Slicectr=0;NullSeq=0;MultiSeqs=0 for Slice in Text16D:                 #April3 Slicectr+=1 for offset in range(0,5): #generates:01234 lens 45678 if (Slice[0] in Hinges19) and (Slice[3+offset] in Hinges19): ANTEctr+=1            #if its a Viable Ante Ante=Slice[:4+offset] #collect all Sequiturs Sequitur=SequiturFlush(tuple(Slice[3+offset:11+offset])) #max length 8 if Sequitur==[]: NullSeq+=1 if Ante in AnteSeqD: MultiSeqs+=1 AnteSeqD[Ante].extend(Sequitur) print(" %s secs " % (time.time - start_time)) UniquifyGlossa print(" %s secs " % (time.time - start_time)) #   print('Slices: ', Slicectr) print('NullSeq: ', NullSeq) print('len(AnteSeqD)',len(AnteSeqD)) print("MultiSeqs ", MultiSeqs) print('Antes: ', ANTEctr) print("Sequitur Total: len(AnteSeqD) + MultiSeqs: ",len(AnteSeqD)+MultiSeqs) print("should equal Antes count:                    ", ANTEctr) # AnteSeqD=defaultdict(list)  #gloss is a list of tuples start_time = time.time BookAnteSeqD
 * 1)         if Slicectr%1000==0:
 * 2)             print(Slicectr, "Slicectr  %s secs " % (time.time - start_time))

SequiturFlush
SequiturFlush gets a Sequitur tuple and returns a list of its Flush, limited to lengths 4/8. Only bookAnteSeqD uses SequiturFlush at the moment. Given: Hinges:('the','of', 'to', 'a', 'was', 'his',\               'in', 'had', 'said', 'for', 'at', 'him',\                'on', 'as', 'not', 'were', 'be', 'would','up') d3=('the', 'of', 'to', 'a', 'was', 'his', 'in', 'had', 'said', 'for', 'at', 'him', 'on', 'as','not', 'were', 'be', 'would', 'up') # # SequiturFlush(d3) returns the list: [('the', 'of', 'to', 'a', 'was', 'his', 'in', 'had'), ('the', 'of', 'to', 'a', 'was', 'his', 'in'), ('the', 'of', 'to', 'a', 'was', 'his'), ('the', 'of', 'to', 'a', 'was'), ('the', 'of', 'to', 'a')] # # Note: SequiturFlush would never get a Sequitur longer than 8 anyway since AnteSeqD limits Sequitur length to 8

def SequiturFlush(tpl):             #March12 accum=[] for x in reversed(range(len(tpl))): if tpl[x] in Hinges19: if len(tpl[:x+1])>3: accum.append(tpl[:x+1]) return accum

UniquifyGlossa(UG)
UG Uniquifies AnteSeqD, removing any repeat Glossa from each AnteSeqD Glossa list. Only bookAnteSeqD uses UG. def UG: for k,v in AnteSeqD.items: if len(v)>1:        #40 Vset=set(v) Vlist=list(Vset) AnteSeqD[k]=Vlist

Autopsy
Autopsy uses 9 PlageDicts, one for each SliceWidth 8/16. Autopsy starts with a fresh copy of SynTexList, the generated SynTex. If Autopsy finds a Plage in SynTexList, a '金' (Kane marker) is swapped into the Plage position in SynTexList, and the Plage itself is removed. That way later iterations wont rediscover the same Plage or its Flush, again, thus avoiding multiple spurious detections of the same Plage repeatedly. Autopsy searches for longest (16gram) Plages first, successively searching for shorter Plages. The PlageDicts themselves are booked from the same Text the Generator trains from. It's critical for PlageDicts and TrainingText(TT) to be booked from the same source texts, else Autopsy will fail.

def AutopsyB(n,plaged,before,after): global SynTexList ctr=0 for q in range(len(SynTexList)-15): STLcopy=SynTexList[q:q+n] if '金' in STLcopy: continue elif (tuple(STLcopy) in plaged) and (tuple(STLcopy) not in Pdict): ctr +=1 print(ctr, before, STLcopy) SynTexList[q:q+n]='金' Pdict[tuple(STLcopy)] Pdict=defaultdict(int) #used by Autopsy RealTime dict def Autopsy: Pdict.clear #Feb1 AutopsyB(16,P15PlageDict16,"16gram",'\nafter 16grams\n') AutopsyB(15,P15PlageDict15,"15gram",'\nafter 15grams\n') AutopsyB(14,P15PlageDict14,"14gram",'\nafter 14grams\n') AutopsyB(13,P15PlageDict13,"13gram",'\nafter 13grams\n') AutopsyB(12,P15PlageDict12,"12gram",'\nafter 12grams\n') AutopsyB(11,P15PlageDict11,"11gram",'\nafter 11grams\n') AutopsyB(10,P15PlageDict10,"10gram",'\nafter 10grams\n') AutopsyB( 9,P15PlageDict9, "9gram", '\nafter 9grams\n') AutopsyB( 8,P15PlageDict8, "8gram", '\nafter 8grams\n')

Detoke
Detoke (detokenize) takes a SynTex list of Tokens and renders them attractive to human readers.

['the', 'cold', 'air', ',', 'was', 'quite', 'able', 'to', 'keep', 'his', 'commodity', 'in', 'its', 'bodily', 'form', 'became', 'a', 'direct', 'relation', 'between', 'the', 'objects', 'to', 'be', 'measured', '?', 'plainly', ',', 'by', 'the', 'quantity', 'of', 'gold', 'and', 'silver', 'in', 'a', 'country', 'are', 'determined', 'by', 'the', 'quantity', 'of', 'gold', 'fixed', 'upon', 'as', 'the', 'standard', 'of', 'prices',

......The cold air, was quite able to keep his commodity in its bodily form became a direct relation between the objects to be measured? plainly, by the quantity of gold and silver in a country are determined by the quantity of gold fixed upon as the standard of prices,

# def Detoke(inlist): out0=[] for x in inlist:                       #flatten any interior lists if type(x)==list: for q in range(len(x)): out0.append(x[q]) else: out0.append(x) out1=[out0[0].capitalize]            #Oct27 prime the accumulator for x in range(len(out0)-1):           #Capitalize sentences # if out0[x]=='.': out1.append(out0[x+1].capitalize) #Oct31 if out0[x] in ['.',':',';']: out1.append(out0[x+1].capitalize) else: out1.append(out0[x+1]) # #   pctr=0                       #new paragraph every 4 DOTs S2='\n\n......'                 #initial indentation Oct12 S3='' tstring='' tstring=' '.join(out1) for z in range(len(tstring)): if (tstring[z] in ['.',':',';',',']) and (pctr==4): #Oct31 pctr=0 S2=S2 + '.....\n......' elif tstring[z] in ['.',':',';',',']: S2=S2 + tstring[z]       #Oct31 pctr+=1 else: S2=S2 + tstring[z] #                                        #string text cosmetics S3= S2.replace(' ,',',').\ replace(' .','.').replace(" 's","'s").\ replace(' ;',';').replace('( ','(').replace(' )',')').\ replace(' i ',' I ').replace(' !','!').replace(" d ’ ", " d’").\ replace(' ?','?').replace(" 'd","'d").replace(" ’ s","’s").\ replace(' ;',' ').replace(".; ",'. ').replace('_',' ').\ replace('!.','!').replace("i'm","I'm").replace("' '","").\ replace(" ’ ","’ ").replace(' “ ',' “').replace(' ” ','” ').\ replace(" ‘ "," ‘") #                 print(S3,'\n',len(S3))
 * 1) April7 Detoke pretty-prints a SynTex list

First Detoke flattens any interior lists (an unecessary artefact). Next it capitalizes sentence heads. A new paragraph issues after every 4th sentence. Replace is used for various string-level cosmetics. Re-attaching tokens is important. 'i',"'",'m' -->  I'm. There is a Quote version of Detoke that handles Pairs (Quotes, Parentheses, etc.)