User:Dcx/sandbox/cryptography

Introduction
Develop a Markov Chain Monte Carlo to decrypt the message: whose answer is:

First thought
1. Read in a very long text (War and Peace), and use the ordering of the letters in that text to find out how often particular letters in the English language follow one another. 2. Create a probability matrix (later on we call it ) that’s $$26\times26$$ that represents a-z on both axes, then fill this with the likelihood that one letter is followed by another sequentially. 3. Generate a random solution key and calculate the overall likelihood of the solution key (later on we call it  (dictionary), which is a vector of the decrypted letters).

Why MCMC
The initial thought was to traverse all possible permutations of the alphabet and just pick up the one with the maximum likelihood. A sample code can be: I did a test of the time I need to do the traverse. I tried the method on 8 letters and it took 5.7 seconds to calculate the likelihoods of all the permutations of the 8 letters. Note that the number of full permutations of $$n$$ letters goes as $$P(n,n)=n!$$, so the time it would take for 26 letters should be
 * $$t=5.7\ \mathrm{s}\times\frac{26!}{8!}=5.7\times10^{22}\ \mathrm{s}=1.8\times10^{15}\ \mathrm{years}$$

This time scale is even 130 thousand times the age of the universe which is not achievable. This is why we need to do Markov Chain Monte Carlo.

Letter sequence only
Then I tried MCMC but considered letter sequence only. After calculating the likelihood of the generated, I randomly chose two letters in the   to swap and calculate the likelihood again. If the later likelihood is greater than 0.99 times the previous likelihood, I save the  and go to the next step, otherwise I randomly swap other two letters in.

After running the chains for one night I found the output is nonsense. So I decided to account for more factors like letter frequency, word frequency, word length, etc.

Initialization
Define variables to describe the problem.

I use two sets of data. One is the book War and Peace from https://bitbucket.org/desika/narayanan_computational_astrophysics_2017/src at The other data is a list of 20 thousand english words ranked by frequency from https://github.com/first20hours/google-10000-english at

I apply a filter to use only letters of the data and transform them all to the lower case.

Letter sequence


First generate the probability matrix  using both   and   data. For  the sequence between successive words are accounted for while for   only the sequence within words are accounted for. The computed probability matrix is shown if Figure 1.

Now given a solution  of the   phrase (which is a string with the same length as , and for which the   is the best solution) I want to calculate the probability of each letter in that solution to be the right letter (hereafter I just call it the "goodness" of the letter). I have tried three methods  of accumulating from the probability matrix: sum, product  and sum of $$\ln(p+1)$$.

Letter frequency
Letter frequency is the frequency of each letter to appear in general English environment. I calculate letter frequencies only with  because   is not a data with natural English expression. I take letter frequency into account for chain initialization, goodness estimation and chain step sampling. The basic idea is to compare each letter's frequency to the frequencies of the letters in the  phrase, which can be calculated as follows. If a letter occurs frequently in the  phrase, we would expect it to be a high-frequency letter in. I use the gaussian function to construct the probability distribution of the letters in  corresponding to each letter in. The constructed probability distributions are shown in Figure 2.

The number of occurrence of letter in  range from 1 to 5. From experience, the more occurrences of a letter in, the more likely it is a high-frequency letter. If a letter occurs five times, the PDF of the letter should be narrower and centered on high-frequency letters, while if a letter occurs only once, the PDF of the letter should be wider and centered on low-frequency letters. Hence I construct the PDF with respect of the width $$\sigma$$ and the center $$\mu$$ separately.

Width
I use a sigmoid function to describe how $$\sigma$$ goes with occurrence rate in. The idea is to keep the width wide at low occurrence and suddenly become narrow at high occurrence. The form of the sigmoid function is a reshaped logistic function:
 * $$\sigma=\frac{26}{1+e^{\frac{6n}{\mathrm{max}(n)}-6}}$$

where $$n$$ is the number of occurrence of letter in, and $$\mathrm{max}(n)$$ is 5. The above equation makes $$\sigma\sim26$$ at low occurrence and falls down to $$\sigma=26/2$$ at maximum occurrence. The code of this equation is:

Center
I assume the letter with the highest occurrence in  is most likely the highest-frequency letter, and the letter with the lowest occurrence in   is most likely the lowest-frequency letter. Therefore their PDFs are centered on the highest- and lowest-frequency letter, respectively. Then letters with numbers of occurrences between the highest and lowest occurrences can be rescaled proportionally to frequencies between the highest and lowest frequencies. I pick up the letters with the closest frequencies with the rescaled frequencies of the letters in  to be the centers of them.

Goodness
I then construct the gaussian distributions with  and. I also define the goodness of letters in a solution  by this gaussian distribution.

Word finding
The idea is that, if I can find more words in a solution, it is more likely the solution is right. The longer and more frequent words I find, the higher goodnesses the letters in corresponding words should be.

Frequency
I use the list of 20k  and assign linear frequencies corresponding to their ranking.

Length
I assign a probability proportional to the length of the words.

Goodness
I take both frequency and length into account and define the goodness of letters in a solution  as the product of the the two probabilities. I make the words to be at least 4 letters long to be more meaningful. The word searching process is designed not to find shorter words if a longer word containing that shorter word can be found. For example, if a solution contains the word "plant", then shorter words "plan" and "ant" should not be found. To achieve this goal, I make the following adjustments:

1. the word head  goes forwards but the word tail   goes backwards, to make sure the longer words found first. 2. move the word head  immediately when a word found, without continuing moving. 3. make sure the word found not contained by existing found words.

This design makes the number of words found in a solution more accurate.

Overall goodness


With individual goodnesses from letter sequence, letter frequency and word finding, I put them together and compute an overall goodness. I give word finding double weight from the idea that if words are found in a solution other letter properties are far less important. Some examples of the computed goodnesses are shown in Figure 3. These are the goodnesses of letters in the solution, but what we care are the goodness of letters in, which synthesize multi-occurrence letters in the solution. I convert goodnesses of a solution to corresponding goodnesses of the  that are used to translate from the   phrase to the solution. For multi-occurrence letters I use the mean value of their goodnesses in the solution.

Chain initialization
The letters in the  phrase do not fill all alphabet , so I only generate the   for the letters in  , which is. I use the gaussian distributions generated from letter frequency analysis as the weights to randomly pick up letters to assemble the. By making copies of the letters array  and the gaussian distributions array , I record what letters are picked up to   and what are left over. At last the leftover letters are stored in  and their corresponding probability distributions are stored in. The size of  or   plus the size of   should be 26.

Chain run
Using the initialized, I estimate an overall goodness   (only one value). Then change one letter in the  either with another letter in   (internal swap) or with a letter in the leftover letters   (external swap). I estimate the overall goodness  of the new   and compare it with the previous one. If the new  is greater than some factor of the old one, save the new   as a chain step, otherwise swap again until   is satisfied.

Step sampling
Given a, we can compute the goodnesses   for every letter in the. We can randomly select a letter with weights of  to change, but in practice I just choose the one with the minimum   (i.e. the worst one) to change. Then I decide which letter (named ) to change   with. The alternative letters can be letters in  (internal letters) and letters not in   (external letters). For internal letters I use their goodnesses  as their PDF to be chosen, while for external letters I use the gaussian functions   as their PDF to be chosen. Then I normalize the external PDF to have the same mean value as the internal PDF. To perform the change, I need to update,   and   simultaneously. If external swap is chosen, I update  by changing   with , and update   and   by deleting those values of   and adding values of. If internal swap is chosen, I update  by swapping   and.

Accepting criteria
At the th step, I need to estimate an overall goodness   as the criteria of step acceptance. I have tried the sum of, the sum of  , and the sum times the number of words found. If the  is greater than some proportion   of the previous , which is now  , the step can be appended to the chain. I have tried different s and also annealing with different annealing rates   which is defined by.

Practice
I tried running the chains many times with different configurations. The longest chains are a set of 10 chains running 97,000 steps with 3,000 burn-in steps. The parameters I use for these chains are,   and without annealing. To test the accuracy of the chains, I compute the proportion of correct letters in the  of each chain step (hereafter correct rate). In these chains, I find: The solution with the best correct rate is, with correct rate of 0.29. We can find 2 words in this solution:. The solution with the best  is , with   of 301.7. We can find 6 words in this solution:. The solution with the best  is , with   of 860.8. We can find 6 words in this solution:. For reference, the answer has  of 514.3 and   of 1320.2. None of the best solutions is the correct answer, but we are on the road of finding more and more words.

Testing configurations
I tried another set of chains with 5000 steps to test different configurations. I tried varying annealing rates, varying proportions   of previous   to accept, and varying probability matrix accumulation method. I control the variations by setting the same initial chains for different configurations. I plot overall goodnesses  and correct rates as functions of chain step, shown in Figure 4.

From the figure we can see: 1. Adding annealing makes the overall acceptance rate much smaller but also accepting more low-quality steps. But changing annealing rate makes little difference to the sampling process. 2. The more proportion of previous goodness needed, the harder to accept a chain step. Chains with higher  have more high-goodness samples, but more correct samples can be achieved by   chains as the base number is larger. 3. Using simple sum makes more chain steps and using product makes less steps, while using sum of log in between. In terms of correct rate, the three methods do not make much difference.

Discussion


This MCMC method of cryptography is different from traditional MCMC methods in that: 1. It is not sampling in a continuous parameter space but sampling in a discrete  space of letters. Thus we cannot use the overall features of all the chain steps to give parameter estimations. We cannot draw a conclusion like "This position should be letter x plus/minus something". If we plot the samples in the parameter space (shown in Figure 5), the samples are much scattered and we cannot see convergence. The parameter set with the most occurrence is far from the right answer (the red cross on the figure), where nearly no sample occurs. 2. By swapping we only change one parameter in one step, but in a traditional MCMC method we can change all the parameters simultaneously in one step. This is why it is so hard to find the right solution.

If we only consider letter sequence, it is almost impossible for the chains to converge to the right answer. Even if in a certain step the solution is accurately the answer we want, the program cannot tell that it is the best answer, because the letter sequence in the best answer may not be the most probable one by the probability matrix. So considering only letter sequence is in fact not a practical method.

By adding factors of letter frequency and especially words finding, the method becomes more practical, but also becomes time-consuming. If it finds a word that is not in the right answer, it will take a long time to dissolve that word and go to other directions. But if we keep the chains running, it should find the answer in the end, because the right answer have the most recognizable words.

I don't have time to test this, but if the cryptographic phrase was separated by spaces, my method should be far more efficient. All the word lengths are already known and we could just search in the words with specific lengths. At least if there is a one-letter word in the phrase, we should give it 99% probability to be "a".