Wikipedia:Reference desk/Archives/Mathematics/2020 April 15

= April 15 =

Comparing random bit-strings
Say you have a pair of random, uniformly distributed bit-streams. Is is there some kind of well-defined "metric of similarity" that could be used to compare the two? In particular I'm wondering what the "average" level of expected "similarity" might be. Earl of Arundel (talk) 13:12, 15 April 2020 (UTC)
 * The expected value of the Hamming weight of the bitwise XOR of two random n-bit bit strings would be n/2, but that is hardly satisfying. -- ToE 16:41, 15 April 2020 (UTC)


 * You may be looking for Hamming distance (which for a pair of bit strings is the weight of the XOR of the strings). For infinite streams you may want something like $$\lim_{n\rightarrow\infty} D_n/n$$, in which $$D_n$$ stands for the distance between the prefixes of length $$n$$. If the bits are independently random with fifty-fifty probability, that limit will be 1/2, though, not very interesting. --Lambiam 17:14, 15 April 2020 (UTC)


 * I see! But then how do you quantify something like the expected "spread" of possible weights? If the average is ~50%, then how much less likely would a "similarity score" of, say, 27% be? Earl of Arundel (talk) 18:38, 15 April 2020 (UTC)


 * For a given length $$n$$, the distance as a random variable has a binomial distribution, with as parameters that length and $$p = 0.5$$. The expected value, as already stated, is $$\mu = \tfrac{1}{2}n$$, and the standard deviation is $$\sigma = \tfrac{1}{2}\sqrt{n}$$. To scale to the interval $$[0,1]$$, divide by $$n$$, giving $$\mu = \tfrac{1}{2}$$ and $$\sigma = \frac{1}{2\sqrt{n}}$$. As $$n$$ gets larger, the distribution can be approximated by a normal distribution with these parameters. For smaller sample sizes, the exact discrete distribution can be used, using binomial coefficients. --Lambiam 20:33, 15 April 2020 (UTC)


 * Well that makes sense. Terrific explanation too. Kudos! Earl of Arundel (talk) 21:26, 15 April 2020 (UTC)


 * There's also mutual information. The expected mutual information is 0. --163.47.115.80 (talk) 21:28, 15 April 2020 (UTC)


 * Some kind of cross correlation with zero lag? →2A00:23C6:AA08:E500:809B:7CE7:690:F1F0 (talk) 10:09, 16 April 2020 (UTC)
 * I was more thinking absolute mutual information.163.47.115.80 (talk) 01:27, 17 April 2020 (UTC)
 * That is always a nonnegative quantity. Since no distribution is involved, the concept of expected value is not applicable – unless you use the universal prior probability induced by the function K, where string s has probability 2−K(s). Then the expected value, although small, will be positive. Unfortunately, K is incomputable, so the practical applicability is limited. --Lambiam 13:40, 17 April 2020 (UTC)
 * I was dividing by the length of the strings, as is being done for Hamming distance above. If X and Y are strings of length n chosen independently via the uniform distribution, then the expected value of $$\frac{I_K(X;Y)}{n}$$ tends towards 0 as n tends to infinity.  If X and Y are infinite binary strings, chosen independently via the uniform distribution (i.e. Lebesgue measure), then $$\lim_n \frac{I_K(X\upharpoonright n; Y\upharpoonright n)}{n}$$ is almost surely 0. (It suffices for $$X\oplus Y$$ to be Martin-Löf random.)--163.47.115.80 (talk) 03:38, 18 April 2020 (UTC)

Well aside from the theoretical implications of course, it obviously doesn't make much sense to talk about the "correlation between two random variables". My actual thought was more along the lines of this. Suppose we have two completely random strings, $$R$$ and $$R'$$. Comparing with some NON-random sequence $$V$$ we find that the similarity score of each, $$S$$ and $$S'$$ respectively, are of equal distance $$D$$ from the expected value, with $$S = \mu + D$$ and $$S' = \mu - D$$. Given that neither is any more or less so similar to V, they are thus in a sense "indistinguishable". Are they not? (That is to say, placed "side by side", both would thus basically appear to be "of equal entropy"?) Earl of Arundel (talk) 15:45, 18 April 2020 (UTC)