Wikipedia:Reference desk/Archives/Mathematics/2017 October 6

= October 6 =

Statistical significance of SHA1-prefix collisions
I have a GitHub repository in which SHA1 commit IDs from another repo are used as folder names. I've noticed that in the 75 commit folders so far, there are only 61 unique values of the first byte of the SHA1 hash; 14 values of the first byte appear twice. (Oddly, none appear three times, nor have I seen any duplication in the 12-bit prefix.) I would have expected only two or three pairs of duplicates from my intuitive understanding of the birthday paradox. What is the statistical significance of this, against the null hypothesis that the random oracle model describes the first byte of SHA1's output? If the null hypothesis is rejected, can this be explained by known weaknesses of SHA1? Neon Merlin  20:18, 6 October 2017 (UTC)
 * The question seems to require familiarity with a few things (GitHub, SHA-1) which probably aren't needed for the answer. The critical bit of information is the number of possible values, in this case the number of possible first bytes of the hash value. I suppose that you might assume that the first byte has 256 possible values, but the fact that they are used for folder names might mean they are encoded into ASCII somehow, reducing the number of possible values. It seems to me that without knowing the details of how GitHub turns the hash into a folder name it's impossible to eliminate that as a factor. In any case, I would think that if a weakness in the SHA-1 was that obvious then someone at the NSA would have noticed before it saw light of day; those guys are supposed to be pretty good at cryptography after all. --RDBury (talk) 00:19, 8 October 2017 (UTC)
 * I don't understand why you expect only two or three pairs of collisions. You usually get only 65 different first bytes from 75 objects, and getting only 61 or fewer different first bytes isn't that unusual either, it probably happens around 8% of the time.  So no, it doesn't have anything to do with the cryptographical weaknesses of SHA-1.
 * This is all assuming you are not specifically paying attention to those first bytes of the SHA-1 values when you write into the repository. If you deliberately created or removed objects with particular first bytes in their hash, then you could line up their distribution in whatever way you want.  – b_jonas 22:34, 8 October 2017 (UTC)
 * Let me give an explanation. Suppose that instead of creating a predetermined 75 objects, you create new objects until you get 66 different first bytes.  If you already have k different first bytes, and you create a new object, then the chance that the first byte of its hash doesn't collide with any of the k first bytes you've got so far is (256−k*)/256.  Thus, to jump from having k different first bytes to having *k*+1 different first bytes, you need to create on average 256/(256−k) new objects.  To go from 0 first bytes to 66 first bytes then, the expected number of new objects you need to create is
 * 256/256 + 256/255 + 256/254 + … + 256/ 191.
 * The value of that sum is approximately 76.2. (If you don't want to compute the full sum, just quickly lower bound it by grouping the terms like 26·256/256 + 20·256/230 + 20·256/210 = 26·1 + 20·1.113 + 20·1.219 = 72.6.)  So you need on average more than 76 objects to get 66 different first bytes.
 * – b_jonas 10:48, 9 October 2017 (UTC)
 * To a first approximation one can consider this like a Poisson distribution for each of the 256 possibilities. The chance of no hits in a box is e^(-75/256) so we should only get about 256*(1-e^(-75/256)) different ones used - which comes out as about 65. The chance of 2 hits would be e^(-75/256)*(75/256)^2 / 2 so for all 256 we should get about 8, and for three we would expect 256*(e^(-75/256)*(75/256)^3 / 6 ) which comes out as about 1. 14 used twice instead of 8 isn't too unusual, the variance for the Poisson distribution would also be 75/256 so the standard deviation in the number of boxes would be sqrt(256*75/256) which is about 8. So I'd have thought the chances of 14 or more were even higher than b_jonas says but I could always be wrong with my finger in the air calculations. Dmcq (talk) 13:32, 9 October 2017 (UTC)
 * I don't think modeling this with a Poisson distribution is a reasonable approximation. Recall that NeonMerlin has measured the number of total objects exactly, and got 75.  If each of the boxes get a number of objects with Poisson distribution with mean 75/256, then the number of total objects you chose is a Poisson distribution with mean 75.  That means you already have 0.056 probability that you chose at most 61 objects total.  When that happens, it's guaranteed that you get at most 61 boxes full, but I don't see why this would explain any real phenomenon when you always take exactly 75 objects.  – b_jonas 16:16, 9 October 2017 (UTC)
 * I computed more precise numbers, rather than the approximations from yesterday. The probability that you get 61 or fewer distinct first bytes from 75 objects is 0.08398.  The following table gives C(k), the probability that you get k or fewer distinct first bytes.
 * – b_jonas 16:56, 9 October 2017 (UTC)
 * Yes you're quite right, I was depending on the average values approximating the proper ones. I think that assumption is near enough okay in the interest of getting something quickly to show the questioners figures were okay. However I went far too far with the variance, the end value varying affects that strongly, and I don't see how to get a nice easy estimate of that and unfortunately that is necessary to say that 61 is a reasonable value. Dmcq (talk) 19:21, 9 October 2017 (UTC)
 * Yes you're quite right, I was depending on the average values approximating the proper ones. I think that assumption is near enough okay in the interest of getting something quickly to show the questioners figures were okay. However I went far too far with the variance, the end value varying affects that strongly, and I don't see how to get a nice easy estimate of that and unfortunately that is necessary to say that 61 is a reasonable value. Dmcq (talk) 19:21, 9 October 2017 (UTC)

Quoting: "256/256 + 256/255 + 256/254 + … + 256/201. The value of that sum is approximately 76.2". No, the value is approximately 63. J code: +/256%201+i.56 63.0564 Bo Jacoby (talk) 21:39, 9 October 2017 (UTC).
 * Argh. Thanks for the proofreading, Bo Jacoby.  Yes, I've made a mistake in writing down two of those indexes, although I did do the right computation with the right sum.  I edited in place above and underlined the changes.  – b_jonas 22:33, 9 October 2017 (UTC)
 * Dmcq: what I should have said above is that the problem with that Poisson approximation is that it would also show that there's 0.056 chance of 61 or less distinct hashes out of 75 objects even if you took the full hash, not just the first byte, which is absurd. But I don't know any simple heuristic for estimating the variance of the number of distinct first bytes either.  – b_jonas 22:43, 9 October 2017 (UTC)
 * $$\sum_{i=251}^{256} 256/i \approx 256 \int_{250.5}^{256.5} dx/x= 256 \ln \frac {256.5}{250.5}=63$$

No need for me to shoot that sparrow with a cannon! Bo Jacoby (talk) 07:06, 10 October 2017 (UTC).
 * Bo Jacoby, now you're making the same kind of stupid typos as I did. That index should be 201 instead of 251 to get that result, or 191 for the computation I actually did.  And yes, I know you can approximate that sum that way, but that's not very relevant here.  – b_jonas 12:23, 10 October 2017 (UTC)

Sorry. I should have written
 * $$\sum_{i=201}^{256} 256/i \approx 256 \int_{200.5}^{256.5} dx/x= 256 \ln \frac {256.5}{200.5}=63$$

Still the result is 63. I did not code it wrongly. Bo Jacoby (talk) 11:01, 11 October 2017 (UTC).
 * I'm not certain what that 63 is about. The expected number of filled spaces should be the total minus the expected number of empty spaces. And the probability of a space being empty is (1 - 1/256)^75, so the expected number of filled spaces is 256- 256*(1-1/255)^75 which Google evaluates as 65.34... I think whether a slot is filled can be approximated well by a Binomial distribution on the chances of an entry colliding so the variance can be approximated by 75*(65/75)*(1-65/75) which gives a standard deviation of nearly 3. So 61 is about 1.4 standard deviations away by a bit of finger licking. Dmcq (talk) 12:50, 11 October 2017 (UTC)