Wikipedia:Reference desk/Archives/Mathematics/2018 April 12

= April 12 =

size of largest gap in partition
Suppose I take the integers 1,2,...8000 and pick a random subset $$A=a_1,a_2\dots a_{1000}$$ of size 1000. Define $$a_0=0$$ and $$a_{1001}=8001$$ so you've got a random partition with its endpoints. Call M(A) the largest gap in A, i.e. $$M(A)=\max_{0\le i\le 1000} (a_{i+1}-a_i)$$M(A) could be as large as 7001, but that is unlikely. I'd like to find an approximate smallest S such that I can be pretty sure ($$p>0.999$$) that $$M(A)<S$$ for random A, while not worrying about "extreme" ($$p\le 0.001$$) outliers.

I don't actually care too much about the exact answer to the above, since numerical simulation is good enough for my immediate purpose (it relates to compressed sensing). What I'm wondering is if this (finding the distribution of gap sizes in partitions) is a well known problem and if there's a standard way to solve it. Thanks! 173.228.123.166 (talk) 00:48, 12 April 2018 (UTC)
 * To recover a classic problem, remove the endpoint breaks. This makes a "rod" of length 8001 broken into 999 pieces and we want the size of the largest chunk. Now, the fact that we break at integer positions might affect the result slightly, but probably not much because the rod is very large. MSE came to my help for remembering the solution (for real-positioned break points), and with your numbers it yields an average length of maximum of $$8001*H_{999}/999 \approx 60$$. However this does not tell us the standard deviation (or the full probability distribution function) that we would need.
 * I will be back with some simulation results shortly. Tigraan Click here to contact me 16:44, 12 April 2018 (UTC)
 * Done. The code is certainly suboptimal but it does the job.


 * The above results in a (rounded) 60±10 maximum interval length in the first case (selection with replacement, i.e. the a_i may not be unique), with the 0.1% "largest largest" (your S value, in effect) starting at 109, and a 57±10 maximum interval length in the second case (without replacement, i.e. the a_i are unique), with the 0.1% "largest largest" starting at 107. Of course your numbers may vary but that's the general idea. A little Python knowledge would of course allow you to modify the script so as to retrieve the full histograms. Tigraan Click here to contact me 17:20, 12 April 2018 (UTC)

Gap number i goes from element number $$a_i+1$$ to element number $$a_{i+1}-1$$, for $$0\le i\le 1000$$. The length of this gap is $$a_{i+1}-a_i-1$$. The total length of the gaps is $$8000-1000=7000$$. The number of gaps is 1001. The mean length of the gaps is $$\mu=7000^1 1001^{-1}\approx 7$$. Every gap length g satisfies $$0\le g \le 7000$$. The distribution of gap lengths is approximately a binomial distribution having $$N=7000$$ and $$p=1001^{-1}$$, because this distribution satisfies the above conditions. The standard deviation is $$\sigma=(Np(1-p))^{2^{-1}}$$ $$\approx 7^{2^{-1}}$$. Let $$S=\mu+3\sigma\approx 15$$. Bo Jacoby (talk) 20:43, 12 April 2018 (UTC).
 * I believe you made a reasoning error here, . When making the gaussian approximation, you are applying the central limit theorem to study a tail of the distribution (since OP's problem is about the one largest gap, not the average gap or the first decile gap or...), and IIRC that does not work. Tigraan Click here to contact me 12:35, 13 April 2018 (UTC)
 * (What is IIRC?) . Approximations are kind of errors even when they provide useful results. Surely there is room for improvement. Bo Jacoby (talk) 19:43, 13 April 2018 (UTC).
 * IIRC = If I Remember Correctly. (FYI, it's in Wiktionary.)
 * No, approximations are not "kinds of errors" and for them to be useful they need a theoretical background, not just "here's some calculations I made up, maybe they work." That you think otherwise is horrifying (although emblematic of the terribleness of the RD).  Things are approximately binomial because there is a convergence theorem, not just because that would be convenient!  This kind of garbage answer is a disservice to the question-asker. --JBL (talk) 12:29, 14 April 2018 (UTC)
 * The problem I have with your reasoning is not that for finite N the binomial distribution is slightly different from a Gaussian. The problem I have is that the binomial distribution is weakly convergent to the Gaussian, and it converges "near the center", and it is not valid even for very large N to apply it to the events on the tails.
 * I recognize this is not a rigorous way of phrasing it but my math is lacking to make the point more precisely. For an example: if you flip a coin N times (binomial distribution with p=0.5) the probability of "all tails" is 0.5^N or 2^(-N), whereas in the gaussian approximation with mean = 0.5N, variance = 0.25N, then the PDF falls quicker: $$\frac{1}{\sqrt{2\pi*0.25N}}e^{-(0.5N)^2/(0.5N)} = \frac{1}{\sqrt{0.5\pi N}}e^{-N} << 2^{-N}$$. I.e. the approximation gets worse as N grows larger, roughly by an exponential factor that grows as (e/2)^N. Now again I am not sure this is the correct benchmark against which to compare your solution, but I am sure it deserves a bit of care. Tigraan Click here to contact me 20:24, 15 April 2018 (UTC)

improvement
The size, $$X$$, of a randomly chosen gap is approximately poisson distributed with mean value 7. The probability, that $$X$$ assumes the value $$i$$, is $$(X=i)\approx e^{-7}\prod_{k=1}^i 7^1 k^{-1}$$ for $$i=0, 1, 2, \cdots, 7000$$. The probability, that $$X$$ is not greater than $$S$$, is $$(X\le S)=\sum_{i=0}^S (X=i)$$. The probability, that no sizes of 1001 randomly chosen gaps are greater than $$S$$, is $$(X\le S)^{1001}$$. The OP wants $$(X\le S)^{1001}\ge 1-10^{-3}$$.


 * $$(X\le S)\ge (1-10^{-3})^{1001^{-1}}\approx 1-10^{-6}$$.
 * $$(X>S)<10^{-6}$$.
 * $$\sum_{i=S+1}^\infty (X=i)\approx (X=S+1)<10^{-6}$$.
 * $$e^{-7}\prod_{k=1}^{S+1} 7^1 k^{-1}<10^{-6}$$.
 * $$\prod_{k=1}^{S+1} 7^1 k^{-1}<e^{7} 10^{-6}\approx 10^{-3}$$.

Now $$\prod_{k=1}^{23} 7^1 k^{-1}\approx 0.001$$ so the answer is $$S=22$$. Bo Jacoby (talk) 06:01, 14 April 2018 (UTC).


 * There is no independence here, so your claim that the probability is a certain 1001th power is completely false, and none of the subsequent calculations have value. --JBL (talk) 12:33, 14 April 2018 (UTC)
 * The dependence when some gaps are very big is rare, and so the approximation is good. Please improve the calculation rather than merely critisizing. Bo Jacoby (talk) 13:26, 14 April 2018 (UTC).
 * There is no way to "improve" this calculation, because it is just pseudomathematics. The improvement is to discard it and do something that makes sense.  When someone answers a question with nonsense, it is a service to the question-asker to point out that the answer is nonsense.  What would be even better is if you did not post nonsense answers. --JBL (talk) 14:16, 14 April 2018 (UTC)

Then do something that makes sense. Solve the problem if you can. The sum of the sizes of the 1001 gaps is exactly $$7000$$, while the sum of the poisson distributed variables is $$7000\pm 84$$. So there is indeed dependence. Taking this dependence into account complicate the calculations without improving the result. Bo Jacoby (talk) 14:58, 14 April 2018 (UTC).
 * There is already a completely adequate discussion by Tigraan above! --JBL (talk) 17:59, 14 April 2018 (UTC)

Tigraan objected against using the gaussian distribution, which is consequently not used in the improvement section. Can JBL or Tigraan solve the problem? Bo Jacoby (talk) 18:38, 14 April 2018 (UTC).
 * Tigraan gave a clear and correct answer before you ever commented on this thread! I know they say that the internet is write-only, but really. --JBL (talk) 21:12, 14 April 2018 (UTC)

From 8000 consecutive numbers remove 1000 random numbers to get 7000 numbers divided into 1001 chunks (or gaps). Tigraan's formulation: 'This makes a "rod" of length 8001 broken into 999 pieces and we want the size of the largest chunk' should read: 'This makes a "rod" of length 7000 broken into 1001 pieces and we want the size of the largest chunk', and 'the average length of maximum' should read: $$7000*H_{1001}/1001\approx 52$$. I redid Tigraan's brute force simulation in J. A=.1+(10000$1000)?8000 NB. 10000 random subsets A=.(/:{])"1 A         NB. sort      A=.0,.A,.8001          NB. append endpoints      A=.(}.-}:)"1 A         NB. compute gap sizes A=.>./"1 A            NB. max gap size      E=.+/%#                NB. Expected value      sd=.E&.:*:&(-E)        NB. standard deviation      (E,sd)A                NB. compute E and sd 56.6289 9.56994 This calculation does not answer the question posed by the OP.

Once in 1000 times the maximum gap length was 117. >./>./(}.-}:)"1]0,.8001,.~(/:{])"1]1+(1000$1000)?8000 117 I was aggrieved by JBL's insolence, but I deserved it. Bo Jacoby (talk) 04:33, 15 April 2018 (UTC).

I'm coming into this late, and I haven't read all of the discussion, but I ran 1 million trials, and if I haven't made a mistake, I get 38. Bubba73 You talkin' to me? 06:40, 16 April 2018 (UTC)


 * Whoops, I got it backwards. 99.9% of the time M < 38.  99.9% of the time M < 69.  Bubba73 You talkin' to me? 07:53, 16 April 2018 (UTC)
 * Actually, < 70. Bubba73 You talkin' to me? 14:08, 16 April 2018 (UTC)

I wrote above:
 * The size, $$X$$, of a randomly chosen gap is approximately poisson distributed.

This is a serious mistake. The simulation indicates that
 * The size, $$X$$, of a randomly chosen gap is approximately geometically distributed.

Then the argument goes as before.

The probability, that $$X$$ assumes the value $$i$$, is $$(X=i)\approx p^1 q^i$$, where $$p=1001^1 7000^{-1}$$ and $$q=1-p$$. The probability, that $$X$$ is not greater than $$S$$, is $$(X\le S)=\sum_{i=0}^S (X=i)$$. The probability, that no sizes of the 1001 (approximately independent) gaps are greater than $$S$$, is $$(X\le S)^{1001}$$. The OP wants $$(X\le S)^{1001}\ge 1-10^{-3}$$.


 * $$(X\le S)\ge (1-10^{-3})^{1001^{-1}}\approx 1-10^{-6}$$.
 * $$(X>S)<10^{-6}$$.
 * $$(X>S)=\sum_{i=S+1}^\infty (X=i)=p^1 q^{S+1}\sum_{i=0}^\infty q^i=q^{S+1}$$.
 * $$S\ge 89$$.

Bo Jacoby (talk) 14:52, 16 April 2018 (UTC).

I checked my program and reran it on 1,000,000 simulations.

The smallest max gap was 34. The largest max gap was 157. 99.9% of the max gaps are <= 69, so S = 70. Bubba73 You talkin' to me? 17:53, 16 April 2018 (UTC)
 * The question is not that 99.9% of all gaps on a large number of attempts fall below S. The question is that for one given attempt, the probability of at least one gap being >S is lower than 0.1%. That is not the same thing because there are multiple gaps in each attempt. Basically that is a confusion between median and mean. Tigraan Click here to contact me 11:58, 17 April 2018 (UTC)
 * That is the way I interpreted it - 99.9% of the time the largest gap is < 70. Bubba73 You talkin' to me? 17:20, 18 April 2018 (UTC)
 * something is definitely wrong with your experiment or how you've implemented it. --JBL (talk) 11:14, 19 April 2018 (UTC)

My above simulation of 1000 cases gave the max gap size 117. This indicates that in one out of 1000 cases the max gap size is like that. Repeating the simulation gave the numbers 99 101 100 95 103 103 122 111. This is in variance with the analytic attempt (S=89) and Bubbas simulation (S=70). A mystery needs to be clarified. Bo Jacoby (talk) 12:41, 17 April 2018 (UTC).
 * If it's a vote, then I get 9990491/10000000 tests with maxgap < 104 (1213/10000000 with maxgap = 104, 1273/10000000 with maxgap;= 103). --catslash (talk) 23:23, 17 April 2018 (UTC)
 * Repeating that with a 'better' pseudo-random number generator gives 9990195/10000000 tests with maxgap < 104 (1255/10000000 with maxgap = 104, 1414/10000000 with maxgap;= 103) - and it makes the tails noticeably smoother. --catslash (talk) 00:24, 18 April 2018 (UTC)

,, and everyone else who replied: thanks very much for these answers, which I found enlightening and useful. Sorry I wasn't able to reply earlier. 173.228.123.166 (talk) 03:02, 19 April 2018 (UTC)