Wikipedia:Reference desk/Archives/Mathematics/2019 June 21

= June 21 =

Random fragments of a line segment
So this question is inspired by wondering about the fragment size distribution from randomly sheared DNA. But my question is a math one. So, let's say you have a line segment of length 1 or length N. You place n random cuts along the line segment. Let's assume these cuts can land anywhere on the segment with uniform probability, and the position of each cut is an independent variable from other cuts (including you can absolutely have cuts between previous cuts). What is the expected distribution of line segment lengths after this process? I have a gut feeling that it is not the multinomial distribution, but I cannot express quite why is mathematical terms. Someguy1221 (talk) 10:20, 21 June 2019 (UTC)
 * I'm assuming the length of the initial segment is 1; it should be easy to scale to length N from there. The first observation to make is that the distribution does not depend on which which segment is chosen. In other words it's conceivable that if the two points selected are y1 < y2, the distributions of y1, y2-y1, and 1-y2 are not all the same. I believe that, in fact, they are all the same, and I'll attempt a justification for this later, but for now assume that this is the case. Then the overall distribution is the same as the distribution of the first segment, in other words it's the distribution of min{xi} if the xi are independent and uniformly distributed. The cumulative distribution is then 1-(1-x)n and the probability density function is n(1-x)n-1. (This is a special case of the beta distribution but I'm not sure if it has its own name. The mean is 1/(n+1) as you would expect, but the distribution is highly skewed so the median is 1-1/21/n.)
 * To see that the distribution does not depend on which segment is chosen, instead start with a circle of circumference 1 and assume n+1 cuts are made. It is clear by symmetry that the different segments now have the same distributions. But if the circle is rotated by any amount then the there is no change in the distributions. So rotate the circle to that the first point x0, coincides with 0, and we get the original configuration. Note that this assumes the distributions are uniform, if not then I imagine the calcultions get much more involved. --RDBury (talk) 12:18, 21 June 2019 (UTC)
 * PS, I'm assuming the segment is continuous, but it now occurs to me that with DNA you might want to assume the cuts have a discrete distribution. If so then how would you account for cases where two or more cuts occur at the same point?. Also, it's hard to imagine that this computation hasn't been done already since it seems relevant to DNA sequencing, so I expect it's somewhere in the literature on the subject. --RDBury (talk) 12:32, 21 June 2019 (UTC)
 * thank you. You're right, things like this have been presented in the DNA literature, usually as a null hypothesis that is rapidly rejected in favor of empirical evidence of non-random cutting. This paper actually gives the exact same formula you did for the probability density function for their initial simplest model. There's something I don't get about this, though. Namely, why is it monotonically decreasing for the entire range 0..1? I guess I would have expected a mode at some non-zero value that depended on n. Someguy1221 (talk) 02:24, 24 June 2019 (UTC)
 * Yeah, I thought it was a bit counterintuitive myself. Perhaps the issue is that in statistics you deal a lot with counts, sums or averages of quantities, and their distributions usually follow a bell curve of some sort. But other operations, such as minimums, produce other shapes. Examples include the Exponential distribution and Geometric distribution, both of which describe the first occurrence of something, and you might think of this result as fitting into that framework. Specifically, the length is the first occurrence of a break after a given break. --RDBury (talk) 11:55, 24 June 2019 (UTC)
 * Thanks. I just wanted to also note that I find it highly satisfying to realize that if "intensity" is defined as x*P(x), as in "amount of the original line in segments of size x", the mode appears to be at 1/n. Someguy1221 (talk) 22:27, 24 June 2019 (UTC)