Wikipedia:Reference desk/Archives/Mathematics/2019 June 17

= June 17 =

Comparing two distributions with low sample size
I'm looking for a defensible method to compare two arbitrary distributions with very low sampling size (16 values each).

Brief background: I have a few hundred datasets from a roving window GIS analysis. Each consists of a percentual area breakdown over 16 landcover types. I want to compare these to the breakdown for the total area, to find the window with the landcover distribution closest to that of the total area. My notion for assessing this similarity is to treat the breakdown as an empirical distribution function, and test each window's function against the total area function. - Neither distribution will have even a nodding acquaintance with a normal distribution, naturally; I was going to sort values by magnitude in the total area to smoothen things somewhat, and use that order for all sets.

I was poking around Kullback-Leibler divergence and Kolmogorov-Smirnov test, but I can't quite figure out how happy either of these is with a sample size of 16. Does anyone know, or have a suggestion for a better metric? I'd be looking at an R implementation. -- Elmidae (talk · contribs) 21:40, 17 June 2019 (UTC)
 * Well, turns out neither is producing very useful results - specifically, KS only produces its test statistic in very broad intervals (300 wildly different distributions, but they each get one of the same 5 statistics...), which makes it pointless for fine ranking. Looking at Wilcoxon signed-rank test instead. -- Elmidae (talk · contribs) 22:08, 18 June 2019 (UTC)
 * At the risk of dating myself, I use the KS algorithm in Numerical Recipes, and have used it for small data numbers. I also use the related Kuiper test for instances where the data reside on a circle (also discussed in NR). Either way, the data need to be declustered if they are obtained from sampling a time series (or other autocorrelated source). Attic Salt (talk) 22:15, 18 June 2019 (UTC)
 * GIS isn't my area, so please take my comments with an appropriate pinch of salt. That said, I've seen (very vaguely) similar problems in other areas. My first thought is that KS tends to be used for one-dimensional continuous distributions (though there are ways to apply it to discrete distributions - see article for details), so my instinct is to look elsewhere. If I understand your question correctly, the data that you're looking at seems to be categorical in nature, so KS seems unlikely to be helpful. (You talk about "landcover types", which I take to mean something like "grass", "asphalt", "trees", "bare ground" and so on; I suppose you could order these in some way to get yourself into a one-dimensional world where KS might be a bit more applicable, but I'm inclined to suspect not.) What went through my mind was the idea of supposing that there is some empirical distribution of your 16 landcover types (e.g. 10% grass, 10% asphalt, 20% trees, 5% bare ground, etc.), and then for each of your data sets there's an observed distribution (e.g. 12% grass, 9% asphalt, 21% trees, 3% bare ground, etc.). Now, this reminds us of Chi-squared testing, so why not use the Chi-squared statistic? We do it like this: we suppose that the underlying empirical distribution is A% grass, B% asphalt, C% trees, D% bare ground, and so on. Now for each of our sets we can calculate the Chi-squared statistic. Our goal is then to suitably minimise the total of all the Chi-squared statistics over all your datasets by varying A, B, C, D, etc.. This, of course, is not an easy problem, but we're in the world where we have ludicrous amounts of raw computing power on tap, so inelegant methods are allowed, so we could use something like simulated annealing. (The reason I suggest this approach is because I've used it in solving similarly complicated problems in the past, though not using the Chi-squared statistic as the target to be minimised.) Finally, if I understand the problem correctly, one other possibility may hold: it may be that there is no single useful answer, as a general description of the total area (single values of A, B, C, D, etc.) may not adequately capture underlying gradients of different landcover types over the total area. But, as I've already said, this isn't my area of expertise, so you probably understand what can work and why in different geographical contexts. RomanSpa (talk) 23:28, 18 June 2019 (UTC)
 * Not sure I follow you here. I'm not trying to construct a hypothetical perfect combination (window) out of sampled category percentages - I'm trying to determine the one among the samples that is closest to the comparison set; which I suppose could be regarded as the "underlying distribution". Wouldn't I thus just compute the statistic for each 16-value set vs the comparison 16-value set, then pick the one with the lowest outcome? But: IIRC, a chi-square test cannot actually be used on percentage data? - it would have to be the count data, and that is not possible because the windows all have slight differences in size (due to irregular borders) and thus different sum totals of observations. -- Elmidae (talk · contribs) 00:16, 19 June 2019 (UTC)
 * Of course, I should have looked at the Chi-squared article myself before making my comment. Turns out there's a really handy example of Chi-squared for categorical data here. This deals with your "percentage data" issue, I think. I think I was misunderstanding your problem earlier: I thought you wanted to find the "average landcover proportions" and find the one closest to that, rather than just finding the dataset with characteristics closest to the total range. Sorry for this - when I do this sort of thing myself I usually have to go a different route from you. However, the example in the Chi-squared article looks like it should take you in the right direction. Please feel free to contact me directly if you want to discuss this further. RomanSpa (talk) 13:03, 19 June 2019 (UTC)
 * How did you get on with KL distance? What you would want to calculate would be $$D_\text{KL}(P_{window} \parallel P_{population})$$ for each window, where the probability distribution $$P_{window}(x)$$ is the chance that any random point from your window would have land-cover type $$x$$, while $$P_{population}(x)$$ is the chance of a random point from the whole population having that land-cover type; then choose the window for which this statistic is lowest.  This would give you the window for which the distribution in the window is closest to the overall distribution, in the sense that if you wanted to communicate the land-use of one point (not pre-specified) in your window, using a code based on the distribution, so that the code-length is related to $$- \log_2 p$$, i.e. so that a very likely land-use gets a very short code, but a very unlikely land-use can be assigned a much longer code, so that overall this minimises the average length of the message you would need to send; then the window that minimises the K-L distance is the one where you suffer least expected additional code-length by using codes based on the overall average land-use distribution, rather than the exact land-use distribution for that window.  Note that the K-L distance is not symmetric, so be careful to calculate it the right way round: if a land-use does not occur in your window, that should contribute $$ 0 \times \log 0 / \log p_{pop} $$ (= 0) to the expected extra message length, not $$ p_{pop} \times \log p_{pop} / \log 0 $$ (= ∞).  The first is the contribution for a possible thing (according to the population probabilities) that has not occurred, giving a low K-L contribution; the second is for an impossible thing that has occurred, giving a very high K-L contribution.  If you run these calculations (which should be pretty quick) does this turn out to be a sane distance function?  Note that it will assess windows where globally unlikely things did happen as more "deviant" than windows where quite likely things didn't happen, though the latter will be marked down a bit. Jheald (talk) 14:09, 19 June 2019 (UTC)

All right - once I got my head out of my behind and realized that for a Chi-square GOF test I want to compare counts (which will be summed per sample) to the distribution in the comparison set, and that different set sizes thus don't matter, it's actually working out. I can just rank sets by increasing Chi-square statistic (also staying away from significance testing, which is always nice), which accords with a common sense inspection of the distributions. So thanks, that was the right direction. Also thanks to for an accessible explanation of what looks like the kind of approach I might eventually have to circle back to, if someone demands a more accountable process! - I swear I'll make it to twenty years of system dynamics modeling and still faceplant on elementary statistics :p... -- Elmidae (talk · contribs) 18:26, 19 June 2019 (UTC)
 * The chi-squared statistic and the KL distance should both probably give you reasonably similar rankings -- they're both examples of f-divergences, as also is the Hellinger distance if you want to try another one. Not sure if there is a good characterization that discusses which sorts of differences register comparatively more strongly in one rather than others of them, or different ways their measures of closeness can be understood.  Some may be primarily useful in the proving of theorems.  Chi-squared, as was said, relates to the chance the observed counts could have occurred randomly, from the population distribution.  They are probably any of them reasonable, for what you want to do. Jheald (talk) 20:23, 19 June 2019 (UTC)