Wikipedia:Reference desk/Archives/Mathematics/2012 November 30

= November 30 =

the R^2 of (X,Y) aggregated samples is much larger than the R^2 of individual samples of (X,Y)
I collect/calculate velocity and curvature data for about 15 fruit flies of a certain genotype per experiment. Each experiment samples new data points at some frequency for 8000 frames. Then I have N experiments. Also actually they are log velocity and log curvature data so I can calculate a power law, but I'm just going to call them v and k.

If I aggregate the data, i.e. I concatenate velocity time-series array for one fly, with another fly's time series, and then for all the flies of a genotype (say, 135), and repeat the same for the curvature time series, such that the nth element of the concatenated velocity array corresponds to the nth element of the concatenated curvature array, I get a pretty high R^2: 0.694.

But when I take the average of the individual R^2's, the R^2 average is as low as 0.193.

Roughly the same magnitude of difference happens with every other genotype. How do I explain this? I don't think it's Simpson's Paradox, which I don't think would give a higher R^2 if there were two different distributions, and I don't see two different distributions when I look at the log-log plot. John Riemann Soong (talk) 05:30, 30 November 2012 (UTC)


 * Can you give us an example ? I don't understand how you can concatenate different flies' data together.  Isn't there a discrepancy at the joins (a discontinuity in the curve) ? StuRat (talk) 05:39, 30 November 2012 (UTC)


 * I'm not plotting against time. I have the following pseudocode (summarising)

p = 1; while p <= number of experiments n=1; while n <= number of flies in experiment speed = cat(2,speed,cor(n,p).logx) %concatenation along the 2nd dimension in MATLAB, or just joining arrays together end to end kurv = cat(2,kurv,cor(n,p).logy) n = n+1; end p = p + 1 end


 * The result is that I have two very long arrays with all the time series joined together. However, I am plotting the nth element in one array (i.e. velocity) against the nth element the other (i.e. curvature). The order is kept because both arrays were joined in the same way. i.e. say I have the following velocity and curvature arrays taken in time, for a first fly and second fly
 * k1 = [a b c d] v1 =[e f g h]
 * k2 = [i j k l] v2 =[m n o p]


 * the nth element of k1 (the first fly) corresponds to the same moment of time as the nth element of k2, v1 or v2. i.e. a, e, i, m all represent data taken at some common frame.
 * When we join the arrays together we get k= [a b c d i j k l] v=[e f g h m n o p]. the first element in k corresponds to the first element in v, represents data from the fly 1 at the same time. the 5th element of k corresponds to the 5th element in v, and is the data for fly 2 at the same time. Correspondence is maintained. You can add flies from other experiments (i.e. data taken on a different day) and the correspondence is maintained.
 * It doesn't matter that there's a time discontinuity, because we are not plotting against time. The neighborhood of some point (k_x,v_x) on the collective plot might have points from times spaced far apart, or from a different experiment. We just have to ensure that v(n) and k(n) represent the same fly at the same moment. Then we can see how correlated that fly's curvature is to its velocity. Repeat for all n, which might be the same fly at a different moment, a different fly at the same moment, or a different fly at a different moment.
 * I've done autocorrelation and Fourier transform plots, plots of v(t) against v(t+1) or v(t+k) (a recurrence plot). In general, two data points separated in time for the same fly are not very well-correlated, i.e. the velocity at frame 1501 has very little predictive power for the velocity at frame 1512. Thus I think the time discontinuities aren't of a big concern and we can ignore the correlation of the observed variables with time. John Riemann Soong (talk) 06:09, 30 November 2012 (UTC)


 * I think I now understand your concern. There would be a discontinuity if each fly (or each experiment) sampled from a different distribution. However when I look at the aggregate plots, the data is very smooth, and I don't see any apparent subdistributions. (I suppose a quantitative way would be to run an ANOVA. But surely running an ANOVA for the time series of 135 flies would make it likely to report a low p-value?) When I compare data from different genotypes, then subdistributions emerge. John Riemann Soong (talk) 06:30, 30 November 2012 (UTC)


 * It would be helpful if you'd link to the concept you're referring to. I'm guessing that by "R^2" you're referring to the coefficient of determination, which seems to measure "goodness of fit".  I'm also supposing that you mean to linear model.  I'm not a statistician, but I can see at least one potential problem (and I'm hoping that simply dealing with data sets of different sizes is not a contributor).  Even though you're dealing with one genotype, you're dealing with distinct flies within the genotype.  There may be random variation in the phenotype from fly to fly: one fly may be slightly faster, and accelerate more rapidly, than another. This will produce a different distribution for each fly.  As soon as the data points are not drawn from the same distribution (each fly will have its own), you should start seeing effects such as you are observing.  I would expect the large discrepancy in the values that you observe only if there is relatively little moment-to-moment correlation between velocity and acceleration, but a readily measurable difference between the behaviour of individual flies within a genotype.  If you take the average velocity and average acceleration for each fly (i.e. you're not looking at correlation for the single fly at all), and determine the correlation between the two sets of averages over the flies of a genotype, I am guessing you'll find this is where your large R2 comes from. — Quondum 08:38, 30 November 2012 (UTC)


 * Well yes, I'm aware the phenotype will be slightly different, but this is measuring from the same population (no one said that all the members of the population had to be identical!) Also, the individual linear estimators (slope and intercept) when averaged out, are very close to the aggregately-calculated slope and intercept. The 95% confidence intervals for the slope and intercept (where one sample = one fly's slope and intercept, and the population being all the flies) are small enough that I can cleanly separate genotypes with different slopes and intercepts. The 95% confidence interval for data containing time series data (both individual and aggregate) isn't very useful I think, but I have calculated it, with the result that the 95% confidence interval is something on the order of 0.002 for say, a slope and intercept of 0.39 and 0.66 (the 95% CI based on the population (not time series) average of the flies' slopes and intercepts are on the order of 0.02). 128.143.1.41 (talk) 13:15, 30 November 2012 (UTC)


 * If I understand the problem correctly, in your concatenated series the first n data points are from one fly, then the next n data points are from another fly, etc. If there is a lot more variation across flies in the mean of whichever is your dependent variable than there is variation within any one fly of its observed dependent variable values, then your aggregated regression is trying to explain dependent variable variance that is mostly due to the former, and you'll get a good R squared with the aggregate data if the flies are included in say increasing order of the mean of the dependent variable. To see what I mean, look at a plot of the aggregated independent and dependent data (always a good idea in data analysis). Maybe in the lower left of the plot you have a cloud of fly 1's data points, then farther to the right and a lot higher up you have a cloud of the second fly's data, then still farther up and to the right you have another cloud, etc. If you hand draw an approximate regression line, you'll explain the vertical variable's variance very well just because you have a large slope coefficient.


 * So try shuffling the order in which the flies appear in the aggregated series. If you try a large number of such shuffles, on average your aggregated slope should be about the same as the slope of the typical unaggregated regression.


 * Also, you could try putting in a dummy variable for all but one of the flies, which has the effect of shifting all your clouds of data down to where they have the same vertical mean as each other. My guess is that all the dummy variables will be significant, meaning that your aggregated regression is inappropriate in the absence of the dummies.


 * Given that you include dummies, you could also try a Chow test of whether all the slope coefficients are the same for all the flies. Duoduoduo (talk) 14:51, 30 November 2012 (UTC)


 * Actually I messed up in my script. For the individual R^2's, the program was calculating the correlation between one type of curvature and another type! The new differences are more like 0.66 for the individual and 0.69 for the aggregate (for the intercept). 137.54.1.168 (talk) 15:07, 30 November 2012 (UTC)


 * Are you John Riemann Soong ? If so, can we mark this Q resolved ? StuRat (talk) 23:19, 1 December 2012 (UTC)

Martingale system works if you have more than 50% of winning?
Yes, Martingale Betting system doenst work in casinos,.... the article explain it. But lets imagine some game where you have more than 50% of winning. Martingale would work in this game? — Preceding unsigned comment added by 187.58.189.95 (talk) 13:01, 30 November 2012 (UTC)


 * Inasmuch as it would produce the same expected value as any other method of play.--80.109.106.49 (talk) 18:15, 30 November 2012 (UTC)


 * Let's provide a link: Martingale betting system. StuRat (talk) 00:04, 1 December 2012 (UTC)


 * Also, you probably mean "winning", as "whining" is only a successful strategy for children. :-) StuRat (talk) 00:07, 1 December 2012 (UTC)
 * Surely the OP meant "wining" as in "winning and celebrating by drinking wine". -- Meni Rosenfeld (talk) 20:44, 1 December 2012 (UTC)
 * The martingale system is bad enough for games where the chance to win is <50%, but it's even worse for games where it is >50%. Its insanely high risk squanders your opportunity to profit from the game.
 * Instead you should repeatedly bet small amounts, say 1% of your net worth. After enough times you're guaranteed to become rich (several assumptions apply, of course). -- Meni Rosenfeld (talk) 20:44, 1 December 2012 (UTC)


 * Agreed, assuming that the payoff when you win is at least twice your bet, and also assuming that the game will continue forever. However, that last assumption is impossible, in the real world.  That is, if the house is steadily losing money, then they can't continue doing so for long.  They will either go bankrupt or come to their senses in short order.  So, in that case, you might want to make larger bets, to get their money before somebody else does. StuRat (talk) 23:17, 1 December 2012 (UTC)


 * How martingale is even worse for games where the chance to wins is >50%? I dont get it, lets imagine some roulette that has, 0 and 00 and if you bet at even and the ball fall in 0 or 00 you win (and you 2x the amount you bet with even/ods). How using martingale system in the case would be bad or worse then betting in some normal way (betting in numbers, black/red....)?201.78.162.15 (talk) 02:41, 2 December 2012 (UTC)
 * If you make 10 $1 bets in this game you will have the same expected earnings as with one $10 bet. But the variance is much smaller in the former case, because variance is quadratic in the bet size but linear in the number of independent bets. More variance means, among other things, more likelihood to go bankrupt before having a chance to profit from this money-making machine.
 * The martingale system dictates that you make very large bets, and is thus bad. You need to carefully choose the optimal bet size taking into account your net worth, the cost of the time required to play the game, the minimal bet size (the lower the minimal bet, the higher your bet should be because you can recover from greater losses) and so on. Martingale doesn't take any of this into account, it just arbitrarily pumps up the bet size. -- Meni Rosenfeld (talk) 09:45, 2 December 2012 (UTC)


 * See Kelly criterion for how much to bet if you find a nice little earner paying out on average more than you put in and you want to exploit it for all it's worth as fast as possible. Dmcq (talk) 16:32, 2 December 2012 (UTC)

Regression estimators on X as a function of Y have a much smaller confidence interval (spread) than Y as a function of X
How do I explain the following data? v-k slope/intercept is based on the linear regression of velocity as the y-variable and curvature as the x-variable. k-v velocity just switches the two. Why can the v-k parameters be used to separate genotypes much more efficiently than the k-v parameters? The Wilcoxon Mann–Whitney U test comparing w1118 and C5v2 v-k slopes have a p-value of 8.6*10^-4. The p-value comparing k-v slopes is 0.0206.

The following is my results, the 95% CI is in parentheses:

average w1118 v-k slope: 0.391291 (0.007820) average w1118 v-k intercept: 0.658777 (0.015886) average individual w1118 R^2: 0.667761 collective w1118 R^2: 0.694240 average w1118 k-v slope: 1.713136 (0.021954) average w1118 k-v intercept: -1.099095 (0.026176)

average C5v2 v-k slope: 0.359071 (0.017672) average C5v2 v-k intercept: 0.574655 (0.029339) average individual C5v2 R^2: 0.626207 collective C5v2 R^2: 0.689704 average C5v2 k-v slope: 1.740820 (0.035692) average C5v2 k-v intercept: -1.003023 (0.031987)

average fumin v-k slope: 0.398489 (0.007759) average fumin v-k intercept: 0.691860 (0.016663) average individual fumin R^2: 0.670038 collective fumin R^2: 0.688467 average fumin k-v slope: 1.681955 (0.026843) average fumin k-v intercept: -1.118567 (0.023364)

average NorpA v-k slope: 0.366310 (0.020219) average NorpA v-k intercept: 0.627074 (0.025159) average individual NorpA R^2: 0.642634 collective NorpA R^2: 0.696703 average NorpA k-v slope: 1.758842 (0.035529) average NorpA k-v intercept: -1.056938 (0.031151)

average CantonS v-k slope: 0.289517 (0.033790) average CantonS v-k intercept: 0.330509 (0.056035) average individual CantonS R^2: 0.505073 collective CantonS R^2: 0.648629 average CantonS k-v slope: 1.781094 (0.121968) average CantonS k-v intercept: -0.742059 (0.049598)

137.54.1.168 (talk) 15:12, 30 November 2012 (UTC)


 * Which variable to put on the left versus the right side of the regression ahs to depend on underlying theoretical assumptions. If you believe the curvature causes speed (i.e., when the fly want to curve more sharply it changes its speed), then the regression of curvature on speed is meaningless. If vice versa, then vice versa. Possibly one of the regressions is giving poor results because it specifies causation going in the wrong direction. Duoduoduo (talk) 15:50, 30 November 2012 (UTC)


 * Regression doesn't determine anything about causation. Right? I thought the Y and X stuff was just for data presentation. 137.54.11.174 (talk) 15:56, 30 November 2012 (UTC)


 * If you only look at the correlation coefficient, then you can do that without mentioning causation. But if you try to interpret anything else about the regression, you have to specify the regression in a valid way in advance. You assume y = a + bx + error and to engage in any kind of inference you assume the errors are uncorrelated with the x data. Without that assumption you run into trouble. But assuming the errors are uncorrelated with the x data is equivalent to assuming causation from x and the errors to y (although the strength of the causation from x may or may not turn out to be zero). So you are right that regression doesn't determine anything about causation; but regression results make no sense unless the regression has been specified in accordance with an assumption about causation. Duoduoduo (talk) 16:20, 30 November 2012 (UTC)