Wikipedia:Reference desk/Archives/Mathematics/2010 September 4

= September 4 =

Need a better parametric test for a sine-wave distribution.
Lunar cycle theorists postulate cyclic fish activity associated with major and minor lunar periods. This creates a theoretical sine-wave centered around the average catch/activity rate. Usually this is illustrated by sine-wave-like imagery with peaks at the top and the negative wave-half flipped as data for both majors and minor and in-between hours can only be positive integer values.

I have a large data set of fish catches, over 10,000 angling hours, with the majority of catches 0 or 1 per hour but a few hours ranging up to a few catches of 9 or 10 per hour. I believe the data to be parametric. Using one way AOV (Statistix 9) I have a P small enough to declare my overall data collection highly significant by scientific standards. However, breakdowns of the data and subsets by type of water fished and fish size show several dependent variables effect results and produce less than significant data sets using AOV. However, these sub-sets still show peaks near the major and minor hours as predicted by the lunar cycle postulates.

Is/are there statistical method/methods that focus on this tendency to have peaks in the right places rather than just seeking sufficient differences between means and SDs? Please provide enough detail so that I might apply your input.

Also please comment if you feel my use of oneway AOV is inappropriate for this type of data. I'm long out of touch with academic sources of statistical guidance.

19:24, 4 September 2010 (UTC)~ —Preceding unsigned comment added by RManns (talk • contribs)


 * The observed nonnegative integer number of fish catched within an hour, n, has a poisson distribution with some nonnegative real parameter &lambda;. So &lambda; has a gamma distribution with mean value n+1 and variance n+1. The first transformation of data is thus to add 1 to all the observed number of catches, for you are interested in &lambda; rather than in n. The sine-like wave should be nonnegative too, so you should find A,B,&omega;,&phi; to minimize
 * &Sigma;((n(t)+1)&minus;AeB sin(&omega;t+&phi;))2/(n(t)+1)
 * where this weighted sum of squares is over all the observations, and t is the time coordinate. Bo Jacoby (talk) 07:12, 5 September 2010 (UTC).

Mr. Jacoby,

thanks for your response, but I'm too long away from school and very stupid when it comes to statistics. I just rely on a Statistix Program to do the math. If I make a modified data set by adding 1 to all, what is the appropriate test called? Also, to do this data conversion is the one added to the zero values or only the positive integers? R Manns RManns (talk) 03:04, 7 September 2010 (UTC)


 * You are welcome. Yes, the one is also added to the zero values. If you catch no fish within an hour, (n=0), then you cannot conclude that the average number of fish catched per hour, (&lambda;), is zero. You might have been unlucky. But you do get some information regarding &lambda;, that &lambda;≈1±1. This should not change the name of the tests you want to perform. I do not know the Statistix Program. Bo Jacoby (talk) 07:07, 7 September 2010 (UTC).
 * Unclear as to why you are doing this analysis. Do you want to advise people whether they should fish during a particular phase of the moon? If so, your 'bottom line' could be a prediction of the relative success they should experience if they select a particular phase of the moon. (E.g. 'fish during the full moon and you will catch 50% more'). The p-value is not very strong guidance for this kind of prediction. It's also peculiar that, when you attempt to factor out other variables you discover that your p-value becomes less impressive. If you are successful in removing other factors from the problem, then any relationship to the phase of the moon (if there really is one) ought to show up more clearly in your data. It is possible that your removal of the other factors is hurting your value of N, so the data is less informative. EdJohnston (talk) 00:37, 8 September 2010 (UTC)

When to use Kendall / Spearman correlations instead of Pearson's?
Statistical software package I use offers, in addition to Pearson product-moment correlation coefficient, also Spearman's rank correlation coefficient and Kendall tau rank correlation coefficient. I am trying to find an explanation of why one would want (or when one can...) use S/K instead of P. I found a bunch of descriptions, little different from our wiki pages - they go into math; but I don't care about the theory as much as for application (when to use which).

I am guessing that there are times you want to use P, and times you want to use S/K, and if you use incorrect one you'll get a misleading result (It seems that for P, both variables should be normally distributed (how can I test if this is true?). S/K do not have this assumption (doesn't it make them better by default...?)). How to determine which one do you want to use?

In particular, I am looking at some data that seems not significant under P, but more so under S/K. What does that mean? Is the correlation in the data I am looking at statistically significant or not? --Piotr Konieczny aka Prokonsul Piotrus 21:06, 4 September 2010 (UTC)


 * Well, in short, Pearson correlation coefficient tells us if there is a linear relationship between two variables and the other two tell us if there is a monotonous relationship between those variables. Thus low Pearson correlation coefficient with high Spearman or Kendal tau correlation coefficient indicate that there might be a monotonous relationship between two variables, but it is not linear. For example, it could be that one variable is proportional to the the cube of the other. You might wish to look at a scatter graph to find out more. --Martynas Patasius (talk) 22:10, 4 September 2010 (UTC)
 * Just FYI, the word in English is monotonic :-). Of course it could also be monotonous.... Trovatore (talk) 09:02, 8 September 2010 (UTC)
 * Spearman's rank correlation looks for linear relationships between the ranks. If I recall correctly, Spearman's and Pearson's are equal to each other in cases in which the data consist of distinct ranks, but when you have ranks to work with rather than the raw numbers getting ranked, then there's a simpler formula for computing them. Michael Hardy (talk) 00:02, 5 September 2010 (UTC)
 * Ditto to above, and also: You certainly want normality if you are using Pearson's correlation coefficient; you can refer to the article normality test for how to test the normality of data. If normality is suspect, go with Kendall's or Spearman's coefficients. Nm420 (talk) 20:17, 6 September 2010 (UTC)

Thanks again 18:54, 7 September 2010 (UTC) —Preceding unsigned comment added by RManns (talk • contribs)

What is the test of statistical significance for nominal variables?
Let's say I want to see if a nominal variable (country, or certain groupings of ones) and ratio one (geographical or population size, for example) are statistically significant. What test should I use? Correlation is out, because it is not useful for categorical variables, right...? --Piotr Konieczny aka Prokonsul Piotrus 21:39, 4 September 2010 (UTC)
 * You could do a multiple comparisons ANOVA. Michael Hardy (talk) 00:03, 5 September 2010 (UTC)