Wikipedia:Reference desk/Archives/Mathematics/2017 August 23

= August 23 =

Statistical estimation
I thought this was a stats-101 straightforward math question, but thinking about it more, it could be somewhat open ended. Basically I know X's shoe size and I want to guess his or her height, by the following method:
 * 1) Round up N=1000 (say) people and measure the height and shoe size of each one (assume X is drawn from the same population as this sample).  Enter all the data in a big spreadsheet.
 * 2) Figure that height H and shoe size S are each about normally distributed, and correlated.  Do some calculations (this is the part I need help with) on the spreadsheet data, to get a maximum likelihood estimate (hopefully with a confidence interval) of X's height given X's shoe size.

My question: how do I do step 2 above? (Step 3 is of course "Profit!!!"). In the simple version, it's ok to assume H and S's distributions are exactly normal, so it's a matter of estimating the parameters and figuring out a joint distribution somehow. But otherwise maybe messier techniques like Lowess could be involved.

Thanks!

(No this isn't homework, just something I've been thinking about).

173.228.123.121 (talk) 07:54, 23 August 2017 (UTC)
 * See Multivariate_normal_distribution I guess. PeterPresent (talk) 12:07, 23 August 2017 (UTC)


 * At the stats-101 level, I would think the easiest starting point is just to do a simple linear regression treating shoe size as your x and height as your y. The confidence interval can then be estimated from the fit, and especially the typical misfit of the line to the data.  Of course, in the real world you need to look at the data and verify whether or not a linear relationship is reasonable.  If the fit is poor and/or the data is non-linear, one may need to do more sophisticated analysis, but a quick linear regression is probably the easiest starting point.  Since you seem to be saying this is a largely hypothetical exercise, then I suppose one needs to ask how complicated you want to make it.  Ordinarily, the complexity of the model and analysis is heavily guided by the qualities of the data.  Dragons flight (talk) 14:46, 23 August 2017 (UTC)


 * Thanks! Linear regression sounds like a simple practical solution, though I wonder if the relationship would actually be linear, due to the Square–cube law.  But by "open ended" I was wondering more about methods to squeeze all the useful info from the data.  I thought of a stats 101 solution as doing something with the the spreadsheet data involving a covariance matrix, but had no idea what the "something" was.  PeterPresent's suggestion of multivariate normal distributions sounds like about what I imagined, so I'll try to read that article. 173.228.123.121 (talk) 03:04, 24 August 2017 (UTC)
 * That article should be accessible if you've studied some probability and linear algebra. PeterPresent (talk) 10:51, 24 August 2017 (UTC)

Regarding the square_cube law, you can plot the data and see if it is indeed a linear relationship, or one of higher order. If it is higher order, you can do a polynomial regression with the appropriate higher-order terms, or even a non-linear regression if you can express the relationship between the two variables mathematically. OldTimeNESter (talk) 22:44, 24 August 2017 (UTC)

Take a logarithm before you make your statistical analysis in order to avoid dealing with negative shoe sizes and heights. Bo Jacoby (talk) 06:45, 25 August 2017 (UTC).

Approximating low-frequency modes with discrete non-uniform data
I am looking for some algorithms and/or conceptual pointers for whether or not there are generally preferred way(s) of estimating low-frequency behavior from discrete non-uniform data. In my context, an ideal time series would be uniformly sampled with one data point every year. Then one can define a low frequency representation of that data in several ways. For example, by taking an N-year moving average, or by using a Fourier transform to remove high frequency components, or by applying various other smoothing approaches. However, I am looking for tips on how best to get a similar low-frequency representation given non-uniform data.

Specifically, instead of one data point per year, imagine the data is still sampled on an annual time scale, but many years were not reported (i.e. missing data) and the distribution of missing data is random. Is there a best way to approximate, for example, the 20-year moving average, if 80% of the years are missing? There are obvious strategies, such as interpolating the data in various ways before smoothing or simply applying the moving average to just the data that one does have, but I am wondering if there are a selection of techniques that are generally preferred for situations like this. I am imagining that a "preferred" solution is an estimate of the low-frequency behavior that would be expected to come closest to the true low-frequency behavior (if the missing data were available) given some set of assumptions about the statistical distribution of the missing data (e.g. normality, autocorrelation, etc.). Dragons flight (talk) 14:31, 23 August 2017 (UTC)


 * Could filling in the missing years by multiple imputation help if that many are missing? Then run your moving average or a fancier low-pass filter.  173.228.123.121 (talk) 16:52, 23 August 2017 (UTC)


 * That kind of poor input data isn't going to give you good results no matter what you do to it. However, it should work better with a consistent and slowly changing stat, such as world population, than something more volatile, like a stock price. StuRat (talk) 17:31, 23 August 2017 (UTC)


 * My understanding is that Fourier analysis is a special case of least squares analysis; it's just that the math works out especially cleanly in the case of Fourier analysis. But that doesn't mean that other types of analysis are impossible. First you need to decide on the how to weight the data. For a rolling average you might give the the last n data points a weight of 1/n, but another popular weighting is proportional to p-t where p is fixed and t is the time difference between the data point and the present. Second you have to decide how you're modelling the data, constant for plain average, linear for a trend, polynomial or trigonometric polynomial for more detailed results. Then use calculus to find the expression with the least square distance from the data. There is a limit to how much meaningful information you can get from a given data set, so you need the limit the degrees of freedom in your model. Probably the best way to start is to simply plot the data and let you brain's visual cortex detect a pattern. That should at least narrow down what type of model you should use. --RDBury (talk) 00:38, 24 August 2017 (UTC)


 * See also here. Count Iblis (talk) 00:42, 24 August 2017 (UTC)


 * The compressed sensing idea is interesting, though from looking at the article it seems the idea is to get high frequency info from samples taken more slowly than the Nyquist rate, using the uneven spacing of the samples. For low frequencies, I'm imagining something like 1000s of years of climate data, which I'd imagine to be some periodic components plus something like a random walk.  So one idea might be to guess a model for the periodic part, and fit parameters to it until the residual part looks like a random walk.  There has to be something better though. 173.228.123.121 (talk) 02:44, 24 August 2017 (UTC)

Angles
The difference between an obtuse angle and acute angle and a right if it is larger than a right angle its obtuse — Preceding unsigned comment added by Jaayinlove1 (talk • contribs) 18:03, 23 August 2017 (UTC)


 * Right. --CiaPan (talk) 19:06, 23 August 2017 (UTC)


 * ... as clearly explained in our article: angle.   D b f i r s   19:56, 23 August 2017 (UTC)


 * And here I thought a cute angle had dimples and obtuse angles were stupid, right ? StuRat (talk) 20:01, 23 August 2017 (UTC)