Talk:Regression analysis/Archive 1

Mergers?
A variety of mergers has been proposed. Please add to the discussion at Talk:Linear regression.

Example
Is it possible to have a better example? Not only does the equation $$y_i=x_i^2+1$$ look obvious, but $$y_i=7.763/2-3.221\cos(x_i)+0.339\cos(2x_i)$$ is simpler than the result given. A regression example needs to have more data than unknown, leading to non-zero residuals. --Henrygb 02:08, 3 Feb 2005 (UTC)

Please feel free. I just rescued this example from function approximation, but I don't think it is the best example either. -- hike395 17:42, 3 Feb 2005 (UTC)

OLS
Should this page mention OLS regression?

Isn't it covered already under linear regression?

Feinstein 06:15, 6 May 2006 (UTC)

Appropriate level of rigor
The maths is too rigorous, to the point where few readers will be able to understand it. You wouldn't expect to see this in a general encyclopaedia, so I think it's not appropriate for the wikipedia.Blaise 22:39, 11 October 2005 (UTC)

Also, isn't alpha reported in a decimal? Noting that alpha = 5% is different than saying 0.05 because of two decimal places....every book I have used has alpha as a decimal (the assumption of percent is only really used when 1-alpha is considered.)


 * It should be reported as .05 with no 0 before the decimal because, as the APA style guide suggests, any number that does not exceed 1 does not need to have a 0 before the decimal. It is always reported as a decimal in peer-reviewed research literature. Chris53516 13:20, 11 October 2006 (UTC)


 * APA style guidelines suggest that any number that cannot exceed 1 (such as probabilities) should not have a 0 before the decimal. All other decimals should report the value before the decimal point. —The preceding unsigned comment was added by 71.225.48.174 (talk • contribs).

Re: Appropriate level of rigor
Hi,

I think I'm largely responsible for this. I'm aware that this is probably not the simplest way to present regression analysis. However, it is hard to find rigourous definitions for all this on the internet. You usually get fairly vague explanations, sufficient for simple applications but which are closer to a user's manual than to a text book.


 * [Comment: is there a reason why these definitions to which you refer need originate from the Internet? If you aren't already extensively familiar with the material, it's probably best to leave the article to someone else. -Dave]


 * I think you missed my point, Dave: as these definitions aren't available, I thought it would be good to make them available. In your edit summary, you say the typo you corrected was one of the smallest problems of this page: could you please be more specific? Deimos 28 13:13, 24 February 2006 (UTC)

Anyway, I think the comment is pertinent and I'll leave it to others to decide what to do with this article (if it should be left "as is", simplified, moved on to another section or deleted alltogether). Obviously I'd be more enclined to keep it, but it's your encyclopedia as much as mine ;)

By the ways, ordinary least-squares is presented in this article...

I second that - the definition is good, but a little difficult to understand, particularly if the person accessing this page is coming from a social sciences background (i.e. poli sci, sociology), and just needs to know the significance of regression analysis, not necessarily how to do it. Could a section be inserted that explains what regression analysis can be used to prove? -sarah


 * Try MathWorld &mdash;James S. 09:39, 5 January 2006 (UTC)

Hi Sarah, I was thinking of adding an example with "real" data. I think it'll be ready in a few days. Cheers! -deimos

I think the article would benefit from adding a simpler introduction. There is material further down that is more accessible. But, I think a reader without mathematical background might give up. It would be nice if someone could write a couple of sentences at the top about the general idea. --Flitzer 14:36, 4 April 2006 (UTC)


 * In all seriousness, who is the imagined audience for this entry? Anyone who understands measure theory wont come here to learn about OLS.  Most everyone else will click away after taking one look at the notation.
 * —Preceding unsigned comment added by 170.223.19.150 (talk • contribs) 14:04, April 5, 2006


 * What needs to be done is to write a sizable introductory section (not a paragraph, a section) that introduces the concept to the average reader, and then to retain the "rigor" in the sections that follow. &mdash;Lowellian (reply) 07:37, 21 April 2006 (UTC)

Philosophical Rigor
This page may have some "mathematical" rigor on it but it is philosophically rather shaky. Regression is arguably THE most abused technique in the social sciences. (see the textbooks by Berk or Freedman) I think that the presentation of this page needs to be brought more in line with reality by incorporating these criticisms. I.e. phrases like "The error term is usually posited to be normally distributed." need to be removed, modified, or preceded with some qualifiers discussing assumptions and their validity. Sections like "Regression diagnostics" need to be fully explained with qualifiers explaining when these diagnostics can be meaningfully interpreted. I think that we should break this page up into two broad sections, one discussing regression as a data-analytic technique for measuring trends in the data, and one discussing (and questioning) the additional structure, cautioning the reader against possible abuses of the material. The page as is is outright misleading. Cazort (talk) 15:15, 13 December 2007 (UTC)

Regression analysis
Where is the math outcome. Using SPSS, what is the significance level. If the correlation is more than .050 regarding your independent variables, your significance is too high. What and where are the simple answers? 2+2=4. If you have this number(?), you need to look at another variable, or your hypothisis is inaccurate.

—Preceding unsigned comment added by 69.152.245.238 (talk • contribs) 08:43, December 3, 2005

Clean up
I've done some cleaning up on this page: I moved a lot of the theoretical details to other articles so that people can skip them if they want to. I also added a detailed example of how to use the material presented in this article. Hope this helps. Let me know what you think about it.

Cheers,

Deimos.


 * Great work! I had nominated this for Mathematics Collaboration of the Week, but now I'm not so sure it needs much. Thanks. &mdash;James S. 20:21, 10 January 2006 (UTC)

Problems with the second example
Hi!

I found a few problems with the second example. First of all, this is an example of an interpolation problem which is a special case of regression for $$n=p$$. In this case, the regression function fits the points exactly. There's no problem with that, except that the way it is presented in the example, it can't work.

First of all, $$p>n$$, which contradicts the hypothesis given at the beginning of the article. This means we have too many coefficients to estimate for the data at hand and therefore that the design matrix will necessarily be singular. If we do not have any more data points, we can reduce the size of the G subspace by noticing that the function we are looking for has to be even, as $$y(x)=y(-x)$$ for all the data at hand (which incidentaly means that half the data is redundant if we choose a trigonometric function). This means we can reduce the problem to finding $$(\theta_0, \theta_1, \theta_2, \theta_3)$$ such that:



f(x) = \frac{\theta_0}{2} + \theta_1\cos(x) + \theta_2\cos(2x) + \theta_3\cos(3x) $$

for $$x=(0,1,2)$$ and $$y=(1,2,5)$$. We now have four coefficients to estimate with three data points. Therefore we still have one coefficient to get rid of: the system is still over-determined. We choose $$\theta_0=0$$. We can choose any other value of $$\theta_0$$ if instead of the ys, we do the regression on $$y-\theta_0$$. Then, just adding $$\theta_0$$ to the regression formula we will have obtained will give us a function of the requested form taking the y values for x.

We build the matrix $$X$$ by putting together the column-vectors $$\cos(x),\cos(2x)$$ and $$\cos(3x)$$:

$$X=\begin{pmatrix} 1&1&1\\ 0.54&-0.42&-0.99\\ -0.42&-0.65&0.96\\ \end{pmatrix} $$ I find the same values as in the example, of course: $$(\theta_1,\theta_2,\theta_3)=(4.252563, -6.130016, 2.877453)$$

I think this is a bad example for regression and should be removed from the article. Maybe more suited for trigonometric interpolation? -- Deimos 28 13:22, 24 February 2006 (UTC)

Model II regressions
Would it be good toa dd model II regressions? KimvdLinde 17:05, 19 March 2006 (UTC)


 * Sure! What is it?
 * Deimos 28 20:04, 19 March 2006 (UTC)

Linear regression in which the variance of both varaible is included, as such, not minimisation along the y values, but y and x values simultaniuosly. Two best known versions are Major axis regression and reduuced major axis regression. KimvdLinde 23:56, 19 March 2006 (UTC)

Gauss-Markov assumptions: what does V stand for?
Under "Gauss-Markov assumptions", it is unclear what V stands for: I guess it is the derivative with respect to X? Do the formulae in that line mean 1. eps_i = 0 for all i 2. (d eps_i)/(dX_j) = (sigma)^2*(kroneckerdelta)_ij  ? Thank you.

—Preceding unsigned comment added by 62.206.54.162 (talk • contribs) 03:31, March 27, 2006


 * The $$\mathbb{V}$$ stands for "variance". It is not the derivative (nowhere in the article is $$\vec{\varepsilon}$$ assumed to be derivable, with respect to any variable). The formulae mean:
 * $$\int_{\Omega}\vec{\varepsilon}(\omega)\,d\omega=\vec{0}$$ (vectors of size n)
 * $$\forall (i,j)\in[\![1,n]\!]^2, \int_{\Omega}\varepsilon_i(\omega)\varepsilon_j(\omega)\,d\omega=\sigma^2 \delta_{ij}$$
 * where $$\vec{\varepsilon}:\Omega\rightarrow \mathbb{R}^n$$ is a random variable, with $$\vec{\varepsilon}=(\varepsilon_1,\cdots,\varepsilon_n)$$.
 * Regards,
 * Deimos 28 09:20, 27 March 2006 (UTC)
 * Deimos 28 09:20, 27 March 2006 (UTC)

Logistic regression
A related article that could really use some work (hint, hint, this is a call for help to the editors here for this collaboration of the week), especially on how exactly one goes about doing the iterative calculations, is logistic regression. &mdash;Lowellian (reply) 07:40, 21 April 2006 (UTC)

Scrap it and start again
I'm sorry but this article is so bad, so full of mistakes, that it really needs to be completely scrapped and replaced. It is riddled with errors. For example, consider the following sentence from the article:

"The simplest type of regression uses a procedure to find the correlation between a quantitative response variable and a quantitative predictor."

There are three problems with this sentence.

First, regression does not use "a procedure to find the correlation". This is nonsense. If one wants to measure the correlation between two variables, then one computes a correlation coefficient -- regression is not needed to do so. Regression can be used to estimate a mathematical model, an equation, that expresses the response variable as a function of the predictor. Note that it is very possible that the traditional correlation coefficient between two variables will be 0 even though there is a perfect mathematical (functional) relationship between the two variables.

Second, in regression, all variables are quantitative. That is, all variables must be expressed as quantities, as numbers. The mathematics of any kind of regression simply does not work using non-quantitative data. I believe the author(s) is confused about the distinction between types of quantitative data (ratio, interval, ordinal, and nominal) and the distinction between the two broad categories of data (quantitative and non-quantitative.)

Finally, I'm not sure how one goes about defining "simplest" in terms of types of regression analysis. In addition, the use of this adjective is potentially confusing since there is a type of regression analysis called "simple linear regression analysis". However I'm not sure it's any more or less simple (in the sense of "simplest") than a model of the form Y = X^THETA + EPSILON.

These are only three examples from one short sentence. The article is full of errors such as these.

I hope that no one ever uses this article to learn about regression analysis. It really needs to be taken down before too much damage is done. It's really bad.


 * "I hope that no one ever uses this article to learn about regression analysis." Have no fear. The article is so poorly written that nobody understands it. --Ioannes Pragensis 18:55, 2 May 2006 (UTC)
 * Be bold and fix it. I don't read talk pages, and nothing will get done if everything is discussed like this.  Mazin07C₪T 13:38, 22 May 2007 (UTC)

ANOVA and ANCOVA
''Predictor variables may be defined quantitatively or qualitatively(or categorical). Categorical predictors are sometimes called factors. Depending on the nature of these predictors, a different regression technique is used:''

Untrue: no different regression technique is used. The different names appear to exist for historical reasons, but there is no real difference between the models or the means used to estimate them. http://www.stats.ox.ac.uk/pub/MASS3/Exegeses.pdf, top of page 12. Similar remarks appear in the book by Seber and Lee, 2003. Please amend.

A word of support
I support the work that is being done here. It promises a useful definition of a basic concept in statistics. Often someone approaching a field from the outside (e.g. Soc Sci, Pol Sci) will be reluctant to ask an expert to explain concepts learnt learnt on day one. In this case it helps to be able to refer to an encyclopedia. The technical maths also has its place, though perhaps could be put separately in a box. While MathWorld is probably the definitive online mathematics source, it is pitched at a much higher level. Wikipedia does have something to offer in terms of an introduction. Previous discussants should remove overly critical comments from this page and instead submit a constructive revision. Wikid 10:38, 24 May 2006 (UTC)

Height vs Weight Example
I don't know which is the first or second example discussed above, but regressing averages against each other leaves such a smooth data set that it gives first time readers a very extreme view of most regressions. For example, how do we explain the r-square value of the resulting relationship? -Finn


 * This example is not the one which I mentioned earlier on this talk page. It is indeed a bit "artificial", but I thought it would be clearer to have a simple example which "works" well (i.e. is straightforward to compute and gives a good fit). It does not serve well for any statistical inference. -- Deimos 28 07:27, 6 June 2006 (UTC)

Isn't it a simple fallacy to equate the confidence interval from a regression, with an essentially Bayesian estimate along the lines of "with a probability 0.95, these parameters lie in this range" as is done at the end of this example? I know it's commonly enough done (and most people don't know the difference).Jdannan 07:17, 5 July 2006 (UTC)

I ran this example through Splus.


 * the correlation between the estimate of the intercept and the estimate of the cubic parameter is 0.9961. This is unacceptably high in my opinion.
 * the model with the quadratic term retained is a better fit than the model chosen, which has it missing. I used AIC as my model selection criterion and got AIC = -27.83 for the model without the quadratic term and AIC = -37.26 for the model with the quadratic term retained.

I haven't done this, but I would next try to estimate this model using nonparametric or semiparametric regression. It's not a good example of polynomial regression and anyway polynomial regression is probably not the most useful thing to be demonstrating here. Blaise 23:38, 13 April 2007 (UTC)

Multicollinearity
I know this is not my area, but somehow the following statement in the section on Linear models does not make sense:
 * "Multicollinearity results in parameter estimates that are unbiased and consistent, but which may have relatively large variances."

The article on multicollinearity does not give this characterization, yet it seems to more accurately describe this phenomenon. Could someone who knows more than I do about Statistics please take a look at this and edit? Vonkje 21:26, 26 November 2006 (UTC)
 * The correct phrasing is, "unbiased, consistent, but inefficient." Wikiant 21:40, 26 November 2006 (UTC)

The original "relatively large variances" statement was mine. (I think I posted it before I had a Wikipedia login - my apologies.)

Assuming that a regression model is correctly specified, then in the presence of multicollinearity the estimates are unbiased, consistent, and efficient. That's what the Gauss-Markov theorem says.

When explanatory variables are correlated with one another, the estimated coefficients have higher variances than if the explanatory variables were uncorrelated. DickStartz 19:44, 29 December 2006 (UTC)

I have made the needed minor corrections.DickStartz 00:14, 14 January 2007 (UTC)

Some suggestions
Hi I would suggest at least the following changes to this article:

1. Use the symbols $$\alpha$$ and $$\beta$$ instead of $$\eta$$ or $$\theta$$. This will at least match up notation with the linear regression article. Similarly use boldface small letters for vectors and capitals for matrices as is done everywhere else.

2. Comment on the outcome of the analysis. Write down the interpretation of the estimates. Clarify the assumptions. Explain the meaning of each symbol more clearly.

3. The example is not so much "too rigourous" as it contains a lot of superfluous information. If we mention the information matrix, then it should have a purpose (e.g. explain the connection to the standard errors). Otherwise leave it out.

4. The part about Bayesian statistics is a bit strange, since it describes some selective parts of bayesian statistics which really apply to all statistical analyses. Better would be just a statement to the effect that bayesian statistics also apply to regression analysis and more can be found in the bayesian stats article. That or give a full example. (Also it is not true that small-sample estimates are biased if all the assumptions are met!)

5. I don't follow this sentence: "If a model has only one equation but more than one predictor, it does not employ multivariate regression". This is nitpicking on terms. Multi-variate mean employing multiple variates and regresssion with more than one predictor definitely falls into that category I would say.

6. Could you explain this sentence: "Geometrically, what we will be doing is an orthogonal projection of Y on the subspace generated by the variables...". Shouldn't this be a projection of X (the matrix) such that this projection is a close as possible to Y. (Or in ML terms, such that the likelihood is a large as possible)

7. In general, in my opinion this should be more an article on how to check assumptions, formulate models, and interpret the coefficients and less a collection of buzzwords and notation.

87.219.190.15 11:26, 12 February 2007 (UTC)


 * I agree completely with your points 1-6. Point 7 would probably violate WP:NOT (Wikipedia articles are not instruction manuals). However we should certainly mention the main assumptions, along with the consequences of violating them, and describe or link to methods for checking whether they hold. -- Avenue 15:30, 12 February 2007 (UTC)


 * Supportive of above, including comment. On (5), possibly the offending line should be deleted. But then so should the 2nd sentence in previous paragraph, which seemed to say just what this sentence does. It uses 'multivariate' to refer to more than one equation. Your usage is the only one I was familiar with until I encountered this article.  (If both are changed, anyone objecting should at least be able to refer to a standard source to support another usage.  Then the usage difference could be duly noted.)  -- Thomasmeeks 03:06, 13 February 2007 (UTC)


 * Hi back again (same person different IP). I made some of the suggested changes and also corrected some mistakes I noticed. Now that I reread it, it is a bit clearer. (Some of the changes are largish, you may want to review.)
 * But to be completely honest, it still isn't very good. I'm beginning to think the earlier suggestions on a rewrite are warranted.
 * Perhaps I could suggest the following structure: start with a general easy to understand definition, with references to the many articles that already treat the concepts involved. Then list the Gauss-Markov assumptions, and state for each assumption what possible sources of violation are, and what models are warranted in that case. An alternative list with anchor-hyperlinks could be added which categorises the sections according to the type of data (binary, censored, categorical, etc.). The article could then end with two sections called "Estimation" (mentioning ML, OLS, GLS, Bayesian methods etc) and "Goodness-of-fit" (mentioning model and parameter tests, confidence intervals, and information criteria). The advantages of this approach would be:


 * 1. It would yield a structure that immediately clarifies the connection between the different "types" of regression as they are now called in the article.


 * 2. One can immediately see the purpose of the different analyses without being first bombarded with confusing notation and seemingly irrelevant examples.


 * 3. The old article contains much repetition because many themes apply to all "regression analyses". This would be avoided with the new structure.


 * 4. The article would be more legible because one could for example decide that censored data are not of interest, and so arcane descriptions of the tobit estimator can be skipped.


 * What do you think about this suggestion? 87.219.191.30 14:05, 18 February 2007 (UTC)


 * Please sign in. &mdash; Chris53516 (Talk) 14:54, 19 February 2007 (UTC)


 * One value of this article is its generality, which I hope your efforts will help preserve. You might want to consider whether linear regression is a more proximate article for your suggested additions. If you'd just be repeating what's already there (but could perhaps be improved on [or significantly expanded]), what would be the point?  You could then reference here and in OLS.  The string of complaints above suggests that too little has been done to make the present article more accessible to a wider audience.  I'm not talking about dumbing it down but rather adding content so that someone reading the article with a little knowledge on the subjeect would get enough content to continue beyond the lead. --Thomasmeeks 05:08, 23 February 2007 (UTC)
 * One thing I am concerned about is the statement above that:
 * I'm beginning to think the earlier suggestions on a rewrite are warranted.
 * Perhaps the reference is to the section 12 "Scrap it and start again." The problem is that if the writer is unprepared to make any improvement, how can one know whether it would be better?  But such disgust is promising if it is backed by fixing the problems, either one at a time or by working on a grand alternative.  One can welcome improvements, and I can well understand how there could be frustration. The trick is to come up with the goods. One Wiki guideline is to be bold in updating pages, always (one hopes) observing the  principle of charity, if nothing else as a matter of expedience in recognizing a possible kernel of truth worth preserving, even if not originally well stated.   --Thomasmeeks 05:20, 23 February 2007 (UTC)

Proposed lead
Different users have contributed to the following proposed lead of the article. Please discuss it or comment below. Thanks to Chris53516 for suggestions and for creating this section to resolve a disputed proposed lead in an orderly fashion.

-

In statistics, regression analysis estimates the systematic relation of variables to each other in a mathematical model. The regression estimates use a data set for the variables as input.* Each regression equation includes the values of:
 * 1) a variable to be explained, called the response variable
 * 2) other variables,  called predictors
 * 3) estimated parameters (constants) of the model, which link (1) and (2) quantitatively.

Regression estimates are used, because parameter values are not known with certainty. Applications include curve fitting, forecasting, testing scientific hypotheses, and describing how good the "fit" is between the response variable and the predictors. Standard ways to estimate regression equations include linear regression, the generalized linear model, nonlinear regression, and logistic regression.

*In a real-world applications, data used could come from any combination of public or private sources. 

-


 * (Edited above as attempt to present a consensus best version available. If anyone was in the process of writing a critique, by all means post it.  The contributors below realize that an Edit of Talk page section is unusual, but then so are the circumstances of this section. --Thomasmeeks 20:50, 21 February 2007 (UTC) Thomasmeeks 14:03, 5 March 2007 (UTC))


 * The intent of this proposed lead is to fill in gaps of the current lead, which makes no reference to:


 * 1) a mathematical model  (sentence 1)
 * 2) data used in regression analysis (sentence 2)
 * 3) the relation of parameters to the response variable and predictors (sentence 3)
 * 4) reasons for the use of regression (sentences 4 & 5)
 * 5) where real world-data might come from (asterisk note at bottom of proposed lead).
 * Inclusion of these items would transition the lead to Section 1 and add context (as specified in WP:LEAD).  IMHO, the current lead presumes such context rather than supplying it.  --Thomasmeeks 22:06, 26 February 2007 (UTC) (edited) --Thomasmeeks 12:20, 3 March 2007 (UTC)

Please indicate below which of the following best applies:

(1) The proposed lead would be an improvement over the current lead.

(2) The proposed lead would not be an improvement over the current lead.

Suggested Edits of above:

Other comments:

If you wish to write your own lead that improves on the above, please do so. Thanks for your help --Thomasmeeks 03:11, 28 February 2007 (UTC)

There are certain problems with this introduction. Please see WP:LEAD for more information about introductions. &mdash; Chris53516 (Talk) 19:41, 19 February 2007 (UTC)


 * Concerning the above, if it is believed that there are certain problems, critical comments are welcome. I believe that the right way to read that Edit & the preceding Edit is in terms of what a reader not already conversant with the subject could get (or not) from either Edit. That is the perspective the Edit was intended to address.  I fin[d] a relatively easy way to make a comparison is to go to the history tab of the article for the above Edit:


 * ( 19:23, 19 February 2007 Thomasmeeks (Talk | contribs) (Lede (regr. with less tears): leads up to previous Edit (starting with "data") in steps. "Predicted" explained.* Last para.: almost unchanged. *Exact definitional relationships vs. regression eq.)


 * Then click the button for that Edit, then the (last) hypertext to make side-side-side comparisons.  No substantive point was deleted from the Edit, but substance was added, which I hope makes the transition to the next section easier.  If there are any points that need raising, I'd do my best to address them. The objective all share is of course improving the final product.   --Thomasmeeks 01:00, 20 February 2007 (UTC) (typo fixed) Thomasmeeks 01:04, 20 February 2007 (UTC)


 * Please read WP:LEAD. It still doesn't fit the appropriate style. By the way, you can correct your own comments, just delete the time stamp at the end and replace it with 5 ~ to insert the new time. And my point in bringing the lead section you edited here was to edit it, so go ahead and edit it. &mdash; Chris53516 (Talk) 15:08, 21 February 2007 (UTC)
 * Thx for your engagingly brief additional comment, to which I have tried to respond, as indicated above.  If I have not (sufficiently) succeeded, to paraphrase the old saw, hit me again while I can still hear you .  --Thomasmeeks 20:50, 21 February 2007 (UTC)


 * I appreciate the concern shown above. WP:LEAD is a guideline, not a policy, for the good reason that articles may differ as to what the lead may require to make it accessible.  I hope the revision will be judged not by some standard of perfection but by the gaps in the current lead by making article that follows more accessible to those not already familiar with the subject.  I do believe the above is a enough of an improvement to allow it run. If someone comes up with something better, so be it. Thx. --Thomasmeeks 21:25, 22 February 2007 (UTC)


 * I disagree. I think we should follow the guideline. Furthermore, you should give it time to let others view your suggestions before posting this lead. By the way, what's with the asterisk in on the bottom of the lead? Why is that there? It doesn't make sense, and it doesn't fit with how we usually write in Wikipedia. &mdash; Chris53516 (Talk) 21:35, 22 February 2007 (UTC)


 * I've tried to guess what you had in mind, but my answers don't seem to satisfy you. Would consider cutting and pasting those portions of the WP:LEAD that you believe I'm not complying with? On the asterisk, it's to make it more accessible (b/c more important) than a footnote.  With mention of data, the reader might reasonably ask:  "What data?"  --Thomasmeeks 23:20, 22 February 2007 (UTC)

Thank you, Thomasmeeks, for this discussion. - I think that the proposed lead is better, but still far from perfect (of course the whole article is horrible etc. ...). It is written from the bad perspective, because the reason for regression is not estimation of parameters but estimation of something like E(x|y) - expectation of x (dependent) given y (independent). The parameters are only means to this end in the case of parametric models. Therefore I think that the lead (and the whole article) should be thoroughly rewritten. Greetings, --Ioannes Pragensis 08:45, 28 February 2007 (UTC)


 * Agreed on substantive points, Ioannes Pragensis, including critical comments on the proposed lead. The instrumental use of  regression analysis is prediction (in an E(x|y) sense).*    I trust that no one likely to read this section will mistake your passion (including your, may I say, charming paren. aside) for a lack of scientific temperament. You are the Velvet Revolution's gift to this page.
 * * There is also the scientific use of course, which in a comprehensive sense is also arguably instrumental.--Thomasmeeks 13:23, 28 February 2007 (UTC)

"Proposed lead": a proposal for decision
The alleged advantages and diadvantates of the proposed lead for Regression analysis are spelled out following the proposed lead above. Here is another proposal, one for avoiding a possible edit disagreement*:
 * Let a majority of registered users decide above on incorporation of the proposed lead (as amended) into the article lead.

Registered-users-only discourages double-"voting". If this proposal is not rejected after 3 days by a majority of registerd users commenting on the propoaal, the clock could continue ticking for an additional day as to acceptance or rejection of the above proposed lead (or variants thereof). Comments are welcome on this proposal below. (Comments are also welcome in the space provided above as to the proposed lead itself. Presently, of 3 persons who have commented above on the proposed lead, 2 lean in favor of it.)  --Thomasmeeks 19:56, 1 March 2007 (UTC) (edited to "for an additional day" above) Thomasmeeks 12:20, 3 March 2007 (UTC)
 * Above proposal seems to accepted by 3 users. So, we can solicit a request for comment (through a minor edit with Edit Summary) of the article.

Comments:

* This is not simply an imaginary concern. The following gives 2 successive earlier edit-and-reverts of the Regression analysis lead (with Edit summaries included), punctuated by (deep-indented) Talk page responses in this "Proposed lead" section:
 * 19:23, 19 February 2007 Thomasmeeks (Lede (regr. with less tears): leads up to previous Edit (starting with "data") in steps. "Predicted" explained.* Last para.: almost unchanged. *Exact definitional relationships vs. regression eq.)
 * 19:32, 19 February 2007 Chris53516 (←Undid revision 109362201 by Thomasmeeks (talk) - let's discuss these changes first)
 * 19:41, 19 February Chris53516 [Creation of this "Proposed lead" section]
 * 21:34, 22 February 2007 Thomasmeeks (See Talk:Regression analysis#Proposed introduction: Intended to make content of previous lead more accessible)
 * 21:36, 22 February 2007 Chris53516 (←Undid revision 110159582 by Thomasmeeks (talk) - you didn't wait for a response, and I still don't think this lead is good)
 * 08:45, 28 February 2007 Ioannes Pragensis (Talk | contribs) (→Proposed lead - comment) [above in this section]

--Thomasmeeks 19:56, 1 March 2007 (UTC)

Thomasmeeks, please wait with the change of the lead. I'll try to summarize the reasons: So taken together, the proposed lead in the current form is full of errors and half-truths, which means that we should wait until it will be more correct. I am glad that you work on it, but give it time to ripe please.--Ioannes Pragensis 20:16, 3 March 2007 (UTC)
 * 1) The proposal does not sound as a standard Wikipedia lead as mentioned above
 * 2) As I told you, the parameters are not the most important thing in a regression, you should explain the substance of the regression model first and then the parameters and their estimation
 * 3) If you say "In statistics, regression analysis is the process used to estimate the parameters (constants) of a mathematical model from data for the variables in the model." then it defines all parametric models, not the regression models only - the same is true e.g. for many clustering techniques.
 * 4) "Regression estimates are used, because parameter values are not known with certainty beforehand." - It sounds as if they were known with certainity after the exercise, which is obviously false.
 * 5) "There are many methods developed to fit regression functions, and these methods typically depend on the type of function being used." - This is only partly true; they depend on the definition of the loss and of the priors much more than on the type of the regression function.
 * 6) "For example: linear regression, nonlinear regression, and logistic regression. A generalization of all of these models has been formalized in the "Generalized Linear Model"." - another false statement - GLM is not a generalization of the nonlinear regression.


 * Thank you for your response, Ioannes Pragensis. (As for your next-to-last sentence, could you speak more plainly ; )? Again, please consider presenting an alternative proposed lead at the top of this section (borrowing, adding, or replacing anything you would think appropriate).


 * I take it that your Edit summary here ("still not acceptable") together with you're earlier comment means that the improvement you acknowleged earlier is not sufficient in this case for an Edit on the article. I grant that if the improvemet is small enough, a big change might raise too many questions.  I do think that the numbered advantages at the top of this "Proposed lead" section are large.  Nor have you denied that. Are the disadvantage too large?  Let me attempt to address each of your points above.  Please excuse me if I have misunderood.
 * Repetition is not substantiation.
 * This a proposed lead. There is a link to mathematical model in the proposed lead. Why isn't that sufficient for the purpose  of the proposed lead for this article?  (I thought that I responded in part to this point in the section above. So, I'm surprised at your raising it here.) (Current proposed lead attempts to meet your point further in 1st sentence of proposed lead. You & Woollymammoth make a good point.)
 * Doesn't regression analysis refer only to models that use data that estimate model parameters (even if not all parametric methods use regression analysis)? If the answer is "yes," then I don't follow your point.  If the answer is "no," then I'm not clear what you are referring to.
 * I agree that this would be a false inference. Still, one cannot guard against every mistaken inference in one sentence. But I'd of course welcome a better alternative.   That inference (which, one hopes, would be infrequent)  is refuted elsewhere in the proposed lead with reference to goodness of fit.  (Deleted "beforehand.")
 * Not my doing.
 * Not my doing. (I left what was there.  Should these be deleted in the interest of brevity?  Should they be edited instead? Those are reasonable questions to pose.)


 * Despite the spirited exchange here (made easier by common courtesies reflecting mutual respect), I trust that no reader here will misunderstand that there is agreement for a significant Edit of the article lead. I believe that any proposed lead should be a a bridge to the rest of the article. That is what good leads do.  They encourage further reading, rather than puzzlement.  I feel that the attempt should be to improve on what is there, rather than waiting for the perfect lead to mate with the perfect non-lead.  I hope too that the above apparent differences here can be (re)solved (analogously to a mathematical problem). Until there's more agreement, there will of course be a temporary delay on any big edit of the lead, as related to this subsection.  --Thomasmeeks 23:21, 3 March 2007 (UTC)

I have edited the lead based on statistical books I have, which consider regression analysis to be a special case of analysis of variance (e.g. p. 147f of Statistics Manual, Crow, Edwin, Frances A. Davis, Margaret W. Maxfield. Dover, 1960). I hope this improves the lead. (I was logged off for some reason?!) Woollymammoth 20:59, 4 March 2007 (UTC)


 * (Wiki may automatically unlog to test users.) Interesting Edit above of the "Proposed lead" by Woollymammoth.  A big advantage is in noting at the start that regression analysis allows more than just estimation of parameters. (#4, deleted here, was a slip. Sorry.)
 * I have above tried to simplify, leaving out everything not necessary to the flow & avoiding unexplained terms covered later in the article. Deletion of the last sentence was consistent with that and the point raised in the section below. Thomasmeeks 05:10, 5 March 2007 (UTC)


 * Good, many problems were solved and it is already clearly better than the current lead now and can be IMHO replaced. Thank you.--Ioannes Pragensis 09:09, 5 March 2007 (UTC)

Ideas for a Lead

I think a lead should have two elements. First it should present an intuitive explanation of the notion of regression. For instance, it could explain that regression aims at studying how some variables may have an effect on a given variable (practical exemple welcome here). Then the lead should contain or be followed by a rigorous definition of what is a regression. This is in my opinion what is missing in this article. I thought to do it, but I have not enough time. Basically the regression is the calculation of the conditional expectation : $$ \operatorname{E}(Y | X)=m(X) $$. This definition makes possible to link the different regression techniques by describing the characteristics of $$m$$ : linear, non linear, non parametric... Such a presentation would makes it possible to have an article understandable for non technical users and useful for students in statistics by giving them a broad and encompassing view of the notion of regression. Gpeilon 12:07, 8 March 2007 (UTC)


 * A very good idea. Just one qualification.  I believe that the WP:LEAD should be "as simple as possible (but no simpler) and that the best place for it would be after the lead, possibly in Section 1.  The new Lindley reference (a real gem) does just what you propose early in his nontechnical entry but for readers who could be expected to have a considerably stronger technical background on average than Wiki readers.   Placement after the lead would allow building on, editing, or replacing part of Section 1 to assist in exposition.  Thanks.    --Thomasmeeks 23:50, 10 March 2007 (UTC)

Reference in Intro to Generalized Linear Models
Generalized linear models generalize linear models but do not generalize nonlinear regression. y = x1 * log(1 + x2) + error is nonlinear but what's the linear predictor?

Proposal for exporting "History of regression" to separate article
In favor of the above: Provisional date for acting on this proposal if there is enough consensus: March 25, 2007. ++++
 * 1) It disrupts the narrative of the lead and section 1.
 * 2) It is very partial, stopping at 1925.  It's treatment is very partial even only to that year.  A lot happened after that.
 * 3) As an article, it could be referred to in one line.  Those interested could click to that article.
 * 1) Those interested in improving that section could give undivided attention without concern that Edits would raise concern other readers and editors concentrating on other aspects of the article.
 * 2) Those with an interest in other regression articles could contribute to a separate article. --Thomasmeeks 23:16, 10 March 2007 (UTC) Comments welcome below, for or against, please:

Favor the above proposal (other comments besides above welcome):
 * I would be in favour of creating a separate page on the history of regression analysis. A brief summary in the article on regression should still be maintained. Woollymammoth 20:14, 13 March 2007 (UTC)

+++++

Do not favor the above proposal: (Please consider giving reasons against the proposal below)

+++++

Other (please specify):
 * The section should stay here, because there should be at least brief history section in all articles about things with a history. But it is still possible to create a more comprehensive separate History of R. article after we collect more information. Read WP:SUMMARY, please.--Ioannes Pragensis 14:27, 12 March 2007 (UTC)


 * Excellent WP:SUMMARY link. The point to which I believe that Ioannes Pragensis is referring is in the 2nd box of that link -- on articles that have grown too long. I don't believe that Regression analysis is too long.  There are good and not-so-good ways to do history in an article.  Good ways are those in for example:
 * Derivative (one unbusy-looking, interesting paragraph)
 * Partial derivative (one sentence at the end of the lead).
 * I believe that there would be agreement that the question is not history or no history but what is to done in this article with the section in question as currently written relative to the rest of the aricle. If someone could come up with a section that eliminated the above problems, we might not be having this (to be sure, pleasant) discussion. Thomasmeeks 17:08, 12 March 2007 (UTC)

--Thomasmeeks 23:16, 10 March 2007 (UTC)

I purged the mess there a bit and hope that it is far better now. We need a few words about the pre-history (Gauss etc.) and it will be OK.--Ioannes Pragensis 09:42, 13 March 2007 (UTC)


 * I believe that many would regard the new Edit of this section as indeed great improvement, Ioannes Pragensis. Though some disadvantages of the previous Edit of this section were mentioned above, we can be grateful that earlier efforts set this section on the right course. --Thomasmeeks 13:22, 14 March 2007 (UTC)

Who is the audience for this article?
Most might agree that the article Regression analysis is not in any significant way written for specialists who already know the subject. Rather it should be written so that someone innocent of the subject but thirsting to learn could get an appreciation and substantive knowledge of the subject. To entice readers, the article should of course be a beautiful, elegantly simple, and  clear cumulative development of the subject, in the fashion of many great statisticians. Comments? --Thomasmeeks 13:22, 14 March 2007 (UTC)

Progress report
Almost 2 weeks ago, someone predicted elsewhere on Wiki that the use of the present article would soar in the near term (thanks to expository & content improvements). This has indeed happened. While it is easy enough to criticize earlier efforts, it would not have happened without those earlier efforts. Improvements are still possible of course. --Thomasmeeks 20:11, 26 March 2007 (UTC)

Thanks to statisticians (or close), consolation to others, & a plea
The "thanks" is for your interest in the Regression analysis article as readers and editors. What a compliment to non-statisticians whose Edits are corrected or improved (or reverted) by you, often with an Edit summary or Talk Page reference. This is in the best spirit of critical rationalism. The principle of charity is thereby exhibited (where there is any truth worth salvaging, of course). Rather, indifference ("Why bother fixing this mess?") is true contempt.

For others who may have been bruised in the editing process (and who hasn't been at times?), take consolation. Corrections or deletions are rarely personal. Even those few reverts without explanation usually have a good reason, which reverters should be prepared to give in detail upon request if necessary.* And there is always redress on the Talk page.

The "plea" referenced above is to statisticians (& their kissing cousins), but it applies to any professional field: Your editing with the lay public in mind, from the the lead on, is sooo appreciated. And that's not easy to do well except in a careful, cumulative fashion. If you edit for the interested but presumably intelligent and uninformed layperson, that should be sufficient, provided that the exposition is relatively self-contained (and not circular) or that it has transparent links (at least after you edit them). "As simple as possible but no simpler" is a useful maxim, including clarifying wording as to what might otherwise come across as jargon. (Terms that have a specialized definition in stat may need supplementing, with possibly a very brief gloss.) Thanks again.
 * * If they are not prepared to explain themselves, they risk losing credibility. They risk nothing if particular reasons do not hold up. That's part of the normal conversation in an open editing process for such a collective enterprise. Comments? -- Thomasmeeks 13:22, 14 March 2007 (UTC) (Deleted redundant sentence.  Thomasmeeks 21:19, 14 March 2007 (UTC) (Proof edited.)  --Thomasmeeks 19:45, 26 March 2007 (UTC) (minor Edit) Thomasmeeks 21:35, 27 March 2007 (UTC)

... Ph.Ds Run Amok ~
~ BRAVO! MAGNIFIQUE! This is the most remarkably complete piece of theoretically impeccable work, that I've have ever seen. I stand in the pall of greatness. I am most unworthy.

A. Samuel Joseph III, Geospatial Econometric Analyst 08:02, 17 March 2007 (UTC)

PS~ Okay?

INCOMPREHENSIBLE. Totally.

Clarification
User:Thomasmeeks asked me to explain the Clarify tag I added back in. The SMOG Index (17) in the introduction is much too high (post-graduate), and the introduction doesn't summarize the article. Please see WP:BETTER. I originally came across this article when I wanted to remind myself what regression is, because I haven't actively used it for 20 years. I had to look at the example to understand what was written here, then I had to get a statistics book for a clear definition. To the best of my knowledge this article is accurate. To make it useful, see especially WP:OBVIOUS and WP:TRITE. I hope this is helpful. --SueHay 21:31, 12 May 2007 (UTC)


 * Thx for response.
 * (1) By the "introduction," I take it that you mean the lead paragraph, rather than the first labelled section ("Introduction"), correct?
 * Yes. And there shouldn't be a section "Introduction". --SueHay 00:18, 13 May 2007 (UTC)
 * (2) Would you consider the clarify the template being relabelled for the current month?
 * No. I don't see any significant improvement since then. Sorry. Please keep in mind that I'm not questioning anyone's expertise here. It's a clarity problem. --SueHay 00:18, 13 May 2007 (UTC)
 * The same person who posted the earlier template took it down (me). I'll look forward to following up on your links.   Thx again. --Thomasmeeks 22:19, 12 May 2007 (UTC)


 * Well, one might mistakenly conclude that the current template had been up continuously since Feb. Would it not be reasonable, given the change of editor to reflect that difference, even though you agreed with its being there earlier?  That's what I've seen on other cite where the template comes down and someone else puts up another template.  It avoids a misleading inference.  Thx.  --Thomasmeeks 04:49, 13 May 2007 (UTC)
 * The clarification request refers to the text of the article. It's not directed at any single editor, it refers to problems with the text. I'm not looking at who's doing the editing, but I certainly hope there hasn't been a "change of editor" (in the singular form). Wikipedia is a collaboration to provide accessible information to the readers. As far as clarity goes, as I said, I don't see an improvement since February. --SueHay 15:21, 13 May 2007 (UTC)


 * In any case there does seem to be increased use of the article since March as noted above (possibly intrigued by the prospect of its obscurity ; ). The "change of editor" referred to the person who posted the clarify template. --Thomasmeeks 02:51, 14 May 2007 (UTC)


 * An earlier proposed lead is discussed above at Talk:Regression analysis above.
 * (1) Do you consider that an improvement over the current lead?
 * (2) What would a clearer article look like? ("Is" inserted in last para., 1st line.) Thomasmeeks 20:57, 14 May 2007 (UTC)

Simple Vs Complicated
I would suggest that the first page has a more accesable description, with a link to this section for the maths behind it.

With statistics you must have the maths described somewhere because of different terminology and to be absolutly precise about what we mean.

But for the general public, we need a more accesable explanation with out the mathematics.

"Thanks for listening"211.30.66.104 22:58, 12 May 2007 (UTC)

Template
Just thought I would say that the template saying this article was not clear gave me a good laugh -- if only my stats classes came with templates such as that, I could have tried to avoid them... :) XinJeisan 05:08, 18 July 2007 (UTC)

nice article; suggestions
this is a very nice article, and all the complainers should write something this nice. To satisfy the to rigourous crowd, you might try something like this in the intro

REgression analysis is a way to find a mathematical equation that describes the relationship between two "variables", such as height and weight. Intuitively,we might think that weight, on average, increases with height; regression analysis allow one to find a mathematical equation that describes this relation; such equations have wide use in science and industry.

this avoids use of terms like dependent and indpendent variable, which don't need to go in the intro.

Can you include an example of how to find the error term in the linear regression equation ? eg, in the height vs weight example, what is the CV or error in the y values, ? In immunoassay lingo, this is called using transformed numbers ? Cinnamon colbert 17:34, 11 August 2007 (UTC)


 * There might be the core of a good idea here. I believe that there is not much hope for a reader stumped by 'dependent' & 'independent' variable lingo in the Lead, esp. since it prepares the reader for what follows and links are provided. (After all, the article is on "regression analysis," not just "regression.")  Those terms might be familiar or easy to pick up on.  They could be used to suggest the height-weight example, that the taller the members of a sample are, the more the heavier they are on average,  which might be worked briefly into the Lead as a short para.


 * The last para. above goes beyond the Lead. Others might wish to comment on it.  --Thomasmeeks 22:25, 11 August 2007 (UTC)

Simple linear regression
The formula does not contain "alpha", instead it is written as "beta". Does it should look like Yi = a + bXi + e, not Yi = b + bXi + e.

Wikestod 18:01, 25 August 2007 (UTC) ___________________________________________________________________________________________________________________

Linear regression based on polychoric correlations ?

In the "Generalizing simple linear regression" section somebody stated:

"An alternative to such procedures is linear regression based on polychoric or polyserial correlations between the categorical variables."

Can anybody tell me how this works or who wrote this ? Or maybe someone knows literature describing this specific procedure !?

I'd appreciate if you could give me a lead ! Thank's in advance !

Cosmoxxx 11:28, 3 September 2007 (UTC)

This is not fit to be an encyclopedia article
I Googled "regression analysis" and this article was the first choice. It's too bad because there are better choices like the one here: http://www.law.uchicago.edu/Lawecon/WkngPprs_01-25/20.Sykes.Regression.pdf

I spent years as a computer specialist, which means that I'm not afraid of math. Although I didn't do so well in my college-level math courses (where I don't remember encountering regression analysis), I'm comfortable discussing numbers. For a work-related project, I needed to learn something about regression analysis. Thus, my attempt to Google the term. Unfortanately, the information provided is complete gobbledeegook to those who don't already know what a regression is.

The whole point behind an encyclopedia article should be to educate those who want more than a definition but less than expert-level material. It should explain the basics, and provide sources for those who wish to learn more. This article, as written, is not meant for people like me -- somebody comfortable discussing numbers, but far from an expert at math.

If you want to see an example of how mathematics can be explained to non-math but number-comfortable people, look at the U Chicago article I cited above. Would somebody please try to rewrite this article so that those looking for basic information can read it?

Techielaw 03:06, 19 September 2007 (UTC)

"Population regression function" section moved from article
The indented material below with the heading "Population regression function" was removed from the Regression analysis article:
 * The population regression function (PRF) is a linear function that is derived from the sample regression function (SRF) which represent the population and sample regression lines, respectively. The SRF can be expressed as: the estimated dependent variable (Y) equals the estimated beta1 parameter value plus the estimated beta2 parameter value multiplied by the explanatory variable (X) plus the (sample) estimated residual (denoted as u-hat sub i). From this function, the PRF can be expressed as: the dependent variable (Y) equals the beta1 paramenter value plus the beta2 paramater value times the explanatory variable (X) plus the stochastic error (denoted as u sub i). These functions serve purposeful during regression analysis, which ultimately determines how the "average value of the dependent variable (or regressand) varies with the given value of the explanatory variable (or regressor)." The stochastic version of the PRF is critical for empirical studies - stochastic meaning that the disturbance term is added to the function in order to completely estimate the PRF.

It was added to the article on 19:00, 14 July 2007 with the Edit summary "merged in Population regression function." In that time since then, it remains undigested in the rest of the article, unnecessary, confused, & obscure. Anyone is welcome to try to rescue anything believed to be of value for the article. --Thomasmeeks 19:16, 19 September 2007 (UTC)

A major proposal
Please see talk:least squares for details. Petergans (talk) 17:41, 22 January 2008 (UTC)

References requested
An IP editor has requested that additional references be added to this article. I am leaving a note here in lieu of numerous tags on the article itself. &mdash; Carl (CBM · talk) 13:59, 6 February 2008 (UTC)

Example
Can somebody explain how the standard deviation of the betas were computed? i.e. where the 16, 4, and 0.05 comes from. Tesi1700 (talk) 19:59, 17 February 2008 (UTC)

Major revision
This article has been given a thorough going over. The revision is based on the earler version, but the order of sections has been extensivle rearranged and material has been transferred between sections. Each section has been revised for improved clarity. Hopefully there is now no reason to flag the article as confusing or unclear. Petergans (talk) 12:42, 20 February 2008 (UTC)

Moved the example to linear regression Petergans (talk) 10:08, 22 February 2008 (UTC)

I edited the notation to make it more consistent with other pages and with standard regression textbooks. The $$ \alpha, \beta $$ notation is confusing when one moves to multiple regression. Also, I cleaned up the discussion a bit. EconProf86 (talk) 01:37, 10 March 2008 (UTC)