Talk:Linear least squares/Archive 2

Relevancy
03:33, 26 October 2005 Oleg Alexandrov m (I don't find the octave code particularly relevant;)

It's very relevant. Actually calculating this is interesting, not just knowing how to calculate it with pen and paper. --marvinXP (talk)


 * You are reffering to this.


 * I am myself doing numerical analysis for living, use Matlab (of whom Octave is a clone) and have no bias against numerical programs. However, that piece of code is just rewiting the formula


 * $$ A^T \! A \mathbf{x} = A^T \mathbf{b}, $$


 * from the article in Octave's notation. It is a trivial exercise I would say, and not worth its place in the article. Oleg Alexandrov (talk) 06:04, 26 October 2005 (UTC)


 * I agree. Actually, I would have deleted the Octave fragment if I had time to do so before Oleg did. -- Jitse Niesen (talk) 11:10, 26 October 2005 (UTC)


 * I see your point, but I disaree. A non-mathematician may want to actually use the formula. It's a very practical application. It's not a matter of "this is how it's written in program X", but rather "if you have the values, you can plug it into this free program, and have the result presented to you". I'd say most people who'd want to calculate this (let's assume they know of least squares), may not know octave, or matlab or any other math-program besides a simple calculator. And may not afford mathlab. --marvinXP (talk), 16:24, 29 October 2005 (UTC)


 * Even if it is true that most people who know about least squares and want to calculate a least-squares solution, do not know that there are programs for doing so (which I doubt), it would be irrelevant for this article, because it's not a statement specifically about least squares. Many mathematical concepts can be calculated by programs, free or otherwise. I think it would be silly to add this fact to articles like matrix multiplication and matrix inversion. By the way, your Octave fragment is suboptimal. As a general rule, one should try to avoid inverting a matrix. The proper way is to use the slash, as follows:
 * A = [0,1; 2,1; 4,1; -1,1]
 * b = [3; 3; 4; 2]
 * x = A \ b
 * -- Jitse Niesen (talk) 17:51, 29 October 2005 (UTC)


 * I'd consider matrix multiplication less directly applicable, and would just involve defining '*' in a given program. This example had more steps to it (well it had before you showed the backslash operator, thank btw). But I guess I'll concede. --marvinXP (talk), 23:09, 29 October 2005 (UTC)

Normal Equation and Normal Matrix
The normal equation forms the matrix product of A-transpose and A. This forms a new, square matrix C. This new square matrix (C) is a Normal matrix. A Normal matrix has the property that its product with its transpose is the same whether pre or post multiplied. This makes/ensures the matrix symetric and at least positive semidefinate (usually positive definate).

It is easy to confuse the form for the normal equation and normal matrix if both refer to a generic matrix using the same symbol 'A'. It is the symetric positive semidefinate property (and its consequences) that is 'Normal'.

Philip Oakley (84.13.249.47) 22:37, 12 April 2006 (UTC)

Help! We're being outgunned and overrun by helpful mathematicians...
In that this is an encyclopedia, i.e. a place where people go to understand concepts they currently don't, I feel that the "explanation" of this term is utterly too complex and filled with mathematical jargon. Not that it doesn't belong, but there needs to be introductory material geared towards newcomers to this material. If I wanted a mathematical proof or advanced applications I would probably consult a book on statistics.

As an example of a better and more appropriate introduction geared towards mathematical newcomers, the wikipedia entry for "Method of least squares" strikes me as well written and clear. —The preceding unsigned comment was added by 161.55.228.176 (talk) 18:17, 28 September 2006 (UTC).


 * Clarity should definitely come before technicalities, but I think linear least squares is a specific example of an application of the method of least squares, so it is only natural that the former should have more technical detail. The lead paragraph, though, could use some more background, as well as a link to method of least squares. --Zvika 17:02, 28 September 2006 (UTC)

I couldn't agree more with the immediately foregoing comments. One thing that would help greatly IMHO would be to use standard notation, consistent with the usual applications of these ideas, as shown in Regression analysis. It is usual to refer to the data or inputs as the X matrix, the unknown parameters as Beta or B, and the dependent or outputs as Y. Accordingly, the normal equations should be expressed in a format like:

-. Y  =  X B  <=>  B  =  (X' X)   X' Y n 1   n k 1     k 1   k   n  k    n 1

where the subscripts are the row and column dimensions, identified to show how they must match, and -. indicates generalized inverse. --Mbhiii 18:48, 27 October 2006 (UTC)


 * I have seen both the $${\mathbf y} = X {\boldsymbol \beta}$$ syntax and the $$A {\mathbf x} = {\mathbf b}$$ syntax used in books. As long as the article is consistent, I don't think it necessarily requires one syntax over another. This subscript notation that you suggest is new to me, though. Is this something you've seen before? I am afraid it might be more confusing than helpful if the reader is not familiar with the notation. --Zvika 21:41, 28 October 2006 (UTC)

Internal consistency should not be the only criterion, but external consistency, as much as possible, with the dominant practical uses of the day, should be very important, as well, because it increases recognizability and, therefore, the general utility of the article. The intersubjective standard of reference that Wikipedia represents, and is slowly increasing, requires no less. This argues for the X matrix form. I know the subscript notation is new, but it's very useful. It's a revision of Einstein's matrix notation. (He used one subscript and one superscript per matrix, which conflicted with exponents.) Though very gifted in abstract visualization, he was not always the best student, and noncommutative matrix multiplication gave him headaches in long calculations. This cleaned up version of his notation is gaining acceptance among those teaching matrix algebra and its applications. Simply identify the left subscript as the number of rows and the right one as number of columns, and all is clear. --Mbhiii 14:28, 31 October 2006 (UTC)


 * I don't understand how you can say we need external consistency, and then push for a new notation. By the way, Einstein's index notation does not show the dimensions of the matrices, so it's something completely different. -- Jitse Niesen (talk) 00:49, 1 November 2006 (UTC)

An X matrix form for the normal equations increases the article's external consistency. Revised Einstein matrix notation is for clarity, an aid to people like students or programmers who don't necessarily use matrix algebra all the time. By the way, Einstein matrix notation is not Einstein summation notation. --Mbhiii 13:55, 1 November 2006 (UTC)

Overexplaining?
Zvika, perhaps I am overexplaining, but as a master's in physics I didn't see the equality of the middle terms right away. It seemed useful - to me, and probably to some other users as well - to explain that step it in some more detail. Pallas44 13:06, 16 November 2006 (UTC)


 * Sorry... it still seems pretty obvious to me. It's just a result of the fact that for any two vectors, $$a^T b = b^T a$$. Right in the next sentence we differentiate a vector equation, equate to zero and give the result - a far more complicated procedure which is given without any technical details. These are things the reader is assumed to be capable of doing by himself or herself, if necessary, or to just take our word for it if he/she is in a hurry. However, I will be happy to hear other people's opinions. We are talking about this edit. --Zvika 20:09, 20 November 2006 (UTC)

Taking the derivative
Could someone please explain in detail how:

$$\frac{d}{dx} [(A \mathbf{x})^T (A \mathbf{x})] = 2 A^T A \hat{\mathbf{x}}$$

Thanks, appzter 23:14, 15 January 2007 (UTC)


 * It follows from the fact that $$(Ax)^T (Ax) = x^T (A^T A) x$$, and $$\frac{d}{dx} x^T M x = 2 M x$$. --Zvika 10:24, 16 January 2007 (UTC)


 * P.S. See also for information on how to differentiate with respect to a vector. --Zvika 12:15, 16 January 2007 (UTC)

Middle terms are equal?
hello, i have a problem understanding one aspect of this

on the previous page after the multiplication of (AX-b)T.(AX-b) its says "The two middle terms bT(Ax) and (Ax)Tb are equal"

i think this is incorrect as

The two middle terms bT(Ax) = ((Ax)Tb)T are equal can someone please clarify this, as if i am correct the derivation would be different


 * I'm not sure what you mean, but I think that you understand that bT(Ax) = ((Ax)Tb)T, and that you are asking why bT(Ax) and (Ax)Tb are equal. The trick is that these are all scalars (1-by-1 matrices) and thus equal to their transpose, so ((Ax)Tb)T = (Ax)Tb. I hope that answers your question. -- Jitse Niesen (talk) 08:47, 19 May 2007 (UTC)

thanks! that is what i was asking about!

Redirection from Normal equations
I was redirected ftom Normal equations to this page (Linear least squares) but the meaning of the Normal equations is not explained or defined here. Should the redirection be canceled? Kotecky (talk) 14:50, 11 January 2008 (UTC)
 * The normal equations are defined towards the end of the Definition section. There might be other uses for this term. Is this what you had in mind? --Zvika (talk) 05:35, 12 January 2008 (UTC)

A major proposal
Please see talk:least squares for details, which include a proposed extensive revision of this article. Petergans (talk) 17:32, 22 January 2008 (UTC)

The contents of this page have been replaced. For discussion see talk:least squares Petergans (talk) 09:22, 30 January 2008 (UTC)

Recent rewrite
I am very unhappy with the recent rewrite to the article. What used to be a simple topic about an overdetermined linear system and some calculus became a full-blown theoretical article on fitting experimental data. While it is the latter where least squares are most used, the way the article is now it is incomprehensible except to the specialist.

Ideally the first part of the article would be the origianl article, and the more complex stuff be later. If the author of the rewrite is not willing to do that, however, I propose a wholesale revert. That may lose valuable information and the insights of a specialist, but it is better than the current incomprihensible thing. Oleg Alexandrov (talk) 15:51, 31 January 2008 (UTC)


 * See Wikipedia talk:WikiProject Mathematics. Oleg Alexandrov (talk) 16:01, 31 January 2008 (UTC)


 * To describe the article as incomprehensible is clearly a gross exaggeration. However, in the light of other these and other comments on the talk page mentioned above, I have simplified the presentation of the theory and placed the example after the theory. The more technical details come after them. Petergans (talk) 20:26, 2 February 2008 (UTC)


 * While most folks will say that math is hard, in this case linear algebra is still a simpler concept than statistical data fitting and the standard deviations associated with each measurement. I reverted the first part of this article to the earlier (simpler) linear algebra formulation. The statistical derivation of linear least squares now comes after that.


 * This still preserves Petergans's work while making the article simpler. Comments? Oleg Alexandrov (talk) 03:15, 3 February 2008 (UTC)
 * This is unacceptable. Few scientists will know what a norm is, and many will not know any matrix algebra. The norm notation does not appear in any other article on least squares or regression analysis, so it is of no help to anyone seeking more detail when reading those articles. Futhermore, the notation in the first section is now inconsistent with the notation in the second section.
 * The revert is both careless and irresponsible, based on a POV that is not shared by many correspondents on the various talk pages, including Wikipedia talk:WikiProject Mathematics. Petergans (talk) 08:52, 3 February 2008 (UTC)
 * I have made every effort to keep the notation consistent. I see I faied to use bold X and y instead of italic. I fixed that now. Is there anything else I missed?


 * I replaced the "squared Euclidean norm of the residual" wording with "sum of squares of the components of the residual" (the text immeadiately after it had the full formula anyway, so those words don't matter much).


 * I am puzzled by your assertion that readers won't know linear algebra. All I did with my edit was bring the derivation of the normal equations a bit higher in the article, before you get into the statistical theory. You yourself wrote the normal equations using matrices. And you use matrix formalism everywhere in the other parts of article yourself. There's no other way when talking about fitting a multidimensional linear model. Oleg Alexandrov (talk) 17:49, 3 February 2008 (UTC)

Least squares: implementation of proposal
which contain more technical details, but it has sufficient detail to stand on its own.
 * Least squares has been revised an expanded. That article should serve as in introduction and over-view to the two subsidiary articles,
 * Linear least squares (revised, this article)
 * Non-linear least squares (created)

In addition Gauss-Newton algorithm has been revised. The earlier article contained a serious error regarding the validity of setting second derivatives to zero. Points to notice include:
 * Adoption of a standard notation in all four articles mentioned above. This makes for easy cross-referencing. The notation also agrees with many of the articles on regression
 * New navigation template
 * Weighted least squares should be deleted. The first section is adequately covered in Linear least squares and Non-linear least squares. The second section (Linear Algebraic Derivation) is rubbish.

This completes the fist phase of restructuring of the topic of least squares analysis. From now on I envisage only minor revision of related articles. May I suggest that comments relating to more than one article be posted on talk: least squares and that comments relating to a specific article be posted on the talk page of that article. This note is being posted an all four talk pages and Wikipedia talk:WikiProject Mathematics.

Petergans (talk) 09:43, 8 February 2008 (UTC)


 * It looks much better now. Only one comment: I am not so sure if QR is more expensive than forming the normal equations and then Choleski. The Q factor never needs to be computed - one can just go ahead and use the R - and the expensive computation of the normal matrix is not needed as well. In numerical analysis classes we tell students never go the normal matrix route. See for example the LAPACK documentation or a contemporary book on numerical linear algebra, such as Demmel or Trefethen and Bau. (Of course some engineering textbooks, written by people who are experts in engineering rather than numerical linear algebra, may differ....). Regarding SVD, Matlab switched from QR to SVD years ago because of higher reliability in spite of somewhat larger cost. The difference in cost is only a small constant factor anyway. Jmath666 (talk) 06:05, 9 February 2008 (UTC)


 * Trefethen and Bau say that normal + Cholesky costs $$mn^2 + \tfrac13n^3$$ flops and QR (using Householder reflections) costs $$2mn^2 - \tfrac23n^3$$, so if $$m \gg n$$ then QR is twice as expensive. So if you're smart enough to know that instability is not going to be an issue, you should go for normal equations. The problem is that the normal equations method is unstable in some quite natural situations.
 * The only thing I'm not so sure of is that QR works if X is not full rank. In that case, R is singular so it seems you can't solve $$R\hat\beta = Q^Ty$$. There are also some smaller things that I intend to change at some point; for instance, saying that having a triangular matrix facilitates solving the equation is a bit weak. -- Jitse Niesen (talk) 17:24, 9 February 2008 (UTC)
 * You do not need to create $$Q$$ at all, just solve $$\mathbf{ R^T R \hat \boldsymbol\beta = X^Ty} $$ using $$\mathbf{ R}$$ twice. (I hate those matrices in boldface. Engineering and stat books do that. Math rarely.) As to the singular case, then the least squares solution is not unique and I seem to recall that finding the minimum norm least squares solution between those is the same as applying the Moore-Penrose pseudoinverse. I suppose one could use QR to get some least squares solution, not necessarily the minimum norm one, and there is also the Rank-revealing QR decomposition that is used instead of SVD sometimes but I would have to research that. See Hansen ISBN 0-89871-403-6. Jmath666 (talk) 20:44, 9 February 2008 (UTC)


 * The flop count I quoted is without creating Q. It comes from Trefethen and Bau, so I believe it unless I see some evidence it's wrong. I see that the singular case is not even mentioned in the article, so it weird that it suddenly turns up in the computation section. You're right that Moore-Penrose computes the minimum-norm least squares solution. I wouldn't be surprised if you can adapt QR to this situation, but like you, I don't know. Generally, there is of course much more to say about the computation; whole books have been written about it. -- Jitse Niesen (talk) 17:43, 10 February 2008 (UTC)
 * Quite! May I suggest a separate article on numerical solution of linear least squares or numerical methods for linear least squares might be in order? The computation section of this article is already more technical than the rest of it. I'm a trained statistician but i was never formally taught about the computational side of least squares. I think that the most needed in this article is to say that direct numerical solution of the normal equations is not generally efficient or numerically stable, and refer the interested reader to another article for more details. From the discussion here there seems to be enough knowledge and interest around to start one. Qwfp (talk) 19:16, 10 February 2008 (UTC)
 * I picked up what I know from the websites such as Allan Miller's (not easy to navigate but a mine of info esp. if you program in Fortran 90), Numerical Recipes and LAPACK docs mainly out of curiosity. I have a hunch that the linear least-squares routine of several statistical software packages are based on rank-revealing QR, which gives excellent results on the NIST tests data sets for linear regression. Statisticians are rarely interested in the minimum norm solution when the system is (numerically) rank deficient - we'd rather drop variable(s) from the model, which is essentially what rank-revealing QR does. It's also useful if you want to add or delete observations without refitting the model - see this extract of Björck's 1996 book on Google books.Qwfp (talk) 19:16, 10 February 2008 (UTC)
 * If the data are truly random the matrices tend to be well conditioned esp for modest dimension, normal equations are just fine, and the singular case just does not happen. Statistics and data analysis is just one application of the least squares technique yet this collection of articles seems to be dominated by the statistic perspective. Jmath666 (talk) 20:53, 10 February 2008 (UTC)
 * That depends what you're fitting. If you're doing straight-line regression, maybe that's true.  If you're fitting 50 radial basis functions, you may find that the singular value of the 50th (least well conditioned) singular vector is getting really rather singular indeed. Jheald (talk) 21:49, 10 February 2008 (UTC)

Ill-conditioning
I had moved QR and SVD to non-linear least squares before reading the discussion above. I apologise for that. I had not realized that ill-conditioning was as important in linear least squares as this discussion would seem to indicate that it is. Where do you folks suggest that they should be, bearing in mind that we don't want to repeat too much material in the linear and non-linear articles? I'm not in favour of creating a separate article. It's a pity that WP does not have a "small print" option for the more technical bits.

I think that it is necessary to distinguish two type of ill-conditioning: intrinsic and non-intrinsic. An example of intrinsic ill-conditioning comes with fitting high-order polynomials, where the normal equations matrix is a vandermonde matrix, which is intrinsically ill-conditioned for large n. In that case the remedy is to re-cast the problem in terms of fitting with orthogonal polynomials. With the example of 50 basis functions, is it sensible to determine all of them simultaneously? Surely a lot of them are "known" from simpler systems?

In non-intrinsic cases, either the model is inadequate or the data don't define the parameters, or both. In my opinion it is futile to seek a mathematical remedy for this kind of problem. If the data cannot be improved the parameters of that model cannot be determined. Petergans (talk) 16:24, 11 February 2008 (UTC)

Confidence Interval Example
I hope somebody clarify the example. They jump from computing r1 = 0.42, r2 = -0.25, r3 = 0.07, r4 = -0.24, and S = 0.305 to say that alpha = 2.6 +- 0.2 and beta = 0.3 +- 0.1. Where 0.2 and 0.1 come from?Tesi1700 (talk) 00:16, 17 February 2008 (UTC)
 * This came about as a result of Oleg's careless reversion captioned "replace example with what I think was the simpler and the better written one from history". I have reinstated it because I want to refer to it in regression analysis where the expression
 * $$\hat{\beta}=\frac{\sum(x_i-\bar{x})(y_i-\bar{y})}{\sum(x_i-\bar{x})^2}$$
 * is given without proof.


 * Oleg, you should have more reason than personal preference for replacing a complete section. It appears that you were unaware that the straight line derivation was a part of the article before I began revision. It was right to retain it and it is logical to use it in this example. Petergans (talk) 15:09, 17 February 2008 (UTC)


 * Sorry I missed the part about standard deviations. The reason I cut out the straight line derivation is because currently the article is too hard to understand for people who never saw this stuff before. Right after the normal equations there should be a very simple example. Things are now more complicated by first doing the straight line derivation, worrying about standard deviations, etc. That straight line derivation is very simple if you already know linear least squares, and it just stays in the way of a simple example which can be done without it if you are not familiar with the topic. Oleg Alexandrov (talk) 16:30, 17 February 2008 (UTC)


 * Thanks, I finally understood where the standard error numbers came from. I hope somebody can do the same in the Regression analysis example, although I suspect the formulas will not be as neat as in this (simple) example.Tesi1700 (talk) 23:25, 17 February 2008 (UTC)

Whether or not something is "too hard to understand" is a matter of personal opinion. My object in re-casting the section of straight line fitting was to give an example without the use of matrix algebra. Matrix algebra is, in my opinion, less accessible to non-mathematicians. It is essential to include standard deviations right from the start as, in data fitting, the parameter estimates are meaningless without them.Petergans (talk) 16:25, 18 February 2008 (UTC)


 * Well, linear algebra is standard in college, while people usually don't learn about standard deviations unless they take a specialized probability or statistics course.


 * But I see that both your and my goal is to make the topic more accessible, even though we don't agree on how is the best to achieve that. I'll leave it at that. Oleg Alexandrov (talk) 05:40, 19 February 2008 (UTC)

It's not so much that we disagree, it's more that we have quite different backgrounds. What you say about standard deviations does not apply in quantitative sciences, where the topic of measurement errors and their consequences is of fundamental importance. The calibration curve illustrates the point. See how each datum is expressed as a value and an error bar. When an unknown analyte is measured the value is taken from the straight line parameters, and the error is propagated from the errors on the parameters and the correlation coefficient between them. You will find this treatment in elementary texts on analytical chemistry.Petergans (talk) 08:59, 19 February 2008 (UTC)

Calculations of sigma(alpha)
I have been through the "Straight Line Fitting" section over and over and I still do not understand something. If you apply an offset 'a' to the x variable, which is equivalent to shifting the data to the right or left, then the 1 sigma error estimates on alpha and beta should not change, right? This ends up being true for the sigma(beta) calculation since D,m and S do not change when you apply an offset to x. However, sigma(alpha) does change, since Sx2 changes when you change the x values. Is it possible that the Sx2 in the sigma(alpha) calculation should be the sum of the square of the deviation of x from the mean of x? Perhaps a reference to the derivation of this parameter would help. --Jonny5cents2 (talk) 09:08, 10 March 2008 (UTC)
 * It is true that the slope of the straight line is unaffected by a shift along the x-axis, but the intercept on the y-axis occurs at a different x-value. That is why the error on the intercept is not the same. Derivation of the expression for errors is in least squares (Least squares, regression analysis and statistics) Petergans (talk) 11:06, 10 March 2008 (UTC)

A picture?
I think a picture showing that the connection between the the least squares solution of a system $$ A\mathbf{x} = \mathbf{b}$$ and the orthogonal projection of $$\mathbf{b}$$ on to $$col \mathbf{A}$$. For me at least, a picture like that was what made understand the concept of a least-squares solution. —Preceding unsigned comment added by Veddan (talk • contribs) 16:51, 25 March 2008 (UTC)
 * I've added a foonote to this effect. Petergans (talk) 10:51, 26 March 2008 (UTC)


 * Good! That's essentially the same as what I had in mind for a picture.
 * Veddan (talk) 12:57, 26 March 2008 (UTC)

Consistency
To Peter: At non-linear least squares it is good to write things in a certain way to make the point about not being able to find the minimum in one step.

At this linear least squares article, this is is not necessary. Going through the same motion as at the other article for the sake of consistency is not a good idea. The order you chose does not flow well. Oleg Alexandrov (talk) 16:42, 31 March 2008 (UTC)


 * Please don't patronize me in this way. You are implying I haven't though it through or considered alternatives.
 * I suggest that you stop tinkering with the text and confine your edits to matters of substance rather than trivialities like sentence order. Petergans (talk) 13:37, 1 April 2008 (UTC)


 * The wording I suggest reads better. Since we are talking about linear least squares, the linearity of model functions should be stated before you go into the details of calculation.


 * I already added a picture to this article, and I've been planning to add another one, based on this, to address the request in the previous section. That will address your "trivialities" issue.


 * Lastly, Peter, if my flaw is that you perceive me to be patronizing (sorry about that), I'd suggest you work on addressing the issues raised rather than deflecting them with "don't tinker, you're patronizing, etc". The goal is to have well-written articles, let's focus on that. Oleg Alexandrov (talk) 15:18, 1 April 2008 (UTC)
 * There you go again! It is patronizing to tell me that the goal is a well-written article. Do you think I don't know that? You repeat for the third time your justification for an alteration of sentence order. I was not convinced the first time and your merely repeating it will not cause me to change my opinion.


 * The picture that you added is more appropriate to linear regression than to parameter determination. It can stay there until a better one is found. Have a look at a Gran plot which is used for potentiometric electrode calibration, for a possible alternative.


 * I've seen a diagram like that in Björck. Quite frankly it means nothing to me, even after reading his explanation. I had to find out what column space is as I had not come across the term before. I doubt if a non-mathematician will understand the significance of this orthogonality. I certainly don't. For these reasons the orthogonality statement was relegated to a footnote and the questioner seemed satisfied with that. Petergans (talk) 08:13, 2 April 2008 (UTC)
 * p.s. I've just found the article gran plot which I had searched for before, but did not find. Petergans (talk) 08:17, 2 April 2008 (UTC)
 * You have not commented on whether my intended edit (moving the mention of linearity of model functions earlier) would be an improvement or not. If you think it won't, please explain why. Oleg Alexandrov (talk) 15:04, 2 April 2008 (UTC)

Linear least squares is not simple linear regression
The picture may lead the reader to think that the dependence on x is linear. That is not necessarily so as now made clearer in this edit. Possibly the picture could be replaced by one that builds the solution as a curve not a straight line out of some basis functions. Jmath666 (talk) 16:22, 4 April 2008 (UTC)
 * Good point. I've replaced the picture and added a note to that extent. Oleg Alexandrov (talk) 06:42, 5 April 2008 (UTC)

Jmath, you are guilty of a commonplace confusion. Linear regression is linear regardless of whether the dependence on x is on the one hand linear or affine, or other other hand more complicated. The term "linear" in "linear regression" is not about the nature of the dependence on x. If you fit a parabola
 * $$ y = ax^2 + bx + c\,$$
 * $$ y = ax^2 + bx + c\,$$

by ordinary least squares, then that's linear regression. The dependence of the vector of least-squares estimates of a, b, and c upon the vector of y-values is linear. Michael Hardy (talk) 15:01, 20 April 2008 (UTC)

Approximate solution
It is quite wrong to suggest that the solution to a linear least squares problem is an approximation. When the Gauss-Markov-Aitken conditions apply it is a minimum variance solution. The variances on the parameters are part of the least squares solution. When the probability distribution of the derived parameters is known, uncertainty in them can be expressed in terms of confidence limits. There are no approximations involved unless the probability distribution of the parameters is approximated, say, by a Student's t distribution, but that only affects the confidence limits; the standard deviations on the parameters are dependent only on the number and precision of the dependent variables and the values of the independent variable(s); they are independent of the probability distribution of the experimental errors. In science the model function is an expression of a physical "law". In regression analysis the model function is in effect a postulate of an empirical relationship. In neither case is the model function an approximation except in the sense that the underlying "law" or relationship may be an approximation to reality. The residual vector is given as $$\mathbf{r=y-f(x,\boldsymbol\beta)}$$: the objective function is not an approximation. Petergans (talk) 09:14, 15 April 2008 (UTC)
 * This is a POV used often in math. Exact solution to the system of equations (that is, one with zero residuals) does not exist in general, so one has to look for an approximate solution, and to minimize the squares of the residuals is one approach to defining such approximate solution. This is a different concept than the exact value of the solution of the regression problem. Perhaps this can be worded somewhat differently, what exactly is exact. Jmath666 (talk) 15:03, 15 April 2008 (UTC)
 * The bulk of readers will not be mathematicians and will therefore be confused by this technical usage which is contrary to the normal English meaning of the word "approximation" (check the definition in your dictionary). I strongly object to the use of this word in this context. Petergans (talk) 10:49, 16 April 2008 (UTC)
 * Approximate = nearly exact; not perfectly accurate or correct. . So, an approximate solution of a system of equations does not satisfy those equations exactly, but only nearly so. What is wrong about that? Jmath666 (talk) 04:21, 17 April 2008 (UTC)
 * The concept is inappropriate because no approximation is involved in deriving the parameters. The relation $$\mathbf{r=y-f(x,\boldsymbol\beta)}$$ is exact. The parameter estimators have well-defined statistical properties, which is a different concept altogether. Saying that $$\mathbf{y=X\hat\boldsymbol\beta}$$ is an approximation sends out the wrong message. A consequence of the fact that it cannot be an exact relationship is that the parameters have non-zero standard deviations. So, parameter values are not known approximately, but statistical inferences can be made using the results of a least squares calculation. For example, confidence limits make no reference to what the value of a parameter might be, approximate or otherwise.
 * In short, it's a matter of maintaining a clear distinction between errors and residuals. $$\mathbf{y=X\hat\boldsymbol\beta}$$ is not exact because of experimental errors. Using the word approximation blurs this distinction. Petergans (talk) 08:07, 17 April 2008 (UTC)
 * Agreed. I think that the current version of the article is OK in that. Instead of "approximation" it clearly says "approximate solution" of a system of linear equations. Jmath666 (talk) 12:20, 17 April 2008 (UTC)


 * Inexactness is not always due to experimental error. Sometimes it is just not possible to find a model to fit very accurately given data, even if that data is known with high precision.


 * And all this argument is besides the point. The statement in the article does not say that the linear least squares problem is inexact, only that the solution to it approximately satisfies a linear system. Oleg Alexandrov (talk) 04:31, 18 April 2008 (UTC)


 * No, Oleg, it is you that is missing the point, which I now have to repeat: the non-mathematician will be confused by the use of the term approximate solution. It is clearly stated that the least squares solution is a best fit solution. To see it apparently described as something else a few sentences later is potentially confusing. A best fit solution is obviously not an exact solution. Petergans (talk) 09:52, 18 April 2008 (UTC)


 * There is no confusion. We are trying to approximately fit a curve through data points, or approximately solve a linear system of equations. The approach chosen minimizes the sum of squares of residuals. The latter problem is solved exactly. Yes, "approximate" in the article means "non-zero residuals", that is rather obvious from the text, and I don't think it will confuse anybody. Oleg Alexandrov (talk) 15:50, 18 April 2008 (UTC)

Sorry, Oleg, you and the other mathematicians are still missing the point. True, a least squares solution is an approximation from the mathematical point of view, but the experimentalist sees things differently - it is a best-fit solution. The difference arises from the fact that the experimenter has control over the number, precision and disposition of the data points, so that the matrix X is not simply a given quantity. No approximations are made in deriving the least squares solution, but the derived parameter values, errors and correlation coefficients will depend on the qualities of the measurements. For that reason it is wrong to treat the topic as a purely mathematical one and it is potentially confusing to call the best fit an approximate solution. Petergans (talk) 14:28, 20 April 2008 (UTC)


 * The problem of minimizing the sum of squares is of course exact. However, its solution will not satisfy the equality


 * $$\mathbf{X} \boldsymbol \beta = \mathbf{y}$$


 * exactly, as the residuals won't be zero.


 * In other words, the equality


 * $$\mathbf{\left(X^TX\right)\hat \boldsymbol \beta=X^Ty}$$


 * is exact, but the statement


 * $$\mathbf{X} \boldsymbol \beta = \mathbf{y}$$


 * is an approximation. Oleg Alexandrov (talk) 15:29, 15 April 2008 (UTC)


 * There is no unique solution to a set of over- or under- determined simultaneous equations. Solutions can be obtained by specifying criteria such as least squares, minimax, etc. but I can't see how these solutions are "approximate" in any dictionary sense. Petergans (talk) 10:49, 16 April 2008 (UTC)
 * They are approximate because if you plug the solutions back into the equations you don't get equalities. Oleg Alexandrov (talk) 15:25, 16 April 2008 (UTC)

Linear least squares as an overdermined system of equations
Peter removed all information about the fact that fitting a linear model to data is the same as solving an overdetermined linear system. That's a pity, since I believe that it is very important to write the linear least squares problem in the language of linear algebra before using the machinery of linear algebra to solve it. The references support my point of view, see, ,  ,  ,  ,. The above are just the first several references from google books, they all have the linear system formulation. Oleg Alexandrov (talk) 03:11, 30 April 2008 (UTC)
 * Quote from introduction: "The parameters are overdetermined, that is, there are more observations than parameters, m>n".


 * I re-wrote the orthogonal decomposition section in order to emphasize the fact that minimization of the sum of squared residuals is what it's all about. The m equations for the residuals, $$\mathbf{r=y-X\boldsymbol\beta}$$, are satisfied exactly for all values of the parameters.


 * My objections to the use of terms implying approximation have been given previously. The least squares fit is not just any old approximate fit, but a unique "best" fit. The algebraic manipulations required to obtain that fit are clearly explained in a logical sequence.


 * Linear least squares is not an article about linear algebra. Petergans (talk) 11:09, 30 April 2008 (UTC)
 * On further thought, if you insist, $$\mathbf{y\approx X\boldsymbol\beta}$$ could go in the properties section as it has nothing to do with the derivation of the solution. Personally I would regard it as a statment of the obvious in that context. Petergans (talk) 15:33, 30 April 2008 (UTC)


 * The issue of "approximate solution" can be discussed separately.


 * Fitting a linear model in one independent variable is just one application of linear least squares. You can have the more general situation of $$\boldsymbol y=\boldsymbol f(\boldsymbol x, \boldsymbol \beta),$$ where all quantities are vectors. Or you can talk about the problem of fitting a hyperplane through a set of n-dimensional points. These are all different applications, what they have in common is that they all reduce to solving an overdetermined linear system.


 * The key to this article is the overdetermined linear system. This fact should be stated as visibly as possible. Oleg Alexandrov (talk) 15:40, 30 April 2008 (UTC)


 * This assertion is unsustainable: the values of the overdetermined parameters are obtained by minimizing the sum of squared residuals. That's why it is called "Least squares". In linear regression the model is stated as $$\mathbf{y = X\boldsymbol\beta + \boldsymbol\epsilon}.$$ These equations are exact. However, they cannot be solved as the experimental errors are unknown. Writing $$\mathbf{y = X\boldsymbol\beta} $$ or $$\mathbf{y \approx X\boldsymbol\beta}$$ will lead to confusion between errors and residuals. Petergans (talk) 09:44, 3 May 2008 (UTC)


 * When I wrote $$y=X\boldsymbol\beta,$$ I wrote right there that this is in the sense of minimizing the sum of squares. There is no confusion. And the statement can be made more clear if you wish. And once again, please, the issue of whether the system is solved exactly or not is not sufficient grounds to remove the information on the overdetermined linear system. Oleg Alexandrov (talk) 15:07, 3 May 2008 (UTC)

Article structure
The methods of orthogonal decomposition do not use the normal equations at all, so it is wrong to place these methods as a subsection of "Solving the normal equations". I was at pains to re-organise the article so as to make this clear. I am reverting in order to restore that article structure. Petergans (talk) 09:44, 3 May 2008 (UTC)
 * Good point about this. However, the normal equations are important enough to deserve their own section, I believe. That even if you decide to circumvent them when solving the problem. Oleg Alexandrov (talk) 23:32, 3 May 2008 (UTC)

Oleg, the results of your recent tinkering are absolutely awful. It appears that you have not fully understood why the article needed to be restructured. This is the reason. The normal equations method and the orthogonal decomposition methods are different ways of minimizing the sum of squared residuals. The minimization is performed in order to best-fit the observed data, that is, to reduce the overall difference between observed and calculated data to its smallest value. I hope I have made this even clearer in the current revision.

I has been deemed necessary to simplify the structure of the article as a whole. I have taken the opportunity to make minor improvements in the later sections. Please look carefully at the article as a whole before you consider making any further changes.

BTW the second paragraph that you added to the lead-in is inappropriate for an article about linear least squares. That’s why it was removed. Petergans (talk) 11:23, 4 May 2008 (UTC)


 * I very much disagree with your removal of all information about solving the overdetermined linear system, for the reasons stated in the previous section.


 * I am not convinced of your current article structure, I believe my version reflected better the material.


 * I will take a few days off from this discussion, since the atmosphere here is getting a bit tense and that does not help anybody. Then I plan to resume the discussion on the overdetermined linear system. I prefer we discuss the issue of article structure afterwards, so that we deal with one issue at a time. Thanks. Oleg Alexandrov (talk) 04:41, 5 May 2008 (UTC)

Overdetermined systems
I agree that the point about a set of overdetermined equations is a good mathematical point. However, this article is about experimental data, not mathematics. If you know of a scientific or mathematical topic in which a set of overdetermined linear equations are generated other than by experimental measurements, then that topic would merit a completely different section, properly sourced to literature in the public domain. If no such topic exists, the point has only theoretical value and as such is not a useful part of the problem statement. I am open to persuasion by reasoned argument, but personal preference alone does not constitute a reason. Petergans (talk) 09:19, 5 May 2008 (UTC)


 * Well, this is not just about my personal preference, posing the linear least squares problem as the problem finding a solution to an overdetermined linear system is widely used in the literature, see again if you wish here, here, and here.


 * To say that this article is not about mathematics, is to not give justice to what the article is about. Ten percent of this article is spent on converting the data fitting problem to the mathematical problem, and 90% of the article is about the mathematics used to solve the problem.


 * Lastly, the topic of linear least squares is I believe of wide interest to statisticians and mathematicians, in additions to practitioners like yourself. Stating the goal of this article in the language of linear algebra makes the article easier to read for people who do not have your background, and is also the natural thing to do, since after all, the linear algebra is the core machinery used to solve the problem. Oleg Alexandrov (talk) 03:44, 8 May 2008 (UTC)
 * The overdetermined equations are the equations in the residuals. These equations are generated by the adoption of the least squares criterion. I have modified the text slightly to make the issue of overdetermination more explicit.


 * I won't accept $$y\approx X\beta$$ as a system of overdetermined equations because i) of the fact of unknown errors in y, ii) they are not EQUality relaTIONS, and iii) it applies the overdetermination statement to linear least squares, whereas overdetermination also applies to nonlinear least squares and regression analysis. The references that you give are unconvincing. Fairbrother writes "If the observations are made without error" which is an abstract idea, unobtainable in practice. Fairbrother then goes on to use this idea to justify the approximate relation. The other two references do not contain $$y\approx X\beta$$ explicitly.


 * The vast majority of readers of this article will not be mathematicians or statisticians, but practitioners interested in the problem of data fitting. Those readers need as simple an introduction as possible. The introduction sections should only contain essential material. That's why, for example, matrix notation is only introduced at the end of the normal equations method. The properties section contains material of a more "advanced" nature.


 * To coin a phrase, you can satisfy all of the readers some of the time, but you can't satisfy some of the readers all of the time. Petergans (talk) 10:07, 8 May 2008 (UTC)
 * Two of the three references contain the linear equations with the approximate sign, and the third one with equality, even if the equality is not attainable.


 * The system in the matrix notation complements the existing material, it does not replace it. It is at the end of the section, and it won't confuse people unfamiliar with matrices (which is unlikely anyway, as a linear algebra course is taken early on in college by all quantitative sciences students).


 * About the readership of this article, doing a google books search for this topic yields book titles like "Numerical Optimization", "Newton Methods for Nonlinear Problems", "Inverse Problems and Optimal Design in Electricity and Magnetism", "Artificial Neural Networks and Neural Information Processing". Linear least squares is a fundamental topic, and your claim that only "data fitting practitioners" will benefit from this article is unconvincing. Oleg Alexandrov (talk) 14:59, 8 May 2008 (UTC)


 * Your claim, "The vast majority of readers of this article will not be mathematicians or statisticians", supports that the basic mathematics behind this article should be explained. Of course, the mathematicians and statisticians already know this stuff. Tparameter (talk) 17:42, 8 May 2008 (UTC)


 * I agree that overdetermined system should be mentioned very early in the article, certainly with the first #Problem statement and solutions section. I'm not sure it needs to be in the lead, as it has been in some versions.  I haven't studied the history of this article in sufficient detail.  &mdash; Arthur Rubin  (talk) 18:33, 8 May 2008 (UTC)


 * Peter, even your favorite Björck book sees no conflict between the formulation in terms of an overdetermined linear system and the formulation in terms of a model function. In fact, the two perspectives reinforce each other, so that the reader is left with a fuller understanding. Oleg Alexandrov (talk) 04:58, 9 May 2008 (UTC)

That's because i) Björck is a mathematician and ii) he has the space to explain everything in detail. My point is that that practitioners may be confused by unneccessary maths. $$y\approx X \beta$$ is not acceptible because it may appear to contradict the derivation of the normal equations etc. As you have written before, it is a property of the solution. Petergans (talk) 07:45, 9 May 2008 (UTC)


 * If you want to get rid of unnecessary maths, you would do much better to keep the linear algebra, which consolidates the subject in a nutshell, and lose (or subpage) the current vast quantities of tiresomely detailed calculation.   At the moment that is what IMO is seriously obscuring the wood for the trees.


 * Perhaps it's worth recalling that Gauss originally introduced linear least squares for solving an overdetermined system -- namely an over-determined system of triangulation equations in land surveying.


 * This is not an article about fitting experimental data. It is an article about linear least squares.  Jheald (talk) 08:06, 9 May 2008 (UTC)


 * What is least squares about if it's not about data fitting? Verifiable sources, please.


 * Consider Galerkin approximations, just as one example. Linear least squares is used in all sorts of mathematics to establish a formally "nearby" solution, in a particular algebraic sense -- even when there is no data.


 * The triangulation thing is an interesting example of parameter-free least squares. It is not an over-determined system, but a constrained system: the problem posed was how to adjust the three experimentally determined angles of a triangle so that they sum to exactly 1800. Gauss solved it using undetermined multipliers, later known as Lagrange multipliers, which is now a standard tool for constrained optimization. Petergans (talk) 09:40, 9 May 2008 (UTC)


 * And when you are locating points using more than a minimum number of angles, then your system is overdetermined. Jheald (talk) 09:59, 9 May 2008 (UTC)

Peter, if the "practitioners" get confused by $$y\approx X \beta$$ (which can be explained carefully, like Björck does), then this article will be as good as useless to them, as the mathematics becomes very heavy very quickly later on in the article.

Also, people whose bread and butter is heavy use of data fitting for problems they encounter in experiments, they will just use an excel plugin, or something, without bothering to understand how the math works in this article (experimental people have better things to do with their time than understanding all low-level details of every tool they use).

The people who will truly benefit from this article are "theoreticians", whose concern is not to fit mountains of data quickly, but who develop the methods practitioners then use. Oleg Alexandrov (talk) 15:13, 9 May 2008 (UTC)


 * This is a pointless discussion. The article will be read by all sorts. A compromise is needed, not too mathematical, not too experimental. The current text is just such a compromise. Nothing essential is omitted. The term "best fit" is an absolutely clear indication that the fit is not perfect. It does not need to be re-stated in different words. Petergans (talk) 15:35, 9 May 2008 (UTC)

The additional short paragraph showing the formulation in terms of the linear system is not going to confuse anybody, since it will be at the end of the section, and that text is very simple (yes, even for experimental people). The formulation in terms of the linear system is very much supported by references, and as Jheald points out, not all linear least squares problems come from data fitting (but all linear least squares problems reduce to an overdetermined linear system).

Your claim that it will make people confuse errors and residuals is weak, the meaning of $$\approx$$ is very clear from the context, and besides, if anybody, data fitting people know very well not to confuse the two.

Lastly, your claim that people don't know about matrices is very weak too, if a science student takes any two math courses in college (and they will take probably more), those two courses will be calculus and linear algebra. Matrices are a very fundamental and simple concept in the sciences, and the natural setting in which to explain this article. Oleg Alexandrov (talk) 15:58, 9 May 2008 (UTC)


 * The opening sentence reads
 * A set of m data points ... is to be best-fitted using a model function ... which is a linear combination of n parameters
 * This is a statement in words of the equation $$y \approx X\beta$$ in which the concept of approximation is given a specific meaning. To write the equation at the end of the section can only be confusing as it is not different from the opening sentence. In the context of this WP article, the formulation in terms of the linear system is, at best, a footnote.
 * The statement "not all linear least squares problems come from data fitting" is unsubstantiated; I have requested that it be supported by a verifiable source describing just such a problem, but none has been forthcoming.
 * The majority of readers will not be mathematicians as the number of scientists, engineers, and other potential practitioners greatly outnumbers the number of mathematicians on the planet. This is a fact, not a personal opinion. Petergans (talk) 08:15, 12 May 2008 (UTC)


 * To say the same thing using different terminology at a later time is not confusing, it gives additional insights. The formulation in terms of linear system is natural for this article which makes heavy use of linear algebra.


 * A problem using linear least squares without data fitting has been pointed out to you -- over-determined system of triangulation equations in land surveying. Actual citations would be nice of course, but the example looks convincing enough. Oleg Alexandrov (talk) 15:06, 12 May 2008 (UTC)
 * Since the angles are measured quantities, this is a data-fitting problem, finding the point which best-fits the angles. Petergans (talk) 09:30, 13 May 2008 (UTC)
 * Ultimately any such problem can be stated as a data fitting problem if you with. Even with the Galerkin problem, you can treat those input numbers as "data", even if that data does not come from experiments. You miss the point, which I am already tired of repeating. This article is about what goes under the hood of linear least squares, when you put on your gloves and open up the engine. Not all experimentalists have the desire or the background to go there. But if you do, the resulting discussion is necessarily mathematical, and it has to be mathematically complete. Your insistence in avoiding a mathematical clarification to a mathematical discussion is just bizarre. I am not talking about making the mathematics more complicated than what it has to be, but the formulation in terms of linear system is the natural formuation of the concept of linear least squares from data fitting in mathematical language. This formulation is applicable even in situations apart from data fitting. It is what shows the true nature of the problem. Oleg Alexandrov (talk) 15:39, 13 May 2008 (UTC)
 * That's your opinion. You are showing no respect for others' opinions. Petergans (talk)
 * Sorry, I did not mean to be uncivil. Oleg Alexandrov (talk) 19:10, 13 May 2008 (UTC)

My two cents: I think there is a constituency of people who know linear algebra but not statistics. For this constituency, the ideal explanation of linear least squares is as the solution of an overdetermined system y = M x, really meaning the solution of y' = M x, where y' is the projection of y onto the image of M, which is the "closest" solvable system (in the least squares" a.k.a. Euclidean sense) to the original one. That this approach is pedagogically reasonable is reinforced by its appearance in intro linear algebra texts such as Shifrin & Adams and Bretscher. More to the point, Wikipedia is a reference work, for people who know math as well as those who don't, and this is useful material that can be presented concisely, as fits a reference work. I strongly urge the editors of this article to keep a linear-algebraic explanation. Joshua R. Davis (talk) 03:03, 14 May 2008 (UTC)


 * I agree. I believe the Björck book does a fine job without going too deep in either the mathematical or the data fitting side of things. I would welcome fresh attempts at editing this article. Its current history is basically a slow revert war between me and Petergans and at least in the near term I am too tired of this to attempt big changes. Oleg Alexandrov (talk) 04:20, 14 May 2008 (UTC)
 * One of my major concerns has been to make this article accessible to as wide a readership as possible, as the topic is one of intestest to a wide variety of constituencies. Inevitably that means compromise. Petergans (talk) 08:57, 14 May 2008 (UTC)

Order of sections
In the sense of the comment above, I would now prefer the specific example to precede the general solution rather than follow it, though that would be a less logical order. It would delay the introduction of matrix notation which would be an advantage to those readers who are not familiar with it, such as (I imagine) biologists and others of that ilk. What is the general feeling? Petergans (talk) 08:57, 14 May 2008 (UTC)


 * I'm all for specific examples preceding general theory. Joshua R. Davis (talk) 13:04, 14 May 2008 (UTC)

Before we even discuss that, I would like to point out that the current article structure is not right.

* 1 Problem statement and solutions o 1.1 Normal equations method + 1.1.1 General solution + 1.1.2 Specific solution, straight line fitting, with example o 1.2 Orthogonal decomposition methods * 2 Weighted linear least squares

The normal equations are not just a "method". That section establishes that the linear least squares problem
 * Has a solution
 * The solution is unique
 * The solution can be obtained in closed form

None of these hold for non-linear least squares, and this section is the foundation of the article. Without the "Normal equations" section nothing in this article makes any sense, including the "orthogonal decomposition methods". So, if we think of this article as a tree, the normal equations belong in the trunk, not in a branch.

Here is my proposed sectioning.

* The linear least squares problem * Derivation of the normal equations * Example * Solving the linear least squares problem * Using the normal equations * Orthogonal decomposition methods * Weighted linear least squares

Comments? Oleg Alexandrov (talk) 15:28, 14 May 2008 (UTC)
 * The orthogonal decomposition method also establishes that the solution exists, is unique, and can be found in closed form. It does this without reference to the normal equations.  The specific problem it solves is "for given A,b, minimize the L^2 norm of Ax-b over all x".  The proof that it does this is already in the article, and is quite a bit simpler than the normal equations to my mind. JackSchmidt (talk) 15:38, 14 May 2008 (UTC)

Good point, I did not realize that this method was proving the existence and uniqueness along the way. I still think though that the current order is quite convoluted. How about this:

* The linear least squares problem * Example? * Solving the linear least squares problem * Using the normal equations * Orthogonal decomposition methods * Weighted linear least squares

This is along the lines of what Peter mentioned too. The example could be simplified from the theoretical situation with n points and a line (made particular later to 4 points). It would still use the normal equations, but in a particular case (and by the way, I think the normal equations are still easier to understand than the orthogonal decomposition, which requires serious matrix theory). Oleg Alexandrov (talk) 16:57, 14 May 2008 (UTC)


 * Sounds good to me, especially if the example only briefly mentions normal equations (enough to have its solution make sense, but no need to derive them; some people are just looking for formulas). Once this is done, I might be able to show why the QR method is simpler (to people with warped vision like myself) since it will just say that the shortest path from a point to a line is perpendicular to the line.  I'll have to dig up my NLA books to find the pretty picture that goes with it. JackSchmidt (talk) 17:06, 14 May 2008 (UTC)


 * Oleg: "A little learning is a dangerous thing; drink deep, or taste not the Pierian spring: there shallow draughts intoxicate the brain, and drinking largely sobers us again." Alexander Pope, An Essay on Criticism, 1709. Petergans (talk) 10:07, 15 May 2008 (UTC)


 * Are you suggesting that Oleg knows dangerously little statistics, or are you talking about your intended readers, who know a little algebra but not linear algebra? I don't get it. Joshua R. Davis (talk) 13:03, 15 May 2008 (UTC)

Conformal vector?
In the section "Properties of the least-squares estimators" the term "conformal vector" is used. I could not find a definition here or at Least squares. I know what "conformal" means in geometry, but what does it mean here? Joshua R. Davis (talk) 13:03, 15 May 2008 (UTC)

Example
I am quite unhappy with the example in the article. First, it is very theoretical, with n points (an example should be as simple as possible, while keeping the essence). Second, it gives no insights into how the parameters alpha and beta are found, it just plugs in numbers into the normal equation which is just another lengthy calculation below. Third, the standard errors are too advanced topic here, they belong at linear regression instead (the goal of this article is to minimize a sum of squares of linear functions). I propose to start with a line and three points to fit, and do a simple re-derivation of the normal equations for this particular setting, and avoiding the standard errors business. Comments? Oleg Alexandrov (talk) 15:41, 15 May 2008 (UTC)
 * I'd agree 100%. The article needs serious streamlining to bring out the main points, rather than (as at present) burying the key simple ideas under a weight of numbing calculative detail.  Jheald (talk) 08:56, 16 May 2008 (UTC)
 * It is obvious that I disagree entirely with these proposals, as I wrote the current text. Since Oleg is proposing a re-write rather than an incremental edit I suggest that he does this in a sandbox with a link on this page so that others may judge if the current text should be replaced. In the re-write the following points should be considered.


 * 1) The article should be addressed to as wide a readership as possible. This article has many goals as it adresses the concerns of many differerent kinds of reader.
 * 2) Any example should be realistic; a 3-point example of straight line fitting is not realistic. A simple rule of thumb is that to determine n parameters a minimum of n2 data points are needed.
 * 3) A very simple example is the determination of a sample average: $$\bar y={\sum y_i\over m}$$. There is only one parameter, so no partial derivatives are involved, there is only one normal equation so its inverse is simply the reciprocal and so forth. Is this too simple to go at the head of the article?
 * 4) Parameter uncertainty is not an advanced topic. It is ESSENTIAL. Parameter values are meaningless without estimates of their uncertainty to indicate how many figures in the value are significant. In the numerical example $$\hat \alpha$$ = 152/59. It would be nonsense to write $$\hat \alpha$$ = 2.576271186, which is what my calculator gives. In the case of sample average the range is a reasonable estimate of the error when the number of measurements is small.
 * —Preceding unsigned comment added by Petergans (talk • contribs) 08:41, 16 May 2008


 * Peter, what you say is correct, but that applies to the linear regression article. This article is about linear least squares. Not all linear least square problems come from data fitting, and parameter estimates are not part of linear least squares. And, a first example should be simple, to give the essence. If people understand a simple example, they will also be able to understand a realistic example. Before you learn how to walk you have to learn how to crawl. Oleg Alexandrov (talk) 16:07, 16 May 2008 (UTC)

Problem statement and solution
It seems to me that the article should start by saying that least squares aims to find a "best" solution to an overdetermined system
 * $$y \approx X\beta,$$

and that such a system can arise in many ways, for instance in trying to fit a set of basis functions &phi;j
 * $$y_i \approx \sum a_j \; \phi_j (x_i)$$

I find that a much more straightforward, easy to understand starting point, because the equations are less fussy and less cluttered; and I think because it gives a broader, more generally inclusive idea of what least squares actually is.

I reject the idea that starting with the matrix notation is too telescoped and too succinct. To people familiar with it (which, let's face it, is pretty much anyone who's done any maths after the age of eleven), the shorter, terser form is much easier for the brain to recognise and chunk quickly. While to people who are slightly rusty, the short terse form is immediately unpacked and made more concrete. And in a production version of the article, one would probably immediately make it more concrete still, by saying eg:
 * such as fitting a quadratic function to a set of data points
 * $$y_i \approx \beta_1 + \beta_2 x_i + \beta_3 x_i^2$$
 * or a straight line (a linear function)
 * $$y_i \approx \beta_1 + \beta_2 x_i$$

And then explain the least squares criterion for bestness...

That seems to me much more close to the pyramid structure an article should be aiming for: start with something quite short, simple and succinct; and then steadily unpack it and slowly add detail. I think it gives a much cleaner start, starting from a birds-eye overview, than the present text which throws you stright into the morass of


 * A set of m data points consisting of experimentally measured values, $$y_1, y_2,\dots, y_m$$ taken at m values of an independent variable, $$x_1, x_2,\dots, x_m$$, ($$x_i$$ may be a scalar or a vector) is to be best-fitted using a model function, with values $$f(x_i, \boldsymbol \beta),$$, which is a linear combination of n parameters, $$\beta_1, \beta_2,\dots, \beta_n$$.
 * $$f(x_i,\boldsymbol \beta)=\sum_{j=1}^{n} X_{ij}\beta_j$$

That's my opinion, anyway. Jheald (talk) 09:41, 16 May 2008 (UTC)
 * It depends on your background. I can see that for mathematicians this might be better, but I have to say that to data fitters in general it would be all but unintelligible. For them one has to start with the idea of fitting a model function to experimental data. I assert that this article should be "biased" towards them since there are far more physicists, chemists, biologists and other scientists, engineers, etc. than there are mathematicians.


 * It should also be taken into account that a general introduction to the topic is given in least squares, as indicated in the lead-in.


 * One cannot assume that all readers (globally!) will be familiar with matrices. The first part of the article should cater for those who are not matrix-literate. Petergans (talk) 13:18, 16 May 2008 (UTC)


 * We can assert that all readers using the concept should be matrix-literate, if not actually linear-algebra-literate. Writing $$\sum_{j=1}^{n} X_{ij}\beta_j$$ without noting that it is generally known as $$\mathrm {X} \ \mathrm {\beta}\,$$, and then using the more compact matrix notation in the sequel, is an attempt to "dumb down" the article to the point that it makes a simplified explanation of the concept which differs subtly from the more precise technical explanation to the point that those who understand the more precise explanation (matrices) are confused.  &mdash; Arthur Rubin  (talk) 13:37, 16 May 2008 (UTC)


 * I agree with Arthur. Peter, I can't agree with your "unintelligibility" comment before that.


 * IMO, regarding the four approximation equations I set out at the top, each of them represents a likely starting-point for some of the people coming to this article. So the opening paragraph setting out the problem should accomodate all of them.  My view is that so long as people can see that one of the equations looks familiar to them, they will be able to then track back and relate it to the more general forms.


 * For example, as a physicist, the starting point which seems most natural to me is the second one (the basis function expansion). And I then appreciate how the matrix X can be explicitly identified with &phi;j(xi).  I think this is a very important link. (Note that above I've only intended to sketch out the sequence of development; a sentence or two more text would need to be added for a production version).  Someone else might find the quadratic their natural entry point; they can then relate that to the more general basis function formula; and then to the overdetermined system formula, more general still.


 * The other thing I do think is valuable is to set out the approximation problem as above, in its various levels of generality, before applying the least squares requirement to it, which IMO should follow after a subparagraph break (subhead:"least squares criterion"). I find articles easier to assimilate the fewer the number of untethered entities flying around at any one moment.  That's why IMO I think the structured development above, from general to more specific, is helpful, rather than first throwing a bucketload of variables at people.  It's also why I think it makes sense to get people to assimilate just the approximation problem first, set out on its own in a first subparagraph, and only after that's done explicitly introduce the least squares criterion.


 * I appreciate, this is not the way you've been pushing this article; but I think it is a better way forward. I also think it's important to introduce - at least for one line - the linear least squares problem through its general form (as an overdetermined linear system, which could encompass eg overdetermined triangulation equations), rather than only as a basis function fitting problem.  Also, when you start from the general set-up, it's easy to see that the specific case is a special case of it.  But if you dive right into the details of the specifics, it's much harder then to pull back from that to the overview.  Jheald (talk) 15:21, 16 May 2008 (UTC)


 * Arthur Rubin is exactly right that, to a reader who knows matrices, the matrix treatment is vastly easier than the non-matrix treatment. However, I agree with Petergans that we should not assume matrix skills from the common reader. (I am basing this on my own experience teaching college math in the U.S. --- at "elite" colleges.) There should be an example with no matrices, and little-to-no subscripts either, since subscripts hugely confuse novices. Making an example this simple is difficult, but one has already been suggested: the fit-a-line-to-three-points idea above. As Petergans says, it's not realistic. But the first example doesn't have to be; it needs to be comprehensible. The educational research I've seen suggests that people learn better by generalizing from examples to theory, than by seeing abstract theory followed by examples. I advocate (1) fit-a-line-to-three-points to convey idea of approximate solution, (2) fit-a-line-to-N-points for a realistic problem without matrices, (3) the same fit-a-line-to-N-points problem in matrix notation, and then (4) general matrix formulation of the linear least squares problem. Joshua R. Davis (talk) 15:53, 16 May 2008 (UTC)


 * I agree with many of the points stated here. This article is about a very simple concept, minimizing a sum of linear squares. It is currently written as if it were a copy of linear regression, which is where the data fitting concepts belong fully. I taught linear least squares to undergraduates when I worked in academia. They would find the this article with data fitting concepts and parameter estimates incomprihensible.


 * Linear least squares is a much simpler concept than what this article makes it appear. Oleg Alexandrov (talk) 16:12, 16 May 2008 (UTC)

Proposed rewrite
Here is a draft proposed rewrite of the current article. I followed the Bjorck book. These are the assumptions that guided me:


 * The concept of linear system is inherently simpler than the concept of statistical data fitting. Linear systems are taught to children before high school (e.g., find the intersection of two lines).


 * Data fitting is the primary application of linear least squares, but data fitting does not equate linear least squares. Linear least squares arise in other applications (such as as Galerkin approximations).
 * The primary article dedicated to using linear least squares in data fitting problems is linear regression. Statistical analysis of model parameters belongs there. This article is about minimizing a sum of linear functions, regardless of the origin and interpretation of such functions.

Comments? Oleg Alexandrov (talk) 03:34, 17 May 2008 (UTC)


 * Perhaps matrix notation could be mentioned in the introduction? #Normal equations seems a bit late for the first matrix.  &mdash; Arthur Rubin  (talk) 17:31, 17 May 2008 (UTC)
 * Perhaps in the "Problem statement" section? Oleg Alexandrov (talk) 00:12, 18 May 2008 (UTC)
 * I'm in a minority of one in this discussion, so I have to represent the needs of all the non-mathematicians who are interested in data fitting. They will be put off immediately by a mathematics-orientated presentation. Look at C.L. Lawson and R.J. Hanson, Solving Least Squares Problems, Prentice-Hall,1974 for an authoritatice presentation of their point of view, before possibly comitting another blunder as with the proposals regarding orthogonal decomposition method. Petergans (talk) 07:15, 18 May 2008 (UTC)


 * Petergans, I think my stance is even more reactionary than yours. Readers will be put off immediately by
 * $$\sum_{j=1}^{n} X_{ij}\beta_j = y_i,\ (i=1, 2, \dots, m)$$
 * which doesn't have any matrices (explicitly). That's why I suggest doing the example ("Specific solution, straight line fitting, with example") first, with a minimum of notation. Joshua R. Davis (talk) 14:42, 18 May 2008 (UTC)
 * A presentation in terms of linear systems, which 14-year old children know about, is "mathematically-oriented"? No matter that the data-fitting application comes immediately after that? Peter, you can't be serious. Oleg Alexandrov (talk) 16:15, 18 May 2008 (UTC)

I copied the proposed version in the article. I think more work is needed, as the prose and article structure is clumsy in places. I also fully support Joshua's proposal to simplify the example and move it before the theory. Oleg Alexandrov (talk) 17:44, 18 May 2008 (UTC)

Motivational example
Any first year undergraduate would be severely reprimanded for writing
 * $$\alpha\approx 2.5763$$
 * $$\beta\approx 0.3390$$

since it is obvious that five significant figures are not merited by the data.

This is emphatically the wrong example to give to inexperienced readers since it gives the impression that very precise results can be obtained from poor data. As I have repeatedly stated, no result is meaningful without an estimate of error and this should be made clear even in the simplest example.

In this case, since the observed values have only one significant figure, the results have at most two significant figures. The precision of the parameters is a little greater than the precision of the observations because the parameters are overdetermed 2:1.

At the very least there should be a link to the determination of standard deviation, or a footnote giving the sd values. Petergans (talk) 09:11, 21 May 2008 (UTC)


 * Since we're making the example, we can instead add more significant figures to the data, so that they are justified in the results. I'm worried that, if we reduce the precision of the results, then the best-fit-ness of the line will be less convincing.


 * As for error, can we not just state the least squares error? And preferably show the error on the graph, in terms of line segments from the line to the data. Joshua R. Davis (talk) 13:15, 21 May 2008 (UTC)

I do agree that parameter estimates are very important to any real-world data fitting computation, however, that could be too much for a first example, and besides, parameter estimation is not part of the linear least squares problem itself, which is concerned with finding the best fit location only (parameter estimates belong to linear regression).

Joshua, I plan to later modify the data a bit (now almost the same numbers show both as x and y values which is confusing). Then I will also remake the picture and show the errors. I'll try to find some time for this later this week. Oleg Alexandrov (talk) 15:29, 21 May 2008 (UTC)
 * Peter, I don't mind a footnote giving the sd values. I'd like to first change the example a bit, for the reason mentioned earlier. Oleg Alexandrov (talk) 15:33, 21 May 2008 (UTC)


 * Peter, there is nothing wrong with saying
 * $$\alpha=\frac{152}{59}\approx 2.5763$$
 * $$\beta=\frac{20}{59}\approx 0.3390$$


 * As you mentioned many times earlier, the linear least squares problem is solved exactly. The discussion of the uncertainty in the input data and their effect on the values of the parameters should not be confused with the fact that the linear least squares problem has exactly one solution that can be determined very accurately given the input data. Oleg Alexandrov (talk) 16:03, 21 May 2008 (UTC)


 * There is everything wrong with giving figures that are meaningless (e.g. 763) in a result. As least squares is used to get parameter values so also it is used to get parameter errors. A parameter value is a linear combination of the observed values. It follows by error propagation that the error is a linear combination of the observational errors; hence the derivation using significant figures. Statistics does not come into it. Parameter errors are an essential part of any least squares analysis of data and the motivational example is an analysis of data. Petergans (talk) 18:16, 21 May 2008 (UTC)


 * If there are errors in the input parameters (xi, Xij, etc.), it takes the problem out of the linear least squares formulation, and puts it squarely into (a generalization of) linear regression. In fact, one might argue that calculating the parameter errors in &beta;j is beyond the scope of this article, but I don't think I'd go that far.  &mdash; Arthur Rubin  (talk) 18:50, 21 May 2008 (UTC)
 * Just because an input number is 4, it does not mean this is accurate to one significant digit. It could be $$4\pm 1$$ or $$4\pm 0.001.$$ I don't believe you can talk about parameter errors without any assumption on the statistical properties of the input data. And that is not strictly part of the linear least squares problem. In linear least squares the input data is taken "as is", and the minimum is found exactly.


 * That being said, I don't mind a footnote at the end of the example, mentioning briefly some of the statistical aspects, even if that is not part of solving the linear least squares problem. Oleg Alexandrov (talk) 20:08, 21 May 2008 (UTC)
 * To be fair to Peter: If you mean $$4\pm 0.001$$, then you should say 4.000 rather than just 4. JRSpriggs (talk) 23:49, 21 May 2008 (UTC)

Let's get one thing straight. There is no real difference between linear least squares and linear regression when the latter uses the the least squares method to obtain the parameters. The different names are an historical fact, resulting from the employment of the principle of least squares by different categories of user. In the WP context both articles must take experimental error into consideration.

Experimental error is not just a "statistical aspect" of the data. All measurements are subject to error. The fact of experimental error is fundamental to any problem of data fitting. The principal reason for collecting more data points than are absolutely needed, that is, for applying the least squares method in the first place, is to improve the precision of the derived parameters. I raise the issue of significant figures, not as a matter for debate, but to give the editor the opportunity to correct his mistake in his own way; otherwise I will have to do it myself. Petergans (talk) 07:06, 22 May 2008 (UTC)


 * Let's go down to the bottom of things. A brief note or a footnote at the end of the example touching upon the issue of experimental errors is welcome. That should not obscure the fact that the numbers alpha and beta are found exactly, given the input numbers, and it should be made clear that any errors in alpha and beta are due to errors in the input data rather than due to the process of minimization. Oleg Alexandrov (talk) 15:38, 22 May 2008 (UTC)
 * This is a correct distinction. It is also the difference between the number coming out of the calculating machine and what is published in the final report. So, $$\hat\alpha=\frac{152}{59}$$ is acceptable as an expression for the analytical solution, but to publish the result with more than two significant figures is not: $$\hat\alpha=2.57$$ is not acceptable because the seven may be replaced by any other digit without making a significant difference to the fit. When going from the analytical solution to the numerical result the choice of the number of significant figure MUST to be explained in the body of the text; the sd argument is more sophisticated and can go in a footnote at this stage. Petergans (talk) 09:23, 23 May 2008 (UTC)
 * I've just noticed that the new four data points appear to lie on a curve. I suggest instead (1,1) (2,5) (3,6) (4,10). This also gets round the significant figures issue because &alpha;=56/20=2.8 &beta;=-30/20=1.5! Petergans (talk) 13:08, 23 May 2008 (UTC)
 * Pardon me, but the data is always on a curve. :) I'll take a look tonight at the new data and if they look good on the picture I'll regenerate it and modify the example. Oleg Alexandrov (talk) 15:28, 23 May 2008 (UTC)
 * Done. It was a nice idea to replace the numbers to get simpler answers. Oleg Alexandrov (talk) 04:57, 24 May 2008 (UTC)

Weighted least squares
The recent re-ordering is the wrong way round. Weighted least squares is more general, so I suggest the properties section should be modified to take account of this and then the old order makes more sense. The modification could be as simple as adding "assuming unit weights", but I would prefer expressions that include weights. Petergans (talk) 16:28, 25 June 2008 (UTC)
 * Fair enough. To be honest though, I'd cut all mentions of weights W from anywhere in the article except where they are truly relevant to the discussion. In many places in the article the weights play no role, and the formulas with weights can be obtained from the formulas without weights by trivial modifications to the matrix $$X$$ and the vectors $$y.$$ Getting rid of W in as many places as possible would make the exposition simpler without reducing the generality. Oleg Alexandrov (talk) 05:07, 26 June 2008 (UTC)
 * I agree it would be simpler from the mathematical point of view, but weights are very important to the experimentalist and I would prefer to see them nearer the beginning. There is possible compromise. If the weight matrix is factored $$\mathbf{W=w^Tw}$$ then the weighted normal equations become $$\mathbf{\left([X^Tw^T][wX]\right)\hat \boldsymbol \beta=[X^Tw^T][wy]}$$ equivalent to $$\mathbf{\left(J^TJ\right)\hat \boldsymbol \beta=J^Ty'}$$ with J=wX and y'=wy. Once this is explained the weights need no longer be stated explicitly. This is useful theoretically but is not very practical when the weight matrix is not diagonal.
 * On the whole I would prefer to see the weights written explicitly. The crunch comes when evaluating the goodness of fit: the weighted sum of squares can be tested against a standard chi-squared distribution with m-n degrees of freedom. To perform this test with unit-weighted residuals one would have to supply an estimate of the variance of an observation, &sigma;2, which is equivalent to setting the weight matrix to $$\frac{1}{\sigma^2}\mathbf{I}$$. Petergans (talk) 21:48, 26 June 2008 (UTC)
 * It depends how far up you want to move verbose formulas containing the weights. I'd suggest below the "normal equations" and "orthogonal decomposition" sections, as those are rather introductory. Oleg Alexandrov (talk) 03:04, 28 June 2008 (UTC)

Inversion Warning?
The section about "Inverting the normal equations" warns that "An exception occurs...". As someone trying to learn from the article, this sentence seems unnecessarily vague. Someone (like me) who knows a little math, knows that if the matrix has an inverse, it is unique. Is the article trying to indicate a computational problem, like rounding, or something else? If you are saying that computing equipment cannot calculate the inverse to an acceptable precision, then say that. Stephensbell (talk) 18:20, 23 October 2008 (UTC)

Notation
Certain editors are changing the symbol for transpose from T ($${}^{\mathrm T}$$) with ' (something which doesn't work at all properly within tags, so I'm not going to try). I think this requires consensus both here and at the math notation page. — Arthur Rubin (talk) 22:31, 6 November 2008 (UTC)


 * I am part of the editors who use the ' notation. In think it is less ambiguous than the T notation, simply because in some contexts the exponent refers to the multiplication of the matrix T times by itself. Moreover some authors use a T (sometimes a lower case t) before the matrix to indicate transposition, which can be ambiguous too. When written after a matrix or a vector, the symbol ' has not, as far as I know, any other use. In addition it would be good to drop the convention of writing vectors and matrices in bold. First most of the pages in Wikipedia already do not use it and then the equations look much cleaner (readable) without bold everywhere. Finally vectors and matrices are just elements of sets and I have never seen in other contexts than basic linear algebra the use of bold to denote the elements of a set. Bold is rather used to write the sets themselves. Flavio Guitian (talk) 16:09, 7 November 2008 (UTC)

Writing
 * $$ A' \, $$
 * $$ A' \, $$

for the transpose of the matrix A is done far more frequently in the statistics literature than is the superscript-T notation for transpose. It's also done in Herstein's Topics in Algebra, but the superscript-T is used in Weisberg's Applied Linear Regression, so I privately jocularly thing of the superscript-T notation as the Weisberg notation and the prime notation as the Herstein notation. Michael Hardy (talk) 17:34, 7 November 2008 (UTC)


 * Hmmm. Well, the T's were there first, so this should probably be considered a style issue, which should not be changed without good cause.  Perhaps changing them to yet another notation used occassionally on Wikipedia,
 * $$ A^\top, \, $$
 * might be a reasonable substitute, annoying everyone equally. For what it's worth, if I researched it sufficiently, produce an equation involving $$\beta '$$, the derivative of $$\beta$$ with respect to an unnamed independent variable.  — Arthur Rubin  (talk) 18:21, 7 November 2008 (UTC)
 * I'm actually not sure about the bolding issue. Perhaps further discussion could be done in the WT:MSM section on that issue.  — Arthur Rubin  (talk) 18:28, 7 November 2008 (UTC)
 * might be a reasonable substitute, annoying everyone equally. For what it's worth, if I researched it sufficiently, produce an equation involving $$\beta '$$, the derivative of $$\beta$$ with respect to an unnamed independent variable.  — Arthur Rubin  (talk) 18:21, 7 November 2008 (UTC)
 * I'm actually not sure about the bolding issue. Perhaps further discussion could be done in the WT:MSM section on that issue.  — Arthur Rubin  (talk) 18:28, 7 November 2008 (UTC)


 * I'm personally a fan of  myself, but don't really care as long as it is consistent. I do however think the excessive bolding is very ugly: I think it should be removed (unless anyone objects, I may do this in the next couple of days). 3mta3 (talk) 17:14, 6 April 2009 (UTC)
 * I agree, the use of bold characters for matrices is probably one the biggest stupidities I have seen in statistics/mathematics/econometrics. Flavio Guitian (talk) 18:34, 7 April 2009 (UTC)


 * Simple and clean notation aids in understanding the material. Imagine how theory of relativity would have looked without Einstein's summation rule? Ugly Σ's would've been all over the place. Same goes with matrix transposes. Of course $$A^T$$ might be not too bad (except that T is often reserved for the number of observations), but compare
 * $$\, (G'WG)^{-1}G'W\Omega^{-1}WG(G'WG)^{-1} $$   with
 * $$\, (\mathbf{G}^T\mathbf{WG})^{-1}\mathbf{G}^T\mathbf{W}\boldsymbol\Omega^{-1}\mathbf{WG}(\mathbf{G}^T\mathbf{WG})^{-1} $$
 * First one looks much tidier and thus easier to understand. Even though T's were there first, it's not a reason to keep them. // Stpasha (talk) 04:57, 1 July 2009 (UTC)

Difficulties following the logical flow of the article
As an MD with an interest in mathematics, and some background knowledge of linear algebra, I was trying to read this article from its beginning to the section entitled 'Inverting the normal equations'. Several points were decidedly unclear to me: 1. The alphas from the motivational example are gone in the section on 'The general problem'. I guess they are simply substracted from the yi, but this was confusing. 2. The section on "Uses in data fitting' ends by saying 'The problem then reduces to the overdetermined linear system mentioned earlier, with Xij = φj(xi).' This is unclear to me, because in the 'General problem' section it is said that the overdetermined linear system usually has no solution. The data fitting procedure, on the other hand, does come up with a solution. So, I would think that the fitting problem we start with is an overdetermined system, and the data fitting procedure comes up with the "best" solution. At the point where it is said that 'The problem then reduces to the overdetermined linear system mentioned earlier', in reality we have left behind that overdetermined linear system already, in order to find the approximate solution. 3. In the 'Derivation of the normal equations' section, and despite a little knowledge of matrix algebra, it was unclear to me how the normal equations are "translated' in matrix notation.

I apologize for my nonprofessional view of these matters, but, then, these encyclopedia articles are meant for people who do not know everything about the subject. So I thought it would be helpful to let know how the article feels to a nonmathematician, nonphysicist reader. Frandege (talk) 21:47, 28 November 2008 (UTC)


 * First I understand that this article is quite difficult to read for people that are not used to regression and linear algebra in general. In addition I find the notation quite bad (this article needs a rewrite) and secondly this is not an article as introductory as the linear regression article. Now about your questions:


 * (1) The alphas are not gone, neither were they subtracted from the $$ y_i $$. It is customary to use $$ \alpha $$ and $$ \beta $$ when there is only one regressors. However in the general case when we have more than two regressors betas indexed by integers are used. Thus, a model including only one regressor and a constant can also be written as


 * $$ Y_i = \beta_1 + \beta_2 X_i + \varepsilon_i. \, $$


 * (2) An overdetermined system of equations usually has no solution, period. It means that there are more equations than unknowns, which seems natural in the case of regression because we expect more observations than parameters to estimate (the betas are the unknowns here). What we do in regression analysis is to come up with a different criterion to still find an approximate solution to this system. One of the most popular criteria is to minimize the sum of squared errors, and there are many mathematical reasons for that. For example it can be shown that if the errors are normally distributed the least squares criterion is also the maximum likelihood estimator. It can also be shown that geometrically the least squares estimation amounts to orthogonally projecting the observations vector y in the column space of the observations matrix X.


 * (3) The translation in matrix notation directly follows from the definition of the matrix product. I can’t really help you much about that, since this comes from a definition. See matrix multiplication.


 * Flavio Guitian (talk) 16:53, 29 November 2008 (UTC)


 * Frandege, thanks for your constructive comments. I have made some edits that should make it flow better, please look again. Jmath666 (talk) 07:21, 30 November 2008 (UTC)

To JMath: Thank you for looking into this; I believe the article benefited a great deal from the changes. I am still not very sure about the alphas which have now become beta1. My difficulty is that the equation which starts the 'General problem' section (:$$\sum_{j=1}^{n} X_{ij}\beta_j = y_i,\ (i=1, 2, \dots, m),$$ ) does not allow for terms without X-values. This stands in contrast to the section 'Uses in data fitting', where it is easy to conceive :$$phi_j(x)=1,$$ for all x in the equation :$$f(x, \boldsymbol \beta) = \sum_{j=1}^{n} \beta_j \phi_j(x).$$. I can see that this point bears little on the further derivation, but it would be preferable to get rid of this inconsistency. To Flavio Guitan: for (1) see my comment above, for (2) this is clear now, for (3) thank you for pointing this out - I should have seen that, but got jumbled up with the transpose sign. Frandege (talk) 19:41, 30 November 2008 (UTC)


 * Thank you. I am not sure I understand. Some of the coefficients $$X_{ij}$$ can be 1 that's what happens in the "Motivating example". I put explicit 1s there now for consistency. Jmath666 (talk) 02:38, 1 December 2008 (UTC)

Thank you again. The confusion arises from the use of Xij, which I confounded with the x's used as abscissa in the second plot on the article page. I can see now that the Xijs are nothing else than the φj(xi), where the xi are the abscissas in the plot. As I stated yesterday the alpha from the previous version (our current beta1) correspond to φj=1. In the example given in the first plot of the article, there would also be a beta2, corresponding to φj(xi)=xi^2. I am still tempted to think that the article would be clearer if the' general problem' were only described where and when it arises, i.e. in the data fitting.

I fix a minor inconsistency (the alpha was still mentioned in one sentence. Frandege (talk) 19:24, 1 December 2008 (UTC)