Talk:Least-squares estimation of linear regression coefficients

What the hell's wrong with math tex codes in this article? All I see are red lines!!! --138.25.80.124 01:03, 8 August 2006 (UTC)

It's hard to know where to begin saying what's wrong with this truly horrible article...


 * Wherein we show that computing the Gauss-Markov estimation of the linear regression coefficients is exactly the same as projecting orthogonally on a subspace of linear functions.

There is no context-setting at all above. Nothing to tell the reader what subject this is on, unless the reader already knows.

The Gauss-Markov theorem states that projecting orthogonally onto a certain subspace is in a certain sesne optimal if certain assumptions hold. That is explained in the article titled Gauss-Markov theorem. What, then, is different about the purpose of this article?


 * Given the Gauss-Markov hypothesis, we can find an explicit form for the function which lies the most closely to the dependent variable $$Y$$.

"an explicit form for the function which lies the most closely to the dependent variable $$Y$$." What does that mean?? This is one of the vaguest bits of writing I've seen in a while.


 * Let F be the space of all random variables $$(\omega,\mathcal{A})\rightarrow(\Gamma,S)$$ such that $$(F,d)$$ is a metric space.

The above is completely nonsensical. It purports to define some particular space F, but it does not. It does not say what &omega;, $$\mathcal{A}$$, &Gamma;, or S is, but those nee to be defined before being referred to in this way. And what possible relevance to the topic does this stipulation of F have?


 * We can see $$\eta$$ as the projection of $$Y$$ on the subspace G of $$F$$ generated by $$(X_1,\cdots,X_p)$$.

What is &eta;?? It has not been defined. A subspace of F? F has also not been defined. What is $$(X_1,\cdots,X_p)$$? Not defined. Conventionally in this topic $$X_1,\cdots,X_p$$ would be column vectors in Rk and the response variable Y would also be in Rk. But that jars with the idea that $$X_1,\cdots,X_p$$ are in some space F of random variables, stated above.


 * Indeed, we know that by definition $$Y=\eta(X;\theta)+\varepsilon$$. As $$\varepsilon$$ and X are supposed to be independant, we have:

How do we know that? And what does it mean? And what is X? Conventionally X would be a "design matrix", and in most accounts, X is not random, so it is trivially independent of any random variable. (And it wouldn't hurt to spell "indepedent" correctly.


 * $$\mathbb{E}(Y|X)=\eta(X;\theta)$$,

What does that have to do with independence of X and anything else, and what does this weird notation &eta;(X;&theta;) mean? I have a PhD in statistics, and I can't make any sense of this.


 * but $$Y\mapsto\mathbb{E}(Y|X)$$ is a projection!

I know a context within which that would make sense, but I don't see its relevance here. The sort of projection in Hilbert space usually contemplated when this sort of thing is asserted is really not relevant to this topic.


 * Hence, $$\eta$$ is a projection of Y.

This is just idiotic nonsense.


 * We will now show this projection is orthogonal. If we consider the Euclidean scalar product between two vectors (i.e. $$:=u^t v$$), we can build a scalar product in F with $$:=\mathbb{E}[X^t Y]$$ (it is indeed a scalar product because if $$\mathbb{E}\|X\|^2=0$$, then $$X=0$$ almost everywhere).

User:Deimos, for $50 per hour I'll sit down with you and parse the above if you're able to do it. I will require your patience. You're writing crap.


 * For any $$X_j$$ ($$1\leq j\leq p$$), $$=-=\mathbb{E}[X_j^t Y] - \mathbb{E}[X_j^t \mathbb{E}[Y|X]=X_j^t(\mathbb{E}Y-\mathbb{E}[\mathbb{E}[Y|X]])=X_j^t(\mathbb{E}Y - \mathbb{E}Y)=0$$. Therefore, $$\varepsilon$$ is orthogonal to G which means the projection is orthogonal.

Some of the above might make some sense, but it is very vaguely written, to say the least. One concrete thing I can suggest: Please don't write


 * $$\,$$

when you mean


 * $$\langle X_j,\varepsilon\rangle.\,$$


 * Therefore, $$X^t (\eta(X;\theta)-Y) = 0$$. As $$\eta(X;\theta)=X\theta$$, this equation yields to $$X^t X \theta = X^t Y$$.


 * If $$X$$ is of full rank, then so is $$X^t X$$. In that case,


 * $$\theta=(X^tX)^{-1} X^t Y$$. Given the realizations $$x$$ and $$y$$ of $$X$$ and $$Y$$, we choose
 * $$\hat{\theta}=(x^t x)^{-1}x^t y$$ and $$\eta(X;\hat{\theta}) = X\hat{\theta}$$.

Sigh..... Let's see .... I could ask why we should choose anything here.

OK, looking through this carefully has convinced me that this article is 100% worthless. Michael Hardy 23:38, 5 February 2006 (UTC)

Recent edits
After the last round of edits, it is still completely unclear what is to be proved in this article, and highly implausible that it proves anything. Michael Hardy 00:51, 9 February 2006 (UTC)

OK, I'm back for the moment. The article contains this sentence:


 * In this article, we provide a proof for the general expression of this estimator (as seen for example in the article regression analysis):

$$\widehat{\theta}_n^{LS}=(X^t X)^{-1}X^t Y$$

What does that mean? Does it mean that the least-squares estimator actually is that particular matrix product? If so, the proof should not involve probability, but only linear algebra. Does it mean that the least-squares estimator is the one that satisfies some list of criteria? If so which criteria? The Gauss-Markov assumptions? If it's the Gauss-Markov assumptions, then this would be a proof of the Gauss-Markov theorem. But I certainly don't think that's what it is. In the present state of the article, the reader can only guess what the writer intended to prove! Michael Hardy 03:19, 9 February 2006 (UTC)

one bit at a time...
I'm going to disect this slowly. The following is just the first step. The article says:


 * $$(\Omega,\mathcal{A}, P)$$ will denote a probability space and $$n\in\mathbb{N}^*$$ (called number of observations). $$\mathcal{B}_n$$ will be the n-dimensional Borel algebra. $$\Theta\subseteq\mathbb{R}$$ is a set of coefficients.


 * The response variable (or vector of observations) Y is a random variable, i.e. a measurable function $$Y:(\Omega,\mathcal{A})\rightarrow(\mathbb{R}^n,\mathcal{B}_n)$$.


 * Let $$p\in\mathbb{N}^*$$. $$p$$ is called number of factors. $$\forall i\in \{1,\cdots,p\}, X_i:(\Omega,\mathcal{A})\rightarrow(\mathbb{R}^n, \mathcal{B}_n)$$ is called a factor.


 * $$\forall\theta\in\Theta^{p+1}$$, let $$\eta(X;\theta):=\theta^0 + \sum_{j=1}^p \theta^j X_j$$.


 * We define the errors $$\varepsilon(\theta):=Y-\eta(X;\theta)$$ with $$\theta:=(\theta_0,\cdots,\theta_p)\in\Theta^{p+1}$$. We can now write:


 * $$\forall \theta\in\Theta, Y=\theta^0 + \sum_{j=1}^p \theta^j X_j+\varepsilon(\theta)$$

In simpler terms, what this says is the following:


 * Let Y be a random variable taking values in Rn, whose components we call observations, and having expected value


 * $$\eta=\theta_0 \mathbf{1}_n + \sum_{j=1}^p \theta_j X_j,$$


 * where
 * Xj &isin; Rn for j = 1, ..., p is a vector called a factor,
 * 1n is a column vector whose n components are all 1, and
 * &theta;j is a scalar, for j = 0, ..., p.


 * Define the vector of errors to be &epsilon; = Y &minus; &eta;.

The first version is badly written because
 * Explicit mention of the underlying probability space, and Borel measureability, are irrelevant clutter, occupying the readers attention but not giving the reader anything. When, in the study of statistics, do you ever see a random vector that is not Borel-measurable?  Will the fact of measurability be used in the succeeding argument?  A link to expected value is quite relevant to the topic; a link to measurable function is not.
 * Saying "$$\Theta\subseteq\mathbb{R}$$ is a set of coefficients" makes no sense. The coefficients are the individual components of a vector &theta; somewhere within this parameter space.  If anything, &Theta; must be a subset of Rp in which the unobserved vector &theta; is known to lie.  If that subset is anything other than the whole of Rp, then I think you'll have trouble making the case that least-squares estimation of &theta; is appropriate, since the estimate presumably should be within the parameter space;
 * The column vector of n "1"s is missing;
 * It alternates between subscripts and superscripts on the letter &theta;, for no apparent reason;
 * Why in the world is &epsilon; asserted to depend on &theta;? Later the article brings the Gauss-Markov assumptions, which would conflict with that.
 * One should use mathematical notation when it serves a purpose, not just whenever one can. It is clearer to say "For every subset A of C" than to say "$$\forall A\in\mathcal{P}(C),$$ where $$\mathcal{P}(C)$$ is the set of all subsets of C."

OK, this is just one small point; the article has many similar problems, not the least of which is that its purpose is still not clear. I'll be back. Michael Hardy 00:25, 20 February 2006 (UTC)

Thanks
OK, this makes sense: I'll correct the article. Except for the "having expected value" part. The way I present it, you can always write $$y=\eta+\varepsilon$$. What the Gauss-Markov assumptions add is that there exists an optimal parameter $$\overline{\theta}$$ for which $$\varepsilon$$ has an expectation of 0 and that its components are independant. The advantage is that you do not have to suppose that the $$X_j$$'s are constants. In the case of randomized designs, this is important. Deimos 28 12:10, 20 February 2006 (UTC)

Aim of the article
I have now added to the introduction that I wish to give a motivation behind the criterion optimized in least-squares (seeing a regression as a projection on a linear space of random variables) and derive the expression of this estimator. One can differentiate the sum of squares and obtain the same result, but I think that the geometrical way of seeing the problem makes it easier to understand why we use the sum of squares (because of Pythagoras theorem, i.e. $$\|Y\|^2_2=\|\eta(X;\overline{\theta)}\|^2_2+\|\varepsilon(\overline{\theta})\|^2_2$$, where $$\|X\|^2_2:=\mathbb{E}[X^2]$$).To see the regression problem in this way requires the Gauss-Markov hypothesis (otherwise we cannot show that E(.|X) is an orthogonal projection). Regards, Deimos 28 08:56, 9 February 2006 (UTC)


 * Looks like it has been 2 years since this article has been receiving any significant attention. Back then most people who commented on this talk page were complaining that the article is complete mess. The article was even nominated for deletion though the proposal was rejected based upon (i) notability of the subject, (ii) faith that somebody would bring it into a reasonable shape, (iii) absence of other articles dedicated to LS methods. The author's idea was to derive OLS method as a projection onto the space of regressors. And although nobody doubts such approach is valid, in my opinion it is more counterintuitive than "easy to understand". Look at the picture on the right: it shows a simple linear regression, however it'd take a 4-dimensional space to represent it as a projection, and there aren't that many people in the world who can visualize things in high-dimensional spaces ...


 * Anyways, my point is: the article is abandoned, still a mess, without a clear idea what it is supposed to be about, and most of the "keep" arguments used in AfD discussion 2 years ago are no longer applicable. Maybe it's time to reopen the AfD discussion? // Stpasha (talk) 10:51, 5 July 2009 (UTC)