Talk:Hessian matrix

Initial discussion
This would really be a lot easier to understand if we could see a visual representation.. something like

Hf(x,y) = [Partial 2 derivative with respect to x2], [partial 2 derivative yx] [Partial 2 derivative with respect to xy], [partial 2 derivative y2]

"Hessian matrices are used in large-scale optimization problems" looks incorrect to me: for high-dimensional problems second order methods are usually taken into account only for problems with some known exploitable (sparse) structure. In general the hessian matrix is too big to be stored. First order methods are the main choice in general. --Lostella (talk) 09:44, 10 July 2013 (UTC)

Is this correct?
The second phrase of 'Second derivative test' (If the Hessian is positive definite...) should not be 'If the determinat of the Hessian is positive definite...' ?


 * A positive-definite matrix is a type of symmetric matrix. A determinant is just a real number, which may be positive or negative, but not positive definite.  Follow the link. -GTBacchus(talk) 23:12, 5 March 2006 (UTC)


 * Being positive-definite is not not related to being symmetric. It just says that all eigenvalues of or bilinear forms constructed with this matrix are positive definite (i.e. x^T A x >= 0, for all x). You only find both terms (i.e. symmetric and positive definite) going together so often, that there is already a shorthand for this: s.p.d. Nonetheless, the terms are distinct. 134.169.77.186 (talk) 12:31, 6 April 2009 (UTC) (ezander)


 * Right. Consider a rotation matrix such as
 * R=$$\begin{bmatrix}0&1\\-1&0\end{bmatrix}$$
 * Note that it has determinant of 1 but is not symmetric and has complex eigenvalues. —Ben FrantzDale (talk) 13:02, 6 April 2009 (UTC)

Del
With regard to the del operator, is it that
 * $$H=\nabla\otimes\nabla\cdot f$$?

Or am I just confused? —Ben FrantzDale 08:13, 28 March 2006 (UTC)


 * I think that is close, but you need to transpose one of the dels as well as write f as a diagonal matrix:


 * $$H=\nabla\otimes\nabla^T\cdot \mathrm{diag}(f) = \begin{bmatrix}\frac {\partial}{\partial x_1} \\ \frac {\partial}{\partial x_2} \\ \vdots \\ \frac {\partial}{\partial x_n}\end{bmatrix} \otimes   \begin{bmatrix}\frac {\partial}{\partial x_1} & \frac {\partial}{\partial x_2} & \cdots & \frac {\partial}{\partial x_n}\end{bmatrix} \cdot \mathrm{diag}(f)$$


 * $$H = \begin{bmatrix}

\frac{\partial^2}{\partial x_1^2} & \frac{\partial^2}{\partial x_1\partial x_2} & \cdots & \frac{\partial^2}{\partial x_1\partial x_n} \\ \frac{\partial^2}{\partial x_2\partial x_1} & \frac{\partial^2}{\partial x_2^2} & \cdots & \frac{\partial^2}{\partial x_2\partial x_n} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial^2}{\partial x_n\partial x_1} & \frac{\partial^2}{\partial x_n\partial x_2} & \cdots & \frac{\partial^2}{\partial x_n^2} \end{bmatrix} \cdot \begin{bmatrix} f & 0 & \cdots & 0 \\ 0 & f & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & f\end{bmatrix} = \begin{bmatrix} \frac{\partial^2 f}{\partial x_1^2} & \frac{\partial^2 f}{\partial x_1\partial x_2} & \cdots & \frac{\partial^2 f}{\partial x_1\partial x_n} \\ \frac{\partial^2 f}{\partial x_2\partial x_1} & \frac{\partial^2 f}{\partial x_2^2} & \cdots & \frac{\partial^2 f}{\partial x_2\partial x_n} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial^2 f}{\partial x_n\partial x_1} & \frac{\partial^2 f}{\partial x_n\partial x_2} & \cdots & \frac{\partial^2 f}{\partial x_n^2} \end{bmatrix}$$


 * I'm pretty sure this is right--hope it helps. 16:56, 3 Apr 2006 (UTC)


 * the transpose is redundant, as it is part of the definition of the dyadic product. The $$\cdot$$ shouldn't be there as that would make it a divergence, which is defined for vector functions whereas f here is a scalar function.
 * $$H=\nabla\otimes\nabla f = \begin{bmatrix}\frac {\partial}{\partial x_1} \\ \frac {\partial}{\partial x_2} \\ \vdots \\ \frac {\partial}{\partial x_n}\end{bmatrix} \otimes   \begin{bmatrix}\frac {\partial f}{\partial x_1} \\ \frac {\partial f}{\partial x_2} \\ \vdots \\ \frac {\partial f}{\partial x_n}\end{bmatrix} = \begin{bmatrix}\frac {\partial}{\partial x_1} \\ \frac {\partial}{\partial x_2} \\ \vdots \\ \frac {\partial}{\partial x_n}\end{bmatrix} \begin{bmatrix}\frac {\partial f}{\partial x_1} & \frac {\partial f}{\partial x_2} & \cdots & \frac {\partial f}{\partial x_n}\end{bmatrix} = \begin{bmatrix}

\frac{\partial^2 f}{\partial x_1^2} & \frac{\partial^2 f}{\partial x_1\partial x_2} & \cdots & \frac{\partial^2 f}{\partial x_1\partial x_n} \\ \frac{\partial^2 f}{\partial x_2\partial x_1} & \frac{\partial^2 f}{\partial x_2^2} & \cdots & \frac{\partial^2 f}{\partial x_2\partial x_n} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial^2 f}{\partial x_n\partial x_1} & \frac{\partial^2 f}{\partial x_n\partial x_2} & \cdots & \frac{\partial^2 f}{\partial x_n^2} \end{bmatrix}$$
 * Kaoru Itou (talk) 20:49, 28 January 2009 (UTC)
 * Also, diagonalising f before multiplying it makes no difference:

\begin{bmatrix} \frac{\partial^2}{\partial x_1^2} & \frac{\partial^2}{\partial x_1\partial x_2} & \cdots & \frac{\partial^2}{\partial x_1\partial x_n} \\ \frac{\partial^2}{\partial x_2\partial x_1} & \frac{\partial^2}{\partial x_2^2} & \cdots & \frac{\partial^2}{\partial x_2\partial x_n} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial^2}{\partial x_n\partial x_1} & \frac{\partial^2}{\partial x_n\partial x_2} & \cdots & \frac{\partial^2}{\partial x_n^2} \end{bmatrix} \cdot \mathrm{diag} (f)= \begin{bmatrix} \frac{\partial^2}{\partial x_1^2} & \frac{\partial^2}{\partial x_1\partial x_2} & \cdots & \frac{\partial^2}{\partial x_1\partial x_n} \\ \frac{\partial^2}{\partial x_2\partial x_1} & \frac{\partial^2}{\partial x_2^2} & \cdots & \frac{\partial^2}{\partial x_2\partial x_n} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial^2}{\partial x_n\partial x_1} & \frac{\partial^2}{\partial x_n\partial x_2} & \cdots & \frac{\partial^2}{\partial x_n^2} \end{bmatrix} \cdot \begin{bmatrix} 1 & 0 & \cdots & 0 \\ 0 & 1 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & 1\end{bmatrix} f = \begin{bmatrix} \frac{\partial^2}{\partial x_1^2} & \frac{\partial^2}{\partial x_1\partial x_2} & \cdots & \frac{\partial^2}{\partial x_1\partial x_n} \\ \frac{\partial^2}{\partial x_2\partial x_1} & \frac{\partial^2}{\partial x_2^2} & \cdots & \frac{\partial^2}{\partial x_2\partial x_n} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial^2}{\partial x_n\partial x_1} & \frac{\partial^2}{\partial x_n\partial x_2} & \cdots & \frac{\partial^2}{\partial x_n^2} \end{bmatrix} f$$
 * Kaoru Itou (talk) 22:14, 28 January 2009 (UTC)

Examples
It would be good to have at least one example of the use of Hessians in optimization problems, and perhaps a few words on the subject of applications of Hessians to statistical problems, e.g. maximization of parameters. --Smári McCarthy 16:01, 19 May 2006 (UTC)


 * I agree. BriEnBest (talk) 20:00, 28 October 2010 (UTC)


 * Me too. --Kvng (talk) 15:01, 3 July 2012 (UTC)

HUH??
The hessian displayed is incorrect, it should be 1\2 of the second derivative matrix. Charlielewis 06:34, 11 December 2006 (UTC)

Bordered Hessian
Is not clear how a Bordered Hessian with more than one constrain should look like. If I knew I would fix it.. --Marra 16:07, 19 February 2007 (UTC)


 * See the added "If there are, say, m constraints ...". Arie ten Cate 15:08, 6 May 2007 (UTC)

The definition of Bordered Hessian is extremely confusing I suggest using the definition used in Luenberger's book "Linear and Nonlinear Programming" I am adding it to my todo list. Will correct it soon. --Max Allen G (talk) 19:31, 8 April 2010 (UTC)

In the Bordered Hessian should it not be the Hessian of the Lagrange function instead of (what is currently presented) the Hessian of f? -- Ben —Preceding unsigned comment added by 130.238.11.97 (talk) 11:27, 8 June 2010 (UTC)


 * I agree. This seems wrong to me. The article cites Fundamental Methods of Mathematical Economics, by Chiang. Chiang has this as the Hessian of the Lagrange function, as you described. I think it should be changed. — Preceding unsigned comment added by Blossomonte (talk • contribs) 15:07, 10 September 2013 (UTC)


 * Please can someone fix this. As it is, the property of being a maximum or minimum depends only on the gradient of the constraint function, which is clearly not correct.  The curvature of the surface defined by the constraint function must also come into it, through its Hessian (which appears through the Hessian of the Lagrangian).  Thus, what is written here is obviously wrong. — Preceding unsigned comment added by 150.203.215.137 (talk • contribs) 03:08, 6 May 2014‎
 * It seems that you have not correctly read the paragraph beginning by "specifically": All the minors that are considered depend not only of the constraint function, but also of the second derivative of the function f. Also, please, sign your comments in the talk page by four tildes (~). D.Lazard (talk) 14:56, 6 May 2014 (UTC)

The Thereom is Wrong
I had learned that Fxy=Fyx is Youngs Thereom, not Swartz —The preceding unsigned comment was added by Lachliggity (talk • contribs) 03:02, 16 March 2007 (UTC).

What if det H is zero?
It would be nice if someone could include what to do when the determinant of the hessian matrix is zero. I thought you had to check with higher order derivatives, but I'm not too sure. Aphexer 09:52, 1 June 2007 (UTC)

"Univariate" function?
Please note that "univariate" in the intro refers to a statistical concept, which I believe does not apply here. Even in function (mathematics) there is no mention of "univariate functions", that anyway to me suggest function of one independent variable, which is not what we are discussing. I'll be bold and remove, please fix if you know what was meant. Thanks. 83.67.217.254 05:43, 9 August 2007 (UTC)

"Single-valued" perhaps? But do we really need to specify that? In the intro? I would just leave "function". 83.67.217.254 05:45, 9 August 2007 (UTC)


 * I think I made that change. My thought was, I wanted to differentiate "single valued" from (keeping in this notation's spirit) "multi-valued".  Or to quote from the second sentence and from the fifth section, "real valued" vs. "vector valued".  I did not want the first sentence to be ambiguous that in general the Hessian is a matrix, which then has a tensor extension for vector valued functions.


 * The term "univariate" does show my professional bias, and while I still think it's appropriate, "single-valued" is completely acceptable as well.  I still have a concern that not qualifying the first sentence at all allows for the tensor to be considered a case of a Hessian matrix, when I think that is better thought of an extension of the concept, since it's not a matrix per se.  However, I will not revert it and will chime in on any discussion and clarification here.  Baccyak4H (Yak!) 14:15, 9 August 2007 (UTC)

Vector valued functions
"If is instead vector-valued, ..., then the array of second partial derivatives is not a matrix, but a tensor of rank 3."

I think this is wrong. Wouldn't the natural extension of the Hessian to a 3-valued (i.e.) function just be 3 Hessian matrices?

Is this sentence instead trying to generalize Hessian matrices to higher-order partial derivative tensors of single-valued functions?

68.107.83.19 07:17, 2 October 2007 (UTC)


 * I'm sure someone can point to a reference which will answer your question, but it would seem that analogous to the Jacobian of a vector valued function, which is a matrix and not (just) a set of derivative vectors, that a rank 3 tensor makes sense: one could take inner products with such a tensor, say like in a higher order term of a multivariate Taylor series. That operation doesn't make as much sense if all one has is a set of matrices.  And it would seem one could always have one extant of the tensor index the elements of the vector of the function, with an arbitrary number of additional extants indexing arbitrary numbers of variables of differentiation.  My $0.02.  Baccyak4H (Yak!) 13:32, 2 October 2007 (UTC)


 * I can't see what you're getting at.


 * What I mean is that if $$f = (f_1, f_2, ..., f_n)\,\!$$ where f maps to R^n and each f_i maps to R, then isn't


 * $$H(f) = (H(f_1), H(f_2), ..., H(f_n))\,\!$$


 * And the only function returning a tensor that makes sense is the higher-order partial derivatives of a real-valued function g. E.g. if rank-3 tensor T holds the 3-rd order partial derivatives of g, then:


 * $$T_{i,j,k} = \frac{\partial^3 g}{\partial x_i\partial x_j\partial x_k}\,\!$$


 * If you disagree with this, can you explicitly state what entry $$T_{ijk}\,\!$$ should be (in terms of f=(f1,f2,...,fn)) if T is supposed to be the "hessian" of a vector-valued function? 68.107.83.19 22:57, 3 October 2007 (UTC)
 * i thought $$H(f) = (H(f_1), H(f_2), ..., H(f_n))$$ was a tensor, with $$T_{ijk}= \frac{\partial^2 f_k}{\partial x_i\partial x_j}$$ Kaoru Itou (talk) 22:16, 4 February 2009 (UTC)

Riemannian geometry
Can someone write on this topic from the point of view of Riemannian geometry? (there should be links e.g. to covariant derivative). Commentor (talk) 05:15, 3 March 2008 (UTC)

I think, there's something wrong with the indices and the tensor- product. We define $$Hess(f) := \nabla\nabla f = \nabla df$$. Then ie in lokal coordinates $$H_{ij}(f) = Hess(f)dx^{i} \otimes dx^{j}$$ (or in terms of the nabla, $$H_{ij}(f) = \nabla \nabla f (dx^{i} \otimes dx^{j}) = \nabla df (dx^{i}\otimes dx^{j})$$) and thus $$H_{ij}(f) = (\nabla_{i} \partial_{j}f) = \frac{\partial^{2}f}{\partial x^{i} \partial x^{j}} - \Gamma^{k}_{ij} \frac{\partial f}{\partial x^{k}}$$. So if we once have written the Hessian with indices, i.e. $$H_{ij}$$, we do not need to write any more tensor products $$dx^{i}\otimes dx^{j}$$. (Since we need to get a tensor of rank 2!) —Preceding unsigned comment added by 86.32.173.12 (talk) 19:14, 2 April 2011 (UTC)

Local polynomial expansion
As I understand it, the Hessian describes the second-order shape of a smooth function in a given neighborhood. So is this right?:
 * $$y=f(\mathbf{x}+\Delta\mathbf{x})\approx f(\mathbf{x}) + J(\mathbf{x})\Delta \mathbf{x} + \Delta\mathbf{x}^\mathrm{T} H(\mathbf{x}) \Delta\mathbf{x}$$

(Noting that the Jacobian matrix is equal to the gradient for scalar-valued functions.) That seems like it should be the vector-domained equivalent of
 * $$y=f(x+\Delta x)\approx f(x) + f'(x) \Delta x + f''(x) \Delta x^2$$

If that's right, I'll add that to the article. —Ben FrantzDale (talk) 04:30, 23 November 2008 (UTC)
 * That appears to be right, as discussed (crudely) in Taylor_series. —Ben FrantzDale (talk) 04:37, 23 November 2008 (UTC)

Approximation with Jacobian
Some optimization algorithms (e.g., levmar) approximate the Hessian of a cost function (half the sum of squares of a residual) with $$J^\top J$$ where J is the Jacobian matrix of r with respect to x:
 * $$J= \frac{\partial r_i}{\partial x_j}$$.

For reference, here's the derrivation. I may add it to this page, since the approximation is important and has practical applications (source).

Here's my understanding:

If we have a sum-of-squares cost function:
 * $$f(x) = \frac{1}{2} \|r(x)\|^2 = \frac{1}{2} \sum_k r_k^2$$

then simple differentiation gives:
 * $$\frac{\partial f}{\partial x_i} = \sum_k \frac{\partial r_k}{\partial x_i} r_k = J^\top r$$.

Then using the product rule inside the summation:
 * $$ \frac{\partial^2 f}{\partial x_i \partial x_j} = \frac{\partial}{\partial x_j}\left[\sum_k \frac{\partial r_k}{\partial x_i} r_k\right]

= \sum_k \left[ \frac{\partial^2 r_k}{\partial x_i \partial x_j} r_k + \frac{\partial r_k}{\partial x_i}\frac{\partial r_k}{\partial x_j}\right] = \left[\sum_k (\nabla^2 r_k) r_k\right] + J^\top J$$
 * $$=\frac{\sum_k r_k \partial^2 r_k}{\partial x_i \partial x_j} + \frac{\sum_k (\partial r_k)^2}{\partial x_i \partial x_j}$$
 * $$=\frac{\sum_k (\partial r_k)^2}{\partial x_i \partial x_j} + {}$$ H.O.T. for small $$r_k$$s.

With all that in mind, for small residuals we can aprpoximate the terms of H with
 * $$H_{ij} = \frac{\partial^2 f}{\partial x_i \partial x_j} \approx \sum_k \frac{\partial r_k}{\partial x_i}\frac{\partial r_k}{\partial x_j} = \sum_k \frac{(\partial r_k)^2}{\partial x_i \partial x_j}=J^\top J$$.

In words this says that for small residuals, the curvature of f (in any combination of directions) is approximated by the sum of squares of the rate of change of the components of the residuals in the same directions. We ignore the curvature of the residual components weighted the residuals—they are small so they don't have much room to curve and are additionally downweighted by their (small) value. —Ben FrantzDale (talk) 13:24, 10 August 2010 (UTC)

Notation
There is a problem with notation in the lead. H(f) and H(x) are used, but the arguments are very different. In the first case, f is a function, and in the second case, x is a vector. One should try to unify the notation or at least clarify the differences. Renato (talk) 22:04, 12 September 2011 (UTC)


 * Good point. I added clarification. —Ben FrantzDale (talk) 11:44, 13 September 2011 (UTC)

Hessian = (transpose of?) Jacobian of the gradient.
Given a scalar smooth function $$f: \mathbb R^n \to \mathbb R$$, the gradient will be a vector valued function $$\operatorname{grad} f: \mathbb R^n \to \mathbb R^n$$. We can even write $$g := \operatorname{grad} f$$, a column vector where the first line is the derivative of f with respect to $$x_1$$ and so forth.

Then, the Jacobian of $$g$$, as stated in Jacobian matrix and determinant, is a matrix where the i-th column has the derivative of all components of $$g$$ with respect to $$x_1$$. This is the transpose of the Hessian, or is it not? Wisapi (talk) 17:04, 20 January 2016 (UTC)
 * The distinction is moot due to the symmetry of the Hessian if Clairaut's theorem holds for all the partial derivatives concerned, as in most elementary cases. If not, then consider the order of partial differentiation implied by the operators. Then one gets somewhat of a contradictory situation. Every source (Wolfram MathWorld among others) calls it simply the Jacobian of the gradient. But the same sources use the notation in the article, and $$\frac{\partial^2 f}{\partial x_i \partial x_j} := \frac{\partial}{\partial x_i} (\frac{\partial f}{\partial x_j})$$ for all i and j. I am in favor of the current statement, however, since the idea is that taking the total derivative twice yields a second derivative, without complications such as this.--Jasper Deng (talk) 17:48, 20 January 2016 (UTC)
 * I also came to the conclusion (much like @Wisapi) that the expression given in the article gives the transpose of the Hessian rather than the Hessian, though I see @Jasper Deng's point with regards to partial derivative notation. Einabla (talk) 06:28, 8 June 2023 (UTC)
 * I am pretty sure the article is inconsistent with the definitions of the Jacobian and Gradient given in their respective articles (it should be the Jacobian of the Gradient, with no transpose). Isn't it better to be consistent within Wikipedia than to be consistent with Wolfram MathWorld? 169.231.53.221 (talk) 01:02, 14 February 2024 (UTC)

Coordinate independent definition
I wonder if it would be worthwhile to add a section on the coordinate independent definition of the Hessian (as a symmetric bilinear form). I know this article is "Hessian matrix," but since the Hessian matrix is just a what you get if you pick local coordinates, I feel like such a section might belong. hiabc (talk) 05:48, 11 January 2021 (UTC)


 * I am actually following a bit of a nitpicking spree right now, as in the article it says that the Hessian is symmetric if the second partial derivatives are continuous (as by Schwarz they then commute). But you can actually give weaker assumptions, one for example that the first partial derivatives are each differentiable (in the sense that there exists a linear map ...) this is given in the article Symmetry of second derivatives#Sufficiency of twice-differentiability. Now why do I comment this to you? The second partial derivatives can be defined and could be well defined even if the function is not twice or even once differentiable (as far as I can tell) in this case the symmetric bilinear form you talk of would not necessarily be well defined, but the Hessian as defined in the article is. But this also immediately means that if the function is twice differentiable (not even continuously, just in the sense that there is the quadratic form that approximates it) the Hessian would be the coordinate representation and would thus be symmetric. To safe another poor soul to go on the same ride as me (who has not seen a calc 2 class in some time now and has thus forgotten some of the intricacies of the definitions) it could be useful to add the more general condition when the Hessian is symmetric and then also relate it as you said to the quadratic form when the function is twice differentiable. I hope this made sense. Tom-Lukas Lübbeke (talk) 14:39, 6 July 2024 (UTC)

Missing section
I find this article to be very clearly written despite some material's being rather technical.

But one thing that seems to be missing (or vastly underemphasized) is the fact that the hessian matrix defines the quadratic term in the Taylor series of a real-valued function defined on an open subset of some euclidean space. In other words, the hessian matrix is the matrix of the quadratic form that displays the best approximation to the function (assuming a horizontal tangent plane) at a given point.

I hope someone familiar with this and also with the appropriate version for riemannian manifolds can add this topic to the article, because it seems to be extremely important. 2601:200:C082:2EA0:F956:E02:6231:F380 (talk) 05:44, 26 January 2023 (UTC)