Talk:Broyden–Fletcher–Goldfarb–Shanno algorithm

Article State
Large amounts of the article are presented without reference. Although this is probably ok for the general theory from the paper, it is not for parts of the algorithm (eg "usually obtained") and implementation sections. If you have contributed something to these, please add a citation rather than relying on the bibliography. If nothing is done within a month I will start pruning these. byo (talk) 03:19, 24 March 2016 (UTC)

Asterisk operator in initialization algorithm
Perhaps someone with more specific experience in numerical optimization can explain the meaning of the asterisk operator in the algorithm section in the equation $$B_0 = I * x $$. I'm assuming $$x$$ here is $$\mathbf{x}_0$$ (as there's no other indication as to what $$x$$ should be) and $$I$$ is the identity matrix. (DrKBall (talk) 14:25, 12 September 2013 (UTC))

I personally think that $$B_0 = I$$ rather than $$B_0 = I * x $$. It is this value of $$B_0$$ which results in the first step of this algorithm conforming with gradient descent. (Ian Davis)


 * The asterisk is just matrix-vector multiplication, which is usually suppressed unless it would be confusing otherwise. Regardless, it looks like it's been fixed with $$B_0=I$$ (correct) and not $$B_0 = I * x =x $$ (which isn't correct).--Numerical one (talk) 18:40, 26 July 2018 (UTC)

Role of y and s
I can't believe this, but the roles of s and y are completely changed from what they should be (God knows since when). I corrected the formula. Please double check.

As a matter of fact I'm running this program with the formula as I input it and it converges. Maybe most people use the inverse update bec. it's better and that's why no one caught this bug?


 * Right now this formula seems incorrect: $$B_{k+1} = B_k + \frac{\mathbf{s}_k \mathbf{s}_k^{\mathrm{T}}}{\mathbf{y}_k^{\mathrm{T}} \mathbf{s}_k} - \frac{B_k \mathbf{y}_k \mathbf{y}_k^{\mathrm{T}} B_k }{\mathbf{y}_k^{\mathrm{T}} B_k \mathbf{y}_k}.$$
 * It can be easily checked by inspecting units of measure.
 * The correct formula can be found here: http://en.wikipedia.org/wiki/Quasi-Newton_method#Description_of_the_method 109.174.112.67 (talk) 06:11, 2 May 2013 (UTC)


 * The formula above seems to be the right one.
 * I have checked in the reference : 978-3-540-35447-5.pdf (springer.com), on page 55. 194.42.11.46 (talk) 18:43, 24 October 2022 (UTC)

rank of matrix v. rank of tensor
What's the deal with the rank 2 matrix claim? Do we mean rank 2 tensor? It seems to me that if you take U = u v^T, then U is a matrix with rank of one (i.e. all rows/cols are linearly dependent). In other words, I think there is potential confusion between matrix rank and tensor rank. Birge 22:51, 28 March 2006 (UTC)
 * Two different rank 1 matrices are added; their sum is rank 2.

Relation to steepest decent (gradient descent)
Could someone explain what the relationship between BFGS and gradient descent is? I could (very well) be wrong, but isn't it pretty much doing an approximate newton step? My notes say that when the Hessian is approximated by an identity matrix then BFGS will take a gradient descent step. Perhaps there's another relationship that I don't get, but it confused me to read that BFGS is "derived" from gradient descent.129.59.43.172 23:45, 3 May 2006 (UTC)

My answer
Newton's Method is typically described as a means of finding assignment of variables x to a function f(x) so that f(x) = 0. We do not wish to know where f(x) is 0 but where f(x) is a minimum (or maximum). This is the case when $$f'(x) = 0$$. So instead of using : $$x_{1} = x_0 - \frac{f(x_0)}{f'(x_0)} \,.$$ we would use : $$x_{1} = x_0 - \frac{f'(x_0)}{f''(x_0)} \,.$$. The Jacobian J generalizes $$f'(x_0)$$ and the hessian B generalizes $$f''(x_0)$$. Taking the inverse of the hessian generalizes the division.

Now $$x_{1}$$ is geometrically the position on the x axis where the tangent intersects it. Gradient descent, as opposed to Newton's method would I think correspond to computing an $$x_{1}$$ somewhere on the tangent perhaps nearer to the solution depending on the step increment, but not having the property that it was where the tangent interested the x axis.

What I don't get is why one then reintroduces a step length into the calculation of how much to alter the values of $$f'(x_0)$$. Having any value for this length other than one seems to depart from the whole point of Newton's Method. — Preceding unsigned comment added by Ijdavis (talk • contribs) 05:04, 23 January 2014 (UTC)

too technical tag
this tag was first added when the article was just a stub; it was just one paragraph long. the article has considerably grown since then, and the stub tag was long ago removed. if the person who added the tag wants to add it back, please do but leave some suggestions on what details need to be better explained to a general audience. Lunch 03:58, 24 September 2006 (UTC)

under-explained math
I'm totally confused as to what the variables are in the equations. If they are explained in other topics (like Newton's Method, etc.), then links should be made.

I am particularly confused by step 4 of the algorithm. B is a matrix, y and s are vectors, correct? Then the first term on the right is a matrix, but the second is a scalar (i.e., quotient of two vector dot products). The third term is another scalar. So, what am I missing? —Preceding unsigned comment added by Dave.rudolf (talk • contribs) 18:06, 24 October 2008 (UTC)
 * The second and third terms are not scalars. This article seems to be using the convention that vectors are column vectors, so the denominators vTw are scalar products, but the numerators vwT are n-by-n matrices. Algebraist 12:16, 1 March 2009 (UTC)

Too Technical
It wasn't me that added the too technical tag originally, but I think it deserves it at the moment. I agree with the above post (under-explained math). The symbols are not explain - what is B? It is too technical because it is not currently possible for someone not familiar with the technical mathematics to use this algorithm. Other articles suggest that this is a very important algorithm for maximisation, therefore people may need to use it who are not experts in optimisation mathematics (which is how I ended up here - I'm going to have to look else where for this algo). Some pseudo-code for this algo would be a massive help. Thanks 81.98.250.190 (talk) 13:39, 11 January 2009 (UTC)

BFGS or DFP?
according to Nocedal and Wright the updating B version of the algorithm is due to Davidon, Fletcher and Powell, the updating H = B^-1 version is the BFGS algorithm. If that is the case, then step 1 should be compute sk = -dot(Hk, grad) and the updating is as per the 'practically' section at the end. —Preceding unsigned comment added by 192.150.10.200 (talk) 17:35, 6 April 2009 (UTC)
 * Any type of quasi-Newton algorithm can update either $$\mathbf{B}$$ or $$\mathbf{B}^{-1}$$. See the page named "Quasi-Newton method."

We should name DFP in the article, section Algorithm. Why? Right now the section shows the 5 steps of the DFP algorithm and they are followed with the changes that make it BFGS. But at first glance it looks like just a discussion about the 5 steps. This leads people to think, the 5 steps are BFGS. E.g. I found someone mistakingly using DFP instead of BFGS here: https://github.com/Sohl-Dickstein/Sum-of-Functions-Optimizer (explained in section 2.4.2 of their paper). Our article's layout should not suggest wrong things. Shouldn't we name the first 5 steps DFP explicitly? Especially as reading DFP first helps understanding BFGS? 2.247.248.176 (talk) 14:05, 15 June 2018 (UTC) Made a mistake, sorry. 2.247.248.176 (talk) 14:26, 15 June 2018 (UTC)

Sherman-Morrison or Woodbury?
According the the article, the update to the inverse of the hessian is via the Sherman–Morrison formula. However, this formula regards a rank one update. Because this is a rank two update, isn't this actually a special case of the Woodbury matrix identity? PDBailey (talk) 00:09, 12 April 2009 (UTC)

Just curious, if anyone successfully derive the update formula using Sherman–Morrison formula? I tried couple hours and was not successful. Hopefully someone can write up the detail. — Preceding unsigned comment added by 99.190.5.153 (talk) 22:44, 17 February 2020 (UTC)

Slight error?
I believe that $$\mathbf{\alpha}_k \mathbf{s}_k$$ should be used instead of $$\mathbf{s}_k$$ in the $$\mathbf{B}_{k+1}$$ and $$\mathbf{B}_{k+1}^{-1}$$ updates. Is this correct? It seems that the update should take into account how far the step size actually was. See also the page "Quasi-Newton method" which shows the updates to $$\mathbf{B}$$ and $$\mathbf{B^{-1}}$$ for several different quasi-Newton methods. Its formulas for BFGS are equivalent to those shown here, except that it uses actual step sizes instead of predicted step sizes. Having analyzed the formulas, this difference is not negligible.

Slight Error
Yes there was a slight error. The correct result is produced in one of the literature sources referenced, namely Numerical Optimization by Nocedal and Wright, section 8.1. The step size used in the update for $$ B_k $$ was incorrect. I corrected it by introducing $$ \mathbf{p}_k=\alpha_k \mathbf{s}_k $$ and used it to replace $$ \mathbf{s}_k $$ in the expression. I suspect that the expression for $$ B_k^{-1} $$ is also incorrect but I could not find an explicit reference to it in the literature, though it should be possible to derive it from the Sherman-Morrison-Woodbury formula from Appendix A.2 and section 8.1 of the same literature source if someone has some spare time =P. This is actually a pretty serious error since it significantly affects the accuracy of the estimate for the Hessian which can seriously mess up your day. —Preceding unsigned comment added by 129.100.226.147 (talk) 18:48, 9 September 2009 (UTC)

Slight Error
Yes the inverse formula was incorrect. However the notation $$ \mathbf{p}_k $$ and  $$ \mathbf{s}_k $$ is the opposite of the standard notation. I switched them so $$ \mathbf{s}_k=\alpha_k \mathbf{p}_k $$, which is standard in the literature. The inverse formula is now correct. —Preceding unsigned comment added by Drybid (talk • contribs) 21:46, 4 October 2009 (UTC)

Explanation needed
I think that in the formula $$\mathbf{B}_k \mathbf{p}_k = - \nabla \mathbf{f}(x_k)$$ is a mistake. $$\mathbf{B}_k$$ is a matrix, $$\mathbf{p}_k$$ is a vector, and $$\nabla \mathbf{f}(x_k)$$ is a matrix. Then on the left-hand side there is a vector (a product of a matrix and a vector) but on the right-hand side there is still a matrix. Is there a mistake or I just miss something? Alex.


 * f(x) is scalar valued, so its gradient is vector valued. 018 (talk) 15:26, 2 June 2010 (UTC)

But what if I have $$\mathbf{f}(x_k)$$ vector? What is the formula then? —Preceding unsigned comment added by 130.161.210.163 (talk) 09:52, 8 June 2010 (UTC)
 * Broyden's method. In line one, it states that BFGS solves argmin{f(x)} where f(x) is an optimization problem. Perhaps this implies that f(x) is a vector to scalar mapping, but this is not explicit. In the second sentence it says that BFGS approximates Newton's method, which is only unambiguously true in the univariate case. For the multivariate case, it can be used to solve scalar f(x), not vector f(x) functions and there are two classes of 'Quasi-Newton methods'. The article also compares BFGS with Broyden's method in "So equivalently, Uk and Vk construct a rank-two update matrix which is robust against the scale problem often suffered in the gradient descent searching (e.g., in Broyden's method)." To me this implies that BFGS and Broyden's method are in some sense comparable, while in fact they solve completely different classes of problem (unless I am still confused, which is entirely possible). I'm sure many students and engineers would greatly appreciate a page which explains what methods are appropriate for solving scalar-to-scalar, vector-to-scalar, vector-to-vector, bounded, unbounded, derivative-available and derivative-free, minimization and root-finding problems noting which focus on accelerating convergence, which focus on robustness and which are or are not subject to numerical precision issues (where unique solutions exist). Finding Brent's method, BFGS (where I take a numerical derivative, which may be sub-optimal) and Broyden's method (where I still have stability problems) within the pages of Wikipedia has been difficult and we'd greatly appreciate help with this. Doug (talk) 17:07, 24 August 2013 (UTC)
 * Correction, Broyden's method is for root finding, I'm not sure how you are defining the minimum of a vector function, but the length would give you a scalar function. Doug (talk) 19:44, 24 August 2013 (UTC)

Answer
$$\nabla \mathbf{f}(x_k)$$ is not a matrix. It is a vector containing the gradient at $$x_k$$. The hessian discussed later which contains second derivatives is a matrix because one can differentiate twice using different partial derivatives at each step.

Ian — Preceding unsigned comment added by Ijdavis (talk • contribs) 05:15, 22 January 2014 (UTC)

Differing formulae for incrementally approximating the inverse hessian..
This page was recently changed to give a different formula at the end for how the inverse hessian was to be computed. While the old and the new formula's appear to produce the same results (having implemented both), I think there is merit in showing both formulae, and discussing their differences.

The earlier formula can be computed using far fewer FLOPS, because it can be implemented without multiplying k by k matrices together, instead at worst computing vectors with k elements which are then multiplied with a k by k matrix. Further some parts of the computation reduce to scalars. It can also be computed using much less memory. In my implementation of the new formula I required two additional temporary k by k matrices, one to form the multiplying matrix and the other to handle the fact that the B^(-1) matrix cannot be easily updated in place. The old formula could easily be implemented using only temporary vectors, with B^(-1) simply being updated incrementally by addition of outer product vectors. A small benchmark suggests that while the old formula can be computed multiple times in about 5 seconds, the same number of computations using the new formulae takes about 2 minutes. That is a pretty stark difference.

The advantage of the new formula is that it is conceptually simpler, and highlights the fact that we are computing transpose(M) B^(-1) M + S. If floating point operations are more stable under multiplication than addition, it may offer greater stability, but given that far more FLOPS are being performed this seems somewhat suspect, since there will be more rounding errors introduced.

I would be interested in other peoples take on the merits of the old and new formulae.

Ian Davis (textserver.com@gmail.com) — Preceding unsigned comment added by Ijdavis (talk • contribs) 05:07, 22 January 2014 (UTC)

A further comment
Arguably all of the information about how one incrementally computes a hessian and its inverse under perturbations would be better placed under the page that describes hessians. It is not relevant only to the BFGS algorithm. — Preceding unsigned comment added by Ijdavis (talk • contribs) 05:09, 23 January 2014 (UTC)

Why do the formula's work
Nothing is really said to justify why the formulae given for computing the approximation to the hessian and its inverse actually do approximate the hessian. That part of this page seems a bit too close to black magic. Can someone explain what is going on here? — Preceding unsigned comment added by Ijdavis (talk • contribs) 05:14, 23 January 2014 (UTC)

External links modified
Hello fellow Wikipedians,

I have just modified 1 one external link on Broyden–Fletcher–Goldfarb–Shanno algorithm. Please take a moment to review my edit. If you have any questions, or need the bot to ignore the links, or the page altogether, please visit this simple FaQ for additional information. I made the following changes:
 * Added archive https://web.archive.org/web/20101028140551/http://www.mathworks.com:80/help/toolbox/optim/ug/brnoxr7-1.html to http://www.mathworks.com/help/toolbox/optim/ug/brnoxr7-1.html#brnpcye

When you have finished reviewing my changes, please set the checked parameter below to true or failed to let others know (documentation at ).

Cheers.— InternetArchiveBot  (Report bug) 14:24, 9 November 2016 (UTC)

Another proof of B_k
Another proof of B_k is to minimize the function: $$min_\mathbf{B}||\mathbf{B-B_k}||$$ subject to: $$\mathbf{B=B^\top, B s_k=y_k}$$ Here the matrix norm is $$\mathbf{||A||_W \equiv ||W^{1/2}AW^{1/2}||_F}$$, where $$\mathbf{G}$$ is the average Hessian.

Let $$\mathbf{E}=\mathbf{B-B_k}; \gamma=y_k; \delta = s_k; \xi=\gamma-\mathbf{B}\delta$$, the Lagrangian function is:


 * $$\mathcal{L} = \frac{1}{4} trace(\mathbf{W E^\top WE})+trace(\mathbf{\Lambda^\top(E^\top-E})) - \lambda^\top\mathbf{W(E\delta-\xi)}$$

Such proof is also good to comprehend.--Dzhouaa (talk) 09:11, 25 January 2017 (UTC)