Talk:Levenberg–Marquardt algorithm

L-M is for nonlinear least squares problems only
The article needs to make clear that the L-M algorithm is for solving nonlinear least squares problems, not the more general class of nonlinear optimization problems. In fact, the introduction used to give the impression that the L-M algorithm is used to solve general nonlinear optimization problems, which is not true.

Functions vs. Values
The article talks about "the functions $$f(x_i,\boldsymbol \beta+\boldsymbol \delta)$$". This confused the heck out of me, until I realized that what is meant is the values of the function $$f(x,\boldsymbol \beta+\boldsymbol \delta)$$ at $$x_i$$. Similar confused terminology is used at other points in the article, for example when calling $$J_i$$ the gradient. These kind of terminology issues make the article hard to read. — Preceding unsigned comment added by 80.65.109.181 (talk) 12:09, 13 September 2013 (UTC)

Inconsistency
In the article is says that Levenberg's contribution is in introducing lambda. But at the same time in the references it says that he was the first to publish the algorithm in 1944 or something. This seems a bit contradictory to me. Ravn-hawk (talk) 15:38, 28 September 2011 (UTC)

Possible replacement article?
Could I please recommend that people interested in this article look at Sam Roweis' document on Levenberg Marquardt at . It seems to be a very complete discription. If it were simply pasted in (pdf2html?) it would answer the questions below and add detail. This would have to be checked with the author, of course, which I will be happy to do. I am new to this, so my etiquette is probably short of the mark, but I hope this is useful.
 * I suggest that if you think there is valuable insights in Roweis' article you might use the information there to improve the existing wikipedia article. The questions below are not fundamentally important, but the article can be improved. Certainly no case for wholesale replacement.Billlion 09:49, 5 September 2006 (UTC)

correction
Should
 * (JTJ + &lambda;)q = -JTf

be
 * (JTJ + &lambda; I)q = -JTf. ?

&mdash;The preceding unsigned comment was added by 129.13.70.88 (talk &bull; contribs) on 08:18, 29 August 2005.


 * Both formulas have the same meaning, namely JTJq + &lambda;q = &minus;JTf. As far as I am concerned, it is a matter of taste which you prefer, but the second is perhaps clearer. -- Jitse Niesen (talk) 20:46, 30 August 2005 (UTC)

All formulas have a lack of negative sign. For example:
 * $$\mathbf{(J^{T}J)\boldsymbol \delta = J^{T} [y - f(\boldsymbol \beta)]} \!$$

should be:
 * $$\mathbf{(J^{T}J)\boldsymbol \delta = -J^{T} [y - f(\boldsymbol \beta)]} \!$$

finding this bug took quite some time. And it should be pointed out, how to get to this formula. --vogt31337 9:00 26 January 2012 (UTC) — Preceding unsigned comment added by Vogt31337 (talk • contribs)

@ Vogt31337: i think there is no sign missing since J is the jacobian of f and f has a negativ sign in S. the goal is to minimize S and therefore we want to go in direction of negativ jacobian of S which turns out to be the negative jacobian of -f and thus -(-J) = J. -- Georg

@ Vogt31337: I agree with you. -- D — Preceding unsigned comment added by 192.55.54.42 (talk) 23:13, 28 June 2019 (UTC)

Transpose
Where f and p are introduced:


 * fT=(f1, ..., fm), and pT=(p1, ..., pn).

they are presented with superscript T, which I think indicates that they are transposed. But f extends over 1..m, and p extends over 1..n, so I don't understand why both are transposed. Can someone please explain what I am missing here? Thanks. --CarlManaster 22:49, 15 December 2005 (UTC)


 * I think the reasoning is that (f1, ..., fm) is a row vector. However, we want f to be a column vector. So the transpose is used to convert from rows to columns. It is perhaps clearer to write
 * f=(f1, ..., fm)T, and p=(p1, ..., pn)T.
 * To be honest, many mathematical texts do not bother to distinguish between row and column vectors, hoping that the reader can deduce from the context which one is meant.
 * Perhaps you know this, and your question is: why do we want to use a column vector? Well, f has to be a column vector because fTf further down the article should be a inner product, and p has to be a column vector because it's added to q, which is a column vector because otherwise the product Jq does not make sense. -- Jitse Niesen (talk) 23:19, 15 December 2005 (UTC)

Improvement by Marquardt
What's about the improvement Marquardt made to the algorithm? He replaced I by diag[H], i.e. the diagonal of the (approximated) Hessian, to incorporate some local curvature estimation. This makes the algorithm go further in directions of smaller gradient to get out of narrow valleys on the error surface.

last entry by G.k. 13:46, 27 April 2006 (UTC)

I believe this is correct. It also has the important practical benefit of making λ dimensionless, so λ=1 is (I think) a sensible general starting point.


 * A way to address this might be to add a section called (say) "Improved solution" after the "Choice of damping parameter" section, with the new formula. The original "solution" section and this new one should probably be better referenced as well.  Thoughts?  I may give it a shot soon.  Baccyak4H (Yak!) 18:29, 25 July 2007 (UTC)

I believe Marquardt's suggested improvement was actually to scale the approximate hessian matrix, $$J^{T}J$$. The scaling had an effect similar to replacing the identity matrix with the diagonal of the hessian approximation. Marquardt suggests scaling the hessian approximation by an amount that makes the diagonal elements ones. Adding a constant diagonal matrix to the scaled matrix is similar to adding a proportion of the diagonal elements to the unscaled matrix. The scaling applied to the other elements of the hessian approximation improves the condition of the matrix. I suggest the following:

$$q = \Sigma_J[\hat{J}^T\hat{J} + \lambda{}I]^{-1}\hat{J}^T [y - f(p)]$$

where $$\Sigma_J$$ is the square diagonal matrix:

$$\Sigma_J = [\mbox{diag}[J^TJ]]^{-\frac{1}{2}}$$ and the scaled Jacobian, $$\hat{J}$$, is:

$$\hat{J} = J\Sigma_J$$

The square matrix, $$\hat{J}^T\hat{J}$$, is then a scaled version of the squared Jacobian, $$J^{T}J$$, where the scale factor is the root mean square of the columns of $$J^{T}J$$. The result of the scaling is ones on all the diagonal elements of $$\hat{J}^T\hat{J}$$.

Note that more recent implementations that use the Levenberg-Marquardt tag do not include Marquardt's suggestion. It seems that clever choices of $$\lambda$$ result in reasonable robustness and less function evaluations. Stephen.scott.webb (talk) 22:29, 4 February 2008 (UTC)

Reference to nonlinear programming
Reference to nonlinear programming seems rather less appropriate here than reference to nonlinear regression. However, the article on the latter inadequate (but developing).Dfarrar 09:08, 10 March 2007 (UTC)

explain terminology better
A short explanation (or a link to a page with an explanation) about the terminology f(t_i|p) would be in order here. I dont think it is common usage outside of statistical groups. —Preceding unsigned comment added by 130.225.102.1 (talk) 15:26, 25 September 2007 (UTC)

A major proposal
Please see talk:least squares for details. Petergans (talk) 17:39, 22 January 2008 (UTC)

Choice of damping parameter
- should discuss the approach of Moré, the standard used in most production-quality numerical libraries. It tends to be both faster and more robust than earlier more naive methods. Jheald (talk) 22:26, 30 January 2008 (UTC)

Explanation of the poor performance of L-M in example
The data sequence has very high frequency components. The sparse sampling plan shown in the plots will necessarily contain huge aliasing error that is beyond any inverse algorithm. The way to correct this apparent poor performance of L-M is to sample the data train much more densely, e.g. in 0.001 step size, then the L-M will find the correct answer easily under a very wide range of initial conditions. It is really not L-M that is at fault here. --Xzhou.lumi (talk) 22:32, 2 July 2008 (UTC)

Room for clarification
I think the article could do with some clarification; I know I could. Most optimization algorithms I have looked at focus on searching in Rn for the minimum of some scalar-valued function of Rn. In this article, that function is clearly $$S(\boldsymbol{\beta})$$, but the article doesn't emphasize that as much as it emphasizes $$f(x_i,\boldsymbol{\beta})$$.

As I try to fully understand L–M, I keep seeing it as a trust region algorithm in which we do a second-order Taylor expansion, $$S(\boldsymbol{\beta}+\boldsymbol{\delta})$$ in the neighborhood of $$\boldsymbol{\delta}=\mathbf{0}$$. In that case we would have
 * $$S(\boldsymbol{\beta}+\boldsymbol{\delta}) \approx S(\boldsymbol{\beta}) + \operatorname{grad}(S) \cdot \boldsymbol{\delta} + \boldsymbol{\delta}^\mathrm{T} \operatorname{Hessian}(S) \boldsymbol{\delta}$$

It looks like L–M uses $$\mathbf{J}^\mathrm{T}\mathbf{J}$$ for the Hessian. Is this the exact Hessian or just an approximation? (It looks like an approximation, but I can't picture the difference between the true Hessian and the approximation.)

Now, the gradient of S WRT &beta; is given in this article as
 * $$\operatorname{grad}(S) = (-2)(\mathbf{J}^{T} [y - f(\boldsymbol \beta) ] )^{T}$$.

With some hand waving, I can see this giving us the equation for the minimum of S of
 * $$\mathbf{(J^{T}J)\boldsymbol \delta = J^{T} [y - f(\boldsymbol \beta)]} \!$$

Then it looks like the damping parameter basically defines the radius of the trust region and that it isn't a hard-edged trust region but rather the $$\lambda I$$ term modifies the cost function to keep the local minimum from being too far away (basically adding a paraboloid, scaled by &lambda; to S. Is that about right?

Finally, I don't quite see the intuition for using $$\operatorname{diag}(\mathbf{D}^\mathrm{T}\mathbf{D})$$ instead of the identity.

Any answers/thoughts/corrections would be appreciated. Thanks. —Ben FrantzDale (talk) 22:04, 7 December 2008 (UTC)

-- In between the section Solution there is the sentence:
 * >> Note that the gradient of S with respect to δ equals $$-2(\mathbf{J}^{T} [\mathbf{y} - \mathbf{f}(\boldsymbol \beta) ] )^T $$ <<
 * Is it not supposed to be: ... gradient of S with respect to &beta; ?? -Georg  — Preceding unsigned comment added by 31.18.248.89 (talk) 20:45, 27 May 2014 (UTC)

Notes and references
There was a slight problem with the use of Notes and References. I have fixed it. --Михал Орела (talk) 20:50, 3 February 2009 (UTC)

Comparison of Efficacy and Performance with other similar Algorithms
Can anyone provide a section or information comparing the LMA with other algorithms such as the Conjugate-Gradient method?MxBuck (talk) 19:18, 23 September 2009 (UTC)

Jacobian J
I think there's a little mistake in the definition of J:

Shouldn't it be: J = S(β) / dβ ?

This would also correspond to this (from the external links): http://www.siam.org/books/textbooks/fr18_book.pdf

We're trying to minimize the error, so taking the Jacobian of the original function wouldn't make sense. — Preceding unsigned comment added by 129.132.154.39 (talk) 10:42, 20 October 2011 (UTC)

Covariance of the solution
There should be discussion of the covariance of the solution. I think it's that if J is left-multiplied by the inverse of the input covariance, than $$J^\top \Sigma^{-1} J$$ is the covariance of the output, or somesuch. —Ben FrantzDale (talk) 17:20, 26 October 2011 (UTC)

External links modified
Hello fellow Wikipedians,

I have just modified 2 external links on Levenberg–Marquardt algorithm. Please take a moment to review my edit. If you have any questions, or need the bot to ignore the links, or the page altogether, please visit this simple FaQ for additional information. I made the following changes:
 * Added archive https://web.archive.org/web/20110723020836/http://trac.astrometry.net/wiki/PyLevmar to http://trac.astrometry.net/wiki/PyLevmar
 * Added archive https://web.archive.org/web/20130722142233/http://www2.imm.dtu.dk/~hbni/Software/SMarquardt.m to http://www2.imm.dtu.dk/~hbni/Software/SMarquardt.m

When you have finished reviewing my changes, you may follow the instructions on the template below to fix any issues with the URLs.

Cheers.— InternetArchiveBot  (Report bug) 02:18, 22 December 2017 (UTC)

Broken images
Hi just a note to say that the image links on this page appear to be broken. Thanks. — Preceding unsigned comment added by 2A01:110:8012:1012:0:0:0:82 (talk) 11:59, 17 January 2018 (UTC)