Talk:Gradient descent

Search in trial direction
I was under the impression that gradient descent did a line search in the trial direction (direction of steepest gradient). Can anyone else confirm this? njh 04:45, 25 November 2005 (UTC)
 * I guess you are saying that given a direction, one should find the optimal step in that direction. That is I guess also gradient descent, but from what I know, in practice it is very hard to do that. So one just takes some reasonable step in one direction, and then from there takes a new direction, as this article illustrates. Did I understand your question right? Oleg Alexandrov (talk) 05:59, 25 November 2005 (UTC)
 * Usually one finds a trial direction, then uses a one dimensional minimisation algorithm like brent's method along the vector thus defined. This is generally a better idea than making arbitrary small steps as you only need to compute the gradient at the minimum in each direction (of course it still suffers from zigzaging).  Conjugate gradient method extends this to ensure that zigzaging doesn't happen. njh 08:37, 25 November 2005 (UTC)
 * I myself used gradient descent only for infinite-dimensional optimization, when the variable of optimization is not a vector, but a function. Then it is really a pain to find the optimal step. :) And you don't need to use an arbitrarily small step size as you say, you can make it adaptive, so you can try to jump a bit more at each step than at the previous one, and scale back if you jump too much. Of course, in this way you are not guaranteed to find the true minimum, but those infinite-dimensional problems often times have an infinite number of local minima to start with, so it is not as if you lost something. Oleg Alexandrov (talk) 15:54, 25 November 2005 (UTC)

I updated the page with a new subsection on different ways to choose the trial direction and step size. Since this is an important problem in modern machine learning, I thought it would be worth it having its own section. This was my first edit on Wikipedia, so would definitely appreciate any feedback, or let me know if I did something wrong. P.S. I am a PhD student who has been researching this topic. (Jeremy Bernstein (talk) 00:50, 10 October 2020 (UTC)).

Someone needs to correct the definition
Doesn't sound logical to me: "algorythm that approaches a local minimumof a function" mean that the algorythm itself is approaching something. What?

I am interested in what I can do (or what I can find) with this algorythm.

Inyuki 21:46, 2 November 2006 (UTC)


 * You are right. That poor wording has been there from the very beginning. I rewrote the article at some point but failed to notice that.


 * This algorithm is used to find local minima of functions. Oleg Alexandrov (talk) 04:40, 3 November 2006 (UTC)

Understanding the algorythm
Excusme, maybe someone could tell me, if I got it right, I tried to write it in procedural code here: http://howto.wikia.com/wiki/Gradient_descent

Thank you.

Inyuki 23:33, 7 November 2006 (UTC)


 * Looks okay to me, for the very basic algorithm. One problem is that you're unlikely to land on a point where the gradient is exactly zero. Terminate the loop when the gradient is "small enough" (choosing what this means is the hard part). -- Jitse Niesen (talk) 01:10, 8 November 2006 (UTC)
 * In a many applications, the minimum derivative is a user-input variable. --Cowbert 05:04, 29 December 2006 (UTC)

I'm not sure why we have previous_step_size = 1/precision? Given that precision is set to be a small double (0.00001) we need only set previous_step_size to be greater than 0.0001 - why not just make it 1.0? For extremely high precision there is a problem of arithmetic overflow that can occur when computing 1/precision. BlackMetalStats (talk) 23:34, 14 May 2018 (UTC)

More elegant formulation?
I've usually seen gradient descent formulated more elegantly as a partial differential equation x'(t)=-\nabla f(x(t)). Once discretized, it takes the form that is given in the article. I think it would be nice to add this, does anyone object? —Preceding unsigned comment added by Winblows (talk • contribs) 22:09, 6 August 2008 (UTC)
 * This "gradient flow" interpretation is neat indeed, but harder to grasp than the "discrete" formulation used in the article, and it is the discrete version of gradient descent which is most used in applications. Some materials on this proposed continuous formulation would be OK, as long as it is further down the article. Oleg Alexandrov (talk) 02:27, 7 August 2008 (UTC)

Non-linear systems example introduces undefined symbols
As far as I can tell, the symbols z, z_0, and t are used without being introduced or defined earlier in the page. This makes the description very difficult to follow. —Preceding unsigned comment added by 216.244.31.162 (talk) 11:12, 25 February 2010 (UTC)

The description of using gradient descent as a way to solve a non-linear equation is impossible to follow because (as the previous comment suggested), basically none of the symbols are defined before they are used. What are "f" and "z" and everything else in that argument? —Preceding unsigned comment added by 69.196.185.126 (talk) 18:21, 13 May 2010 (UTC)

Please correct this article or kill it. June 13, 2010
The "talker" right above me is right. The author(s) skip(s) around a bit changing style and notation. It's a nice attempt but has gone awry. Try Google for something better. <! W. Watson !>


 * Okay, I hope that helps. 018 (talk) 01:03, 15 June 2010 (UTC)

It also uses the jacobian matrix of the scalar function F, which is nonsense. Lostella (talk) 09:50, 12 October 2010 (UTC)
 * Now this is fixed, I believe. We take the transpose of the Jacobian of G with respect to X, and multiply it by the G vector. Which is equivalent to taking the gradient of our F function. Though I believe there is a 0.5 multiple which is not accounted for. This is a scalar, of course, and since the gamma "step" multiplies the same term, I suppose it doesn't matter. But it should probably be added for completeness. Peryeat (talk) 05:32, 10 January 2011 (UTC)

I believe it also made an error in arithmetic. The first entry in the G(X0) matrix is computed as -1.5. I believe it is -2.5, as the cosine of 0 is 1. It took me an hour and a half to figure out that was wrong. ;-) I'll attempt to fix it.Peryeat (talk) 03:16, 10 January 2011 (UTC)
 * I think it's better now. Many of the calculations needed to be redone in light of this, and the optimization is less exciting, but I think it's correct. Peryeat (talk) 05:32, 10 January 2011 (UTC)

Introduction of animated example. January 16, 2011
I added an animation to try to aid in the understanding of this example. I know that there is great potential for something like this to work for explanation, but I don't have the proper software to edit the animation sufficiently. If anyone would like to clean it up, I'd appreciate it. I could give the Octave code to anyone who wants it. Peryeat (talk) 19:22, 16 January 2011 (UTC)

External links modified
Hello fellow Wikipedians,

I have just modified one external link on Gradient descent. Please take a moment to review my edit. If you have any questions, or need the bot to ignore the links, or the page altogether, please visit this simple FaQ for additional information. I made the following changes:
 * Added archive https://web.archive.org/web/20140508200053/http://brahms.cpmc.columbia.edu/publications/momentum.pdf to http://brahms.cpmc.columbia.edu/publications/momentum.pdf

When you have finished reviewing my changes, you may follow the instructions on the template below to fix any issues with the URLs.

Cheers.— InternetArchiveBot  (Report bug) 21:43, 23 March 2017 (UTC)

Barely distinguishable colors
The new graph uses colors that are very hard to distinguish with red-green-color vision issues: https://en.wikipedia.org/wiki/Gradient_descent#/media/File:Steepest_descent.png

If there is anybody who could fix that in the original graphing packages, that would improve the situation a lot. This way a lot of graphs are just unusable. — Preceding unsigned comment added by 88.219.19.174 (talk) 17:35, 2 April 2019 (UTC)

Wrong Image Used
Hello,

The images used in Description > Examples showing the Zig-Zagging nature of the method show gradient ascent being used instead of gradient descent.

Images used are https://commons.wikimedia.org/wiki/File:Gradient_ascent_(contour).png and https://commons.wikimedia.org/wiki/File:Gradient_ascent_(surface).png are labelled as such.

If there is someone who could correct this with access to graphing software, it would help.

Thanks, Enchoc373 (talk) 20:18, 8 May 2019 (UTC)

The_Gradient_is_the_Steepest_Direction
Hi all, (I am a CS prof) Personally, I do not think that it is clear what it means that the direction of steepest ascent to ne the gradient. Even if one understands it, I dont think it is clear that it is true. I know that my undergrad and graduate students even in machine learning dont really get it. Today I wrote the following page Draft:The_Gradient_is_the_Steepest_Direction What are your thoughts in it? Thanks Jeff JeffAEdmonds (talk) 15:30, 18 May 2020 (UTC)


 * Page doesn't seem to exist/isn't publicly visible. Youarelovedfool (talk) 05:41, 28 January 2024 (UTC)

Remove steepest descent link
We should remove the redirect from steepest descent since that is a different optimization algorithm. DMH43 (talk) 14:55, 27 December 2023 (UTC)