Talk:Stochastic gradient descent

Optimisation?
Could someone please diamb. the optimisation link? I'd love to do it, but I cannot figure if this would have to be the mathematical optimisation (finding the min-max or balance of a function) or the computer sciences optimisation (making a program ressource-efficient)... --Tribaal 08:26, 12 January 2006 (UTC)

Does this article text actually mean anything?
I'm currently attempting to implement a stochastic hill-climbing algorithm into code, and I still don't know what the heck this page is on about. Could we perhaps edit it so the words actually mean something to someone? —Preceding unsigned comment added by 68.118.0.110


 * You may read Stochastic hill climbing. I think these two terms refer to different things. 129.186.93.219 (talk) 23:47, 15 April 2009 (UTC)

Back propagation training algorith?
Could it not be said the the usual method of training a back propagation neural network (i.e., by picking a random training element at each step and adjusting the weight matrix based on the error it produces) is an instance of this algorithm? And probably by far the most common one? 92.41.75.253 (talk) 14:33, 24 October 2008 (UTC)

Yes, the standard backpropagation algorithm for multi-layer perceptron (MLP) neural networks  is a form of stochastic gradient descent  —Preceding unsigned comment added by 129.49.7.137 (talk) 15:43, 21 July 2009 (UTC)

Move to "Stochastic gradient method"
"Oppose": I withdraw the motion. Further, I am ordering myself to take a "Wikibreak": Sorry for the confusion. Kiefer.Wolfowitz (talk) 22:37, 16 October 2010 (UTC)

(However, I did add some good references and expanded the information of the article. Thanks, 22:37, 16 October 2010 (UTC))

The title is misleading. The method need not find the minimum (even in theory) unless the problem is convex. (The "method" is not a proper iterative method in general, since it need not converge (even in theory) on nonconvex problems.) Kiefer.Wolfowitz (talk) 14:08, 16 October 2010 (UTC)


 * Oppose per WP:COMMONNAME. I've looked up the number of results currently returned by Google Scholar for several name variants: descent - 4250, algorithm - 3450, method - 1310, ascent - 763, optimization - 453, search - 452, incremental gradient descent/method/algorithm - 263. This shows that "descent" is the most commonly used name. -- X7q (talk) 17:23, 16 October 2010 (UTC)


 * Two replies. First, the method is not a descent method unless you narrow the problem class, so the title is misleading. Second, according to Google Scholar, the "ordered subsets" method is more common than "stochastic gradient descent" (sic.): Other names include "incremental gradient" and "incremental subgradient" methods. I added references to articles on deterministic problems, which have a simpler convergence analysis than do stochastic approximation methods. Thanks, Kiefer.Wolfowitz (talk) 17:45, 16 October 2010 (UTC)22:33, 16 October 2010 (UTC)


 * "First, the method is not a descent method." Well, gradient descent also doesn't necessarily decrease objective function's value at each step (unless you do a line search for step size as in steepest descent). So it's not a descent method, too? -- X7q (talk) 21:27, 16 October 2010 (UTC)


 * I expressed myself poorly, for which I apologize. Reasonable information-complexity results use convexity, specifically the property that at every step the (sub)gradient may be followed (for a sufficiently small stepsize) towards the minimum set (Judin & Nemirovskii). Without that property, which requires convexity, you can prove convergence at really slow rates.
 * But again, it's in practice a bad idea to require descent with gradient methods: C.f., the "spectral gradient" method of Barzilai & Jon Borwein. Thanks, Kiefer.Wolfowitz (talk) 22:15, 16 October 2010 (UTC)

Category:Convex optimization
Why is the article listed in Category:Convex optimization? This thing has been actively used for many years by neural networks guys and their problems are far from being convex. If the reason is that there are some convergence results for convex functions, then why isn't gradient descent listed in that category as well? -- X7q (talk) 17:43, 16 October 2010 (UTC)


 * I removed the category, following your objection.
 * Nonetheless, the article claims to have a descent method of minimization: The key words that concern me are "method" and "minimization". Perhaps only a local stationary point (which attracts a convergent subsequence) is intended? Convexity suffices for a proof of descent (progress at every "major step"), even without differentiability. What proof of convergence to a global minimum can you give without convexity? Kiefer.Wolfowitz (talk) 22:42, 16 October 2010 (UTC)


 * I re-added the category. Although SGD is used for non-convex problems, it is also a convex optimization algorithm, and in fact, in the case of a convex problem, it is guaranteed to find the local minimum if the learning rate is small enough. In all other cases, it is only a local optimization algorithm/heuristic. Q VVERTYVS (hm?) 18:33, 4 July 2015 (UTC)

Delta
In the definition of the gradient descent algorithm there is no definition for the quantity delta. — Preceding unsigned comment added by Eep1mp (talk • contribs) 02:28, 9 February 2011 (UTC)

Implicit updates (ISGD)
Introduces an "x-prime" term without defining it, nor is it obvious what it should mean.Dmumme (talk) 21:17, 12 March 2019 (UTC)

Example section
Isn't the example in this article talking about batch gradient descent? Could someone please clarify? — Preceding unsigned comment added by Kindlychung (talk • contribs) 09:51, 13 April 2014 (UTC)

Formula in the background section

 * $$w := w - \alpha \nabla Q(w) = w - \alpha \sum_{i=1}^n \nabla Q_i(w),$$

I am a little confused here, the sum of gradients is a scalar, and w is a vector of coefficients, how can you subtract the one from the other? — Preceding unsigned comment added by Kindlychung (talk • contribs) 09:55, 13 April 2014 (UTC)


 * Gradient is a vector (http://en.wikipedia.org/wiki/Gradient), so sum of gradients is also a vector. 213.33.220.118 (talk) 12:35, 4 June 2014 (UTC)

History in optimization
A version of stochastic gradient method dates back to Robins and Monro's pioneer paper "A stochastic approximation method" in 1951 (cited more than 3500 times now). The current wiki page mainly focuses on the development of SGD in machine learning, but SGD seems to have a long history in optimization as well, though not as widely used as LMS in signal processing and back propagation in neural networks.

Regularization
The article does not explain l2 regularization, which is a commonly used adaptation of the algorithm that improves convergence and keeps weights from exploding. Regularization in online optimization settings is an interesting topic on its own right, so it would be good to cover it here.

I'm working on adding a section for regularization to this article in my sandbox. — Preceding unsigned comment added by Groceryheist (talk • contribs) 16:07, 11 April 2016 (UTC)

Error in formulas for Adam
As per the paper (https://arxiv.org/abs/1412.6980), Algorithm 1 on page 2, the update rule for Adam uses bias-corrected versions of the first and second moment estimates, $$m$$ and $$v$$. The formulas in the article are missing the divisions by $$1-\beta_1^t$$ and $$1-\beta_2^t$$. — Preceding unsigned comment added by Andersgb1 (talk • contribs) 10:34, 14 June 2016 (UTC)

Is citing the Mei et al. paper appropriated?
The first citation in this article, and the only citation in the introduction is to a 2018 paper of Mei et al. Is this really appropriate? I have looked at the paper. It is of some specialized interest (although I can't say one way or the other whether it is technically correct), but it is hardly highly important, nor is it central to the SGD algorithm or its history. (By way of comparison, the original Robbins-Monro does not have a formal citation.)

I have seen citations to Mei et al. pop up in unexpected places. I think there is a possibility of this citation being academic spam. Would it make sense to check when this citation was added, and by whom?


 * No, it isn't. I edited the introduction to include citations that are much more relevant and established. Now citing the paper of Robbins-Monro and the paper by Tong Zhang (2004) which kickstarted the use of SGD in machine learning. Before there were citations to Mei et al and a Business textbook, which were not as relevant. (perhaps academic spamming).--Magicheader (talk) 22:43, 25 November 2019 (UTC)

ْ

Start at the beginning
Don't understand the maths 2603:8001:8100:C71D:50E4:A0EA:F266:835F (talk) 12:14, 4 April 2022 (UTC)


 * I agree that the structure of the article doesn't currently support any sort of 'gentle' or more generally helpful explanation, such as an analogy or salient visual. Does anyone have any ideas on how we might better structure it? I think a higher-level introduction would be better to start with, but I have no opinion on where the history section should go. VirtualVistas (talk) 18:09, 15 October 2023 (UTC)