Talk:Vanishing gradient problem

Uh... what is the problem itself?
Shouldn't the article define what the problem is? --Doradus (talk) 02:29, 23 January 2015 (UTC)


 * I made an attempt. It is difficult to explain this in a non-technical way. Bhny (talk) 17:12, 23 January 2015 (UTC)


 * Well, I am a student in ML, I understand everything what article says, but it just says nothing about what the problem actually is. Linguiloce (talk) 14:04, 1 October 2016 (UTC)


 * I just came back to this article, and found this quote: "The problem is that in some cases, the gradient will be vanishingly small, effectively preventing the weight from changing its value." Works for me.  --Doradus (talk) 16:35, 2 December 2018 (UTC)

Other solutions
Perhaps it would be useful to cite other solutions, such as:
 * Better weight initialization, for example using Xavier initialization (http://jmlr.org/proceedings/papers/v9/glorot10a/glorot10a.pdf): for each layer, use a normal distribution with a standard deviation equal to 1 / sqrt(nb of inputs).
 * Batch normalization (https://arxiv.org/abs/1502.03167), where the inputs of each layer are normalized.
 * Simply using a non-saturating activation function (typically ReLU, leaky ReLU) seems to help a lot.
 * Unsupervised pre-training: train the lower layer (eg. to reproduce its inputs using autoencoders), then proceed to the next layer, etc. Finally fine-tune with regular backpropagation.
 * Reuse the lower layers of a network that was trained on simular inputs.

What do you think? Miniquark (talk) —Preceding undated comment added 12:15, 9 August 2016 (UTC)


 * - Shouldn't "Faster hardware" be removed from the "Solutions" section, since that's supposed to have nothing to do with accuracy? Sz. (talk) 20:07, 30 March 2017 (UTC)
 * - Also: the "Unsupervised pre-training" item above seems to have been added now (ref.: multi-hierarchy, Schmidhuber). Sz. (talk) 20:07, 30 March 2017 (UTC)


 * I've added rectifiers. The section should be expanded. Wqwt (talk) 05:08, 5 April 2018 (UTC)
 * I have added a section on weight initialisation. I've not mentioned Xavier initialisation as we found it does not really work well in deep networks. Riccardopoli (talk) 06:43, 23 June 2022 (UTC)

Size of Problem?
How many nodes in an unfolded RNN are viable without LSTM? i.e. where is the practical cut off point where the gradient hasn't vanished? There must be some rule of thumb that if your patterns in time occur in less than N samples then you can use RNN. If greater than M samples you are better off with LSTM? robertbowerman (talk) 04:30, 9 February 2017 (UTC)

Suggested rename: extreme gradient problem
I really don't see the point of having both vanishing gradient and exploding gradient pages. We just have two inbound redirects, and bold both inbound terms in the lead. Should be fine IMO. &mdash; MaxEnt 00:10, 21 May 2017 (UTC)

It is a well known problem in ML and pretty much everyone calls it the vanishing gradient problem. Sometimes they'll say vanishing/exploding gradient problem, but even that is rare. I've never heard it called the extreme gradient problem. Themumblingprophet (talk) 02:21, 15 April 2020 (UTC)

Fundamentally, the problem is about attractors in the parameter space of the error function; the problematic regions are stabilisers when you consider derivatives of this space parallel to various axes. This perspective is probably more abstract than the level at which most programmers operate, whereas "vanishing gradient" is reasonably concrete. 80.230.156.224 (talk) 08:55, 7 May 2024 (UTC)