Talk:Backpropagation

Wiki Education Foundation-supported course assignment
This article was the subject of a Wiki Education Foundation-supported course assignment, between 6 September 2020 and 7 December 2020. Further details are available on the course page. Student editor(s): Lipute17.

Above undated message substituted from Template:Dashboard.wikiedu.org assignment by PrimeBOT (talk) 15:12, 16 January 2022 (UTC)

Confusing notation
What does the notation y(x) mean when y is an R^n vector and x is an R^m vector? That notation is found in the loss function section, and I've been having a hard time inferring what it means. 2804:14C:8793:86F6:76D4:35FF:FE5B:6F91 (talk) 16:14, 20 February 2017 (UTC)


 * It refers to an output vector y corresponding to a training example vector x. The definition at the top of the section ($$ y, y' \in \R^n $$) has been implicitly altered to something more like $$ y : \R^m \to \R^n $$, a mapping from input vector to output vector. I agree it is confusing. Andars97 (talk) 01:46, 22 April 2017 (UTC)

Questions
Dear All,

Apologies if this is a very naïve question, but in section 6.1 shouldn't the reference be to all neurons directly affected by the neuron (ie L is the next layer), rather than all neurons (L is the total set of neurons lying between the input neuron and the output neurons).

Thx

NeilHyatt (talk) 11:08, 9 January 2017 (UTC)

Dear all, why doesn't this article mention that one needs to learn the biases of multilayer perceptrons? — Preceding unsigned comment added by 92.227.230.174 (talk) 15:10, 12 December 2014 (UTC) Using the neurons weights on it's incoming connections - What does this mean? - Zoe 06:22, 31 January 2003 (UTC)
 * Hopefully this has been clarified. - Eyrian 05:30, 22 Apr 2005 (UTC)


 * This is due to that in the structure of a Neural Network, the neuros are all connected. Each link that connects two neurons carries a weight. - BestDaNiu (talk) 13:37, 12 November 2014 (UTC)

Backpropagation usually allows quick convergence on local minima - What does this mean? - Zoe 06:22, 31 January 2003 (UTC)
 * I think that this is reasonable terminology for use in an article on optimization (though hopefully clarified somewhat). - Eyrian 05:30, 22 Apr 2005 (UTC)


 * In all optimization problems, the objective is to find the maximum or minimum value of a given function with or without constraints. Backprobagation can be viewed as an optimization problem, as it tries to minimize the cost function between the hypothesis outputs and the actual outputs. And as the objective function of backprobagation is non-convex, given a vector of initially random weights, it usually ends with a local minima. -BestDaNiu (talk) 13:37, 12 November 2014 (UTC)

Is $$t_{j}-o_{j}$$ in the 2nd to last equation in the "Finding the derivative of the error" supposed to be backward from (negative of) the y - t in $$\frac{\partial E}{\partial o_j} = \frac{\partial E}{\partial y} = \frac{\partial}{\partial y} \frac{1}{2}(t - y)^2 = y - t $$ ? Couldn't figure out why it would be. — Preceding unsigned comment added by 206.248.153.120 (talk) 03:18, 27 July 2014 (UTC)
 * The order doesn't matter because the difference between these two variables is being squared. The order was chosen for convenience. [Similarly, the 1/2 that appears in the error/loss expression is there for convenience to make the terms simpler after the derivatives are taken. The error expression needs to measures how far off the mark the training round was. I don't think the training's behavior is understood well enough to define a best form for the loss function in all scenarios.]Hozelda (talk) 08:11, 21 June 2019 (UTC)

Please translate from German to English
There seems to be much more information in the German version of this page. Please translate it to English. Thanks! --Jcarroll 23:19, 5 May 2006 (UTC)


 * Added the template to the article. Nicholas Léonard 02:05, 8 January 2013 (UTC) — Preceding unsigned comment added by Strife911 (talk • contribs)

History of the technique
I had heard at some point that a review of Gauss's work indicated that the technique we call backpropagation was actually developed by Gauss. Werbos would then be an independent re-discoverer of the technique. It may even have been Werbos who told me this; I wasn't making notes at the time (about 1990). But I'm not finding corroboration of this, so I thought I would at least broach the topic here. Until there is such corroboration, of course, there should be no change to the article. --Wesley R. Elsberry 11:58, 27 August 2006 (UTC)
 * I somehow remember that Yann LeCun is blamed for this technique. On his web page, I found an article "generalization using back-propagation" from 1987. (http://yann.lecun.com/exdb/publis/index.html)
 * The mathematical theory was there for a long time, so a historical rewiev is indeed interesting. —Preceding unsigned comment added by 83.216.234.177 (talk) 09:58, August 28, 2007 (UTC)


 * Any technique for function minimization (in this case, classification error minimization) that is not simple trial-and-error will predate to the works of Isaac Newton and/or Leibnitz. Perceptron and MLP training are not exceptions, most training algorithms will be variations of Gradient Descent or Newton-Raphson (like Levenberg-Marquadt). Historical review on this would be endless.

--Lucas Gallindo (talk) 19:45, 8 September 2009 (UTC)

Tautology?
"Backpropagation usually allows quick convergence on satisfactory local minima for error in the kind of networks to which it is suited." —The preceding unsigned comment was added by 129.170.31.11 (talk • contribs) 03:16, 16 January 2007 (UTC).
 * I don't think so. The sentence indicates that backpropagation works well on certain networks. --Eyrian 05:49, 16 January 2007 (UTC)

Alternatives
Actually i would like some links to alternative methods for teaching multi-layered networks. Maybe a see also-section —Preceding unsigned comment added by 80.164.105.90 (talk) 07:11, 8 September 2007 (UTC)

Algorithm
Citation: Material seems to be directly copied from http://www2.cs.uh.edu/~ceick/ai/nn.ppt (slide 6); if it is, a reference should be included.

Explanation: I'm not understanding how to compute the per-node error. How much blame should be subtracted from each connection's weight, and how much should be propagated back to the inputing nodes'? (I'm assuming that it backpropagates according to the proportion of error contributed by each earlier connection.) --Jesdisciple 03:54, 8 October 2007 (UTC)


 * I detail the error propagation algorithm in multilayer perceptron. Perhaps some of the math should be copied to here? SamuelRiv (talk) 13:04, 18 November 2007 (UTC)

The phase 1 description and pseudocode need to be more detailed. The key propagation step is briefly described in point 3 of phase 1 and line 7,8 of the pseudocode, but how exactly one does the propagation is not explicitly defined. I.e. you could not write C code from the given information as is. 202.112.90.192 (talk) 04:12, 28 November 2018 (UTC)

Connection to Biology
Someone added a section on backpropagation of action potential in biological neurons. I just backed this change out. The material is irrelevant in the context of this article. (In this article, backpropgation refers to reverse propagation of errors through the computation graph. As far as anyone knows, this has nothing to do with the potential jump of action potentials invading the proximal regions of the dendritic arbor) Barak (talk) 19:10, 17 November 2007 (UTC)
 * term "Backpropagation" is ambiguous


 * How about some tact here. I added 2000 bytes of information, so maybe there should be some discussion before wiping it. The idea of backpropagation as a biological model, as it has been applied by several research groups (IBNS for one) is useless without some justification. A computational model is effectively a neuroscience model, and any information added to this effect is useful. Anyway, if more context or a separate article is needed, then we can add that. But please don't just blank a good half-hour of factual and relevant information. SamuelRiv (talk) 21:14, 17 November 2007 (UTC)


 * I'm sorry if my change seemed abrupt. The material you added is both interesting and (I say this as someone knowledgeable in the field) entirely correct.  The reason I removed it is because it belongs in a different article.  Perhaps in an article about action potentials in neurons, or in one about Hebbian synaptic plasticity, or some place like that.  But unfortunately, it is not appropriate here: "back propagation" of potential from the soma into the dendritic arbor is a different sense of the term "backpropagation" than that being used here, which refers to a mathematical construction for an estimation problem.  Please do not take this personally though, and I am serious when I say that the text you wrote deserves a home.  It just doesn't fit here.  Barak (talk) 11:05, 18 November 2007 (UTC)


 * Thanks for the detailed response. I disagree, obviously, because from my perspective artificial and biological neural nets are inseparable concepts. The nonlinear activation function, for example, was partially inspired by biology, and in this case the success of backprop as an algorithm encouraged biologists to find evidence for it in biological systems. I don't think there is enough there for it to warrant a separate article (it still seems like a lot of speculation as to the algorithmic purpose of the retrograde signal). It belongs here because backpropagation is a feedforward algorithm, where as Hebbian learning, for example, has no such bias (it could work with a feedback loop to transmit errors, for instance). I don't know, personally I can't think of a more appropriate place for this information, especially as this article is pretty short already. I'll think about it more, though, and see if there might be a better fit somewhere. SamuelRiv (talk) 13:02, 18 November 2007 (UTC)


 * I'm sorry but what you're saying doesn't really support the idea that this particular material should be included in this particular article. You're summarizing a biology paper that happens to have the word "backpropagation" in its title.  But that paper is using that word in a completely irrelevant way.  Any other paper about some detailed electrophyiological measurement would be just as relevant.  This article is short, yes.  There are many relevant hunks of information that could be added.  Eg: a review of backprop-based work that has cast new light on a variety of biological phenomena, such as NETtalk and language acquisition, Qian et al's shape-from-shading model and the bar detector story.  Or some hooks into the literature of practical applications could be added: PAPnet is a billion-dollar thousands-of-lives success of backprop.  Credit card transaction validation.  Loan application testing.  ALVINN.  Face detection.  Facial gender discrimination.  Or hooks into the relevant theoretical literature: theory of generalization in such networks, VC dimension bounds, early stopping, optimal brain damage.  Convolutive networks and TDNNs enjoy wide application (leNet is the gold standard for handwritten digit recognition, for example.)  Pointers to implementations that are really useful, like "lush", so people can go out and build real systems.  But a summary of a bit of completely irrelevant biology data just doesn't belong here.  (You mention Leon Cooper's group at Brown as if it makes this relevant.  But that group doesn't work on backprop or related methods.) Barak (talk) 19:54, 18 November 2007 (UTC)


 * You still have not addressed why it is irrelevant. Here we have an algorithm that essentially cannot function without some method of backward propagation of signals (simple feedback loops don't work by nature of the correction algorithm). This is used in neural networks, which are almost by definition the best existing model of a functional brain. So as soon as you have an algorithm, you have a brain model. The Brown group works on brain models from both a top-down and bottom-up perspective, so they address both algorithms (mostly RBNs - granted I haven't seen any papers specifically working with backprop) and biological models. So as a model of the brain, which the backpropagation algorithm is, we would naturally want some evidence for it. That's where this addition comes into play.
 * Now I'd rather have a discussion than an edit war, but I fail to see why, in a dispute, you'd rather default to having less information visible in the article than more. SamuelRiv (talk) 20:15, 18 November 2007 (UTC)
 * Addendum: I have tagged this page on Third opinion. Additionally, I want to clarify that IBNS has done work on backprop algorithms, but no paper deals specifically with them. Most use multilayer perceptrons in comparison to RBNs. SamuelRiv (talk) 20:43, 18 November 2007 (UTC)


 * Let me answer your actual question above, which I paraphrase: "Here we have an algorithm that cannot function without some method of backward propagation of signals ... why is evidence for backward (i.e., into the dendritic arbour) propagation of action potentials not relevant?" The reason is that what is being retrograde transmitted into the dendrites are action potentials, which is the activity of the neurons.  It is not the error, which is what would need to be transmitted to implement backprop.  It is backward something, but the wrong thing!  It is not backward propagation of errors.  Barak (talk) 21:00, 18 November 2007 (UTC)


 * Okay, new paragraph because I'm sick of typing colons. In response, your point about error is fair, in that error or its energy is not necessarily what is being backpropagated. But it can be. Imagine a 3-layer perceptron that wants to learn to match its output to a memorized binary pattern, say. So its output layer connects in a linear 1-1 fashion with the memory, producing a spike frequency in the memory layer that increases with the amount of error in the output, that is, if the output approaches "0" frequency, the memory wants frequency "1" and spikes at full frequency to signal that error. Once the signals match, the memory stops spiking. So each of these error spikes backpropagates, and then you get the full LTP/LTD effects (this last sentence has not yet been observed). I'll see if I can find a paper on this mechanism when I get to the office. SamuelRiv (talk) 15:15, 19 November 2007 (UTC)


 * Addendum - see if you can open it for a paper outlining a biological backprop mechanism. Does information from this at least deserve to be in the article? SamuelRiv (talk) 19:04, 19 November 2007 (UTC)


 * That's not a bad paper, although there are two other efforts I'd look at first. One is a paper by Paul Werbos on implementing backprop in neural hardware.  The other is the "Recirculation Neural network" architecture (G. E. Hinton and J. L. McClelland, J. L., "Learning representations by recirculation", in D. Z. Anderson (ed), "Neural Information Processing Systems," pages 358-366, AIP, ), which was the start of a line of research that, in a modern probabilistic context, developed into the Helmholtz Machine and the Restricted Boltzmann Machine architectures.


 * A wikipedia article on Biologically Plausible NN Learning Algorithms would be a useful place to have pointers to these various strands of research, and to put them all in perspective. Barak (talk) 10:51, 26 November 2007 (UTC)


 * Yikes, another thing on my "to do" list. I'll explore making that article (I have to look at the literature on biological SVMs and CMs first). Meanwhile, thanks for the papers, except I can't find the Werbos one - do you remember the title or keywords? SamuelRiv (talk) 16:28, 26 November 2007 (UTC)

Request for comments on biological evidence section
The question is whether or not a section which was alternately added and removed is pertinent, relevant, and suitable for inclusion in this article. — Athaenara ✉  03:16, 19 November 2007 (UTC)


 * Responding to RfC. This article is about a computer algorithm, while the section in question describes a biological process that works in the same way. I don't believe the section should be included, as it is of little relevanse to what I believe the article should be about (the algorithm and its properties). But the information presented in the section is interesting and seems to be notable, so I beleive it should be added to a seperate article and linked to from this article. Labongo (talk) 11:58, 19 November 2007 (UTC)


 * My main problem is that every neural network algorithm doubles as a biological model for the time being, because we simply don't know for sure what algorithms the brain uses. We have some evidence for several different algorithms, so where do we put all that information? SamuelRiv (talk) 15:17, 19 November 2007 (UTC)


 * The information should probably be in the other article or some related article, since how the brain works is clearly outside the scope of this article. Labongo (talk) 16:00, 20 November 2007 (UTC)


 * I created Neural backpropagation, as per the third opinion. However, I frankly think this is ridiculous, as the literature only refers to the phenomenon as "backpropagation" and there is enough overlap between these topics that they deserve separate articles. I'd also like to say that repeatedly deleting large-scale factual and arguably relevant contributions to an article seems to me to violate the spirit of Wikipedia, as stated in its revert policies. Someone will inevitably come along and suggest these articles be merged, if anyone ever reads them. SamuelRiv (talk) 23:58, 20 November 2007 (UTC)


 * I added otheruses templates to both algorithms, to ensure that there is no confusion with regard to the term. Since the issues seems to be resolved I have also removed to RfC tag from the talkpage. Labongo (talk) 15:32, 22 November 2007 (UTC)

Poor summary?
It's a very good idea to have a summary of an algorithm like this, but I don't find the summary as it currently stands, to be very good. Thomas Tvileren (talk) 10:01, 19 January 2009 (UTC)

Much better now! Thanks to Strife911. Thomas Tvileren (talk) 17:42, 23 February 2011 (UTC)


 * Because this is an encyclopedia, I was hoping for an introduction that would give a general idea of the topic to non-specialists. The first sentence refers to the "objective function", but it is not linked or defined, and non-specialists will have no idea what it is. It's my view that the first sentence in any article should come as close as possible to a general description, even if not perfectly precise, without resorting to opaque jargon, and with a minimal number of words and concepts that the general reader has to look up in order to understand that description. I'd like to invite someone who's knowledgeable about the subject take on the challenge of making the whole introductory section more accessible to non-specialists. - Do c  t  orW  23:36, 23 June 2012 (UTC)


 * This is not my area, but let me take a stab at what I have in mind, in order to illustrate:
 * Backpropagation, an abbreviation for "backward propagation of errors", is a common method of training artificial neural networks. From a desired output, the network learns from many inputs, similar to the way a child learns to identify a dog from examples of dogs. [Perhaps a more exact and more technical, but nevertheless brief, explanation could comprise the next sentence.]
 * The introduction could then go on to talk about the supervised learning method, and then the delta rule.
 * The second paragraph could introduce the seminal figures (Bryson & Ho, Werbos, Rumelhart, and Hinton & Williams). - Do c  t  orW  00:19, 24 June 2012 (UTC)

There was no response to my suggestion from 6 months ago, so I was bold and took a crack at it, even though this is not my area. Please feel free to correct any errors I may have made and continue to make at least the introduction more accessible to non-specialists. This is an encyclopedia, after all. - Do c  t  orW  09:47, 24 December 2012 (UTC)

Invented by
The article attributed the first description to "Paul Werbos in 1974". According to my source this is incorrect, I have updated the text to cite Bryson and Ho 1969. I am mentioning this in Talk since I am surprised to see a long-standing error of such importance. 78.105.234.140 (talk) 17:01, 15 October 2009 (UTC)

If one actually reads the relevant section of Bryson and Ho (1969) it plainly (a) cites an earlier journal paper describing the idea, and (b) mentions that this method is an application of Pontryagin's minimum principle. Barak (talk) 22:35, 16 October 2011 (UTC)

Observer's comment: 87.205.28.164 (talk) 21:35, 1 January 2013 (UTC) As an active researcher in Backpropagation and Pontryagin's maximum principle, I can assure you that backpropagation and pontryagain's principle are not connected. Barak, Can you give the page number of the source where you read this, and I'll check it out?

Relationship to Delta Rule?
The sentence enclosing the link to the Delta rule for this page states:

"It [Back-propagation] is a supervised learning method, and is an implementation of the Delta rule."

Implying that back-propagation is a subset of Delta Rule, but for the link to Back-propagation from the Delta Rule page it states:

"It [The Delta Rule] is a special case of the more general backpropagation algorithm."

Can this be? Or is the wording simply a bit confusing? —Preceding unsigned comment added by 72.221.98.197 (talk) 02:44, 20 December 2010 (UTC)


 * If you look in books, it appears that the delta rule only works at outputs where you have a target, and the back-prop is a generalization of it. So the first quoted sentence should have "implementation" changed to "generalization", I think.  Dicklyon (talk) 03:17, 20 December 2010 (UTC)

If you read the original article by Rumelhart, Hinton & Williams, they explicitly call it the "generalized delta rule." The original delta rule applies to linear single-layer networks, although it can be shown to be identical to the perceptron algorithm. The generalization is to nonlinear networks with a squashing function, and to multilayer networks (hence, "backpropagation", which is unnecessary in single layer networks). -gary cottrell —Preceding unsigned comment added by 75.80.97.109 (talk) 10:00, 25 January 2011 (UTC)

It doesn't explain how the delta rule is generalised. It just duplicates the derivation of the delta rule and then gives up on explaining what happens to hidden nodes. Ptolom (talk) 11:06, 19 January 2014 (UTC)

A new algorithm proposal
Could we change the layout of the algorithm? Why not make each instruction a link to additional information? But should this information be put in the same page, or a one relating to the 'next level of detail', for the algorithm? —Preceding unsigned comment added by 139.80.206.172 (talk) 08:57, 1 May 2011 (UTC)

Equations
The article is missing the equation(s) for back propagation learning algorithms. Malithyapa (talk) 23:21, 31 August 2011 (UTC)

Meaning of Weight
I was unfamiliar with the term "weight" used in this context- I think it ought to be more clearly explained early on. Also, it might be good for someone more knowledgeable to add a quick reference to the Weight disambiguation page. Brauden (talk) 03:20, 7 May 2012 (UTC)
 * I added a bit at the disambig page. Dicklyon (talk) 04:22, 7 May 2012 (UTC)

Stochastic versus online?
In adding cross-references to the Modes of Learning section, I found an apparent contradiction between this article and that for stochastic gradient descent (SGD), where the latter (specifically, in the  Iterative method section) equates SGD with  online learning while this article distinguishes between them. Come anyone wiser than me please advise if these should be equated here or the other article corrected?

Also, the statement that "Stochastic goes through the data set in a random order in order to reduce its chances of getting stuck in local minima" seems quite far from the method and purpose described in the SGD article. Perhaps the Modes of Learning section would be better simply cross-referencing the others rather than trying to summarise them. Or have I entirely missed a point?

Many thanks. p.r.newman (talk) 13:33, 4 April 2013 (UTC)

Threshold? Derivative? Other activation functions?
1. What happens with the threshold? It is mentioned in othe articles, here is no word of it. In the German version, they say they replace the threshold with an "on-neuron", but even they do not venture to describe it.

2. How do you get the derivative? - I think this is very nicely explained here:

https://en.wikipedia.org/wiki/Delta_rule

3. It should be explained what happens with other activation functions. In particular, that the derivative may actually be 1 in case of a linear activation. It is elsewhere mentioned that you can use tanh as activation, too.

- Nino — Preceding unsigned comment added by 188.23.245.69 (talk) 05:38, 26 February 2014 (UTC)

misleading statements
It wasn't said or implied in the article that input layer isn't affected by error estimate. Corrected algorithm description. Rest of article still misleading. 2001:470:600D:DEAD:E94B:92C8:B4E1:F8C5 (talk) 11:57, 28 August 2014 (UTC)

Outdated
This article, or at least parts of it, are quite outdated. They follow the old definition of backpropagation, where this name refers to the entire gradient descent algorithm for squared-error feed-forward networks. The lede, by contrast, uses the more modern meaning, where backprop is the generic gradient computation algorithm based on the chain rule, which can be combined with any gradient-based optimizer (conjugate gradient, L-BFGS) and any differentiable loss (log-loss, squared-error).

I'm planning a rewrite of this entire article to bring it in line with Bishop's (2006) Pattern Recognition and Machine Learning instead of Rojas's (1996) Neural Networks, which (judging by the notation) is the main source for the current text. Q VVERTYVS (hm?) 20:34, 4 November 2014 (UTC)

Modes of learning
Statement "Yet batch learning will yield a much more stable descent to a local minimum since each update is performed based on all patterns." can be misleading. Batch learning requires one to go through all of the test cases, while online learning can cope with a adding a single one, or a few. But if you pass through all of the test cases with on-line learning, you shouldn't end up further away from solution than with a batch learning which should actually get completely abandoned, as it is nonesense. Refer to http://axon.cs.byu.edu/papers/Wilson.nn03.batch.pdf for explanation why Wiki should abandon any talks of batch learning entirely, due to it not being useful in any sort of way. --213.174.16.2 (talk) 14:38, 21 January 2015 (UTC)


 * Usefulness is not the criterion for covering something on WP. Also, while batch learning may not be optimal or even better for gradient descent, it also allows using algorithms like L-BFGS or conjugate gradient (see Bishop, Pattern Recognition and Machine Learning, p. 240). These are not easily recast as online or minibatch algorithms. Q VVERTYVS (hm?) 16:57, 21 January 2015 (UTC)

global convergence?
"However, convergence to the global minimum is said to be guaranteed using the adaptive stopping criterion.[3]" is an extraordinary claim. As such, it requires extraordinary proof. The referenced article provides only the claim. Mborg (talk) 15:25, 16 September 2015 (UTC)
 * It's not really an extraordinary claim. The quotation is not a claim; the quotation reports that "it is said". If you feel that's too strong, you could change it to "articles report..., but there is no proof for it".--Gciriani (talk) 15:52, 16 September 2015 (UTC)
 * "It is said" is certainly not a strong claim; it's far too vague a claim for a technical article. I've removed the claim and the reference because a poorly-cited article by some unknown researchers in an off-track journal that presents a ridiculous algorithm without proof of effectiveness or even an intuitive justification as to why its crazy formulas should work is not the kind of material on which to base an encyclopedic article. Q VVERTYVS (hm?) 20:09, 16 September 2015 (UTC)

Error in algorithm?
In its current form the algorithm has two nested loops initialize network weights (often small random values) do forEach training example ex        prediction = neural-net-output (network, ex)  // forward pass actual = teacher-output (ex) compute error (prediction - actual) at the output units compute $\Delta w_h$ for all weights from hidden layer to output layer // backward pass compute $\Delta w_i$ for all weights from input layer to hidden layer  // backward pass continued update network weights // input layer not modified by error estimate until all examples classified correctly or another stopping criterion satisfied return the network

The way I learned the algorithm, the network weights aren't updated for every training example, but rather once for each iteration of the outer loop. That is the inner loop is used to calculate the gradient and then, after the inner loop, the gradient is used to update the weights. So I suggest initialize network weights (often small random values) do forEach training example ex        prediction = neural-net-output (network, ex)  // forward pass actual = teacher-output (ex) compute error (prediction - actual) at the output units compute $\Delta w_h$ for all weights from hidden layer to output layer // backward pass compute $\Delta w_i$ for all weights from input layer to hidden layer  // backward pass continued end forEach update network weights // input layer not modified by error estimate until all examples classified correctly or another stopping criterion satisfied return the network

— Preceding unsigned comment added by 142.166.157.21 (talk • contribs)


 * It's explained (inadequately) below the pseudocode:

"Technically speaking, backpropagation calculates the gradient of the error of the network regarding the network's modifiable weights. This gradient is almost always used in a simple stochastic gradient descent algorithm to find weights that minimize the error."


 * The pseudocode is actually that of stochastic gradient descent, not backprop; backprop is the part labeled "backward pass" and practically glanced over. What you're suggesting is batch gradient descent. Q VVERTYVS (hm?) 15:24, 18 October 2015 (UTC)

Wrong Error Function Visualization and Description under the "Intuition" Heading
I believe that under the "Intuition" heading, given the current example of (x1,x2,t) = (1,1,0), the correct error function is NOT a parabolic bowl as written, but instead what Wolfram Alpha refers to as a parabolic cylinder. I believe the correct graphic should be of E= 0 - (x1*w1 + x2*w2)^2, and not the current graphic which models E= 0 - [(x1w1)^2 + (x2w2)^2].

I concur - this is a very confusing picture. If w1=-w2 then the error is zero. Henk.muller (talk) 10:37, 6 June 2016 (UTC)

Sources on the history of backpropagation
According to various sources, basics of continuous backpropagation were derived in the context of control theory by Henry J. Kelley in 1960 and by Arthur E. Bryson in 1961, using principles of dynamic programming. In 1962, Stuart Dreyfus published a simpler derivation based only on the chain rule.

In 1970, Seppo Linnainmaa finally published the general method for automatic differentiation (AD) of discrete connected networks of nested differentiable functions. This corresponds to the modern version of backpropagation which is efficient even when the networks are sparse.

In 1973, Stuart Dreyfus used backpropagation to adapt parameters of controllers in proportion to error gradients. In 1974, Paul Werbos mentioned the possibility of applying this principle to artificial neural networks, and in 1982, he applied Linnainmaa's AD method to neural networks in the way that is widely used today.

In 1986, David E. Rumelhart, Geoffrey E. Hinton and Ronald J. Williams showed through computer experiments that this method can generate useful internal representations of incoming data in hidden layers of neural networks. In 1993, Eric A. Wan was the first to win an international pattern recognition contest through backpropagation.

One should insert the missing references in the article. Mathpriest (talk) 15:56, 28 March 2016 (UTC)

Remake of the algorithm section
I have written a candidate for a better description of the algorithm here: User:Esraiak/sandbox, and I am considering replacing the current Algorithm (and/or Summary) section with that one.

I will do the replacement in a few days unless someone objects.

Esraiak (talk) 14:25, 28 September 2016 (UTC)

Added content from Spanish/German articles. Improved for non-technical readers.
I added some sections from the German article (mostly the Extension section) and from the Spanish article (introduction).

I also tried to make the introduction and a few other sections better for non-technical readers. AlgorithmEdits (talk) 02:02, 12 December 2016 (UTC)

Buzzwords
Some buzzwords in the algorithm description: input thought, output activations, weight-synapse.

With a bit of critical thinking it's possible to infer what they refer to, but they are not defined here and inappropriate (subjective interpretations) for a purely technical description. — Preceding unsigned comment added by 88.177.31.169 (talk) 15:38, 26 December 2016 (UTC)

Core concepts in first sentences
I am unhappy with the first sentences in this article. I don't mean that they are wrong, only that they do not explain well what it is all about. Here is a better explanation that I found in Deep Learning by Goodfellow, Bengio and Courville, 2016. I think this is now the dominating course book on Deep Learning. It says on page 198: "The term back-propagation is often misunderstood as meaning the whole learning algorithm for multi layer neural networks. Actually back-propagation refers only to the method for computing the gradient, while another algorithm, such as stochastic gradient descent, is used to perform the learning." That is, when tuning an ANN to perform well on a set of data, the weights in the network are changed so that a cost calculated from the difference between the network output for the data and the known correct output is minimized. This minimization algorithm calculates the gradient of the cost as a function of the weights. Back-propagation is a method to calculate that gradient. Therefore, in my view, backprop is a method to calculate a gradient that is needed in the calculation of the weights to be used in an artificial neural network. --Ettrig (talk) 08:50, 5 February 2018 (UTC)

Confusing discussion of gradient descent
Why is gradient descent mentioned in this wiki article? Backpropagation is an algorithm to find the gradients, then what you do with those gradients is an entirely different issue. Reading the wiki one may think that backpropagation also concerns itself with SGD or similar; see discussion about choosing an adaptive learning rate, etc.

What do you think? Should we delete those discussions? User:Daniel.Cardenas


 * Good point! For orientation, gradient descent should be mentioned as the most common application and the context in which it was developed. Currently gradient descent occurs 12 times. That is far too much. The subject is difficult as it is. We should not add to the confusion by adding off-topic material.--Ettrig (talk) 06:42, 28 November 2018 (UTC)
 * There are whole sections of such off-topic material. Should this be just removed, or dumped on articles where it belongs? --Ettrig (talk) 07:20, 28 November 2018 (UTC)
 * I moved it to artificial neural network. That is at least a small step in the right direction. But a lot of restructuring of ANN material is neede. --Ettrig (talk) 14:16, 28 November 2018 (UTC)
 * 10K bytes moved, out of 38K. Now, there is a lot to clean up in ANN, but at least it is in the right place. There is a lot of mathematical notation, which is cumbersome to produce. So it would be unfortunate to just lose it. I think it is mostly correct. --Ettrig (talk) 16:43, 28 November 2018 (UTC)

History section added
I added a history section. I consider it a poor attempt, but figured something was better than nothing. Please improve. Thanks! Daniel.Cardenas (talk) 16:14, 30 September 2018 (UTC)
 * There was a history at the bottom of the article. I merged yours with that one.--Ettrig (talk) 14:17, 28 November 2018 (UTC)

Confusing notation #2
Using $$o_k$$ for both the input and output of a neural network layer in the Derivation section is super confusing. Could this be changed (at least add some accent to distinguish the two)? Attila I Szabo (talk) 09:49, 11 April 2019 (UTC)
 * Looks like several people have cleaned up most of that Hozelda (talk) 13:10, 16 June 2019 (UTC)

"total derivative" just before eqn 5 is not correct explanation
There is a mention of "total derivative" presumably to justify a summation formula that follows, "taking the total derivative with respect to $$o_j$$, a recursive expression for the derivative is obtained: $$\frac{\partial E}{\partial o_j} = \sum_{\ell \in L} \left(\frac{\partial E}{\partial \text{net}_\ell}\frac{\partial \text{net}_\ell}{\partial o_j}\right) = \sum_{\ell \in L} \left(\frac{\partial E}{\partial o_\ell}\frac{\partial o_\ell}{\partial \text{net}_\ell}\frac{\partial \text{net}_\ell}{\partial o_j}\right) = \sum_{\ell \in L} \left(\frac{\partial E}{\partial o_\ell}\frac{\partial o_\ell}{\partial \text{net}_\ell}w_{j \ell}\right)$$ (eqn 5)."

That appeal seems incorrect. The total derivative comes into play if you wanted to know the approximate change in the error based on all the small $$\Delta w_{i,j}$$. As far as calculating any individual $$w_{i,j}$$ (the purpose of the section) the total derivative has nothing to do with it.

I am not presenting a proof of eqn 5, but I think it could be true because of at least 2 things: the error function is a sum of terms (one for each output neuron), and the inputs to every neuron are combined in a summation (before applying the logistic function). These summations of inputs make it so that the partial derivatives, after the chain rule applications, separate into summed components since the derivative of a sum is the sum of derivatives.

I am not sure what changes should be made, but suggesting the "total derivative" is why eqn 5 is true is not true imo. — Preceding unsigned comment added by Hozelda (talk • contribs) 08:26, 21 June 2019 (UTC)
 * I'm back 6 months later and just did a particular case on paper (not the generalized case) and got an answer consistent with $$\delta_j$$ as per this page, but the reason for the summation in the $$\delta_j$$ formula has nothing to do with "total derivative" as stated on the page. It is a result of working the partial of that loss function* all the way to the end for the relevant $$w_{ij}$$ that gives that form for $$\delta_j$$. If the loss function was something else then the result could be different for $$\delta_j$$. Similarly if the definition of the perceptron was other than the dot product of the inputs with their weights then $$\delta_j$$ could have a different form that might not even recurse in a compact way. Also, you guys should look up the meaning of total derivative. It doesn't even make sense how it is being used here. In summary, it is a coincidence that the correct answer for $$\delta_j$$ looks somewhat like a total derivative (summation of partial derivative factors). This happens for the typical loss functions and the definition of perceptron (which have linear aspects to them). [*] square error or any similar error function that separates each output perceptron contribution into a distinct "term". Hozelda (talk) 05:04, 4 January 2020 (UTC)
 * OK I am reinterpreting the meaning of total derivative so that instead of having several independent variables with respect to which we derive the various parts under the summation, that those don't have to be independent variables. In that case, the explanation more or less makes sense. Hozelda (talk) 11:44, 4 January 2020 (UTC)


 * Hi
 * Thanks for the analysis! Yes, the fundamental confusion is that backpropagation computes the gradient, not the total derivative. These are frequently conflated, because they have the same components, but are distinct: the gradient is computed backwards from the end, while the total derivative is computed forwards from the start (see the gradient page for more). The $$\delta_j$$ is the gradient at a specific level, interpreted as "(backwards) propagation of error", which is why computing it in terms of total derivatives is confusing: it's backwards.
 * I've rewritten the lead and written a "Overview" section in terms of gradient, notably in, but haven't rewritten other sections.
 * Are these new "Overview" and "Matrix multiplication" sections clearer?
 * —Nils von Barth (nbarth) (talk) 03:06, 5 January 2020 (UTC)

Incoherent introduction
It has already been mentioned several times in this page, but there is still a confusion in the introduction between “error backpropagation” the algorithm from Rumelhart et al. and LeCun to train a NN and “gradient computation by backpropagation” the more modern understanding of the term that comes from the fact that people have now decoupled the gradient computation and SGD parts of the original algorithm (and don't always use SGD anymore). As is, the first paragraph seem to use the first one and the second paragraph the second one, which is confusing. It would be better to explicit that there are several meanings to the term and explain why. I nobody is against it, I will try to rework it in the next days --Evpok (talk) 08:42, 18 September 2019 (UTC)
 * Sounds like a good move.  We might have a chicken-and-the egg situation here because the leads should be a summary of what is in the article.   But it sounds like a good step. North8000 (talk) 18:08, 18 September 2019 (UTC)
 * As a non-expert, I've gone through the first para and made a few changes to the wording. Please check. In my view, the language could be more precise in a few places. For example: "The term backpropagation strictly refers only to the algorithm for computing the gradient, but is often used loosely to refer to the entire learning algorithm, including how the gradient is used, such as by stochastic gradient descent." It's a winding, complex sentence, so could be kinder to less-informed readers like me. But critically, by this start to the second para, I'm still unsure whether "the algorithm" is a single identifiable algorithm for universal use, or a class of algorithms that are developed for particular contexts. I should not be unsure by that stage. Tony (talk)  23:46, 4 January 2020 (UTC)
 * Thanks, your perspective is very useful, and your tweaks and comments are appreciated!
 * I've tweaked a bit more, so it's hopefully more useful. The efficiency is the key point, but I realize now that it was mentioned only in passing, so I've elaborated and emphasized briefly.
 * —Nils von Barth (nbarth) (talk) 03:23, 5 January 2020 (UTC)
 * —Nils von Barth (nbarth) (talk) 03:23, 5 January 2020 (UTC)

I'm not expert enough to judge, but it looks good and thank you to both of you. This is the kind of work needed on these articles. North8000 (talk) 13:48, 7 January 2020 (UTC)

Sigmoid function?

 * Added title. —Nils von Barth (nbarth) (talk) 23:53, 8 March 2020 (UTC)

The lead makes no mention of the sigmoid function; but isn't that critical to understanding "e.g. 0.1, 0.7, 0.2", which occurs soon after the lead? Tony (talk)  02:47, 5 March 2020 (UTC)
 * Probably not critical. It used to be common to use the logistic sigmoid to get outputs between 0 and 1 for class probabilities, but nowadays the use of a "softmax" layer is more commom for that.  The sigmoid would be great for definiteness, but it's certainly not required or unique. Dicklyon (talk) 07:01, 5 March 2020 (UTC)
 * Thx. Tony (talk)  07:03, 5 March 2020 (UTC)

Imprecise language at the end of the section Finding the Derivative of the Error.
The text reads:

The new $$\Delta w_{ij}$$ is added to the old weight, and the product of the learning rate and the gradient, multiplied by $$-1$$ guarantees that $$w_{ij}$$ changes in a way that always decreases $$E$$.

This is false for any reasonable definition of "guarantees". E.g. consider if the present value of $$w_{ij}$$ is a local minimum. At such a location any change to $$w_{ij}$$ will result in an increase in error, while the gradient at that value will be zero.

The text further says:

In other words, in the equation immediately below, $$- \eta \frac{\partial E}{\partial w_{ij}}$$ always changes $$w_{ij}$$ in such a way that $$E$$ is decreased:

This is also incorrect for any definition of "always" since, with large values of $$\eta$$, it is possible to overshoot a local minimum an increase the error.

Let's tighten up this language.

Skremer (talk) 17:48, 18 May 2020 (UTC)

Problem with equation before equation 5
Hi All,

I think there is a problem with the equation $$ \frac{\partial E}{\partial w_{ij}} = o_i \delta_j$$ because the index $$i$$ and $$j$$ are interchaged. The correct expression could be $$ \frac{\partial E}{\partial w_{ij}} = o_j \delta_i$$

The source of the problem may be in Eq.1 $$\frac{\partial E}{\partial w_{ij}} = \frac{\partial E}{\partial o_{j}} \frac{\partial o_{j}}{\partial w_{ij}} = \frac{\partial E}{\partial o_j} \frac{\partial o_j}{\partial\text{net}_j} \frac{\partial \text{net}_j}{\partial w_{ij}}$$

And the correct expression could be $$\frac{\partial E}{\partial w_{ij}} = \frac{\partial E}{\partial o_{i}} \frac{\partial o_{i}}{\partial w_{ij}} = \frac{\partial E}{\partial o_i} \frac{\partial o_i}{\partial\text{net}_i} \frac{\partial \text{net}_i}{\partial w_{ij}}$$

Please, look at the book Bishop, C. M. (2006). Pattern recognition and machine learning (1st ed.). Springer-Verlag New York, p. 243 at the equation 5.53 --Memo26167 (talk) 02:08, 9 August 2021 (UTC)

Introduction is way, way, way too complicated
Wikipedia has this habit of writing introductions to mathematical topics as if the reader is a Math graduating student.

Why can't the introduction just start saying that backpropagation at its core uses iteractive methods like inverse gradient method:

Xk+1 = Xk - ε .ᐁf(Xk)

to atune the parameters X={x1, x2, ..., xn} of a neural neural network, so that it minimizes an error function f(X), and that the basis of it is rather simple, but implementation can be complicated due to troubles of calculating ᐁf(Xk)? Instead one gets right out of the bat: "stockastic gradient method" "g(X) = lenghty formula". These things are very confusing and discouraging to those like me that don't want to be mathematicians.

I read the article in Wikipedia first, understood nothing of it, then went to https://codesachin.wordpress.com/2015/12/06/backpropagation-for-dummies/, and it was completely clear to me, I wrote my first neural network the same day. That was almost 10 years ago. As time passes Wikipedia is becoming less and less of an asset to the public and more and more an obscure "mathematician's pub (and physicist's pub too, btw)". I write this with the greatest love, as I love Wikipedia and I think it is (and should increasingly be) an asset to the general population. 136.54.50.251 (talk) 04:49, 30 April 2024 (UTC)