User:AI456/sandbox

Derivation
Since backpropagation uses the gradient descent method, one needs to calculate the derivative of the squared error function with respect to the weights of the network. The squared error function is (the $$\frac{1}{2}$$ term is added to cancel the exponent when differentiating):

Therefore the error, $$E$$, depends on the output $$ y $$. However, the output $$y$$ depends on the weighted sum of all its input:

The above formula only holds true for a neuron with a linear activation function (that is the output is solely the weighted sum of the input). In general, a non-linear, differentiable activation function, $$\varphi$$, is used. Thus, more correctly: This lays the groundwork for calculating the partial derivative of the error with respect to a weight $$w_i$$ using the chain rule: Since the weighted sum $$net$$ is just the sum over all products $$w_i$$ $$x_i$$, therefore the partial derivative of the sum with respect to a weight $$w_i$$ is the just the corresponding input $$x_i$$. Similarly, the partial derivative of the sum with respect to an input value $$x_i$$ is just the weight $$w_i$$: The derivative of the output $$y$$ with respect to the weighted sum $$net$$ is simply the derivative of the activation function $$\varphi$$:
 * $$\frac{dy}{dnet} = \frac{d}{dnet}\varphi $$

This is the reason why backpropagation requires the activation function to be differentiable. A commonly used activation function is the logistic function:
 * $$ y = \frac{1}{1+e^{-z}}$$

which has a nice derivative of:
 * $$ \frac {dy}{dt} = y(1-y) $$

For example purposes, assume the network uses a logistic activation function, in which case the derivative of the output $$y$$ with respect to the weighted sum $$net$$ is the same as the derivative of the logistic function:
 * $$ \frac {dy}{dnet} = y(1-y) $$

Finally, the derivative of the error $$E$$ with respect to the output $$y$$ is: Putting it all together: If one were to use a different activation function, the only difference would be the $$y (1 - y)$$ term will be replaced by the derivative of the newly chosen activation function.

To update the weight $$w_i$$ using gradient descent, one must chooses a learning rate, $$\alpha$$. The change in weight after learning then would be the product of the learning rate and the gradient: For a linear neuron, the derivative of the activation function $$ \varphi $$ is 1, which yields:
 * $$ \Delta w_i = \alpha (t - y) x_i $$

This is exactly the delta rule for perceptron learning, which is why the backpropagation algorithm is a generalization of the delta rule. In backpropagation and perceptron learning, when the output $$y$$ matches the desired output $$t$$, the change in weight $$ \Delta w_i $$ would be zero, which is exactly what is desired.

The result may converge to a local minimum
The "hill climbing" strategy of gradient descent is guaranteed to work if there is only one minimum. However, often times the error surface has many local minimum and maximum. If the starting point of the gradient descent happens to be somewhere between a local maximum and local minimum, then going down the direction with the most negative gradient will lead to the local minimum.

Solution: Scale the inputs to have zero mean over the training set
Consider the following training example: (101, 101) -> 2 (101, 99) -> 0