User:Chakazul/AI

Functions and Partial Derivatives
$$\begin{array}{lcl} C = \displaystyle{1 \over m} \text{sum}(\mathcal{L}(\hat Y, Y)) & \Rightarrow & \displaystyle{\partial C \over \partial \hat Y} = \hat Y - Y \\ A = g(Z) = \begin{cases} 1/(1+e^{-Z}) \\ \tanh(Z) \\ \max(0, Z) \\ e^Z / \sum e^Z \end{cases} & \Rightarrow & \displaystyle{\partial A \over \partial Z} = g'(Z) = \begin{cases} A (1 - A) & \mathsf{.. sigmoid} \\ 1 - A^2 & \mathsf{.. tanh} \\ 0 \text{ or } 1 & \mathsf{.. ReLU} \\ A (1 - A) & \mathsf{.. softmax} \end{cases} \\ Z = W A_\ominus + b \ 1 & \Rightarrow & \displaystyle{\partial Z \over \partial A_\ominus} = W \quad \Bigl| \quad \displaystyle{\partial Z \over \partial W} = A_\ominus \quad \Bigl| \quad \displaystyle{\partial Z \over \partial b} = 1 \\ \end{array}$$

Chain Rule
$$\begin{array}{l} \displaystyle{\partial C \over \partial W} = \Bigl[ \Bigl[ \displaystyle{\partial C \over \partial \hat Y} \displaystyle{\partial \hat Y \over \partial Z_L} \Bigl] \cdots \displaystyle{\partial Z_\oplus \over \partial A} \displaystyle{\partial A \over \partial Z} \Bigr] \displaystyle{\partial Z \over \partial W} = \Bigl[ W_\oplus^T \cdots \Bigl[ (\hat Y - Y) \odot g'(Z_L) \Bigl] \cdots \odot g'(Z) \Bigr] A_\ominus^T \\ \displaystyle{\partial C \over \partial b} = \Bigl[ \Bigl[ \displaystyle{\partial C \over \partial \hat Y} \displaystyle{\partial \hat Y \over \partial Z_L} \Bigl] \cdots \displaystyle{\partial Z_\oplus \over \partial A} \displaystyle{\partial A \over \partial Z} \Bigr] \displaystyle{\partial Z \over \partial b} = \Bigl[ W_\oplus^T \cdots \Bigl[ (\hat Y - Y) \odot g'(Z_L) \Bigl] \cdots \odot g'(Z) \Bigr] \\ \end{array}$$

Weight / Bias Update (Gradient Descent)
$$\begin{array}{ll} \Delta W = - \alpha \displaystyle{\partial C \over \partial W} & \quad W = W + \Delta W \\ \Delta b = - \alpha \displaystyle{\partial C \over \partial b} & \quad b = b + \Delta b \\ \end{array}$$

Examples
$$\begin{array}{l} {\partial C \over \partial W_2} = {\partial C \over \partial A_2} {\partial A_2 \over \partial Z_2} {\partial Z_2 \over \partial W_2} = (A_2 - Y) \odot g'(Z_2) A_1 \\ {\partial C \over \partial b_2} = {\partial C \over \partial A_2} {\partial A_2 \over \partial Z_2} {\partial Z_2 \over \partial b_2} = (A_2 - Y) \odot g'(Z_2) \\ {\partial C \over \partial W_1} = {\partial C \over \partial A_2} {\partial A_2 \over \partial Z_2} {\partial Z_2 \over \partial A_1} {\partial A_1 \over \partial Z_1} {\partial Z_1 \over \partial W_1} = (A_2 - Y) \odot g'(Z_2) W_2 \odot g'(Z_1) A_0 \\ {\partial C \over \partial b_1} = {\partial C \over \partial A_2} {\partial A_2 \over \partial Z_2} {\partial Z_2 \over \partial A_1} {\partial A_1 \over \partial Z_1} {\partial Z_1 \over \partial b_1} = (A_2 - Y) \odot g'(Z_2) W_2 \odot g'(Z_1) \\ \end{array}$$

Remarks

 * $$\Box_\ominus \equiv \Box_{\ell-1}$$ is the matrix of the previous layer, $$\Box_\oplus \equiv \Box_{\ell+1}$$ is that of the next layer, otherwise $$\Box \equiv \Box_{\ell}$$ implicitly refer to the current layer
 * $$g$$ is the activation function (e.g. sigmoid, tanh, ReLU)
 * $$\odot$$ is the element-wise product
 * $$\Box^{\circ 2}$$ is the element-wise power
 * $$\mathrm{sum}(\Box)$$ is the matrix's sum of elements
 * $${\partial \over \partial \Box}$$ is the matrix derivative
 * Variations:
 * All matrices transposed, matrix multiplcations in reverse order (row vectors instead of column vectors)
 * $$W, b$$ combined into one parameter matrix $$\Theta$$
 * No $$\odot g'(Z_L)$$ term in $$E_L$$