User:Simply Another Editor/sandbox

In mathematics, the softmax function, softargmax function, or normalized exponential function, frequently denoted by $$\sigma(\mathbf{z})$$, takes a vector of real numbers and normalizes it into a probability distribution. That is, after applying softmax, the order(??) of the vector elements will be the same, but each element will be between 0 and 1 and the sum of all the elements will be 1. The term softmax comes from the fact that it is a continuous, or "soft", varient of the argmax function, which... The standard (unit) softmax function is given by the standard exponential function on each element, divided by the sum of all the exponentiated elements, as a normalizing constant:


 * $$\sigma(\mathbf{z})_j = \frac{e^{z_j}}{\sum_{k=1}^K e^{z_k}}$$   for j = 1, …, K.

For example, the vector $$[3,-1,2]$$ becomes $$\sigma([3,-1,2])\approx [0.72,0.01,0.27]$$ after applying the standard softmax function. Softmax is often used in machine learning to map the output of a neural network to a vector of probabilities for each output classes.

Domain and range

 * $$\sigma\colon \mathbb{R}^K \to \left\{\sigma \in \mathbb{R}^K| \sigma_i > 0, \sum_{i = 1}^K \sigma_i = 1 \right\} $$

Bases
Instead of $e$, a different base $b$ can be used, for any positive $b > 0$. This is written using $$b = e^\beta$$ or $$b = e^{-\beta}$$. Positive $β$ means the highest score will have the highest probability, called the "maximum convention", and is usual in machine learning. Negative $−β$ corresponds to the minimum convention, and is conventional in thermodynamics, corresponding to the lowest energy state having the highest probability; this matches the convention in the Gibbs distribution, interpreting $β$ as coldness. (for any real $β$), the notation $β$ is for the thermodynamic beta, which is inverse temperature: $$\beta = 1/T$$, $$T = 1/\beta.$$, yielding the expression:


 * $$\sigma(\mathbf{z})_j = \frac{e^{\beta z_j}}{\sum_{k=1}^K e^{\beta z_k}}$$   or    $$\sigma(\mathbf{z})_j = \frac{e^{-\beta z_j}}{\sum_{k=1}^K e^{-\beta z_k}}$$    for j = 1, …, K.

In some fields, the base is fixed, corresponding to a fixed scale, while in others the parameter $(1/n, \dots, 1/n)$ is varied.