Talk:Softmax function

Probability distribution
"to normalize the output of a network to a probability distribution over predicted output classes. " Maybe I am misunderstanding something but you cannot just normalize something to a probability distribution. Not everything that takes a set and assigns values in [0,1] which sum to 1 is a probability distribution. It is not even clear what the probability space would be. — Preceding unsigned comment added by SmnFx (talk • contribs) 07:56, 22 October 2020 (UTC)

Origin
To my knowledge, the softmax function was first proposed in
 * J. S. Bridle, “Probabilistic Interpretation of Feedforward Classification Network Outputs, with Relationships to Statistical Pattern Recognition,” in Neurocomputing, F. F. Soulié and J. Hérault, Eds. Springer Berlin Heidelberg, 1990, pp. 227–236.

and
 * J. S. Bridle, “Training Stochastic Model Recognition Algorithms as Networks can Lead to Maximum Mutual Information Estimation of Parameters,” in Advances in Neural Information Processing Systems 2, D. S. Touretzky, Ed. Morgan-Kaufmann, 1990, pp. 211–217.

Would it be appropriate to mention these publications, or at least their author, in the article? --131.152.137.39 (talk) 13:40, 13 November 2015 (UTC)
 * Yes, appropriate to mention the publications, certainly. (No point mentioning the author without the publications.) --mcld (talk) 15:12, 8 March 2017 (UTC)

Family of functions
From reading Serre 2005, it sounds that there are multiple definitions of softmax. Is this the case? What other representations are there? It looks like Riesenhuber and Poggio, 1999b and Yu et al., 2002 as referenced in Serre 2005 might give clues. JonathanWilliford (talk) 23:53, 11 August 2009 (UTC)
 * There may have been multiple definitions, but this is the only one I encounter in computer science literature, so perhaps it has achieved consensus name-wise. --mcld (talk) 15:13, 8 March 2017 (UTC)

Content
Is this acceptable content? 95% of the article is a direct copy of the content from. Jludwig (talk) 06:23, 10 May 2008 (UTC)


 * I agree with this concern. I am deleting most of the content of the article, per the concern of copyright violation. 128.197.81.32 (talk) 22:00, 30 July 2008 (UTC)

Furthermore, most of the terms are not defined. You can't just copy a bunch of equations into a document and not define the terms, it's terrible. People w/a background in stats can guess what most of the variables mean, but regardless, i hope whoever wrote this will return and define every variable that appears in an expression. Chafe66 (talk) 22:40, 29 October 2015 (UTC)

Derivation
Whats about the derivation of the softmax function? —Preceding unsigned comment added by 78.34.250.44 (talk) 21:28, 13 March 2009 (UTC)

John D Cook's definition is different from all of these
I followed the external link to the description of softmax as a substitute for maximum by John D. Cook. There, the softmax is described as
 * $$ \log( \sum_{j=1}^n\exp(q_j) ) $$

not
 * $$ \frac{\exp(q_i)}{\sum_{j=1}^n\exp(q_j)} $$

as in wkp. His version makes more sense to me. Can anyone corroborate me on this? I think the article needs fixing. But since there seem to be multiple definitions, it's hard to be clear.--mcld (talk) 15:31, 2 January 2014 (UTC)


 * I saw this on Cook's blog and I was highly surprised. Apparently this is a completely different function that is also called the softmax; I've never seen it in use. Q VVERTYVS (hm?) 15:28, 5 February 2014 (UTC)


 * I checked Cook's blog post again, and found that he doesn't even call his function softmax; he calls it "soft maximum". Removed the link as it's quite unrelated. Q VVERTYVS (hm?) 17:36, 5 February 2014 (UTC)

Cook's post is very informative on smooth maximum. There seems to be no natural setting for putting smooth maximum sub heading in soft max. Moving to a new page. — Preceding unsigned comment added by Yodamaster1 (talk • contribs) 16:50, 20 February 2015 (UTC)
 * Thanks for moving that content to Smooth maximum. I also discovered the page LogSumExp, and I think those two should be merged.--mcld (talk) 15:27, 8 March 2017 (UTC)


 * Should someone mention that the gradient of the LogSumExp is the softmax?

Possible?
Could it ever be possible that the explanation of how the function works be any more incomprehensible?

Apparently, there's an extremely well developed culture in wikipedia of: everyone is expected to know a bunch of inscrutable variable-name conventions. Either that or writers really are convinced those are solidly established conventions of the likes of "+", "-", "x", etc. . I mean, not even "n" (usually used to mean "number of elements") is conventional enough in many cases (especially considering how often it is used to other meanings too).

I'm really fed up with this, and this article is among the most poor examples of this that I've found so far. — Preceding unsigned comment added by 151.227.23.87 (talk) 17:27, 1 August 2015 (UTC)

The definition of the softmax as provided on Wikipedia simply does not make sense. The output of the softmax cannot possibly be the cube (0,1)^k, as (0.8, 0.8, 0.8, 0.8, 0.8, ..., 0.8) is in the cube but is not the output of the softmax. Someone fix this.

The definition has been changed to an even more wrong version. How can the \sigma be a function as well as a vector in R^N at the same time? This is the worst article on Wikipedia by far. — Preceding unsigned comment added by 111.224.214.171 (talk) 03:28, 28 May 2018 (UTC)

The hyperbolic tangent function is almost linear near the mean, but has a slope of half that of the sigmoid function.
The subject sentence does not appear to be correct. near x=0 tanh(x) has a derivative of 1.0 and the sigmoid 1/(1+exp(-x)) has a derivative of approx 0.25 — Preceding unsigned comment added by 129.34.20.23 (talk) 19:43, 14 June 2017 (UTC)

Possible Errors in the Second Equation on the Page
As of 2017-01-20, the page has:


 * $$\sigma(\mathbf{z})_j = \frac{e^{z_j}}{\sum_{k=1}^K e^{z_k}}$$   for j = 1, …, K.

On the left side, why is j outside the parenthesis?

On the right side, underneath, why is the counter k instead of j?

Maybe the equation should read:


 * $$\sigma(\mathbf{z_j}) = \frac{e^{z_j}}{\sum_{j=1}^K e^{z_j}}$$   for j = 1, …, K.  — Preceding unsigned comment added by 216.10.188.57 (talk) 07:48, 20 January 2018 (UTC)


 * The original is correct. The left side means that the softmax function takes a vector as input and returns a vector, the jth component of this output being .... And the right side means to take e to the power of the jth component of the input and divide it by the sum of in turn e to the power of each input component (including the jth component and all others). Hozelda (talk) 09:50, 8 September 2020 (UTC)

Flagged as too technical and lacking context
I agree with other readers who have noted that this article is close to incomprehensible. The references

https://medium.com/data-science-bootcamp/understand-the-softmax-function-in-minutes-f3a59641e86d

and

https://developers.google.com/machine-learning/crash-course/multi-class-neural-networks/softmax

are far more comprehensible. Until I paraphrase and integrate the content there with that on this page (with appropriate citations), I have flagged the problems with this article as a warning to those who actually hope to learn something from it. - Prakash Nadkarni (talk) 06:00, 10 December 2018 (UTC)

A function-weighted average, using exp
So, why does YOUR field need a whole journal? — Preceding unsigned comment added by 129.93.68.165 (talk) 18:42, 30 November 2021 (UTC)

Hierarchical softmax?
I realise that "hierarchical softmax" isn't actually softmax, but it is used to replace softmax for efficiency in machine learning contexts. Maybe there should be some clarification. I know I'm confused! akay (talk) 16:31, 23 December 2021 (UTC)