Talk:Multinomial logistic regression

2007
This page appears to fail to recognize the existance of ordered multinomial responses, no? Pdbailey 02:15, 12 June 2007 (UTC)

I corrected a mistake in the denominator
In the Model, we better have the probabilities adding to 1:
 * $$\sum_{j=0}^{J} Pr(y_i=j) = 1$$

That means the denominator needs to be:
 * $$1+\sum_{j}^{J}\exp(X_{i}\beta_{j})$$

and not just
 * $$\sum_{j}^{J}\exp(X_{i}\beta_{j})$$

The 1 here is the contribution from $$Pr(y_i=0)$$.

By the way, I'm used to thinking of $$X_i$$ as a vector, in which case $$X$$ is a matrix, $$\beta_j$$ is a vector, and $$X_i \beta_j$$ is a dot-product. The way these equations are written imply that $$X_i$$ and $$\beta_j$$ are scalars, which is okay if you are predicting y from a single scalar X. I found that a bit confusing and had to stare at it a bit longer than usual to parse it. Some clarification about this might be helpful in the article.

I believe the summation in the denominator is over an incorrect index. If there are K types of probabilistic events and J explanatory variables x1...xJ, the summation in the denominator should be over k to K k=2,3...K. The missing k is the base probabilistic event. The implicit summation embedded in the vector multiplication XB is over j from j=1 to j=J. —Preceding unsigned comment added by 72.13.225.74 (talk) 17:13, 20 October 2008 (UTC)

Choosing a base does not have much to do with the probabilities adding to one, does it? I mean, if I add the stuff we had before it would still sum to one. I think the problem is one of identification. I will ask at the ref desk thought. Brusegadi (talk) 08:17, 5 November 2008 (UTC)


 * The '1's most definitely need to be there, so I've reverted the change that removed them. They can be derived by assuming $$\log \frac{Pr(y_i = j)}{Pr(y_i = 0)} = x_i^\top\beta_j, j=1, \ldots, J$$. 209.2.237.95 (talk) 01:01, 2 March 2011 (UTC)

IIA
I changed "If the multinomial logit is used to model choices and the error terms are assumed to be independent, it can violate the assumption of independence of irrelevant alternatives (IIA)." The IIA stems from the independence of the error term. The IIA is violated not by the independence of the error term, but by the non independence in practice. In any case, as there is no formulation of the MNL with an equation with errors and their distribution, I do not think we should describe the IIA in terms of structure of errors as this will not make the article clearer. The bus example is better. Also the nested logit does not allow for correlated error but for different variances for the error term in different subgroups within which the IIA still holds (see for instance Green Econometric Analysis, pp. 847-850). Gpeilon (talk) 20:22, 3 November 2009 (UTC)


 * I am not arguing against removing the bus example, I think we should keep it. I think we disagree about IIA however. I think it is a desirable property that a model can have for discrete choice (see for example the bus example), and that an iid multinomial logit does not have this property. However, allowing for correlated error terms, IIA can be satisfied. Do you agree with that? PDBailey (talk) 20:36, 3 November 2009 (UTC)


 * Hmm, I think there is some confusion here. The Wikipedia article on IIA states a definition that is at odds with i.e. this article. In the bus example, the ratios or red bus to car change in fact, but in the model they do not. This means that in the model the IIA assumption is violated (some people move from car to bus despite having picked car before the blue bus was added), but the model is supposed to have the IIA property according to the linked preference. PDBailey (talk) 20:42, 3 November 2009 (UTC)

Visual request
As a student i think a visual representation might be nice, i would make one my self but i am just starting to learn R. —Preceding unsigned comment added by Thedreamshaper (talk • contribs) 21:51, 28 October 2010 (UTC)

Re the proposed change to the beginning
There are several problems with the recent edit, so I am reverting it pending a discussion here.

First, please be careful to hit the "Show Preview" button so you can proofread before you save the page. The recent edit contains a stray {, a misspelled link that appears as a non-functioning redlink, and the incomplete sentence "Quite often, this external option is used as the benchmark upon which the comparison."

Regarding the paragraph


 * The MNL model may include an external, undefined option. This option is not assigned the same attribute values as the other outcomes. Quite often, this external option is used as the benchmark upon which the comparison. This is done by setting all attributes of this option to zero. However, it is possible to fix the attribute value of one of the fully characterized options instead. This is often done to avoid collinearity issues between two attributes.

I find this very unclear. What does "external" mean here? How about "undefined"? If we complete the sentence as "Quite often, this external option is used as the benchmark upon which the comparison [is based]", what is the antecedent of "the comparison"? And what does the sentence "However, it is possible to fix the attribute value of one of the fully characterized options instead." mean? For one thing, the term "attribute values" has not yet been defined in the article. Does it mean regressors?


 * The external option is some other choice that can be selected but is outside of the scope of the MNL choice model. For example, if one was looking at the movie theater market trying to study how the movies currently available impact customers choice of which theater to visit (say there are 2 in a town or something), outside options include going to a sporting event instead or staying home.  In economics all such options are lumped into a single choice called the "outside/external option" or "outside/external good", where "outside/external" refers to the fact that such options are outside the scope of the study.  Sticking with the example above, sports are external to the movie theater market we're studying.


 * Because these external goods are often characterized by different attributes than the goods in a study, the outside good (generally) enters the model with a utility that is based on a different set of attributes that all other choices. Using the movie theater example, you'd likely want to control for factors that might impact which theater someone prefers, like the quality of the sound systems or the style of seating in the theater, so you can see the effect of the films clearly.  These factors, as well as the films available, are all "attributes" of the theaters and serve as "regressors" in estimation.  Clearly the combined outside option doesn't have attributes like "sound quality".  Instead, the attractiveness of the outside good, i.e., the propensity for someone to choose the outside good, is generally either modeled using demographic information about the person making the choice - say their age - if you have customer segments.  Alternatively, many modelers opt to assign the outside good a (expected) utility of "0" thereby anchoring the utilities all other options to be the utility that good generates for a customer above (or below) consumption of the outside good.  This specification of the utility as zero is what I mean by the outside good being a "benchmark".


 * Looking at the discussion of the "comparison" and the sentence you highlight there, I am speaking of the fact that you end up with a utility equation for each possible choice, including the external option, each of which is generally modeled with an intercept term. If you do not fix one of these intercepts to some pre-specified value, you will have infinitely many solutions to the equations you're trying to fit in your estimation and you won't be able to estimate the MNL model.  Most people opt to set the external good's intercept to zero but you could fix one of the other choice's intercepts to zero if you wanted.  The "other choices" being those that are not external, referred to in my suggested paragraph as "fully characterized".  If you do set the external option's intercept to zero, and that is the only term in your utility function for the outside good (aside from the error term), you will effectively set this good's utility to zero (see prev. para).


 * Your suggested completion of the sentence is great. Margaretpierson (talk) 23:30, 9 September 2012 (UTC)

The passage


 * It can also be thought of as the frequency with which an outcome occurs relative to the others described. When multiplied by the number of times an outcome will be realized the MNL model gives the number of times outcome j is realized.

is unclearly worded. What is the semantic relationship between "an outcome occurs" and "an outcome is realized"?--are they intended to be the same, or somehow related? And the term "outcome j" pops up at the end without previously being referred to as such.

The sentence "The model is used in several applications such as marketing and machine learning." makes it look very much narrower in its areas of use than is true -- the lede is better in that regard.

Also, the section heading "Introduction" is not a good idea since the lede at the top of the page is supposed to be the introduction.

In your reason for the citation-needed tag you say


 * Citation needed|reason=please give a reliable source for this assertion. I use the mnl model quite frequently in econometric contexts and I'm not using regressions to estimate the parameters. However, it's quite likely that there is a more general way to view this from a pure-theory perspective.

I don't understand -- if you use it predict the several possible outcomes based on "attributes", which I interpret to mean explanatory variables, then this is a regression. Can you explain it here?

Thanks for contributing to Wikipedia -- I see that it is your first edit -- and I hope my comments will help you revise your edit. Duoduoduo (talk) 21:05, 28 November 2011 (UTC)

deletion of assumptions/estimation of the intercept sections
An IP recently deleted the "assumptions" and "estimation of the intercept" sections. I strongly disagree that this makes the article better, so I undid this. Any objections? 018 (talk) 14:45, 27 December 2011 (UTC)

Unexplained notation
Some equations in the section Multinomial logistic regression have some parameters $$\beta'$$ which as far as I can see are not defined, and it appears to me that they ought to be just $$\beta$$. Any objection to my removing the prime signs? Duoduoduo (talk) 17:54, 5 February 2013 (UTC)

Clarifying the section "As a set of independent binary regressions"
Perhaps I'm missing something obvious, but the section "As a set of independent binary regressions" starts by talking about doing K-1 regressions and then presents formulas for the log probability ratios such as Yi=1 / Yi=K. If we were to run this as, say, a logistic regression, what would the dependent variable be? 1 if the outcome is 1 and 0 if the outcome is K, but what about Yi=2 to K-1? — Preceding unsigned comment added by Jetopal (talk • contribs) 01:10, 26 January 2014 (UTC)

Statement about softmax is incorrect
The stated function is not actually the softmax – it's missing x_i in the numerator – and so it approximates the indicator function 1(x_j = max_i x_i) rather than max_i x_i, as is stated in the article. — Preceding unsigned comment added by 162.129.251.86 (talk • contribs)


 * The function is correct, but the explanation is not. The softmax is supposed to approximate the argmax, not the max. I'll fix it. Q VVERTYVS (hm?) 14:36, 7 October 2014 (UTC)


 * I stumble on that paragraph as it stands now. While the softmax function approximates the indicator function, it is/was indeed correct to state that it can be used to construct a weighted average that approximates the max function: The dot product of the vector x and softmax(x). Elias (talk) 19:35, 8 March 2023 (UTC)
 * PS: While x dot softmax(x) is a smooth approximation of the max function, the LogSumExp function apparently is more common for this purpose. softmax is the gradient of the LogSumExp function. I suggest removing some of the low-quality text about softmax from this article and refer to Softmax instead. — Preceding unsigned comment added by Ehasl (talk • contribs) 10:11, 10 March 2023 (UTC)
 * PPS: Apparently, the approximation I describe has been studied. It is named the "Boltzmann operator", and includes a parameter α that uniformly scales the softmax inputs. With α→∞, the function converges to the maximum, and with α→-∞, to the minimum: x⋅softmax(αx) — Preceding unsigned comment added by Ehasl (talk • contribs) 10:35, 10 March 2023 (UTC)

Categorical data?
I saw you just added this page to Category:Categorical data, but I'm not sure if that's appropriate. The only thing necessarily categorical in LR are the class labels (the outcomes/dependent variables), but that is true for any classification model. The inputs (independent variables/predictors/features) are real values. Perhaps Category:Classification algorithms should be a subcat of Category:Categorical data? Q VVERTYVS (hm?) 18:00, 26 May 2015 (UTC)


 * Okay, I'd have no objection to your doing that. I see that logistic regression is also under Category Categorical data (that's what led me to add the category here), so if you subcategorize it like that maybe you should also move logistic regression to category classification algorithms (and put probit there too). Loraof (talk) 19:57, 26 May 2015 (UTC) And multinomial probit too. Loraof (talk) 20:00, 26 May 2015 (UTC)


 * Ok, did it! Q VVERTYVS (hm?) 19:00, 27 May 2015 (UTC)

Why is there no worked example?
Why is there no worked example?

Mark W. Miller (talk) 11:36, 28 November 2015 (UTC)


 * That level of detail would be more useful for a standard toy problem, to get past introductory material. Let's agree on a standard problem, first. --Ancheta Wis   (talk  &#124; contribs) 14:43, 28 November 2015 (UTC)

Not just for classification Comment
In the introduction, it says that multinomial logistic regression is a solution to a classification problem. This is correct, but not complete. As a regression method, it is also used to find out how the independent variables are related to the dependent variable, in this case by getting odds ratios. Sometimes classification is not a goal at all.

Where should that go? — Preceding unsigned comment added by PeterLFlomPhD (talk • contribs) 12:51, 24 August 2017 (UTC)