Talk:Probit model

Practical Example
Anyone willing to give a practical example for quick intuition? Math only-descriptions are only so helpful sometimes. 160.39.191.94 (talk) 20:42, 19 April 2018 (UTC)

Background
Can I get a bit more background? What is the probit model? What does it do? What was it developed for? How was it developed? As is, the article provides minimal explanatory text before launching into near gibbrish.Theblindsage (talk) 08:19, 26 November 2013 (UTC)

Citation for Bliss
The citations for Bliss's papers are already in the article. I added McCullagh & Nelder which claims this was developed by Bliss. I thus then removed the cite needed tag. Baccyak4H (Yak!) 17:39, 11 June 2007 (UTC)


 * Where is the cite in M&N (i.e. what page?) Pdbailey 02:18, 12 June 2007 (UTC)


 * From the second edition (1989), top of pg 13 (beginning of section 1.2.5): "The technique known as probit analysis arose in conection with bioassay, and the modern method of analysis dates from Bliss (1935)." Baccyak4H (Yak!) 13:05, 12 June 2007 (UTC)


 * The same book (p 43) says that the method of solving the probit was in an appendix to the same article, written by Fisher. How can one be so sure that Fisher didn't influence Bliss? Pdbailey 13:10, 12 June 2007 (UTC)


 * I read that ref to imply Fisher's contribution was using the scoring method in the context of the model itself. However, without immediate access to the paper, that is just my best educated interpretation/guess.  If reading the paper in its entirety suggests Fisher could get co-credit (or even most or all of the credit) for the model itself, this article should reflect that.  If you an get a copy, that would be great.  17:50, 12 June 2007 (UTC)

X prime?
Is x prime correct? I've always seen it xB, not x'B for the classical linear model.Aaronchall (talk) 04:54, 8 January 2008 (UTC)


 * Have replaced the x' with x^T to represent vector/matrix transpose in a consistent manner throughout the article. Rysa98 (talk) 15:01, 10 May 2023 (UTC)



wording issues
This is a pretty dense little article. I think a gentler intro should be developed. But more specifically now, I take issue with some wordings that are ambiguous or potentially a bit haughty:


 * "Because the response is a series of binomial results, the likelihood is often assumed to follow the binomial distribution.": What does that mean?  I myself understand how there is an assumption of normally distributed unobservable errors in the latent variable formulation of this model, but how or where is there a binomial distribution assumption in the model?
 * I believe because: in the true relationship (as in, with no errors), the latent variable, Y*, falls between negative infinity and positive infinity, but usually close to one. Still in the true relationship, Y would always be one or zero for any Y*, based upon whether Y* is positive or negative.  Thus, adding "normally distributed unobservable errors" to Y*, there is a probability p that Y will be 0 or 1 based upon the how far Y* is from 0 and the standard deviation of errors.  Accordingly, for some true value Y*0, the probability of observing 0 for Y is equal to the normalcdf of 0 given a mean of Y*0 and a standard deviation s.  In the sample, there will often be only one observation for each value of Y* observed, so for each Y* there is one bernoulli trial.  However, as it is theoretically possible to get many observations which yeild the same Y*, the distribution of observable Y would be binomial, as a binomial distribution is constructed of many bernoulli trials.  Assuming homoschedastic errors, p is only related to Y* and something like a single binomial distribution can be inferred from all values of Y* in the results.  Can someone comment on this? 131.122.52.166 (talk) 04:16, 27 April 2011 (UTC)


 * "The parameters &beta; are typically estimated by maximum likelihood." Actually, how else can they be understood or estimated?  I have the impression that there are different ways that this or any other maxiumum likelihood model parameters can be found, numerically.  But it is a maximum likelihood model, so what is meant by the suggestion this is "typically" but not always a maximum likelihood model.  I am clearly missing something, or the language is imprecise.


 * "While easily motivated without it, the probit model can be generated by a simple latent variable model." Easily motivated by whom / how?  I object to the "easily" word.


 * "Then it is easy to show that" should be changed to Then it can be shown that". Easy is subjective, and I think it comes across wrong to general readers of the encyclopdia, the vast majority of whom will not find anything easy about showing that. doncram (talk) 23:40, 21 May 2009 (UTC)


 * What do you mean by saying that it's a "maximum likelihood model"? There's nothing in the model itself that says anything about maximum likelihood, and one can readily imagine methods other than maximum likelihood for estimating the parameters.  For example, if one has a prior probability distribution of the parameters, then one could use the posterior expected values of the parameters as estimates.  You are right to say that you're clearly missing something.  I don't think the language at that point is imprecise. Michael Hardy (talk) 00:43, 22 May 2009 (UTC)


 * Hmm, thanks for responding, that helps me a bit. As a reader, I am really already invested in understanding it as a maximum likelihood model.  Given data, I can't really absorb how you could (and why you would) choose any other method of estimating the model, besides trying to figure out what are the parameters that are most likely to have resulted in the observed data (given an assumption of normal errors in the latent variable model).  You suggest that i could also want to take into account a prior distribution.  But then, I absorb that only as a broader maximum likelihood problem:  there was previous data that is summarized in some informed priors, and then there is some new data.  I don't exactly know how to do this necessarily, but I would want to use a maximum likelihood approach to combine the priors and new data to come up with new estimates.  I wonder then:  Is there a non-MLE based approach (which would also have a Bayesian perspective extension)?  Is there some non-MLE approach to estimating the parameters of the model that has ever been used for practical purposes?  In a regular linear regression context, i do understand other alternatives, but here i do not. doncram (talk) 01:23, 22 May 2009 (UTC)


 * P.S. I see you edited the article to remove the sentence about the binomial distribution, and to remove the two "it is easy" assertions. Thanks!  However, the math display is all messed up now. doncram (talk) 01:27, 22 May 2009 (UTC)


 * Doncram, in addition to Frequentist statistics (which has things like the MLE) there is Bayesian statistics (which has things like priors). PDBailey (talk) 01:55, 22 May 2009 (UTC)


 * I don't see any "messed up" math displays. They're all working normally for me.
 * Posterior expected values are not "extended" maximum likelihood estimates. It is true that you would still use the likelihood function, but the estimates would not always correspond to points that maximize it.  "most likely to have resulted in the observed data" is an often heard but misleading phrase.  MLEs are not the parameter values that are most likely, given the data; the are the values that make the data more likely than the data would have been with any other parameter values.  Let's say you want to estimate the frequency with which car accidents happen on a certain highway.  You observe it for an hour.  No accidents happen.  Then the maximum likelihood estimate of the frequency is zero.  Why, then, might you want to use any other estimate?  Or consider the German tank problem: tanks are numbered 1, 2, 3, ..., N and you want to estimate N.  A sample of 20 tanks gives various numbers in the range from 1 to 9541.  The largest number observed is 9541.  What, then, do you take to be an estimate of the total number of tanks?  The MLE is exactly 9541.  But does it not seem likely that the very highest number, corresponding to the exact number of tanks, is not in the sample of 20, and therefore that number is probably higher than 9541?  That does not conflict with the fact that the data you actually observed are more probable if the total number of tanks is 9541 than if it is bigger than that.  Why, then, might you want to use any other estimate? Michael Hardy (talk) 02:16, 22 May 2009 (UTC)
 * Posterior expected values are not "extended" maximum likelihood estimates. It is true that you would still use the likelihood function, but the estimates would not always correspond to points that maximize it.  "most likely to have resulted in the observed data" is an often heard but misleading phrase.  MLEs are not the parameter values that are most likely, given the data; the are the values that make the data more likely than the data would have been with any other parameter values.  Let's say you want to estimate the frequency with which car accidents happen on a certain highway.  You observe it for an hour.  No accidents happen.  Then the maximum likelihood estimate of the frequency is zero.  Why, then, might you want to use any other estimate?  Or consider the German tank problem: tanks are numbered 1, 2, 3, ..., N and you want to estimate N.  A sample of 20 tanks gives various numbers in the range from 1 to 9541.  The largest number observed is 9541.  What, then, do you take to be an estimate of the total number of tanks?  The MLE is exactly 9541.  But does it not seem likely that the very highest number, corresponding to the exact number of tanks, is not in the sample of 20, and therefore that number is probably higher than 9541?  That does not conflict with the fact that the data you actually observed are more probable if the total number of tanks is 9541 than if it is bigger than that.  Why, then, might you want to use any other estimate? Michael Hardy (talk) 02:16, 22 May 2009 (UTC)

Berkson's minimum chi-square method
In the section of the same name in the article, what the heck is going on? What is the beta value that is being estimated? The article now reads Its advantage is the presence of a closed-form formula for the estimator, and possibility to carry out analysis even when individual observations are not available, only their aggregated counts $$r_t$$, $$n_t$$, and $$x_{(t)}$$ (for example in the analysis of voting behavior). But why can't the MLE be found with tabular data in this same situation? What is this advantage over? PDBailey (talk) 02:50, 4 August 2009 (UTC)


 * A possible reference for Berkson’s method is «». The β being estimates is exactly the same beta which was in the definition of the model: P[Y=1|X] = Φ(X &prime;β). The presence of closed-form solution is indeed an advantage, as it is generally more difficult to implement the maximization routine. As for the “applicability to tabular data” — I’m not sure about that, better to find the original article. ...  st pasha  » talk » 07:50, 4 August 2009 (UTC)


 * Thanks, so the advantage is a disadvantage then and I just added a fact tag to the claims about the estimator. If you know the book has these claims in it, you could add it. PDBailey (talk) 01:21, 5 August 2009 (UTC)

Dr. Winkelmann's comment on this article
Dr. Winkelmann has reviewed this Wikipedia page, and provided us with the following comments to improve its quality:

""A probit model is a popular specification for an ordinal[2] or a binary response model. " Drop reference to "ordinal". This is confusing.

Keep consistent notation: Transpose is first denoted by "T", later by " ' ".

The maximum likelihood estimator need not exist if there is perfect separation.

There should be a subsection on how to interpret probit results: non-constant marginal, or partial, effects: they go to zero, as Pr approaches to 1 or 0."

We hope Wikipedians on this talk page can take advantage of these comments and improve the quality of the article accordingly.

We believe Dr. Winkelmann has expertise on the topic of this article, since he has published relevant scholarly research:


 * Reference : Rainer Winkelmann, 2009. "Copula-based bivariate binary response models," SOI - Working Papers 0913, Socioeconomic Institute - University of Zurich.

ExpertIdeasBot (talk) 16:57, 27 July 2016 (UTC)

Differences vs. Logistic Regression Model
I think it would be very helpful to mention the distinctions between a probit model and logistic regression, which is the most popular classification model in machine learning. The picture I have is:

Logistic regression models log odds as a linear function of the predictors; Probit models odds (?) as a linear function of predictors

Logistic regression creates a linear decision boundary or hyperplane; Probit creates a nonlinear decision boundary (? of what functional form? or is it simply a different linear decision boundary? If probit is a generative function that returns a probability for any set of predictors, what would be the discriminant function that returns a binary indicator?)

When would one use a probit model vs. logistic regression? When one has a very clear picture that odds, not log-odds, are a linear function of the predictors?

98.14.249.94 (talk) 12:00, 20 June 2018 (UTC)

Merge from Binary response model with latent variable
The article Binary response model with latent variable is short and poorly written, but it contains some content about dealing with heteroskedasticity and misspecification in probit models which might be useful here (for example, a 'Heteroskedasticity' section). I propose that the useful content from the other article be merged here. (Then the article can be redirected to binary choice model or binomial regression). Wikiacc (¶) 22:42, 21 May 2019 (UTC)
 * Instead of merging, I just copied the text on misspecification here. It requires some cleanup. Wikiacc (¶) 18:20, 16 June 2019 (UTC)