Talk:Statistical model

Could this be explained for the layman?

Introduction
The Introduction to this article should be generally readable by people who have little training in statistics. The Introduction was previously in need of improvement.

Then someone (Kri) made a change so that the Introduction began with this sentence: "A statistical model is a formalization of stochastic relationships between variables in the form of mathematical equations". The sentence is incomprehensible to most people, because they do not know what stochastic means. The justification for the change was that the term is "explained later in the paragraph". It is didactically awful to use a technical term and define the term later; a term should first be defined, at least intuitively, and then used.

I have reverted the change. I have also made an edit to hopefully improve clarity, as well as to correct an error (it is not necessary, or even usual, that the true model is in P). Further work is need though. 86.149.160.165 (talk) 17:02, 6 November 2014 (UTC)


 * Thank you explaining why you reverted my edit this time; reverting someone's edits without explanation is usually not a good idea as it easily can be seen as destructive to the one who made the first edit, since he obviously thought that he did something creative constructive himself.


 * As for my justification for the edit, I didn't mean that the term stochastic was explained later in the paragraph; what I meant was that the fact that the relationships are stochastic was stated later in the paragraph (although I used the word "explained" instead of "stated"). So I thought, why not make that statement about the relationships already the first time they are mentioned? But if you thought it was incomprehensible to most people, then maybe it was. —Kri (talk) 21:59, 6 November 2014 (UTC)


 * I should have explained the first time, I definitely agree, and will do so in the future. And I really appreciate your elaborating. [I'm the same editor as before.] 86.152.238.35 (talk) 13:31, 7 November 2014 (UTC)

Proposed merge with Statistical assumption
The article Statistical assumption makes very little sense. It claims that there are "non-modelling assumptions". Yet the set of statistical assumptions is the statistical model.

The reference given in the article [McPherson, 1990 (Section 3.3)] states the following. The vast majority of statistical models require the assumption that the sample which provides the data has been selected by a process of random selection&hellip;. The importance of this assumption is made apparent in Chapter 5. Where the sample members are not independently selected, there is a need to assume a structure or mechanism by which observations made on the sample members are connected. Generally, the statistical description is difficult even though the experimental description may be simple. For example, where plants are competing for light, moisture or nutrients, a strong growing plant is likely to be surrounded by weaker plants because of competition. Attempting to model the effects of this competition is not an easy task and frequently leads to the introduction of parameters of unknown value into the model.

Note the repeated use of the term "model".

The valid parts of the article Statistical assumption should be merged into the article Statistical model. FlagrantUsername (talk) 17:44, 8 November 2014 (UTC)
 * It has now been over two months since the merge was proposed, and there are no comments. So, I have left the article Statistical assumption unmerged, but substantially edited the article to make things clearer.   FlagrantUsername (talk) 15:18, 17 January 2015 (UTC)

Re Grey box completion and validation
“See also“ “Grey box completion and validation“ has been removed anonymously without explanation from this and several other topics. Following advice from Wikipedia if there are no objections (please provide your name and reasons), I plan to reinstate the reference in a weeks time.

The removed reference provides information on a general method of developing models where part of the model structure is known. In particular most models are incomplete (i.e. a grey box) and thus need completion and validation. This reference seems to be within the appropriate content of the “See also” section see Wikipedia:Manual_of_Style/Layout#See_also_section.

BillWhiten (talk) 05:30, 22 March 2015 (UTC)
 * My suggestion is to rename Grey box completion and validation to "Grey box model", and revise the article appropriately. Having an article about completion and validation makes little sense, and doing so while not having an article for Grey box models generally makes no sense at all.
 * It is also highly questionable whether you should be doing something like this that promotes your own work.
 * SolidPhase (talk) 21:56, 30 March 2015 (UTC)

Example
The example is somewhat useful but loses its tractability when things get formal. The sample space $$S$$ is fine, defined like that, as is $$\Theta$$. What is $$P_{\theta}$$, though? Without that information (i.e. a formula), a novice reader is lost. Neither can the mapping $$P_{\theta}$$ be derived from information contained in the example (indeed, one would need to know the distribution of ages at the very least), nor can $$\mathcal{P}$$ be determined in the absence of the mapping. Can anyone work out the example in more detail?Athenray (talk) 08:49, 4 September 2015 (UTC)
 * I have revised the example, so that it includes a distribution of the ages. FlagrantUsername (talk) 10:39, 28 October 2017 (UTC)

Removed quotation
I have removed the higlighted part in the following statement, which was later restored by an IP user.


 * A statistical model is usually specified by mathematical equations that relate one or more random variables and possibly other non-random variables. As such, "a model is a formal representation of a theory" (Herman Adèr quoting Kenneth Bollen).

My reason for removing was is that I can't see the connection between the quotation and the preceding sentence. There is no mention of statistical models, random variables, or non-random variables in the quotation. If this should be kept in the article, it needs to be explained better. Elektrolurch (talk) 10:17, 20 June 2017 (UTC)


 * The quotation is indeed about statistical models, but that was not clear from the way that it was quoted. I have restored the quotation, but revised the way that it is quoted, so as to make that clear. Additionally, the citation note now includes a direct link to the page of the book that contains the quote. About the connection with the preceding sentence, mathematical equations are a formal representation. 86.183.239.72 (talk) 11:09, 24 July 2017 (UTC)

General remarks section
"A statistical model is a special class of mathematical model. What distinguishes a statistical model from other mathematical models is that a statistical model is non-deterministic. Thus, in a statistical model specified via mathematical equations, some of the variables do not have specific values, but instead have probability distributions; i.e. some of the variables are stochastic. In the above example with children's heights, ε is a stochastic variable; without that stochastic variable, the model would be deterministic." The above statement is confusing. Even without the error ε, the age and the height remain random variables. So the presence of random variables in the model does not turn a mathematical model into a statistical model. But what does? AVM2019 (talk) 20:26, 22 December 2019 (UTC)

Seems inconsistent with itself and with the definition
A statistical model is semiparametric'' if it has both finite-dimensional and infinite-dimensional parameters. Formally, if $k$ is the dimension of $$\Theta$$ and $n$ is the number of samples, both semiparametric and nonparametric models have $$k \rightarrow \infty$$ as $$n \rightarrow \infty$$. If $$k/n \rightarrow 0$$ as $$n \rightarrow \infty$$, then the model is semiparametric; otherwise, the model is nonparametric.''

1) Why should $$k$$ change when the size of the sample changes? It seems that a model should have a fixed definition, and the statistician can take as many sample points as he wants in order to test the model or estimate its parameters.

If there is some pragmatic principle whereby the statistician chooses larger models when he anticipates larger samples, this should be explained in advance, not implicitly assumed.

In any case, from a mathematical perspective, in the cited text, the property semiparametric is not a property of a statistical model, but of an infinite sequence of pairs (model, sample size).

2) The two characterizations


 * A statistical model is semiparametric if it has both finite-dimensional and infinite-dimensional parameters.

and


 * If $$k/n \rightarrow 0$$ as $$n \rightarrow \infty$$, then the model is semiparametric; otherwise, the model is nonparametric.

are not consistent with each other. It is true that something is finite and something is infinite in both cases; that is all they have in common.

2A02:1210:2642:4A00:C9E4:9F6:5E99:526E (talk) 18:12, 23 June 2023 (UTC)

The difficulty of a universal definition
I've tried to do an edit to improve this article, but I'm unhappy with the result. I'm thus trying to put my difficulties in writing so that maybe they can be addressed in the future.

I think my problem stems from the fact that there is a large diversity of statistical frameworks. They have common elements, for example probability theory, on top of which they add their own particularities. I'm not sure if there is a universal definition of a statistical model that would apply throughout all of these frameworks.

For example, consider the problem of finding the center and scale of a Gaussian in a Bayesian framework and frequentist framework. Superficially, the two models are similar: we will consider the ensemble of all Gaussians. But:

- in the Bayesian case, there is a single ordinary probabilistic model on which we apply Bayes' rule. - in the frequentist case, we consider instead the set of all Gaussians to be all valid models. We find some procedure that is valid for all models at once.

When we look at the details, these are now two very different models. And the situation becomes even more complex once we consider other frameworks: likelihood-based, robust, empirical bayes.

Perhaps, it would be better to be explicit about this difficulty? Guillaume P Dehaene (talk) 14:34, 2 July 2024 (UTC)