Talk:Minimum description length

Comments
The following comment pertains to the previous version of this page, of which only traces remain in the current version. Any new comments are welcome! I'll check this discussion page now and then for the next couple of weeks (september 2004).

see the disccussion Minimum message length which may be relevant to stress the difference between MML and MDL (perhaps the easiest is in the historical sense / pratical issues) - Meduz 12:15, 4 August 2006 (UTC)

This page lacks mathematics. What _exactly_ is the definition of MDL? This should be front and centre, with various fun sounding things interpretations presented afterwards. — Preceding unsigned comment added by 193.60.86.75 (talk) 15:55, 28 June 2017 (UTC)

I'm sure this means something to somebody, but it needs more context to help out those of us that aren't in whatever field this is (the theory of computation, maybe?).

For example:

- what kind of "model" are we talking about?

- what does "in the light of data" mean?

- ''Who are Wallace and Boulton? Shannon?''

- Lots of stuff needs to be wikified (i.e. turned into Wikipedia links)

- ''Can the second paragraph be clarified? What's the take-home message that it's trying to say?''

- What's the "MDL community"?

One more question:

Is it really true that the same priors tend to be favored in "objective Bayesian" analysis as in MDLtheory?

Yes, this is true: The problem with Bayesian probability is that, since the prior is subjective, any probabilities you compute no longer necessarily correspond to frequencies in a repeated experiment, as in classical frequentist probability theory. Objective Bayesians try to get the best of both worlds by using priors that, while not in themselves corresponding to frequencies of any kind, converge as quickly as possible to posterior distributions under which any predictions closely correspond to actual frequencies that you observe in real life experiments. This means that priors should use as little assumptions about the data as possible. Also the prior should be invariant under the functional form of the distribution function, for example: you could parameterize the normal distribution using $$\mu$$ and $$\exp(\sigma^2)$$, but that shouldn't change which distributions are more likely according to the prior. For these reasons, Jeffreys' prior is often advocated by Objective Bayesians. It turns out that the Bayesian probability using Jeffreys' prior behaves asymptotically the same as the NML probability that is favoured by MDL people (read: Rissanen).

All this is quite involved and I don't think people will be helped if stuff like this is included on the main page. Also, my apologies for the lack of references, I will try to provide a few on request...

I have slightly changed the introductory part of this article: it suggested, incorrectly, that MDL is basically the same thing as inference based on Kolmogorov complexity. While it is inspired on Solomonoff's theory, one of the basic properties of MDL is that it is computable, which it would not be if general computer programs were used to model the data.

fix link to function to be function (mathematics)
This text in the article: "Central to MDL theory is the 1-1 correspondence between code length functions and probability distributions. (The lemma involved is the Kraft-McMillan inequality.)" linked to the function disambiguation page. I altered the link to point to function (mathematics). I'm fairly certain this is correct, but since the article does reference Turing machines, I'm hoping the text does not refer to subroutines by the word "function." Correct me if I am wrong. Volfy 02:07, 23 November 2005 (UTC)

functions
No, I really meant ordinary mathematical functions, so the cross reference is correct.

latest modifications
I made a number of new changes to the page. Most changes are minor. The most important changes are: I included a bit about MDL's built-in protection from overfitting and I cleared up the relation to Bayesian inference and MML. I removed the cross-reference to 'reduction (mathematics)' because the phrase 'reduces to' was not intended to be taken in a technical-mathematical way. I also removed the suggestion to merge the topic with MML: the page already contains a very clear reference and the two subjects really are not the same thing. Although I do feel that the exact differences should be explained in more detail. I'll get one of my colleagues to help with this shortly.

another minor change
I added a little more on the differences between MML and MDL. I also reverted the first sentence to its previous state. Occam's razor is a general principle in the philosophy of science which has nothing to with Bayesian inference so I object to the term 'Bayesian Occam's razor', which suggests that MDL implements a specific kind of Occam's razor which has something to do with Bayes' rule. This is not the case. MDL is proposed directly as an implementation of Occam's razor and developed independently of Bayesian statistics. It is true that there is a very strong connection between MDL and Bayes; both implement Occam's razor for instance, but that doesn't mean that MDL should be defined in terms of Bayesian statistics. I welcome real improvements to the text, which is obviously far from perfect, but this doesn't really help. Finally, as a courtesy to me, if you change the first line, would you please consider motivating your changes on the discussion page?

Coin example, never selecting a biased coin model?
I think this section needs a citation, clarification, or confirmation by an expert (preferably by citation). It looks like original research as it stands, and I tend to think the conclusion is incorrect.

The simplest counter-example is that of a loss-less graphic compression algorithm. These map pixel data into a computably-feasible model space, and using techniques such as run-length encoding often effectively exploit the fact that not all images appear with equal likelihood, and that color bias and spacial consistency often allow a reduction. This amounts to a model selection. Even a random image of 1000 pixels which are selected according to a severe bias (95% black and 5% white) will be compressible to less than 1000 bits, because at those odds run-length encoding works well, as does an "all are black but these" model. With those languages, should you encounter a picture that is 50/50 random, you may need give up a constant number of bits to say "no better compression possible", but making these practical trade-offs is one of the advantages of MDL versus Kolomorov computation: as the article says, in MDL the actual constants matter.

12.232.236.2 (talk) 16:36, 5 July 2011 (UTC)

External links modified
Hello fellow Wikipedians,

I have just added archive links to 1 one external link on Minimum description length. Please take a moment to review my edit. If necessary, add after the link to keep me from modifying it. Alternatively, you can add to keep me off the page altogether. I made the following changes:
 * Added archive http://web.archive.org/web/20100218134039/http://www.mdl-research.org:80/ to http://www.mdl-research.org/

When you have finished reviewing my changes, please set the checked parameter below to true or failed to let others know (documentation at ).

Cheers.—cyberbot II  Talk to my owner :Online 00:18, 19 March 2016 (UTC)

External link defunct
http://www.mdl-research.org (edited by the bot) appears to have been taken over by some chinese entity. http://www.mdl-research.net/ still exists (and refers to the .org domain).

External links modified (February 2018)
Hello fellow Wikipedians,

I have just modified one external link on Minimum description length. Please take a moment to review my edit. If you have any questions, or need the bot to ignore the links, or the page altogether, please visit this simple FaQ for additional information. I made the following changes:
 * Added archive https://web.archive.org/web/20100616234207/http://mitpress.mit.edu/catalog/item/default.asp?ttype=2&tid=11155 to http://mitpress.mit.edu/catalog/item/default.asp?ttype=2&tid=11155

When you have finished reviewing my changes, you may follow the instructions on the template below to fix any issues with the URLs.

Cheers.— InternetArchiveBot  (Report bug) 02:14, 1 February 2018 (UTC)

Adding an example to "MDL in Machine Learning"

 * Specific text to be added or removed: In practice, when finding a machine learning model that best fits the data according to the MDL principle, one selects the model that minimizes the length of a two-part code (program). As an example, in the supervised case of learning the best fitting rule list (the probabilistic form of a decision list) from a dataset, one chooses the rule list that minimizes the two-part code composed of: 1) the code that uniquely defines a rule list given all possible rule lists that could be generated from that dataset, i.e., taking into account the number of variables and their possible combinations; and 2) the code that returns the dataset given the specific rule list as defined in the first code. The first code is specifically made to avoid overfitting through the penalization of more complex rule lists that would naturally be more prone to overfitting. Note that using a two-part code is essential when one needs to distinguish the different models, such as in the case of finding the model and its parameters that best fit the data. This contrasts with the task of merely wanting to know the size of the model (e.g., the number of rules in a rule list) necessary for the best performance.


 * Reason for the change: The MDL in Machine Learning section is quite poor and misses the most common problem for beginners in the use of MDL for ML, i.e., how to pass from its abstract form to a concrete problem. I decided to add an example based on a recent work that I know well and that is accessible to the machine learning practitioner and found some level of success. Given COI, I cannot be a partial judge but believe this to add some significant clarification regarding what it means in practice to use MDL in machine learning and what does it consist of. As a last note, even if this edit is not approved I would suggest for someone to add an example of this in practice. A significant extension would be to add concrete equations and formulas to help. Afternoonfriend (talk) 17:11, 1 January 2022 (UTC)


 * Red information icon with gradient background.svg Not done: please provide reliable sources that support the change you want to be made.   Thinker78  (talk) 00:23, 3 December 2022 (UTC)