User:Thepigdog/Inductive Inference

Purpose of web page
This page is my attempt to put together a simple description of Inductive inference.

Inductive inference is the theory of learning, in a general form.

Deductive Inference
I wish to describe describe the general theory by which probabilities may be assign to events based on the data history available Inductive inference.

The theory does not make assumptions about the nature of the world, the fairness of coins, or other things. Instead based solely on the data the theory should be able to assign probabilities.

The problem may be simplified as follows; Given the first N bits of a sequence assign probabilities to bits that come after it.

In general the input data may not be a sequence of bits. It may be a sequence of real numbers, or an array of data representing images. But all such data may be reduced to a sequence of bits, so for this discussion the simplification of the problem is adequate.

The theory assigns probabilities to models of the data. Each model is a function that maps model parameters to a data sequence. In addition a model may take a series of code bits as input to the model.

The theory is applicable to machine intelligence at a theoretical level. In practice there are computability problems.

Probabilities based on Message Length
In general the shorter the description of a theory that describes some data the more probable the theory. This principle is called Occam's razor. This is described in,


 * Kolmogorov complexity
 * Minimum message length
 * Minimum description length

All these theories give ways of measuring how good a model is. The shortest model + code is the best. We can state this more simply. The shortest description (including model and code) that matches the data is the best. This principle is called Occam's razor.

These principles are based on Information Theory which started with Shannon. Kraft's Inequality plays a key role.


 * Information Theory
 * Claude Shannon
 * Kraft's inequality

But you dont need to go off and read all those WIKI pages, unless you want to. The core principles are straight forward enough. In information theory terms, for a message x with probability p(x) the optimal encoding uses length l(x) bits.


 * $$p(x) = 2^{-l(x)}. \! $$

or,
 * $$\operatorname{l}(E) = -\log_2(P(E)).$$

Where l(x) is the length of x. This can be seen very simply. l(x) bits has $$2^{l(x)}$$ states. If we encode the data efficiently we wish each state to have the same probability. So,


 * $$p(x) = \frac{1}{2^{l(x)}}.$$

This is also the probability of choosing a message of length l(x) randomly.

Models + Codes
In compressing data we want an encoding function that chooses a representation for a message based on its probability. raw data D --> encoding function e --> compressed data code C


 * $$C = e(D, M). \!$$

The encoding function e must know the probability of the message d in order to compress it to its optimal encoding. The probability function is called the model M.

The model function is needed to uncompress the code. So the model function is part of the code. It must be represented with a series of bits of length l(M).


 * $$l(M+C) = l(M) + l(D). \!$$
 * $$p(M+C) = 2^{-( l(M) + l(D) )}. \!$$
 * $$p(M+C) = p(M) * p(D). \!$$

The length $$l(M+C)$$ is what Occam's Razor tells us to minimise.

Bayes Theorem
Bayes' Theorem can be extended to cover more than one event. This is not difficult once conditional probabilities are understood. This is explained in Bayes' theorem alternate form. See also *Bayes' theorem.

Given a set of alternatives, then,
 * $$A_i$$ are mutually exclusive sets of events (a Partition) where i is in the set K.
 * $$B$$ is a set of events.
 * $$P(B|A_i)$$ is the probability of $$B$$ given $$A_i$$.
 * $$P(A_i|B)$$ is the probability of $$A_i$$ given $$B$$.
 * $$P(A_i|B) = \frac{P(B | A_i)\, P(A_i)}{P(B)} = \frac{P(B | A_i)\, P(A_i)}{\sum_j P(B|A_j)\,P(A_j)}  \!$$.

Probability Theory Terminology
An Experiment has a set of equally likely Outcomes. These outcomes are classified into sets called Events. In the limit as the experiment is repeated many times the relative frequency of each outcome converges to 1.

The Universe Generator
Assume that the data sequence was generated from some model. A model assigns a probability to each data sequence.


 * A data sequence, tagged with the model it is generated from is an outcome.
 * A model is an event.
 * A data sequence prefix D is an event.

We can think of the experiment as a two step process.


 * 1 - A model $$P(M_i)$$ is created with probability,
 * $$P(M_i) = 2^{-l(M_i)}\!$$


 * 2 - A data sequence is generated from the model.

This outcome belongs to the event D of outcomes with this prefix. The model has a probability of generating this data set,


 * $$P(D|M_i) = 2^{-l(e(D, M_i))}\!$$

where $$e(D, M_i)$$ represents the encoding of D with the model $$M_i$$.

Bayes' Theorem may be applied to this experiment. Each outcome is tagged with the model it is created from, so the models $$M_i$$ are a Partition of the set of all outcomes.


 * $$P(M_i|D) = \frac{P(D | M_i)\, P(M_i)}{P(D)} = \frac{P(D | M_i)\, P(M_i)}{\sum_j P(D|M_j)\,P(M_j)}  \!$$.

This gives the probability that the outcome was generated by the model given a data prefix D.

Predictive Models
Some models will compress the data,


 * $$P(D, M_i) * P(M_i) > P(D)$$
 * $$l(e(D, M_i)) + l(M_i) < l(D)$$

These are the models that provide information about the behaviour of D, by compressing the data. They are predictive models.

Other models dont compress the data. These are unpredictive models. Rather than deal with each unpredictive model separately we want to group them together and handle them all as the set U of models that dont compress the data. Two sets of indexes J and K are created for the good and bad models,


 * $$J = {j: l(e(D, M_j)) + l(M_j) < l(D) }\!$$

is the set of indexes of models that compress the data.


 * $$K = {k: l(e(D, M_k)) + l(M_k) >= l(D) }\!$$

is the set of indexes of models that dont compress the data.


 * $$U = {\sum_k M_k}\!$$

Then we need to know,
 * $$P(D, U) = 2 ** -l(D)\!$$
 * $$P(U) = constant\!$$

Bayes law with all the models that dont compress the data merged together becomes,


 * $$P(M_i|D) = \frac{P(D|M_i) * P(M_i)}{s}\!$$
 * $$P(U|D) = \frac{P(D|U) * P(U)}{s}\!$$

where
 * $$s = P(D|U) * P(U) + {\sum_j P(D|M_j) * P(M_j)}\!$$

summarising the probabilities,

Predicting the Future
Based on the probabilities of the models found by the methods above we can find a probability distribution for future events.

Each model i predicts a probability $$x(M_i, j)\!$$ for the j th bit of future data. The probability of a bit j being set in the outcome O is,

$$P(O_j) = P(U|D)/2 + {\sum_i P(M_i|D) * x(M_i, j)}\!$$

Prior Probabilities
There are implicit prior probabilities built into the way that models are encoded. These prior probabilities are encoded in the language the models are described in. However the length in bits differs by less than a fixed number of bits from the coding in another language.

Perhaps it is not correct to think of probability as an absolute and imutable value. Probabilities are determined by,
 * The language used for encoding models (the a-priori knowledge).
 * The data history
 * Computational limitations.

Probability is relative to these factors.

All real world probability uses past experience to predict future events based on models of the data. So all actual probability is relative.

Only theoretical probability can be regarded as absolute. The toss of an unbiased coin gives probabilities we can all agree on. But real world probability must consider all possible models for the coins behaviour (not just the unbiased coin model).

Examples
This section gives examples.

Two Theories
Suppose there is a murder mystery and we know that only Jane had access to murder the victim. Then we feel that we have good evidence that Jane did the deed. But if later on we find that John also had access then the probability that Jane did it suddenly halves.

If we only have "Jane did it" as a theory,
 * $$s ~= P(D|Jane) * P(Jane)\!$$

And so
 * $$P(Jane|D) ~= \frac{P(D|Jane) * P(Jane)}{P(D|Jane) * P(Jane)}\!$$
 * $$P(Jane|D) ~= 1\!$$

But if "John did it" and "Jane did it" are equally complex,


 * $$s ~= P(D|Jane) * P(Jane) + P(D|John) * P(John)\!$$

And so
 * $$P(Jane|D) = \frac{P(D|Jane) * P(Jane)}{s}\!$$
 * $$P(John|D) = \frac{P(D|John) * P(John)}{s}\!$$
 * $$P(Jane|D) ~= 0.5\!$$
 * $$P(John|D) ~= 0.5\!$$

Where there is Less data the Probability Distribution should spread out
For the second intuition, suppose we have a regression based on some data where the points are all grouped together on the x-axis. Tradition regression theory fits a regression model to it. Clearly this is model will be more accurate near where we have data on the x-axis. But traditional regression theory doesnt tell us that. By considering all theories we get the fan out of probabilities further away from the actual data that we expect.

<< Need an actual regression test example here >>