User:Jmath666/Conditional probability and expectation

Elementary description
If $$\textstyle A,$$ $$\textstyle B$$ are events such that $$\textstyle P\left( B\right)  >0$$, the conditional probability of the event $$\textstyle A$$ given $$\textstyle B$$ is defined by


 * $$ P\left( A|B\right)  =\frac{P\left(  A\cap B\right)  }{P\left(  B\right)  }. $$

If $$\textstyle B$$ is fixed, the mapping $$\textstyle A\mapsto P\left( A|B\right)  $$ is a conditional probability distribution given the event $$\textstyle B$$.

If also $$\textstyle P\left( B\right)  >0$$, then also


 * $$ P\left( B|A\right)  =\frac{P\left(  A\cap B\right)  }{P\left(  A\right)  }$$

and so


 * $$\begin{align} P\left( A|B\right)    & =\frac{P\left(  A\cap B\right)  }{P\left(  B\right) }=\frac{P\left(  A\cap B\right)  }{P\left(  A\right)  }\frac{P\left( A\right)  }{P\left(  B\right)  }\ & =\frac{P\left(  B|A\right)  P\left(  A\right)  }{P\left(  B\right)  }, \end{align}$$

which is known as the Bayes theorem.

Conditioning of discrete random variables
If $$\textstyle Y$$ is a discrete real random variable (that is, attaining only values $$\textstyle y_{j}$$, $$\textstyle j=1,2,\ldots$$), then the conditional probability of an event $$\textstyle A$$ given that $$\textstyle Y=y_{j}$$ is


 * $$ P\left( A|Y=y_{j}\right)  =\frac{P\left(  A\wedge Y=y_{j}\right)  }{P\left( Y=y_{j}\right)  }. $$

The mapping $$\textstyle A\mapsto P\left( A|Y=y_{j}\right)  $$ defines a conditional probability distribution given that $$\textstyle Y=y_{j}$$.

Note that $$\textstyle P\left( A|Y=y_{j}\right)  $$ is a number, that is, a deterministic quantity. If we allow $$\textstyle y_{j}$$ to be a realization of the random variable $$\textstyle Y$$, we obtain conditional probability of the event $$\textstyle A$$ given random variable $$\textstyle Y$$, denoted by $$\textstyle P\left( A|Y\right)  $$, which is a random variable itself. The conditional probability $$\textstyle P\left( A|Y\right) $$ attains the value of $$\textstyle P\left(  A|Y=y_{j}\right)  $$ with probability $$\textstyle P\left(  Y=y_{j}\right)  $$.

Now suppose $$\textstyle X$$ and $$\textstyle Y$$ are two discrete real random variables with a joint distribution. Then the conditional probability distribution of $$\textstyle X$$ given $$\textstyle Y=y_{j}$$ is


 * $$ P\left( X=x_{i}|Y=y_{j}\right)  =\frac{P\left(  X=x_{i}\wedge Y=y_{j}\right) }{P\left(  Y=y_{j}\right)  }. $$

If we allow $$\textstyle y_{j}$$ to be a realization of the random variable $$\textstyle Y$$, we obtain the conditional distribution $$\textstyle P\left( X|Y\right)  $$ of random variable $$\textstyle X$$ given random variable $$\textstyle Y$$. Given $$\textstyle x_{i}$$, the random variable $$\textstyle P\left( X=x_{i}|Y\right)  $$ that attains the value $$\textstyle P\left( X=x_{i}|Y=y_{j}\right)  $$ with probability $$\textstyle P\left(  Y=y_{j}\right)  $$.

The random variables $$\textstyle X$$ and $$\textstyle Y$$ are independent when the events $$\textstyle X=x_{i}$$ and $$\textstyle Y=y_{j}$$ are independent for all $$\textstyle x_{i}$$ and $$\textstyle y_{j}$$, that is,


 * $$ P\left( X=x_{i}\wedge Y=y_{j}\right)  =P\left(  X=x_{i}\right)  P\left( Y=y_{j}\right)  . $$

Clearly, this is equivalent to


 * $$ P\left( X=x_{i}|Y=y_{j}\right)  =P\left(  X=x_{i}\right)  . $$

The conditional expectation of $$\textstyle X$$ given the value $$\textstyle Y=y_{j}$$ is


 * $$\begin{align} E\left( X|Y=y_{j}\right)   &  =\sum_{i}x_{i}P\left(  X=x_{i}|Y=y_{j}\right) \ &  =\sum_{i}x_{i}\frac{P\left(  X=x_{i}\wedge Y=y_{j}\right)  }{P\left( Y=y_{j}\right)  }\text{, }\end{align}$$

which is defined whenever the marginal probability


 * $$ P\left( Y=y_{j}\right)  =\sum_{i}P\left(  X=x_{i}\wedge Y=y_{j}\right)  >0. $$

This is a description common in statistics. Note that $$\textstyle E\left( X|Y=y_{j}\right)  $$ is a number, that is, a deterministic quantity, and the particular value of $$\textstyle y_{j}$$ does not matter; only the probabilities $$\textstyle P\left(  X=x_{i}\wedge Y=y_{j}\right)  $$ do.

If we allow $$\textstyle y_{j}$$ to be a realization of the random variable $$\textstyle Y$$, we obtain conditional expectation of random variable $$\textstyle X$$ given random variable $$\textstyle Y$$, denoted by $$\textstyle E\left( X|Y\right)  $$. This form is closer to the mathematical form favored by probabilists (described in more detail below), and it is a random variable itself. The conditional expectation $$\textstyle E\left( X|Y\right) $$ attains the value $$\textstyle E\left(  X|Y=y_{j}\right)  $$ with probability $$\textstyle P\left(  Y=y_{j}\right)  $$.

Conditioning of continuous random variables
For continuous random variables $$\textstyle X$$, $$\textstyle Y$$ with joint density $$\textstyle p_{X,Y}\left( x,y\right) $$, the conditional probability density of $$\textstyle X$$ given that $$\textstyle Y=y$$ is


 * $$ p_{X|Y}\left( x,y\right)  =\frac{p_{X,Y}\left(  x,y\right)  }{p_{Y}\left( y\right)  }, $$

where


 * $$ p_{Y}\left( y\right)  =\int p_{X,Y}\left(  x,y\right)  dx $$

is the marginal density of $$\textstyle Y$$. The conventional notation $$\textstyle p_{X|Y}\left( x|y\right) $$ is often used to mean the same as $$\textstyle p_{X|Y}\left(  x,y\right)  $$, that is,  the function $$\textstyle p_{X|Y}$$ of two variables $$\textstyle x$$ and $$\textstyle y$$. The notation $$\textstyle p\left( x|y\right)  $$, often used in practice, is ambigous, because if $$\textstyle x$$ and $$\textstyle y$$ are substituted for by something else (like specific numbers), the information what $$\textstyle p$$ means is lost.

The continuous random variables are independent if, for all $$\textstyle x$$ and $$\textstyle y$$, the events $$\textstyle P\left( X\leq x\right)  $$ and $$\textstyle P\left(  Y\leq y\right)  $$ are independent, which can be proved to be equivalent to


 * $$ p_{X,Y}\left( x,y\right)  =p_{X}\left(  x\right)  p_{Y}\left(  y\right)  . $$

This is clearly equivalent to


 * $$ p_{X,Y}\left( x,y\right)  =p_{X|Y}\left(  x,y\right)  p_{Y}\left(  y\right) . $$

The conditional probability density of $$\textstyle X$$ given $$\textstyle Y$$ is the random function $$\textstyle p_{X|Y}\left( x,Y\right)  $$. The conditional expectation of $$\textstyle X$$ given the value $$\textstyle Y=y$$ is


 * $$ E\left( X|Y=y\right)  =\int xp_{X|Y}\left(  x|y\right)  dx $$

and the conditional expectation of $$\textstyle X$$ given $$\textstyle Y$$ is the random variable


 * $$ E\left( X|Y\right)  =\int xp_{X|Y}\left(  x|Y\right)  dx, $$

dependent on the values of $$\textstyle Y$$.

Warning
Unfortunately, in the the literature, esp. more elementary oriented statistics texts, the authors do not always distinguish properly between conditioning given the value of a random variable (the result is a number) and conditioning given the random variable (the result is a random variable), so, confusingly enough, the words “ given the random variable\textquotedblright can mean either.

Mathematical synopsis
This section follows. In probability theory, a conditional expectation (also known as conditional expected value or conditional mean) is the expected value of a random variable with respect to a conditional probability distribution, defined as follows.

If $$\textstyle X$$ is a real random variable, and $$\textstyle A$$ is an event with positive probability, then the conditional probability distribution of $$\textstyle X$$ given $$\textstyle A$$ assigns a probability $$\textstyle P(X\in B|A)$$ to the Borel set $$\textstyle B$$. The mean (if it exists) of this conditional probability distribution of $$\textstyle X$$ is denoted by $$\textstyle E(X|A)$$ and called the conditional expectation of $$\textstyle X$$ given the event $$\textstyle A$$.

If $$\textstyle Y$$ is another random variable, then the conditional expectation $$\textstyle E(X|Y=y)$$ of $$\textstyle X$$ given that the value $$\textstyle Y=y$$ is a function of $$\textstyle y$$, let us say $$\textstyle g(y)$$. An argument using the Radon-Nikodym theorem is needed to define $$\textstyle g$$ properly because the event that $$\textstyle Y=y$$ may have probability zero. Also, $$\textstyle g$$ is defined only for almost all $$\textstyle y$$, with respect to the distribution of $$\textstyle Y$$. The conditional expectation of $$\textstyle X$$ given random variable $$\textstyle Y$$, denoted by $$\textstyle E(X|Y)$$, is the random variable $$\textstyle g(Y)$$.

It turns out that the conditional expectation $$\textstyle E(X|Y)$$ is a function only of the sigma-algebra, say $$\textstyle \mathcal{A}$$, generated by the events $$\textstyle Y\in B$$ for Borel sets $$\textstyle B$$, rather than the particular values of $$\textstyle Y$$. For a $$\textstyle \sigma $$-algebra $$\textstyle \mathcal{A}$$, the conditional expectation $$\textstyle E(X|A)$$ of $$\textstyle X$$ given the $$\textstyle \sigma$$-algebra $$\textstyle A$$ is a random variable that is $$\textstyle \mathcal{A}$$-measurable and whose integral over any $$\textstyle \mathcal{A}$$-measurable set is the same as the integral of $$\textstyle X$$ over the same set. The existence of this conditional expectation is proved from the Radon-Nikodym theorem. If $$\textstyle X$$ happens to be $$\textstyle \mathcal{A}$$-measurable, then $$\textstyle E(X|\mathcal{A})=X$$.

If $$\textstyle X$$ has an expected value, then the conditional expectation $$\textstyle E(X|Y)$$ also has an expected value, which is the same as that of $$\textstyle X$$. This is the law of total expectation.

For simplicity, the presentation here is done for real-valued random variables, but generalization to probability on more general spaces, such as $$\textstyle \mathbb{R}^{n}$$ or normed metric spaces equipped with a probability measure, is immediate.

Mathematical prerequisites
Recall that probability space is $$\textstyle \left( \Omega,\Sigma,P\right)  $$, where $$\textstyle \Sigma$$ is a $$\textstyle \sigma$$-algebra of subsets of $$\textstyle \Omega$$, and $$\textstyle P$$ a probability measure with $$\textstyle \mathcal{B}$$ measurable sets. A random variable on the space $$\textstyle \left( \Omega,\Sigma,P\right)  $$ is a $$\textstyle \Sigma$$-measurable function. $$\textstyle \mathcal{B}\left( \mathbb{R}\right)  $$ is the sigma algebra of all Borel sets in $$\textstyle \mathbb{R}$$. If $$\textstyle A$$ is a set and $$\textstyle X$$ a random variable, $$\textstyle X\in A$$ or $$\textstyle \left\{ X\in A\right\}  $$ are common shorthands for the event $$\textstyle \left\{ \omega:X\left(  \omega\right)  \in A\right\}  =X^{-1}\left(  A\right) \in\Sigma.$$

Probability conditional on the value of a random variable
Let $$\textstyle \left( \Omega,\Sigma,P\right)  $$ be probability space, $$\textstyle Y$$ a $$\textstyle \Sigma $$-measurable random variable with values in $$\textstyle \mathbb{R}$$, $$\textstyle A\in\Sigma$$ (i.e., an event not necessarily independent of $$\textstyle Y$$), and $$\textstyle B\in\mathcal{B}\left( \mathbb{R}\right)  $$. For $$\textstyle P\left( Y\in B\right)  >0$$ and $$\textstyle A\in\Sigma$$, the conditional probability of $$\textstyle A$$ given $$\textstyle Y\in B$$ is by definition


 * $$ P\left( A|Y\in B\right)  =\frac{P\left(  A\cap\left\{  Y\in B\right\} \right)  }{P\left(  Y\in B\right)  }. $$

We wish to attach a meaning to the conditional probability of $$\textstyle A$$ given $$\textstyle Y=y$$ even when $$\textstyle P\left( Y=y\right)  =0$$. The following argument follows Wilks, who attributes it to Kolmogorov. Fix $$\textstyle A$$ and define


 * $$ Q\left( B\right)  =P\left(  A\cap\left\{  Y\in B\right\}  \right)  =P\left( A\cap Y^{-1}\left(  B\right)  \right)  . $$

Since $$\textstyle Y$$ is $$\textstyle \Sigma$$-measurable, the set function $$\textstyle R$$ is a measure on Borel sets $$\textstyle \mathcal{B}\left( \mathbb{R}\right)  $$. Define another measure $$\textstyle Q$$ on $$\textstyle \mathcal{B}\left( \mathbb{R}\right)  $$ by


 * $$ R\left( B\right)  =P\left(  \left\{  Y\in B\right\}  \right)  \quad\forall B\in\mathcal{B}\left(  \mathbb{R}\right) $$

Clearly,


 * $$ 0\leq Q\left( B\right)  \leq R\left(  B\right)  \quad\forall B\in \mathcal{B}\left(  \mathbb{R}\right) $$

\newline and hence $$\textstyle R\left( B\right)  =0$$ implies $$\textstyle Q\left(  B\right)  =0$$. Thus the measure $$\textstyle Q$$ is absolutely continuous with respect to the measure $$\textstyle R$$ and by the Radon-Nykodym theorem, there exists a real-valued $$\textstyle \mathcal{B}\left( \mathbb{R}\right)  $$-measurable function $$\textstyle f$$ such that


 * $$ Q\left( B\right)  =\int_{B}f\left(  y\right)  dR\left(  y\right) \quad\forall B\in\mathcal{B}\left(  \mathbb{R}\right)  . $$

We interpret the function $$\textstyle f$$ as the conditional probability of $$\textstyle A$$ given $$\textstyle Y=y$$,


 * $$ f\left( y\right)  =P\left(  A|Y=y\right)  . $$

Once the conditional probability is defined, other concepts of probability follow, such as expectation and density.

One way to justify this interpretation is $$\textstyle f$$ as the conditional probability of $$\textstyle A$$ given $$\textstyle Y=y$$ the limit of probability conditioned on the value of $$\textstyle Y$$ being in a small neighborhood of $$\textstyle y$$. Set $$\textstyle B=N_{\varepsilon}\left( y\right) $$ (a neighborhood of $$\textstyle y$$ with radius $$\textstyle x$$) to get


 * $$ Q\left( N_{\varepsilon}\left(  y\right)  \right)  =P\left(  A\cap Y^{-1}\left(  N_{\varepsilon}\left(  y\right)  \right)  \right) $$

and using the fact that $$\textstyle P\left( Y\in N_{\varepsilon}\left(  y\right) \right)  =\int_{N_{\varepsilon}\left(  y\right)  }dR$$, we have


 * $$ Q\left( N_{\varepsilon}\left(  y\right)  \right)  =\int_{N_{\varepsilon }\left(  y\right)  }fdR=\frac{\int_{N_{\varepsilon}\left(  x\right)  }fdR}{\int_{N_{\varepsilon}\left(  x\right)  }dR}P\left(  Y\in N_{\varepsilon }\left(  y\right)  \right) , $$

so


 * $$ P\left( A|Y\in N_{\varepsilon}\left(  y\right)  \right)  =\frac{P\left( A\cap Y\in N_{\varepsilon}\left(  y\right)  \right)  }{P\left(  Y\in N_{\varepsilon}\left(  y\right)  \right)  }=\frac{\int_{N_{\varepsilon}\left( y\right)  }fdR}{\int_{N_{\varepsilon}\left(  y\right)  }dR}\rightarrow f\left(  y\right)  ,\quad\varepsilon\rightarrow0, $$

for almost all $$\textstyle x$$ in the measure $$\textstyle R$$.\footnote{I do not know how to prove that without additional assumptions on $$\textstyle f$$, like continuous. claims the limit a.e. “ can\textquotedblright be proved, though he does not proceed this way, and neglects to mention a.e. is in the measure $$\textstyle R$$.}

As another illustration and justification for understanding $$\textstyle f$$ as the conditional probability of $$\textstyle A$$ given $$\textstyle Y=y$$, we now show what happens when the random variable $$\textstyle Y$$ is discrete. Suppose $$\textstyle Y$$ attains only values $$\textstyle y_{j}$$, $$\textstyle j=1,2,\ldots$$, with $$\textstyle P\left( Y=y_{j}\right)  >0$$. Then


 * $$ R\left( B\right)  =P\left(  Y\in B\right)  =\sum_{y_{j}\in B}P\left( Y=y_{j}\right)  ,\quad\forall B\in\mathcal{B}\left(  \mathbb{R}\right)  . $$

Choose $$\textstyle y_{j}$$ and $$\textstyle B$$ as a neighborhood $$\textstyle N_{\varepsilon}\left( y_{j}\right) $$ of $$\textstyle y_{j}$$ with radius $$\textstyle \varepsilon>0$$ so small that $$\textstyle N_{\varepsilon}\left( y_{j}\right)  $$ does not contain any other $$\textstyle y_{k}$$, $$\textstyle k\neq j$$. Then for any $$\textstyle A\in\Sigma$$,


 * $$ Q\left( N_{\varepsilon}\left(  y_{j}\right)  \right)  =P\left(  A\cap\left\{ Y\in N_{\varepsilon}\right\}  \right)  =P\left(  A\cap\left\{  Y=y_{j}\right\}  \right) $$

by the definition of $$\textstyle Q$$, and from the definition of $$\textstyle f$$ by Radon-Nykodym derivative,


 * $$ Q\left( N_{\varepsilon}\left(  y_{j}\right)  \right)  =\int_{N_{\varepsilon}}f\left(  y\right)  dR\left(  y\right)  =f\left(  y_{j}\right)  P\left( Y=y_{j}\right)  . $$

This gives, for $$\textstyle y=y_{j}$$,


 * $$\begin{align} f\left( y\right)   &  =\lim_{\varepsilon\rightarrow0}\frac{P\left( E\cap\left\{  Y\in N_{\varepsilon}\left(  y\right)  \right\}  \right) }{P\left(  Y\in N_{\varepsilon}\left(  y\right)  \right)  }=\lim _{\varepsilon\rightarrow0}P\left(  A|Y\in N_{\varepsilon}\left(  y\right) \right) \ &  =\frac{P\left(  A\cap\left\{  Y=y\right\}  \right)  }{P\left(  Y=y\right) }=P\left(  A|Y=y\right) , \end{align}$$

by definition of conditional probability. The function $$\textstyle f\left( y\right)  $$ is defined only on the set $$\textstyle \left\{  y_{1},y_{2},\ldots\right\}  $$. Because that's where the variable $$\textstyle Y$$ is concentrated, this is a.s.

Expectation conditional on the value of a random variable
Suppose that $$\textstyle X$$ and $$\textstyle Y$$ are random variables, $$\textstyle X$$ integrable. Define again the measures on $$\textstyle \mathcal{B}\left( \mathbb{R}\right)  $$ generated by the random variable $$\textstyle Y$$,


 * $$ R\left( B\right)  =P\left(  Y\in B\right)  =P\left(  Y^{-1}\left(  B\right) \right) , $$

and a signed finite measure on $$\textstyle \mathcal{B}\left( \mathbb{R}\right)  $$,


 * $$ Q\left( B\right)  =E\left(  X\mathbf{1}_{Y\in B}\right)  =\int_{\omega :Y\left(  \omega\right)  \in B}X\left(  \omega\right)  P\left(  d\omega \right)  =\int_{Y^{-1}\left(  B\right)  }X\left(  \omega\right)  P\left( d\omega\right)  . $$

Here, $$\textstyle \mathbf{1}_{Y\in B}$$ is the indicator function of the event $$\textstyle Y\in B$$, so $$\textstyle \left( X\mathbf{1}_{Y\in B}\right)  \left(  \omega\right)  =X\left( \omega\right)  $$ if $$\textstyle Y\left(  \omega\right)  \in B$$ and zero otherwise. Since


 * $$\begin{align} \left\vert Q\left( B\right)  \right\vert  &  \leq\underbrace{P\left( Y^{-1}\left(  B\right)  \right)  }_{R\left(  B\right)  }\int_{\Omega}X\left( \omega\right)  P\left(  d\omega\right) \ &  =R\left(  B\right)  E\left(  X\right) \end{align}$$

and $$\textstyle E\left( X\right)  <+\infty$$, we have that $$\textstyle R\left(  B\right) =0\Longrightarrow Q\left(  B\right)  =0$$, so $$\textstyle Q$$ is absolutely continuous with respect to $$\textstyle R$$. Consequently, there exists Radon-Nikodym derivative $$\textstyle f$$ such that


 * $$ Q\left( B\right)  =\int_{B}f\left(  y\right)  R\left(  dy\right) ,\quad\forall B\in\mathcal{B}\left(  \mathbb{R}\right)  . $$

The value $$\textstyle f\left( y\right)  $$ is conditional expectation of $$\textstyle X$$ given $$\textstyle Y=y$$ and denoted by $$\textstyle E\left(  X|Y=y\right)  $$. Then the result can be written as


 * $$ E\left( X\mathbf{1}_{Y\in B}\right)  =\int_{B}E\left(  X|Y=y\right)  P\left( Y\in dy\right) , $$

for almost all $$\textstyle y$$ in the measure $$\textstyle P\left( Y\in dy\right)  $$ generated by the random variable $$\textstyle Y$$.

This definition is consistent with that of conditional probability: the conditional probability of $$\textstyle A$$ given $$\textstyle Y=y$$ is the same as the conditional mean of the indicator function of $$\textstyle A$$ given $$\textstyle Y=y$$. The proof is also completely the same. Actually we did not have to do conditional probability at all and just call it a special case of conditional expectation.

Expectation conditional on a random variable and on a $$\textstyle \sigma $$-algebra
Let $$\textstyle g\left( y\right)  =E\left(  X|Y=y\right)  $$ be conditional expectation of the random variable $$\textstyle X$$ given that random variable $$\textstyle Y=y$$. Here $$\textstyle y$$ is a fixed, deterministic value. Now take $$\textstyle y$$ random, namely the value of the random variable $$\textstyle Y$$, $$\textstyle y=Y\left( \omega\right)  $$. The result is called the conditional expectation of $$\textstyle X$$ given $$\textstyle Y$$, which is the random variable


 * $$ E\left( X|Y\right)  \left(  \omega\right)  =E\left(  X|Y=Y\left( \omega\right)  \right)  =g\left(  Y\left(  \omega\right)  \right)  . $$

So now we have the conditional expectation given in terms of the sample space $$\textstyle \Omega$$ rather than in terms of $$\textstyle \mathbb{R}$$, the range space of the random variable $$\textstyle Y$$. It will turn out that after the change of the independent variable, the particular values attained by the random variable $$\textstyle Y$$ do not matter that much; rather, it is the granularity of $$\textstyle Y$$ that is important. The granularity of $$\textstyle Y$$ can be expressed in terms of the $$\textstyle \sigma$$-algebra generated by the random variable $$\textstyle Y$$, which is


 * $$ \mathcal{A}=\left\{ Y^{-1}\left(  B\right)  :\mathcal{B}\left( \mathbb{R}\right)  \right\}  . $$

By substitution, the conditional expectation $$\textstyle g$$ satisfies


 * $$ E\left( X\mathbf{1}_{\omega\in Y^{-1}\left(  B\right)  }\right) =\int_{Y^{-1}\left(  B\right)  }g\left(  Y\left(  \omega\right)  \right) P\left(  d\omega\right)  ,\quad\forall B\in\mathcal{B}\left(  \mathbb{R}\right)  . $$

which, by writing


 * $$ C=Y^{-1}\left( B\right)  ,\quad h\left(  \omega\right)  =g\left(  Y\left( \omega\right)  \right) , $$

is seen to be the same as


 * $$ \int_{C}X\left( \omega\right)  P\left(  d\omega\right)  =\int_{C}h\left( \omega\right)  P\left(  d\omega\right)  ,\quad\forall C\in\mathcal{A}. $$

It can be proved that for any $$\textstyle \sigma$$-algebra $$\textstyle \mathcal{A}\subset\Sigma$$, the random variable $$\textstyle h$$ exists and is defined by this equation uniquely, up to equality a.e. in $$\textstyle P$$. The random variable $$\textstyle h$$ is called the conditional expectation of $$\textstyle X$$ given the $$\textstyle \sigma$$-algebra $$\textstyle \mathcal{A}$$''. ''It can be interpreted as a sort of averaging of the random variable $$\textstyle X$$ to the granularity given by the $$\textstyle \sigma$$-algebra $$\textstyle \mathcal{A}$$.

The conditional probability $$\textstyle h=P\left( A|\mathcal{A}\right)  $$ of a an event (that is, a set) $$\textstyle A\in\Sigma$$ given the $$\textstyle \sigma $$-algebra $$\textstyle \mathcal{A}$$ is obtained by substituting $$\textstyle X=\mathbf{1}_{\omega\in A}$$, which gives


 * $$ P\left( A\cap C\right)  =\int_{C}h\left(  \omega\right)  P\left( d\omega\right)  ,\quad\forall C\in\mathcal{A}. $$

An event $$\textstyle A\in\Sigma$$ is defined to be independent of a $$\textstyle \sigma $$-algebra $$\textstyle \mathcal{A}\subset\Sigma$$ if $$\textstyle A$$ and any $$\textstyle C\in\mathcal{A}$$ are independent. It is easy to see that $$\textstyle A\in\Sigma$$ is independent of $$\textstyle \sigma$$-algebra $$\textstyle A$$ if and only if


 * $$ P\left( A\cap C\right)  =P\left(  A\right)  P\left(  C\right)  =\int _{C}P\left(  A\right)  P\left(  d\omega\right)  ,\quad\forall C\in \mathcal{A}, $$

that is, if and only if $$\textstyle P\left( A|\mathcal{A}\right)  =P\left(  A\right)  $$ a.s. (which is a particularly obscure way to write independence given how complicated the definitions are).

Two random variables $$\textstyle X$$, $$\textstyle Y$$ are said to be independent if


 * $$ P\left( X\in A\wedge Y\in B\right)  =P\left(  X\in A\right)  P\left(  Y\in B\right)  ,\quad\forall A,B\in\mathcal{B}\left(  \mathbb{R}\right) , $$

which is now seen to be the same as


 * $$ P\left( X\in A|Y\right)  =P\left(  X\in A\right)  ,\quad\forall A\in\mathcal{B}\left(  \mathbb{R}\right)  . $$

Properties of conditional expectation
To be done.

Conditional density and likelihood
Now that we have $$\textstyle P\left( A|Y=y\right)  $$ for an arbitrary event $$\textstyle A$$, we can define the conditional probability $$\textstyle P\left(  X\in F|Y=y\right)  $$ for a random variable $$\textstyle X$$ and Borel set $$\textstyle F$$. Thus we can define the conditional density $$\textstyle p_{X|Y}\left( x,y\right)  $$ as the Radon-Nikodym derivative,


 * $$ P\left( X\in F|Y=y\right)  =\int_{G}p_{X|Y}\left(  x,y\right)  d\mu\left( y\right) $$

where $$\textstyle \mu$$ is the Lebesgue measure. In the conditional density $$\textstyle p_{X|Y}\left( x,y\right)  $$, $$\textstyle X$$ and $$\textstyle Y$$ are random variables that identify the density function, and $$\textstyle x$$ and $$\textstyle y$$ are the arguments of the density function.

Note that in general $$\textstyle p_{X|Y}\left( x,y\right)  $$ is defined only for almost all $$\textstyle x$$ (in Lebesgue measure) and almost all $$\textstyle y$$ (in the measure $$\textstyle R$$ generated by the random variable $$\textstyle Y$$).\textbf{ }Under reasonable additional conditions (for example, it is enough to assume that the joint density $$\textstyle p_{X,Y}$$ is continuous at $$\textstyle \left(  x,y\right)  $$, and $$\textstyle p\left(  y\right)  >0$$), the density of $$\textstyle X$$ conditional on $$\textstyle Y=y$$ satisfies


 * $$\begin{align} p_{X|Y}\left( x,y\right)   &  =\lim_{\varepsilon\rightarrow0}\frac{P\left( X\in N_{\varepsilon}\left(  x\right)  |Y\in N_{\varepsilon}\left(  y\right) \right)  }{\mu\left(  N_{\varepsilon}\left(  x\right)  \right)  }\ &  =\lim_{\varepsilon\rightarrow0}\frac{P\left(  X\in N_{\varepsilon}\left( x\right)  \cap Y\in N_{\varepsilon}\left(  y\right)  \right)  }{\mu\left( N_{\varepsilon}\left(  x\right)  \right)  P\left(  Y\in N_{\varepsilon}\left( y\right)  \right)  }\ &  =\lim_{\varepsilon\rightarrow0}\frac{P\left(  x\in N_{\varepsilon}\left( x\right)  \cap Y\in N_{\varepsilon}\left(  y\right)  \right)  }{\mu\left( N_{\varepsilon}\left(  x\right)  \right)  \mu\left(  N_{\varepsilon}\left( y\right)  \right)  }\frac{\mu\left(  N_{\varepsilon}\left(  y\right)  \right) }{P\left(  Y\in N_{\varepsilon}\left(  y\right)  \right)  }\ &  =\frac{p\left(  x,y\right)  }{p\left(  y\right)  }. \end{align}$$

Note that this density is a deterministic function.

Density of a random variable $$\textstyle X$$ conditional on a random variable $$\textstyle Y$$ is


 * $$ p_{X|Y}\left( x,Y\right)  =\frac{p\left(  x,Y\right)  }{p\left(  Y\right) }. $$

It is a function valued random variable obtained from the deterministic function $$\textstyle p_{X|Y}\left( x,y\right)  $$ by taking $$\textstyle y$$ to be the value of the random variable $$\textstyle Y$$.

A common shorthand for the conditional density is


 * $$ p_{X|Y}\left( x,y\right)  =p\left(  x|y\right)  . $$

This abuse of notation identifies a function from the symbols for its arguments, which is incorrect. Imagine that we wish to evaluate the value of the conditional density of $$\textstyle X$$ at $$\textstyle 2$$ given $$\textstyle Y=1$$; then $$\textstyle p\left( x|y\right) $$ becomes $$\textstyle p\left(  2|1\right)  $$, which is a nonsense.

When the value of $$\textstyle y$$ is constant, the function $$\textstyle x\longmapsto p\left( x|y\right) $$ is a probability density function of $$\textstyle y$$. When the value of $$\textstyle x$$ is constant, the function $$\textstyle y\longmapsto p\left( x|y\right)  $$ is called the likelihood function.