User:InfoTheorist/Shannon's theorem

In information theory, a result is known as a converse if it provides an upper bound on the achievable rate for a given channel. Thus a converse, combined with a matching achievability result (lower bound) determines the capacity of a channel. In general, two types of converse results have been studied. The first type, known as a weak converse, provides an upper bound on the achievable rate for a given probability of error $&epsilon;$. The implies that if a transmitter sends at a rate higher than the bound provided by the weak converse, the probability of error will be greater than $&epsilon;$. A strong converse, on the other hand, provides a much stronger result: if a transmitter sends at a rate higher than the given bound, the probability of error will not only be greater than $&epsilon;$, but will converge to one as codes with larger blocklengths are utilized.

In Shannon's noisy-channel coding theorem, Fano's inequality is used to obtain a weak converse. However, while Fano's inequality provides a non-vanishing lower bound for the probability of error when the transmitter's rate is above the channel capacity, it is not sufficient to prove that the probability of error converges to one when the blocklength goes to infinity.

Problem Formulation
Let $$\mathcal{X}$$ and $$\mathcal{Y}$$ be sets. A channel, with input alphabet $$\mathcal{X}$$ and output alphabet $$\mathcal{Y}$$, is a sequence of conditional probability distributions $$\{p^{(n)}(y^{n}|x^{n})\}_{n=1}^{\infty}$$ where

$$ p^{(n)}(y^{n}|x^{n}):\mathcal{Y}^{n}\times\mathcal{X}^{n}\rightarrow\mathbb{R}_{\geq 0}. $$

The channel is said to be discrete if both $$\mathcal{X}$$ and $$\mathcal{Y}$$ are finite sets. A channel is called memoryless if for every positive integer $$n$$ we have

$$ p^{(n)}(y^{n}|x^{n})=\prod_{i=1}^{n}p_{i}(y_{i}|x_{i}). $$

We say a channel is stationary if every stationary input results in a stationary output. A memoryless channel is stationary if the functions $$p_{i}(.|.)$$ are equal for all $$i$$.

Therefore a stationary memoryless channel can be simply represented as the triple

$$ (\mathcal{X},p(y|x),\mathcal{Y}). $$

A $(2^{nR},n)$ code consists of a message set

$$\mathcal{W}=\{1,\dots,\lceil 2^{nR}\rceil\},$$

an encoder

$$f_{n}:\mathcal{W}\rightarrow\mathcal{X}^{n},$$

a decoder

$$g_{n}:\mathcal{Y}^{n}\rightarrow\mathcal{W}.$$

The average probability of error of the code is given by

$$ P_{e}^{(n)} = \frac{1}{M}\sum_{(w,y^{n}):g_{n}(y^{n})\neq w}p(y^{n}|f_{n}(w)). $$

The value of n is known as the blocklength of the code.

A rate R (which is a nonnegative number) is said to be achievable, if there exists a sequence of $(2^{nR},n)$ codes with $P^{(n)}_{e}$ going to zero as n goes to infinity. The noisy-channel coding theorem states that a rate R is achievable if and only if R is smaller than the capacity of the channel C, where

$$C=\max_{p(x)}I(X;Y).$$

Wolfowitz's theorem states that for any discrete memoryless channel with capacity $C$ and any $(2^{nR},n)$ code with rate $R>C$,

$$ P_{e}^{(n)}\geq 1-\frac{4A}{n(R-C)^{2}}-e^{-\frac{n(R-C)}{2}} $$

for some positive constant $A$ dependent on the channel but not on $n$ or $M$. The proof which follows is based on Gallager's book.

Proof
For the proof we first require a lemma. This lemma is essentially a special case of the method of Lagrange multipliers for a concave function defined on the standard simplex $$\Delta^{n}$$. It is then followed by a corollary which simply applies the lemma to the mutual information.

Lemma
Let $$f:\Delta^{n}\rightarrow\mathbb{R}$$ be a concave function. Suppose $f$ has continuous partial derivatives on its domain. Then $$\alpha^{*}=(\alpha_{k}^{*})_{k=1}^{n}\in\Delta^{n}$$ maximizes $$f$$ iff there exists some real $$\lambda$$ such that for every $$k\in\mathrm{supp}(\alpha^{*}),$$

$$ \frac{\partial f}{\partial \alpha_{k}}\bigg|_{\alpha=\alpha^{*}}=\lambda, $$

and for every $$k\notin\mathrm{supp}(\alpha^{*}),$$

$$ \frac{\partial f}{\partial \alpha_{k}}\bigg|_{\alpha=\alpha^{*}}\leq\lambda. $$

Proof of Lemma
Suppose $$\alpha^{*}\in\Delta^{n}$$ satisfies the above conditions. We'll show that $$f$$ achieves its maximum at $$\alpha^{*}$$. Let $$\alpha$$ be any element of $$\Delta^{n}$$. By the concavity of $$f$$ for any $$\theta\in [0,1]$$, we have

$$ \theta f(\alpha) + (1-\theta)f(\alpha^{*}) \leq f(\theta\alpha+(1-\theta)\alpha^{*}), $$

thus

$$ f(\alpha)-f(\alpha^{*})\leq\frac{f(\theta\alpha+(1-\theta)\alpha^{*})-f(\alpha)}{\theta}. $$

Allowing $$\theta\rightarrow 0^{+}$$ and making use of the continuity of partial derivatives results in

$$ \begin{align} f(\alpha)-f(\alpha^{*}) &\leq \frac{df(\theta\alpha+(1-\theta)\alpha^{*})}{d\theta}\big|_{\theta=0}\\ &= \sum_{k=1}^{n}\frac{\partial f}{\partial \alpha_{k}}\big|_{\alpha=\alpha^{*}}(\alpha_{k}-\alpha^{*}_{k})\\ &\leq \lambda\sum_{k=1}^{n}(\alpha_{k}-\alpha^{*}_{k})=0. \end{align} $$

For the other direction suppose $$\alpha^{*}$$ maximizes $$f$$. Then for every $$\alpha\in\Delta^{n}$$ and every $$\theta\in [0,1]$$,

$$ f(\theta\alpha + (1-\theta)\alpha^{*})-f(\alpha^{*})\leq 0. $$

This implies

$$  \frac{df(\theta\alpha+(1-\theta)\alpha^{*})}{d\theta}\big|_{\theta=0^{+}}\leq 0, $$

and by the continuity of the partial derivatives,

Since $$\alpha^{*}\in\Delta^{n}$$, at least one of its components, say $$\alpha^{*}_{1}$$ is strictly positive. Now let $$j$$ be an arbitrary element of $$\{2,\dots,n\}$$. Furthermore, for every $$k\in\{1,\dots,n\}$$, let $$e_{k}$$ denote the element of $$\Delta^{n}$$ that consists of all zeros but one one in the $$k^\text{th}$$ position. Define

$$ \alpha = \alpha^{*}+\alpha^{*}_{1}(e_{j}-e_{1}). $$

Then inequality ($$) simplifies to

$$ \frac{\partial f}{\partial \alpha_{j}}\big|_{\alpha^{*}}\leq \frac{\partial f}{\partial \alpha_{1}}\big|_{\alpha^{*}}. $$

In addition, if $$\alpha^{*}_{j}>0$$ and we define

$$ \alpha = \alpha^{*}+\alpha^{*}_{j}(e_{1}-e_{j}), $$

then ($$) results in

$$ \frac{\partial f}{\partial \alpha_{1}}\big|_{\alpha^{*}}\leq \frac{\partial f}{\partial \alpha_{j}}\big|_{\alpha^{*}}. $$

Thus if we define $$\lambda$$ as

$$ \lambda = \frac{\partial f}{\partial \alpha_{1}}\big|_{\alpha^{*}}, $$

the result follows.

Corollary
For any discrete memoryless channel the distribution $$p^{*}(x)$$ achieves capacity iff there exists a real number $$C$$ such that $$I^{*}(x;Y)=C$$ for $$x\in\text{supp}(p^{*})$$, and $$I^{*}(x;Y)\leq C$$ for $$x\notin\text{supp}(p^{*})$$, where

$$I^{*}(x;Y)=\sum_{y}p(y|x)\log\frac{p(y|x)}{\sum_{x'}p^{*}(x')p(y|x')}$$.

Furthermore, $$C$$ is the capacity of the channel.

Proof of Corollary
The proof of the corollary is straightforward and follows directly from the lemma. To see this, note that for any $$x,x'\in\mathcal{X}$$,

$$ \frac{\partial I(x';Y)}{\partial p(x)} = 1. $$

Since

$$ I(X;Y) = \sum_{x'}p(x')I(x';Y), $$

this implies

$$ \begin{align} \frac{\partial I(X;Y)}{\partial p(x)} &= I(x;Y) + \sum_{x'}p(x')\frac{\partial I(x';Y)}{\partial p(x)}\\ &= I(x;Y) + \sum_{x'}p(x') = I(x;Y)+1. \end{align} $$

Now using the lemma, the claim follows.

Proof of the Strong Converse
For any two random variables $$ and $X$ define the information density as

$$i(X,Y)=\log\frac{p(Y|X)}{p(Y)}.$$

Note that

$$I(x;Y)=\mathbb{E}[i(X,Y)|X=x],$$

and

$$I(X;Y)=\mathbb{E}[i(X,Y)].$$

Let $$p^{*}(y)$$ be the capacity-achieving output distribution. For any positive integer $Y$, define

$$ p^{*}(y^{n})=\prod_{i=1}^{n}p^{*}(y_{i}) $$ For any $$(x^{n},y^{n})\in\mathcal{X}^{n}\times\mathcal{Y}^{n}$$, define

$$ i(x^{n},y^{n})=\log\frac{p(y^{n}|x^{n})}{p^{*}(y^{n})}=\sum_{i=1}^{n}i(x_{i},y_{i}), $$

where

$$ i(x_{i},y_{i})=\log\frac{p(y_{i}|x_{i})}{p^{*}(y_{i})}. $$

Consider a $(2^{nR},n)$ code with codewords $$\{x_{i}^{n}\}_{i=1}^{M}$$ and decoding regions $$\{D_{i}\}_{i=1}^{M}$$. Then the probability that a codeword is decoded correctly is given by

$$ P_{c}=\frac{1}{M}\sum_{m=1}^{M}\sum_{y^{n}\in D_{m}}p(y^{n}|x_{m}^{n}). $$

Fix positive $&epsilon;$. For every $m$, define the set

$$ B_{m}=\{y^{n}:i(x_{m}^{n},y^{n})>n(C+\epsilon)\}. $$

Then

$$ P_{c}=\frac{1}{M}\sum_{m=1}^{M}\sum_{y^{n}\in D_{m}\cap B_{m}}p(y^{n}|x_{m}^{n}) +\frac{1}{M}\sum_{m=1}^{M}\sum_{y^{n}\in D_{m}\cap B_{m}^{c}} p(y^{n}|x_{m}^{n}). $$

Based on the definition of $B_{m}$, the second sum can be upper bounded as

$$  \begin{align} \frac{1}{M}\sum_{m=1}^{M}\sum_{y^{n}\in D_{m}\cap B_{m}^{c}} p(y^{n}|x_{m}^{n}) &\leq \frac{1}{M}\sum_{m=1}^{M}\sum_{y^{n}\in D_{m}\cap B_{m}^{c}} e^{n(C+\epsilon)} p^{*}(y^{n})\\ &=\frac{e^{n(C+\epsilon)}}{M}\sum_{m=1}^{M}\sum_{y^{n}\in D_{m}\cap B_{m}^{c}}p^{*}(y^{n}) \leq\frac{1}{M}e^{n(C+\epsilon)}. \end{align} $$

Using Chebyshev's inequality we can find an upper bound on the first sum

$$ \begin{align} \sum_{y^{n}\in D_{m}\cap B_{m}}p(y^{n}|x_{m}^{n}) &\leq \sum_{y^{n}\in B_{m}}p(y^{n}|x_{m}^{n}) \\ &=\mathrm{Pr}[i(X^{n},Y^{n})>n(C+\epsilon)|X^{n}=x_{m}^{n}] \\ &=\mathrm{Pr}\Big[\sum_{i=1}^{n}i(X_{i},Y_{i})>n(C+\epsilon)|X^{n}=x_{m}^{n}\Big]\\ &\leq \frac{\sum_{i=1}^{n}\mathrm{Var}(i(X_{i},Y_{i})|X_{i}=x_{mi})} {n^{2}\left(C-\sum_{i=1}^{n}\mathbb{E}[i(X_{i},Y_{i}|X_{i}=x_{mi})]+\epsilon\right)^{2}}\\ &\leq \frac{\sum_{i=1}^{n}\mathrm{Var}(i(X_{i},Y_{i})|X_{i}=x_{mi})} {n^{2}\left(C-\sum_{i=1}^{n}I(x_{mi};Y_{i})+\epsilon\right)^{2}}\\ &\leq \frac{1}{n^{2}\epsilon^{2}}\sum_{i=1}^{n}\mathrm{Var}(i(X_{i},Y_{i})|X_{i}=x_{mi}), \end{align} $$

where

$$ \mathrm{Var}(i(X_{i},Y_{i})|X_{i}=x_{mi})=\sum_{y}p(y|x_{mi})\Big(\log\frac{p(y|x_{mi})}  {p^{*}(y)}\Big)^{2}-\Big(\sum_{y}p(y|x_{mi})\log\frac{p(y|x_{mi})}{p^{*}(y)}\Big)^{2}. $$

If we define

$$A=\max_{x\in\mathcal{X}}\mathrm{Var}(i(X,Y)|X=x),$$

(note that $n$ only depends on the channel and is independent of $A$ and $n$) then

$$\sum_{y^{n}\in B_{m}}p(y^{n}|x_{m}^{n})\leq\frac{A}{n\epsilon^{2}}.$$

Therefore

$$P_{c}\leq\frac{A}{n\epsilon^{2}}+\frac{1}{M}e^{n(C+\epsilon)}.$$

Since $$M=\lceil 2^{nR}\rceil$$, we get

$$ \begin{align} P_{e}^{(n)} &\geq 1-\frac{A}{n\epsilon^{2}}-\frac{1}{M}e^{n(C+\epsilon)}\\ &\geq 1-\frac{A}{n\epsilon^{2}}-e^{n(C-R+\epsilon)}. \end{align} $$

Should $R>C$, setting $&epsilon;$ to $M$ results in

$$P_{e}^{(n)}\geq 1-\frac{4A}{n(R-C)^{2}}-e^{-\frac{n}{2}(R-C)}.$$