User:PAR/sandbox

Quantal CMF
For theoretical purposes, it it often convenient to characterize radiation in terms of photons rather than energy. The energy E of a photon is given by the Planck relation


 * $$E = h \nu = h c/\lambda$$

where E is the energy per photon, h is Planck's constant, c is the speed of light, &nu; is the frequency of the radiation and &lambda; is the wavelength. A spectral radiative quantity in terms of energy, JE(&lambda;), is converted to its quantal form JQ(&lambda;) by dividing by the energy per photon:


 * $$JQ(\lambda) = JE(\lambda) (\lambda/hc)$$

For example, If JE(&lambda;) is spectral radiance with units of watts/m2/sr/m, then the quantal equivalent JQ(&lambda;) characterizes that radiation with units of photons/sec/m2/sr/m.

If CE&lambda;i(&lambda;) (i=1,2,3) are the three energy-based color matching functions for a particular color space (LMS color space for the purposes of this article), then the tristimulus values may be expressed in terms of the quantal radiative quantity by:


 * $$CE_i = \int_0^\infty JE(\lambda) CE_{\lambda i}(\lambda) d\lambda

= \int_0^\infty JQ(\lambda)(hc/\lambda) CE_{\lambda i}(\lambda) d\lambda$$

Define the quantal color matching functions:


 * $$CQ_{\lambda i}(\lambda) = (CE_{\lambda i}(\lambda)/\lambda)/(CE_{\lambda i}(\lambda_{max i})/\lambda_{max})$$

where &lambda;max i is the wavelength at which CE&lambda; i(&lambda;)/&lambda; is maximized. Define the quantal tristimulus values:


 * $$CQ_i = \int_0^\infty JQ(\lambda) CQ_{\lambda i}(\lambda) d\lambda$$

Note that, as with the energy based functions, the peak value of CQ&lambda;i(&lambda;) will be equal to unity. Using the above equation for the energy tristimulus values Ci


 * $$CE_i = (hc/\lambda_{max i})\,CE_{\lambda i}(\lambda_{max i})\, CQ_i$$

For the LMS color space, $$\lambda_{max i}$$ = {566,541,441} nm and


 * $$CE_i/CQ_i = \{3.49694,3.1253,0.144944\} \times 10^{-19}$$

A more general proof
Suppose we are given four equal-length lists of field elements $$n_i$$, $$z_i$$, $$n_i'$$, $$z_i'$$ from which we may define $$w_i=n_i'/n_i$$. $$n_i$$ and $$z_i$$ will be called the parent population numbers and characteristics associated with each index i. Likewise $$n_i'$$ and $$z_i'$$ will be called the child population numbers and characteristics. (Equivalently, we could have been given $$n_i$$, $$z_i$$, $$w_i$$, $$z_i'$$ with $$n_i'=w_i n_i$$ where $$w_i$$ is referred to as the fitness associated with index i) Define the parent and child population totals:
 * {|cellspacing=20

and the probabilities (or frequencies) :
 * $$n\;\stackrel{\mathrm{def}}{=}\;\sum_i n_i$$ || $$n'\;\stackrel{\mathrm{def}}{=}\;\sum_i n_i'$$
 * }
 * }
 * {|cellspacing=20

Note that these are of the form of probability mass functions in that $$\sum_i q_i = \sum_i q_i' = 1$$ and are in fact the probabilities that a random individual drawn from the parent population has a characteristic $$z_i$$ and likewise for the child population. Define the fitnesses:
 * $$q_i\;\stackrel{\mathrm{def}}{=}\;n_i/n$$ || $$q_i'\;\stackrel{\mathrm{def}}{=}\;n_i'/n'$$
 * }
 * }
 * $$w_i\;\stackrel{\mathrm{def}}{=}\;n_i'/n_i$$

The average of any list $$x_i$$ is given by:
 * $$E(x_i)=\sum_i q_i x_i$$

so the average characteristics are defined as:
 * {|cellspacing=20

and the average fitness is:
 * $$z\;\stackrel{\mathrm{def}}{=}\;\sum_i q_i z_i$$ || $$z'\;\stackrel{\mathrm{def}}{=}\;\sum_i q_i' z_i'$$
 * }
 * }
 * $$w\;\stackrel{\mathrm{def}}{=}\;\sum_i q_i w_i$$

A simple theorem can be proved: $$q_i w_i = \left(\frac{n_i}{n}\right)\left(\frac{n_i'}{n_i}\right) = \left(\frac{n_i'}{n'}\right) \left(\frac{n'}{n}\right)=q_i'\left(\frac{n'}{n}\right)$$ so that:
 * $$w=\frac{n'}{n}\sum_i q_i' = \frac{n'}{n}$$

and
 * $$q_i w_i = w\,q_i'$$

The covariance of $$w_i$$ and $$z_i$$ is defined by:
 * $$\operatorname{cov}(w_i,z_i)\;\stackrel{\mathrm{def}}{=}\;E(w_i z_i)-E(w_i)E(z_i) = \sum_i q_i w_i z_i - w z$$

Defining $$\Delta z_i \;\stackrel{\mathrm{def}}{=}\; z_i'-z_i$$, the expectation value of $$w_i \Delta z_i$$ is
 * $$E(w_i \Delta z_i) = \sum q_i w_i (z_i'-z_i) = \sum_i q_i w_i z_i' - \sum_i q_i w_i z_i$$

The sum of the two terms is:
 * $$\operatorname{cov}(w_i,z_i)+E(w_i \Delta z_i) = \sum_i q_i w_i z_i - w z + \sum_i q_i w_i z_i' - \sum_i q_i w_i z_i = \sum_i q_i w_i z_i' - w z $$

Using the above mentioned simple theorem, the sum becomes
 * $$\operatorname{cov}(w_i,z_i)+E(w_i \Delta z_i) = w\sum_i q_i' z_i' - w z = w z'-wz = w\Delta z$$

where $$\Delta z\;\stackrel{\mathrm{def}}{=}\;z'-z$$.

Logistic Regression and maximum entropy

 * $$f_z(z) =

\int_0^\infty f_x(x)f_y(z/x)\frac{dx}{x} -\int_{-\infty}^0 f_x(x)f_y(z/x)\frac{dx}{x} $$


 * $$f_z(z) =

\int_0^\infty f_x(x)f_y(z/x)\frac{dx}{x} +\int_0^\infty f_x(-x)f_y(-z/x)\frac{dx}{x} $$

Defining $$x=e^p$$ which means $$p=\ln(x)$$ and $$dp=dx/x$$

Also define: $$p_z=\ln(z)$$


 * $$f_z(z) =

\int_{-\infty}^\infty f_x(e^p)f_y(e^{p_z-p})dp +\int_{-\infty}^\infty f_x(-e^p)f_y(-e^{p_z-p})dp $$


 * $$h_x(p)=f_x(e^p)+i f_x(-e^p)$$ and $$h_y(p)=f_y(e^p)+i f_y(-e^p)$$


 * $$h_x(p)h_y^*(p_z-p) = A+iB$$

Since within the respective bounds of integration, $$f_x$$ and $$f_y$$ are real,


 * $$A = f_x(e^p)f_y(e^{p_z-p})+f_x(-e^p)f_y(-e^{p_z-p})$$


 * $$B = f_x(e^p)f_y(-e^{p_z-p})+f_x(-e^p)f_y(e^{p_z-p})$$


 * $$f_z(z) = \int_{-\infty}^\infty A\,dp = \Re\left(

\int_{-\infty}^\infty h_x(p)h_y^*(p_z-p)dp \right)$$

Of all the functional forms used for estimating the probabilities of a particular categorical outcome which optimize the fit by maximizing the likelihood function (e.g. probit regression), the logistic regression solution is unique in that it is a maximum entropy solution.

In order to show this, we use the method of Lagrange multipliers. The Lagrangian is equal to the entropy plus the sum of the products of Lagrange multipliers times various constraint expressions. The general multinomial case will be considered, since the proof is not made that much simpler by considering simpler cases. Equating the derivative of the Lagrangian with respect to the various probabilities to zero yields a functional form for those probabilities which corresponds to those used in logistic regression.

As in the above section on multinomial logistic regression, we will consider M+1 explanatory variables denoted xm and which include x0=1. There will be a total of K data points, indexed by k={1,2,...,K}, and the data points are given by xmk and yk. The xmk will also be represented as a M+1-dimensional vector $$\boldsymbol{x}_k = \{x_{0k},x_{1k},...,x_{Mk}\}$$. There will be N+1 possible values of the categorical variable y ranging from 0 to N.

Let pn(x) be the probability, given explanatory variable vector x, that the outcome will be y=n. Define $$p_{nk}=p_n(\boldsymbol{x}_k)$$ which is the probability that for the k-th measurement, the categorical outcome is n.

The Lagrangian will be expressed as a function of the probabilities pnk and will minimized by equating the derivatives of the Lagrangian with respect to these probabilities to zero. An important point is that the probabilities are treated equally and the fact that they sum to unity is part of the Lagrangian formulation, rather than being assumed from the beginning.

The first contribution to the Lagrangian is the entropy:


 * $$\mathcal{L}_{ent}=-\sum_{k=1}^K\sum_{n=0}^N p_{nk}\ln(p_{nk})$$

The log-likelihood is:


 * $$\ell=\sum_{k=1}^K\sum_{n=0}^N \Delta(n,y_k)\ln(p_{nk})$$

Assuming the multinomial logistic function, the derivative of the log-likelihood with respect the beta coefficients was found to be:


 * $$\frac{\partial \ell}{\partial  \beta_{nm}}=\sum_{k=1}^K \left(p_{nk}x_{mk}-\Delta(n,y_k)x_{mk})\right)$$

A very important point here is that this expression is (remarkably) not an explicit function of the beta coefficients. It is only a function of the probabilities pnk and the data. Rather than being specific to the assumed multinomial logistic case, it is taken to be a general statement of the condition at which the log-likelihood is maximized and makes no reference to the functional form of pnk. There are then (M+1)(N+1) fitting constraints and the fitting constraint term in the Lagrangian is then:


 * $$\mathcal{L}_{fit}=\sum_{n=0}^N\sum_{m=0}^M \lambda_{nm}\sum_{k=1}^K \left(p_{nk}x_{mk}-\Delta(n,y_k)x_{mk})\right)$$

where the &lambda;nm are the appropriate Lagrange multipliers. There are K normalization constraints which may be written:


 * $$\sum_{n=0}^N p_{nk}=1$$

so that the normalization term in the Lagrangian is:


 * $$\mathcal{L}_{norm}=\sum_{k=1}^K \alpha_k \left(1-\sum_{n=1}^N p_{nk}\right) $$

where the &alpha;k are the appropriate Lagrange multipliers. The Lagrangian is then the sum of the above three terms:


 * $$\mathcal{L}=\mathcal{L}_{ent} + \mathcal{L}_{fit} + \mathcal{L}_{norm}$$

Setting the derivative of the Lagrangian with respect to one of the probabilities to zero yields:


 * $$\frac{\partial \mathcal{L}}{\partial p_{n'k'}}=0=-\ln(p_{n'k'})-1+\sum_{m=0}^M (\lambda_{mn'}x_{mk'})-\alpha_{k'}$$

Using the more condensed vector notation:


 * $$\sum_{m=0}^M \lambda_{mn}x_{mk} = \boldsymbol{\lambda}_n\cdot\boldsymbol{x}_k$$

and dropping the primes on the n and k indices, and then solving for $$p_{nk}$$ yields:


 * $$p_{nk}=A_k e^{\boldsymbol{\lambda}_n\cdot\boldsymbol{x}_k}$$

where:


 * $$A_k=e^{-(1+\alpha_k)}$$

Imposing the normalization constraint, we can write the probabilities as:


 * $$p_{nk}=\frac{e^{\boldsymbol{\lambda}_n\cdot\boldsymbol{x}_k}}{\sum_{u=0}^N e^{\boldsymbol{\lambda}_u\cdot\boldsymbol{x}_k}}$$

If we substitute this expression back into the log-likelihood expression and maximize it with respect to the &lambda;mn in order to find the appropriate &lambda;mn for our data, we will find that the minimum is not at a single point but lies in an M+1 dimensional space in the (M+1)(N+1) dimensional space of the &lambda;mn. In other words, there are an infinite number of equally valid choices of the &lambda;mn. We can choose which &lambda;mn to use in any number of ways, and the method chosen in the multinomial logistic regression section above was to set &lambda;m0=0 (which are M+1 in number) and identify the beta coefficients as &beta;mn=&lambda;mn for all n except n=0. This recovers the results from that section.

Other approaches
In machine learning applications where logistic regression is used for binary classification, the MLE minimises the Cross entropy loss function.

Logistic regression is an important machine learning algorithm. The goal is to model the probability of a random variable $$Y$$ being 0 or 1 given experimental data.

Consider a generalized linear model function parameterized by $$\theta$$,

h_\theta(X) = \frac{1}{1 + e^{-\theta^TX}} = \Pr(Y=1 \mid X; \theta) $$

Therefore,

\Pr(Y=0 \mid X; \theta) = 1 - h_\theta(X) $$ and since $$ Y \in \{0,1\}$$, we see that $$ \Pr(y\mid X;\theta) $$ is given by $$ \Pr(y \mid X; \theta) = h_\theta(X)^y(1 - h_\theta(X))^{(1-y)}. $$ We now calculate the likelihood function assuming that all the observations in the sample are independently Bernoulli distributed,
 * $$\begin{align}

L(\theta \mid y; x) &= \Pr(Y \mid X; \theta) \\ &= \prod_i \Pr(y_i \mid x_i; \theta) \\ &= \prod_i h_\theta(x_i)^{y_i}(1 - h_\theta(x_i))^{(1-y_i)} \end{align}$$

Typically, the log likelihood is maximized,

N^{-1} \log L(\theta \mid y; x) = N^{-1} \sum_{i=1}^N \log \Pr(y_i \mid x_i; \theta) $$ which is maximized using optimization techniques such as gradient descent.

Assuming the $$(x, y)$$ pairs are drawn uniformly from the underlying distribution, then in the limit of large N,
 * $$\begin{align}

& \lim \limits_{N \rightarrow +\infty} N^{-1} \sum_{i=1}^N \log \Pr(y_i \mid x_i; \theta) = \sum_{x \in \mathcal{X}} \sum_{y \in \mathcal{Y}} \Pr(X=x, Y=y) \log \Pr(Y=y \mid X=x; \theta) \\[6pt] = {} & \sum_{x \in \mathcal{X}} \sum_{y \in \mathcal{Y}} \Pr(X=x, Y=y) \left( - \log\frac{\Pr(Y=y \mid X=x)}{\Pr(Y=y \mid X=x; \theta)} + \log \Pr(Y=y \mid X=x) \right) \\[6pt] = {} & - D_\text{KL}( Y \parallel Y_\theta ) - H(Y \mid X) \end{align}$$ where $$H(Y\mid X)$$ is the conditional entropy and $$D_\text{KL}$$ is the Kullback–Leibler divergence. This leads to the intuition that by maximizing the log-likelihood of a model, you are minimizing the KL divergence of your model from the maximal entropy distribution. Intuitively searching for the model that makes the fewest assumptions in its parameters.