User:Balakg89/sandbox

=Historical Perspective= The history of learning vector-valued functions is closely linked to transfer learning, a broad term that refers to systems that learn by transferring knowledge between different domains. The fundamental motivation for transfer learning in the field of machine learning was discussed in a NIPS-95 workshop on “Learning to Learn,” which focused on the need for lifelong machine learning methods that retain and reuse previously learned knowledge. Research on transfer learning has attracted much attention since 1995 in different names: learning to learn, life-long learning, knowledge transfer, inductive transfer, multitask learning, knowledge consolidation, context-sensitive learning, knowledge-based inductive bias, metalearning, and incremental/cumulative learning. Interest in learning vector-valued functions was particularly sparked by multitask learning, a framework which tries to learn multiple, possibly different tasks simultaneously.

Much of the initial research in multitask learning in the machine learning community was algorithmic in nature, and applied to methods such as neural networks, decision trees and $$k$$-nearest neighbors in the 1990s The use of probabilistic models and Gaussian processes was pioneered and largely developed in the context of geostatistics, where prediction over vector-valued output data is known as cokriging. Geostatistical approaches to multivariate modeling are mostly formulated around the linear model of coregionalization (LMC), that can be considered as a generative approach for developing valid covariance functions. The regularization and kernel theory literature for vector-valued functions followed in the 2000s. While the Bayesian and regularization perspectives were developed independently, they are in fact closely related.

=Gaussian Process Perspective= The estimator derived in the regularization framework can also be derived from a Bayesian viewpoint using Gaussian process methods. The Gaussian process specifies our prior belief about the properties of the function being modeled. The beliefs are updated in the presence of data by means of a likelihood function, that relates the prior assumptions to the actual observations. This leads to an updated, or posterior, distribution that can be used for predicting test cases. In this setup the vector-valued function $$\textbf{f}$$, consisting of $$D$$ outputs $$\left\{f_d\right\}_{d=1}^D$$, is assumed to follow a Gaussian process: $$\begin{align}\textbf{f} \sim \mathcal{GP}(\textbf{m},\textbf{K}) \end{align}$$, where $$\textbf{m} \in \textbf{R}^D$$ is a vector of the mean functions $$\left\{m_d(\textbf{x})\right\}_{d=1}^D$$ for the outputs and $$\textbf{K}$$  is a positive matrix-valued function with entry $$(\textbf{K}(\textbf{x},\textbf{x}'))_{d,d'}$$ corresponding to the covariance between the outputs $$f_d(\textbf{x})$$ and $$f_{d'}(\textbf{x}')$$.

For a set of inputs $$\textbf{X}$$, the prior distribution over the vector $$\textbf{f}(\textbf{X})$$ is given by $$\mathcal{N}(\textbf{m}(\textbf{X}),\textbf{K}(\textbf{X},\textbf{X}))$$, where $$\textbf{m}(\textbf{X})$$ is a vector that concatenates the mean vectors associated to the outputs and $$\textbf{K}(\textbf{X},\textbf{X}))$$ is a block-partitioned matrix. In regression, the likelihood function for the outputs is often taken to be the Gaussian distribution:

$$\begin{align} p(\textbf{y}|\textbf{f},\textbf{x}, \Sigma) = \mathcal{N}(\textbf{f}(\textbf{x}),\Sigma) \end{align}$$

where $$\Sigma \in \mathcal{\textbf{R}}^{D x D}$$ is a diagonal matrix with elements $$\left\{\sigma_d^2\right\}_{d=1}^{D}$$ specifying the noise for each output. Using this form for the likelihood, the predictive distribution for a new vector $$\textbf{x}_*$$ is:

$$\begin{align}p(\textbf{f}(\textbf{x}_*)|\textbf{S},\textbf{f},\textbf{x}_*,\phi) = \mathcal{N}(\textbf{f}_*(\textbf{x}_*),\textbf{K}_*(\textbf{x}_*,\textbf{x}_*))\end{align}$$

where $$\textbf{S}$$ is the training data, and $$\phi$$ is a set of hyperparameters for $$\textbf{K}(\textbf{x},\textbf{x}')$$ and $$\Sigma$$. Equations for $$\textbf{f}_*$$ and $$\textbf{K}_*$$ can then be obtained:

$$\begin{align}\textbf{f}_*(\textbf{x}_*) &= \textbf{K}_{\textbf{x}_*}^T(\textbf{K}(\textbf{X},\textbf{X}) + \boldsymbol\Sigma)^{-1}\bar{\textbf{y}}\\ \textbf{K}_*(\textbf{x}_*,\textbf{x}_*) &= \textbf{K}(\textbf{x}_*,\textbf{x}_*) - \textbf{K}_{\textbf{x}_*}(\textbf{K}(\textbf{X},\textbf{X}) + \boldsymbol\Sigma)^{-1}\textbf{K}_{\textbf{x}_*}^T\end{align}$$

where $$\boldsymbol\Sigma = \Sigma \otimes \textbf{I}_N, \textbf{K}_{\textbf{x}_*} \in \mathcal{\textbf{R}}^{DxND}$$ has entries $$(\textbf{K}(\textbf{x}_*,\textbf{x}_j))_{d,d'}$$ for $$j = 1,\cdots,N$$ and $$d,d' = 1,\cdots,D$$. Note that the predictor $$\textbf{f}^*$$ is identical to the predictor derived in the regularization framework. For non-Gaussian likelihoods different methods such as Laplace approximation and variational methods are needed to approximate the estimators.

=Seperable Kernels=

The Linear Model of Coregionalization (LMC)
LMC has been used in machine learning in the context of Gaussian processes for multivariate regression and in statistics for computer emulation of expensive multivariate computer codes. In LMC, outputs are expressed as linear combinations of independent random functions such that the resulting covariance function (over all inputs and outputs) is a valid positive semidefinite function. Assuming $$D$$ outputs $$\left\{f_d(\textbf{x})\right\}_{d=1}^D$$ with $$\textbf{x} \in \mathcal{\textbf{R}}^p$$, each $$f_d$$ is expressed as:

$$\begin{align} f_d(\textbf{x}) = \sum_{q=1}^Q{a_{d,q}u_q(\textbf{x})}, \end{align}$$

where $$a_{d,q}$$ are scalar coefficients and the independent functions $$u_q(\textbf{x})$$ have zero mean and covariance cov$$[u_q(\textbf{x}),u_{q'}(\textbf{x}')] = k_q(\textbf{x},\textbf{x}')$$ if $$q=q'$$ and 0 otherwise. A similar expression for $$f_d(\textbf{x})$$ can be written grouping the functions $$u_q(\textbf{x})$$ which share the same covariance:

$$\begin{align} f_d(\textbf{x}) = \sum_{q=1}^Q{\sum_{i=1}^{R_q}{a_{d,q}^iu_q^i(\textbf{x})}} \end{align}$$

where the functions $$u_q^i(\textbf{x})$$, with $$q=1,\cdots,Q$$ and $$i=1,\cdots,R_q$$ have zero mean and covariance cov$$[u_q^i(\textbf{x}),u_{q'}^{i'}(\textbf{x})'] = k_q(\textbf{x},\textbf{x}')$$ if $$i=i'$$ and $$q=q'$$. The cross covariance between any two functions $$f_d(\textbf{x})$$ and $$f_{d'}(\textbf{x})$$ can then be written as:

$$\begin{align} \text{cov}[f_d(\textbf{x}),f_{d'}(\textbf{x}')] = \sum_{q=1}^Q{\sum_{i=1}^{R_q}{a_{d,q}^ia_{d',q}^{i}k_q(\textbf{x},\textbf{x}')}} = \sum_{q=1}^Q{b_{d,d'}^qk_q(\textbf{x},\textbf{x}')} \end{align}$$

But $$\text{cov}[f_d(\textbf{x}),f_{d'}(\textbf{x}')]$$ is given by $$(\textbf{K}(\textbf{x},\textbf{x}'))_{d,d'}$$. Thus the kernel $$\textbf{K}(\textbf{x},\textbf{x}')$$ can now be expressed as

$$\begin{align} \textbf{K}(\textbf{x},\textbf{x}') = \sum_{q=1}^Q{\textbf{B}_qk_q(\textbf{x},\textbf{x}')} \end{align}$$

where each $$\textbf{B}_q \in \mathcal{\textbf{R}}^{DxD}$$ is known as a coregionalization matrix. Therefore, the kernel derived from LMC is a sum of the products of two covariance functions, one that models the dependence between the outputs, independently of the input vector $$\textbf{x}$$ (the coregionalization matrix $$\textbf{B}_q$$), and one that models the input dependence, independently of $$\left\{f_d(\textbf{x})\right\}_{d=1}^D$$(the covariance function $$k_q(\textbf{x},\textbf{x}')$$).

Intrinsic Coregionalization Model (ICM)
The ICM is a simplified version of the LMC, with $$Q=1$$. ICM assumes that the elements $$b_{d,d'}^q$$ of the coregionalization matrix $$\textbf{B}_q$$ can be written as $$b_{d,d'}^q = v_{d,d'}b_q$$, for some suitable coefficients $$v_{d,d'}$$. With this form for $$b_{d,d'}^q$$, we have

$$\begin{align} \text{cov}[f_d(\textbf{x}),f_{d'}(\textbf{x}')] = \sum_{q=1}^Q{v_{d,d'}b_qk_q(\textbf{x},\textbf{x}')} = v_{d,d'}\sum_{q=1}^Q{b_qk_q(\textbf{x},\textbf{x}')} =  v_{d,d'}k(\textbf{x},\textbf{x}') \end{align}$$

where $$k(\textbf{x},\textbf{x}') = \sum_{q=1}^Q{b_qk_q(\textbf{x},\textbf{x}')}$$. In this case, the coefficients $$v_{d,d'} = \sum_{i=1}^{R_1}{a_{d,1}^ia_{d',1}^i} = b_{d,d'}^1$$ and the kernel matrix for multiple outputs becomes $$\textbf{K}(\textbf{x},\textbf{x}') = k(\textbf{x},\textbf{x}')\textbf{B}$$. ICM corresponds to the special separable kernel often used in the context of regularization. ICM is much more restrictive than the LMC since it assumes that each basic covariance $$k_q(\textbf{x},\textbf{x}')$$ contributes equally to the construction of the autocovariances and cross covariances for the outputs. However, the computations required for the inference are greatly simplified.

Semiparametric Latent Factor Model (SLFM)
Another simplified version of the LMC is the semiparametric latent factor model (SLFM), which corresponds to setting $$R_q = 1$$ (instead of $$Q = 1$$ as in ICM). Thus each latent function $$u_q$$ has its own covariance.

=Non-Separable Kernels=

Process Convolution
LMC produces a separable kernel because the output functions evaluated at a point $$\textbf{x}$$ only depend on the values of the latent functions at $$\textbf{x}$$. A non-trivial way to mix the latent functions is by convolving a base process with a smoothing kernel. If the base process is a Gaussian process, the convolved process is Gaussian as well. We can therefore exploit convolutions to construct covariance functions. Process convolutions were introduced for multiple outputs in the machine learning community as "dependent Gaussian processes".

=Implementation=

Parameter Estimation
There are many works related to parameter estimation for Gaussian processes. Some methods such as maximization of the marginal likelihood (also known as evidence approximation, type II maximum likelihood, empirical Bayes), and least squares give point estimates of the parameter vector $\phi$. There are also works emplying a full Bayesian inference by assigning priors to $$\phi$$ and computing the posterior distribution through a sampling procedure. For non-Gaussian likelihoods, there is no closed form solution for the posterior distribution or for the marginal likelihood. However, the marginal likelihood can be approximated under a Laplace, variational Bayes or expectation propagation (EP) approximation frameworks for multiple output classification and used to find estimates for the hyperparameters.

Reducing Computational Complexity
The main computational problem is the same as the one appearing in regularization theory of inverting the matrix $$\overline{\textbf{K}(\textbf{X},\textbf{X})} = \textbf{K}(\textbf{X},\textbf{X}) + \boldsymbol\Sigma$$. This step is necessary for computing the marginal likelihood and the predictive distribution. For most proposed approximation methods to reduce computation, the computational efficiency gained is independent of the particular method employed (e.g. LMC, process convolution) used to compute the multi-output covariance matrix. A summary of different methods for reducing computational complexity in multi-output Gaussian processes is presented in