User:EugenioTL/sandbox

In machine learning, a variational autoencoder, also known as VAE, is the artificial neural network architecture introduced by Diederik P Kingma and Max Welling, belonging to the families of probabilistic graphical models and variational bayesian methods.

It is often associated with the autoencoder model because of its architectural affinity, but there are significant differences both in the goal and in the mathematical formulation. Variational autoencoders are meant to compress the input information into a constrained multivariate latent distribution (encoding) to reconstruct it as accurately as possible (decoding). Although this type of model was initially designed for unsupervised learning, its effectiveness has been proven in other domains of machine learning such as semi-supervised learning  or supervised learning.

Architecture
Variational autoencoders are variational bayesian methods with a multivariate distribution as prior and a posterior, approximated by an artificial neural network, forming the so-called variational encoder-decoder structure.

A vanilla encoder is an artificial neural network to reduce its input information into a bottleneck representation named latent space. It represents the first half of the architecture of both encoder and variational autoencoder. For the former, the output is a fixed vector of artificial neurons. For the latter, the outgoing information is compressed into a probabilistic latent space composed still by artificial neurons. However, in variational autoencoder architecture, they represent and are treated as two distinct vectors with the same dimensions, representing the vector of means and the vector of standard deviations, respectively.

A vanilla decoder is still an artificial neural network thought to be the mirror architecture of the encoder. It takes as input the compressed information coming from the latent space, and then it expands it to produce an output that is as equal as possible to the encoder's input. While for an autoencoder, the decoder input is trivially a fixed-length vector of real values, for a variational autoencoder, it is necessary to introduce an intermediate step. Given the probabilistic nature of the latent space, it is possible to consider it as a multivariate Gaussian vector. With this assumption, and through the technique known as the reparametrization trick, it is possible to sample populations from this latent space and treat them precisely as a fixed-length vector of real values.

From a systemic point of view, both the vanilla autoencoder and the variational autoencoder models receive as input a set of high dimensional data. Then they adaptively compress it into a latent space (encoding), and finally, they try to reconstruct it as accurately as possible (decoding). Given the nature of its latent space, the variational autoencoder is characterized by a slightly different objective function: it has to minimize a reconstruction loss function like the vanilla autoencoder. However, it also takes into account the Kullback–Leibler divergence between the latent space and a vector of normal Gaussians.

Formulation
From a formal perspective, given an input dataset $$\mathbf{x}$$ characterized by an unknown probability function $$P(\mathbf{x})$$ and a multivariate latent encoding vector $$\mathbf{z}$$, we want to model the data as a distribution $$p_\theta(\mathbf{x})$$, with $$\theta$$ defined as the set of the network parameters.

It is possible to formalize this distribution as

$$p_\theta(\mathbf{x}) = \int_{\mathbf{z}}p_\theta(\mathbf{x,z})d\mathbf{z} $$

where $$p_\theta$$ is the evidence of the model's data with marginalization performed over unobserved variables and thus $$p_\theta(\mathbf{x,z})$$ represents the joint distribution between input data and its latent representation according to the network parameters $$\theta$$.

According to the Bayes' theorem, the equation can be rewritten as

$$p_\theta(\mathbf{x}) = \int_{\mathbf{z}}p_\theta(\mathbf{x|z})p_\theta(\mathbf{z})d\mathbf{z}$$

In the vanilla variational autoencoder we assume $$\mathbf{z}$$ with discrete dimension and that $$p_\theta(\mathbf{x|z})$$ is a Gaussian distribution, then $$p_\theta(\mathbf{x})$$ is a mixture of Gaussian distributions.

It is now possible to define the set of the relationships between the input data and its latent representation as
 * Prior $$p_\theta(\mathbf{z})$$
 * Likelihood $$p_\theta(\mathbf{x}|\mathbf{z})$$
 * Posterior $$p_\theta(\mathbf{z}|\mathbf{x})$$

Unfortunately, the computation of $$p_\theta(\mathbf{x})$$ is very expensive and in most cases even intractable. To speed up the calculus and make it feasible, it is necessary to introduce a further function to approximate the posterior distribution as

$$q_\Phi(\mathbf{z|x}) \approx p_\theta(\mathbf{z|x})$$

with $$\Phi$$ defined as the set of real values that parametrize $$q$$.

In this way, the overall problem can be easily translated into the autoencoder domain, in which the conditional likelihood distribution $$p_\theta(\mathbf{x}|\mathbf{z})$$ is carried by the probabilistic encoder, while the approximated posterior distribution $$q_\Phi(\mathbf{z|x})$$ is computed by the probabilistic decoder.

ELBO loss function
As in every deep learning problem, it is necessary to define a differentiable loss function in order to update the network weights through backpropagation.

For variational autoencoders the idea is to jointly minimize the generative model parameters $$\theta$$ to reduce the reconstruction error between the input and the output of the network, and $$\Phi$$ to have $$q_\Phi(\mathbf{z|x})$$ as close as possible to $$p_\theta(\mathbf{z}|\mathbf{x})$$.

As reconstruction loss mean squared error and cross entropy represent good alternatives.

As distance loss between the two distributions the reverse Kullback–Leibler divergence $$D_{KL}(q_\Phi(\mathbf{z|x})||p_\theta(\mathbf{z|x}))$$ is a good choice to squeeze $$q_\Phi(\mathbf{z|x})$$ under $$p_\theta(\mathbf{z}|\mathbf{x})$$.

The distance loss just defined is expanded as

$$\begin{align} D_{KL}(q_\Phi(\mathbf{z|x})||p_\theta(\mathbf{z|x})) &= \int q_\Phi(\mathbf{z|x}) \log \frac{q_\Phi(\mathbf{z|x})}{p_\theta(\mathbf{z|x})} d\mathbf{z}\\ &= \int q_\Phi(\mathbf{z|x}) \log \frac{q_\Phi(\mathbf{z|x})p_\theta(\mathbf{x})}{p_\theta(\mathbf{z,x})} d\mathbf{z}\\ &= \int q_\Phi(\mathbf{z|x}) \left( \log (p_\theta(\mathbf{x})) + \log \frac{q_\Phi(\mathbf{z|x})}{p_\theta(\mathbf{z,x})}\right) d\mathbf{z}\\ &= \log (p_\theta(\mathbf{x})) + \int q_\Phi(\mathbf{z|x}) \log \frac{q_\Phi(\mathbf{z|x})}{p_\theta(\mathbf{z,x})} d\mathbf{z}\\ &= \log (p_\theta(\mathbf{x})) + \int q_\Phi(\mathbf{z|x}) \log \frac{q_\Phi(\mathbf{z|x})}{p_\theta(\mathbf{x|z})p_\theta(\mathbf{z})} d\mathbf{z}\\ &= \log (p_\theta(\mathbf{x})) + E_{\mathbf{z} \sim q_\Phi(\mathbf{z|x})}(\log \frac{q_\Phi(\mathbf{z|x})}{p_\theta(\mathbf{z})} - \log(p_\theta(\mathbf{x|z})))\\ &= \log (p_\theta(\mathbf{x})) + D_{KL}(q_\Phi(\mathbf{z|x}) || p_\theta(\mathbf{z})) - E_{\mathbf{z} \sim q_\Phi(\mathbf{z|x})}(\log(p_\theta(\mathbf{x|z}))) \end{align}$$

At this point, it is possible to rewrite the equation as

$$\log (p_\theta(\mathbf{x})) - D_{KL}(q_\Phi(\mathbf{z|x})||p_\theta(\mathbf{z|x})) = E_{\mathbf{z} \sim q_\Phi(\mathbf{z|x})}(\log(p_\theta(\mathbf{x|z}))) - D_{KL}(q_\Phi(\mathbf{z|x}) || p_\theta(\mathbf{z}))$$

The goal is to maximize the log-likelihood of the LHS of the equation to improve the generated data quality and to minimize the distribution distances between the real posterior and the estimated one.

This is equivalent to minimize the negative log-likelihood, which is a common practice in optimization problems.

The loss function so obtained, also named evidence lower bound loss function, shortly ELBO, can be written as

$$L_{\theta,\Phi} = -\log (p_\theta(\mathbf{x})) + D_{KL}(q_\Phi(\mathbf{z|x})||p_\theta(\mathbf{z|x})) = -E_{\mathbf{z} \sim q_\Phi(\mathbf{z|x})}(\log(p_\theta(\mathbf{x|z}))) + D_{KL}(q_\Phi(\mathbf{z|x}) || p_\theta(\mathbf{z})) $$

Given the non-negative property of the Kullback–Leibler divergence, it is correct to assert that

$$-L_{\theta,\Phi} = \log (p_\theta(\mathbf{x})) - D_{KL}(q_\Phi(\mathbf{z|x})||p_\theta(\mathbf{z|x})) \leq \log (p_\theta(\mathbf{x}))     $$

The optimal parameters are the ones that minimize this loss function. The problem can be summarized as

$$\theta^*,\Phi^* = \underset{\theta,\Phi}{arg min} L_{\theta,\Phi} $$

The main advantage of this formulation relies on the possibility to jointly optimize with respect to parameters $$\theta $$ and $$\Phi $$.

Before applying the ELBO loss function to an optimization problem to backpropagate the gradient, it is necessary to make it differentiable by applying the so-called reparameterization trick to remove the stochastic sampling from the formation, and thus making it differentiable.

Reparameterization trick
To make the ELBO formulation suitable for training purposes, it is necessary to introduce a further minor modification to the formulation of the problem and as well as to the structure of the variational autoencoder.

Stochastic sampling is the non-differentiable operation with which it is possible to sample from the latent space and feed the probabilistic decoder. In order to make it feasible the application of backpropagation processes, such as the stochastic gradient descent, the reparameterization trick is introduced.

The main assumption about the latent space is that it can be considered as a set of multivariate Gaussian distributions, and thus can be described as

$$\mathbf{z} \sim q_\phi(\mathbf{z}\vert\mathbf{x}) = \mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\sigma}^2)$$

Given $$\boldsymbol{\epsilon} \sim \mathcal{N}(0, \boldsymbol{I})$$ and $$\odot$$ defined as the element-wise product, the reparameterization trick modifies the above equation as

$$\mathbf{z} = \boldsymbol{\mu} + \boldsymbol{\sigma} \odot \boldsymbol{\epsilon} $$

Thanks to this transformation, that can be extended also to other distributions different from the Gaussian, the variational autoencoder is trainable and the probabilistic encoder has to learn how to map a compressed representation of the input into the two latent vectors $$\boldsymbol{\mu} $$ and $$\boldsymbol{\sigma} $$, while the stochasticity remains out from the updating process and is injected in the latent space as an external input through the random vector $$\boldsymbol{\epsilon} $$.

Applications
There are many variational autoencoders applications and extensions in order to adapt the architecture to different domains and improve its performance.

$$\beta$$-VAE is an implementation with a weighted Kullback–Leibler divergence term to automatically discover and interpret factorised latent representations. With this implementation, it is possible to force manifold disentanglement for $$\beta$$ values greater than one. The authors demonstrate this architecture ability to generate high-quality synthetic samples.

One other implementation named conditional variational autoencoder, shortly CVAE, is thought to insert label information in the latent space so to force a deterministic constrained representation of the learned data.

Some structures directly deal with the quality of the generated samples or implement more than one latent space to further improve the representation learning.

Some architectures mix the structures of variational autoencoders and generative adversarial networks to obtain hybrid models with high generative capabilities.