User:Sergey WereWolf/DLA

Introduction
There are huge number of different variants of deep architectures, however, most of them are branched from some original parent architectures. It is not always possible to compare the performance of multiple architectures all together since, they are not all implemented on the same data set. It is important to mention that deep learning is a fast growing field that one may find some different architectures, variants, or algorithms every couple of weeks.

Deep Boltzmann Machines
A Deep Boltzmann Machine (DBM) is a type of binary pairwise Markov random field (undirected probabilistic graphical models) with multiple layers of hidden random variables. It is a network of symmetrically coupled stochastic binary units. It comprises a set of visible units $$\boldsymbol{\nu} \in \{0,1\}^D$$, and a series of layers of hidden units $$\boldsymbol{h}^{(1)} \in \{0,1\}^{F_1}, \boldsymbol{h}^{(2)} \in \{0,1\}^{F_2}, \ldots, \boldsymbol{h}^{(L)} \in \{0,1\}^{F_L}$$. There is no connection between the units of the same layer (like RBM). For the DBM, we can write the probability which is assigned to vector $$\nu$$ as:

$$p(\boldsymbol{\nu}) = \frac{1}{Z}\sum_h e^{\sum_{ij}W_{ij}^{(1)}\nu_i h_j^1 + \sum_{jl}W_{jl}^{(2)}h_j^{(1)}h_l^{(2)}+\sum_{lm}W_{lm}^{(3)}h_l^{(2)}h_m^{(3)}},$$

where $$\boldsymbol{h} = \{\boldsymbol{h}^{(1)}, \boldsymbol{h}^{(2)}, \boldsymbol{h}^{(3)} \}$$ are the set of hidden units, and $$\theta = \{\boldsymbol{W}^{(1)}, \boldsymbol{W}^{(2)}, \boldsymbol{W}^{(3)} \} $$ are the model parameters, representing visible-hidden and hidden-hidden symmetric interaction, since they are undirected links. As it is clear by setting $$\boldsymbol{W}^{(2)} = 0$$ and $$\boldsymbol{W}^{(3)} = 0$$ the network becomes the well-known restricted Boltzmann machine.

The top two layers of a DBN form an undirected graph and the remaining layers form a belief net with directed, top-down connections. For a DBM, all the connections are undirected.

There are several reasons which motivate us to take advantage of deep Boltzmann machine architectures. Like DBNs, they benefit from the ability of learning complex and abstract internal representations of the input in tasks such as object or speech recognition, with the use of limited number of labeled sensory data to fine-tune the representations which is built based on a large supply of unlabeled sensory input data. However, unlike DBNs and deep convolutional neural networks, they adopt the inference and training procedure in both directions, bottom-up and top-down pass, which enable the DBMs to better unveil the representations of the ambiguous and complex input structures , . Another important advantage of DBMs is the joint optimization of all layers using the approximate gradient of a variation lower-bound on the likelihood function which impacts greatly on the more proper learning of generative models.

Since the exact maximum likelihood learning is intractable for the DBMs, we may perform the approximate maximum likelihood learning. One should note that this algorithm is rather slow, especially for those architectures with multiple layers of hidden units, where upper layers are quite remote from the visible layer. There is another possibility, to use mean-field inference to estimate data-dependent expectations, incorporation with a Markov chain Monte Carlo (MCMC) based stochastic approximation technique to approximate the expected sufficient statistics of the model. Again, there is the problem of random initialization of weights, since the mentioned learning procedure performance is quite poor for DBMs. introduced a layer-wise pre-training algorithm to solve or modify this problem.

We can see the difference between DBN and DBM. In DBN, the top two layers form a restricted Boltzmann machine which is an undirected graphical model, but the lower layers form a directed generative model. A greedy layer-wise unsupervised learning algorithm was introduced in.

The idea behind the pre-training algorithm is straightforward. When learning parameters of the first layer RBM, the bottom-up weights are constrained to be twice the top-down weights and tie the visible-hidden weights. Intuitively, using twice the weights when inferring the states of the hidden units h(1) compensates for the initial lack of top-down feedback. Conversely, when pre-training the last RBM in the stack, the top-down weights are constrained to be twice the bottom-up weights. For all the intermediate RBMs the weights are halved in both directions when composing them to form a DBM, they are symmetric. This trick, eliminates the double-counting problem once top-down and bottom-up inferences are subsequently combined. In this modified RBM with tied parameters, the conditional distributions over the hidden and visible states are defined as :

$$p(\boldsymbol{\nu}) = \frac{1}{Z}\sum_h e^{\sum_{ij}W_{ij}^{(1)}\nu_i h_j^1 + \sum_{jl}W_{jl}^{(2)}h_j^{(1)}h_l^{(2)}+\sum_{lm}W_{lm}^{(3)}h_l^{(2)}h_m^{(3)}}.$$

This heuristic pre-training algorithm works surprisingly well in practice. However, it is solely motivated by the need to end up with a model that has symmetric weights, and does not provide any useful insights into what is happening during the pre-training stage. Furthermore, unlike the pre-training algorithm for Deep Belief Networks (DBNs), it lacks a proof that each time a layer is added to the DBM, the variation bound improves.

Apart from all the advantages of DBMs discussed so far, they have a crucial disadvantage which limits the performance and functionality of this kind of architecture. The approximate inference, which is based on mean-field method, is about 25 to 50 times slower than a single bottom-up pass in DBNs. This time consuming task make the joint optimization, introduced before, quite impractical for large data sets, and also seriously restricts the use of DBMs in the tasks such as feature representations (the mean- field inference have to be performed for each new test input).

Stacked (Denoising) Auto-Encoders
In the context of neural networks it is really common to believe that with several layers of non-linear processing one might be able to model even a complex model efficiently, and to generalize the performance on difficult recognition tasks, , ,. This belief was inspired from both theoretical point of view such that in, and also from the discoveries related to the biological models of human brain such as visual cortex, for instance in ,. The non-convex characteristics of the optimization in MLPs had limited the scientists and engineers, for a long time, to apply more than two layers of hidden units,. Consequently, the researchers had been carried out in shallow architectures so as to conceal the problem of optimization and cope with convex functions.

The auto encoder idea is motivated by the concept of good representation, in other words it states that not all the representations or features of the input are suitable to perform a specific task such as classification. Therefore, there is a need to have a clear understanding of what is a good representation and how we can distinguish it. For instance for the case of classifier it is possible to define that A good representation is one that will yield a better performing classifier. Despite the fact that whether this philosophy might be true, we may think of it as a pre-training stage with a defined criterion.

According to the definition introduced in, encoder is referred to a deterministic mapping $$f_\theta$$ that transforms an input vector x into hidden representation y, where $$\theta = \{\boldsymbol{W}, b\}$$, $$\boldsymbol{W}$$ is the weight matrix and b is an offset vector (bias). On the contrary a decoder maps back the hidden representation y to the reconstructed input z  via $$g_\theta$$. The whole process of auto encoding is to compare this reconstructed input to the original, apparently it needs an error criterion, and try to minimize this error to make the reconstructed value as close as possible to the original.

Here in the case of denoising auto encoders, we also focus on exploiting good representation. In this context we are seeking carefully the features which are useful for our specific task, and the rest (corrupting features) are not useful for the task at hand, so are to be denoised (cleaned). There are different strategies to distinguish and choose the good representations (features), such as restricting the representation by means of traditional bottleneck or sparse representation, maximizing the mutual information. However, in stacked denoising auto encoders, the partially corrupted output is cleaned (denoised). This fact has been introduced in with a specific approach to good representation, a good representation is one that can be obtained robustly from a corrupted input and that will be useful for recovering the corresponding clean input. Implicit in this definition are the ideas of


 * The higher level representations are relatively stable and robust to the corruption of the input;
 * It is required to extract features that are useful for representation of the input distribution.

The algorithm is performed by small changes in the basic auto encoders described above. It consists of multiple steps, starts by a stochastic mapping of $$\boldsymbol{x}$$ to $$\tilde{\boldsymbol{x}}$$ through $$q_D(\tilde{\boldsymbol{x}}|\boldsymbol{x})$$, this is the corrupting step. Then the corrupted input $$\tilde{\boldsymbol{x}}$$ passes through a basic auto encoder process and is mapped to a hidden representation $$\boldsymbol{y} = f_\theta(\tilde{\boldsymbol{x}}) = s(\boldsymbol{W}\tilde{\boldsymbol{x}}+b)$$. From this hidden representation we can reconstruct $$\boldsymbol{z} = g_\theta(\boldsymbol{y})$$. In last stage a minimization algorithm is done in order to have a z as close as possible to uncorrupted input $$\boldsymbol{x}$$, the only difference with the conventional auto encoder is that z is a deterministic function of corrupted input rather than uncorrupted input. The reconstruction error LH(x, z) might be either the cross-entropy loss with an affine-sigmoid decoder, or the squared error loss with an affine decoder. Note that in this architecture parameters initialized randomly and adjusted using stochastic gradient descent algorithm.

In order to make a deep architecture out of this denoising auto encoders, we have to stack them one on top of another, similar to what has been mentioned for the RBMs in DBNs or what is done in, , for conventional auto encoders. Once the encoding function $$f_\theta$$ of the first denoising auto encoder is learned and used to uncorrupt the input (corrupted input), we can train the second level of stacked auto encoder.

Once the stacked auto encoder is trained, its output might be used as the input to a supervised learning algorithm such as support vector machine classifier or a multiclass logistic regression.

Deep Stacking Networks
One of the deep architectures recently introduced in, which is based on building hierarchies with blocks of simplified neural network modules, is called deep convex network. They are called convex because of the formulation of the weights learning problem, which is a convex optimization problem with a closed-form solution, while the lower layer weights are initialized with a fixed RBM. However, the network was renamed the deep stacking network (DSN), emphasizing on this fact that a similar mechanism as the stacked generalization is used. Along with this new name, there is a change in the learning process, for better performance in tasks such as classification, the lower-layer weights are also learned, which causes the overall learning problem is not a convex problem any longer.

The DSN blocks, each consisting of a simple, easy-to-learn module, are stacked to form the overall deep network. It can be trained block-wise in a supervised fashion without the need for back-propagation for the entire blocks.

As designed in each block consists of a simplified MLP with a single hidden layer. It comprises a weight matrix U as the connection between the logistic sigmoidal units of the hidden layer h to the linear output layer y, and a weight matrix W which connects each input of the blocks to their respective hidden layers. If we assume that the target vectors t be arranged to form the columns of T (the target matrix), let the input data vectors x be arranged to form the columns of X, let $$\boldsymbol{H} = \sigma(\boldsymbol{W}^T\boldsymbol{X})$$ denote the matrix of hidden units, and assume the lower-layer weights W are known (training layer-by-layer). The function performs the element-wise logistic sigmoid operation. Then learning the upper-layer weight matrix U can be formulated as a convex optimization problem:

$$\min_{U^T} f = ||\boldsymbol{U}^T \boldsymbol{H} - \boldsymbol{T}||^2_F,$$

which has a closed-form solution. The input to the first block X only contains the original data, however in the upper blocks in addition to this original (raw) data there is a copy of the lower-block(s) output y. It is possible to optimize the weights of the layer-layer (input-hidden links) using an accelerated gradient descent, this is done by minimizing the squared error objective. In this step we assume that upper-layers weights U are known (the optimal solution from the previous step) and try to update the W using iterative gradient algorithm, then again U shall be computed with new values.

What is done in this method is in each block an estimate of the same final label class y is produced, then this estimated label concatenated with original input to form the expanded input for the upper block. Note that here in this architecture in contrast with other deep architectures, such as DBNs, the goal is not to discover the transformed feature representation. Regarding the structure of the hierarchy of this kind of architecture, it makes the parallel training possible. In purely discriminative tasks DSN performance is better than the conventional DBN. Note that only in the final block the output y is used for the classification tasks, in other blocks the output is only used to form the expanded input for the upper blocks.

Tensor Deep Stacking Networks (T-DSN)
The architecture discussed here is an extension of the DSN. It is called tensor deep stacking network (TDSN). It improves the DSN in two important ways, using the higher order information by means of covariance statistics and also transforming the non-convex problem of the lower-layer to a convex sub-problem of the upper-layer.

Unlike the DSN, the covariance statistics of the data is employed using a bilinear mapping from two distinct sets of hidden units in the same layer to predictions via a third-order tensor. Looking to the covariance structure of the data was also proposed in the works on mean-covariance RBM (mcRBM) architecture, however, there they use it on the raw data rather than on binary hidden feature layers as is done in TDSN, ,. In the mcRBM, the higher-order structure is represented in the visible data, while in the TDSN it is the hidden units which are responsible for this representation. Due to the learning complexity of the mcRBM models which are caused by the factorization, required for the reduction of the cubic growth in the size of the weight parameters, it is also very difficult to use mcRBM in deeper layers, usually it is used only in the bottom layer of a deep architecture. These difficulties of the factorization in addition to the high costs of the Hybrid Monte Carlo in learning are the very limiting factors in mcRBM to scale up to very large data sets. However, they are all removed in TDSN, and due to the specific architecture of TDSN the parallel training and closed-form solution for the upper-layer convex problems are straightforward. TDSN adopts small sizes of hidden layers so as to eliminate the factorization process. There are other differences between mcRBM and TDSN. The mcRBM is a generative model optimizing a maximum likelihood objective, while the TDSN is a discriminative model optimizing a least squares objective. In, more advantages of the TDSN over other architectures are discussed.

The scalability and parallelization are the two important factors in the learning algorithms which are not considered seriously in the conventional DNNs, ,. In, , is noted that all the learning process for the DSN (and TDSN as well) is done on a batch-mode basis, so as to make the parallelization possible on a cluster of CPU and/or GPU nodes. Parallelization gives us the opportunity to scale up our design to larger (deeper) architectures and data sets, in a way different than what is done in for deep sparse auto encoders.

It is important to note that the basic architecture is suitable for tasks such as classification, or regression. However, if it is used as a part of a hybrid architecture with HMM, a softmax layer to produce posterior probabilities is desirable.

Spike-and-Slab RBMs (ssRBMs)
The need for real-valued inputs which are employed in Gaussian RBMs (GRBMs), motivates scientists seeking new methods. One of these methods is the spike and slab RBM (ssRBM), which models continuous-valued inputs with strictly binary latent variables.

Similar to basic RBMs and its variants, the spike and slab RBM is a bipartite graph. Like GRBM, the visible units (input) are real-valued. The difference arises in the hidden layer, where each hidden unit come along with a binary spike variable and real-valued slab variable. These terms (spike and slab) come from the statistics literature, and refer to a prior including a mixture of two components. One is a discrete probability mass at zero called spike, and the other is a density over continuous domain.

There is also an extension of the ssRBM model, which is called µ-ssRBM. This variant provides extra modeling capacity to the architecture using additional terms in the energy function. One of these terms enable model to form a conditional distribution of the spike variables by means of marginalizing out the slab variables given an observation, which is quite similar to the conditional of mcRBM method and also mPoT model. The µ-ssRBM slab variables and input are jointly Gaussian with diagonal covariance matrix, given both observed and spike variables. The observations are Gaussian with diagonal covariance, given both the spike and slab variables. The µ-ssRBM is related to Gibbs sampling. These properties make this model as a good choice for the building blocks of deep structures such as DBM. However, there is no guarantee that the resulting model produce a valid density over the whole real-valued data space, this is one of the main shortcomings of ssRBM architectures.

Compound Hierarchical-Deep Models
For a human brain in comparison with the current state-of-the-art artificial systems, fewer number of examples is needed to categorize and even extend the already existing categories for the novel instances (generalization), , ,. This is the main motivation of this subsection, by means of learning abstract knowledge of the data and use them for novel cases in the future.

The class of these new architectures is called compound HD models, where HD stands for Hierarchical-Deep. They are structured as a composition of non-parametric Bayesian models with deep networks. The features, learned by deep architectures such as DBNs, DBMs , deep auto encoders , convolutional variants , , ssRBMs , deep coding network , DBNs with sparse feature learning , recursive neural networks , conditional DBNs , denoising auto encoders , are able to provide better representation for more rapid and accurate classification tasks with high-dimensional training data sets. However, they are not quite powerful in learning novel classes with few examples, themselves. In these architectures, all units through the network are involved in the representation of the input (distributed representations), and they have to be adjusted together (high degree of freedom). However, if we limit the degree of freedom, we make it easier for the model to learn new classes out of few training samples (less parameters to learn). Hierarchical Bayesian (HB) models, provide us learning from few examples as you may find in, , , , for computer vision, statistics, and cognitive science. These HB models are based on categorization of the training examples, and the generalization to the new classes at hands. However, they are all fed by hand-crafted features, such as GiST, or SIFT features in computer vision, and MFCC features in speech perception domains. Another shortcoming of HB models is the fixed architecture it employs, , it does not discover the representation and the links between different parameters in an unsupervised fashion.

There are several methods addressing the subject of learning with few examples, such as using several boosted detectors in a multi-task settings in, crossgeneralization approach in , which are both discriminative, boosting methods in , and also HB approach in.

Compound HD architectures try to integrate both characteristics of HB and deep networks. Here we introduce the compound HDP-DBM architecture, a hierarchical Dirichlet process (HDP) as a hierarchical model, incorporated with DBM architecture. It is a full generative model, generalized from abstract concepts flowing through the layers of the model, which is able to synthesize new examples in novel classes that look reasonably natural. Note that all the levels are learned jointly by maximizing a joint log-probability score.

Consider a DBM with three hidden layers, the probability of a visible input $$\boldsymbol{\nu}$$ is:

$$p(\boldsymbol{\nu}, \psi) = \frac{1}{Z}\sum_h e^{\sum_{ij}W_{ij}^{(1)}\nu_i h_j^1 + \sum_{jl}W_{jl}^{(2)}h_j^{1}h_l^{2}+\sum_{lm}W_{lm}^{(3)}h_l^{2}h_m^{3}},$$

where $$\boldsymbol{h} = \{\boldsymbol{h}^{(1)}, \boldsymbol{h}^{(2)}, \boldsymbol{h}^{(3)} \}$$ are the set of hidden units, and $$\psi = \{\boldsymbol{W}^{(1)}, \boldsymbol{W}^{(2)}, \boldsymbol{W}^{(3)} \} $$ are the model parameters, representing visible-hidden and hidden-hidden symmetric interaction terms.

After a DBM model has been learned, we have an undirected model that defines the joint distribution $$P(\nu, h^1, h^2, h^3)$$. One way to express what has been learned is the conditional model $$P(\nu, h^1, h^2|h^3)$$ and a prior term $$P(h^3)$$. We can therefore rewrite the variation bound as:

$$\log P(\nu) \geq \sum_{h^1, h^2, h^3} Q(h|\nu; \mu) \log P(\nu, h^1, h^2|h^3) + \mathcal{H}(Q) + \sum_{h^3}Q(h^3|\nu;\mu)\log P(h^3)$$

This particular decomposition lies at the core of the greedy recursive pre-training algorithm: we keep the learned conditional model $$P(\nu, h^1, h^2|h^3)$$, but maximize the variational lower-bound with respect to the last term. Instead of adding an additional undirected layer, (e. g. a restricted Boltzmann machine), to model $$P(h^3)$$, we can place a hierarchical Dirichlet process prior over $$h^3$$, that will allow us to learn category hierarchies, and more importantly, useful representations of classes that contain few training examples. The part we keep, $$P(\nu, h^1, h^2|h^3)$$, represents a conditional DBM model, which can be viewed as a two-layer DBM but with bias terms given by the states of $$h^3$$:

$$P(\nu, h^1, h^2|h^3) = \frac{1}{Z(\psi, h^3)}e^{\sum_{ij}W_{ij}^{(1)}\nu_i h_j^1 + \sum_{jl}W_{jl}^{(2)}h_j^{1}h_l^{2}+\sum_{lm}W_{lm}^{(3)}h_l^{2}h_m^{3}}.$$

Deep Coding Networks
There are several advantages to have a model which can actively update itself to the context in data. One of these methods arises from this idea to have a model which is able to adjust its prior knowledge dynamically according to the context of the data. Deep coding network (DPCN) is a predictive coding scheme where top-down information are used to empirically adjust the priors needed for the bottom-up inference procedure by means of a deep locally-connected generative model. This is based on extracting sparse features out of time-varying observations using a linear dynamical model. Then with an extension to this feature extraction block, a pooling strategy is employed in order to learn invariant feature representations. Similar to other deep architectures, these blocks are the building elements of a deeper architecture where greedy layer-wise unsupervised learning are used. Note that the layers constitute a kind of Markov chain such that the states at any layer are only dependent on the succeeding and preceding layers. Since, it predicts the representation of the layer, by means of a top-down approach using the information in upper layer and also temporal dependencies from the previous states, it is called deep predictive coding network (DPCN).

In, a statistical model is also used to explain the cortical functions in the mammalian brain. Those models are in a close relation with the one introduced here, DPCN. uses a kind of update procedure like Kalman filter for inference and a general framework consisting of all higher-order moments is used in. The important problem with these methods is the lack of discriminative representation (sparse and invariant representation), helpful for task such as object recognition. However, in DPCN, and efficient inference algorithm is employed to extract locally invariant representation of the image sequences and more abstract information in higher layers. It is also possible to extend the DPCN to form a convolutional network.

Deep Kernel Machines
Artificial neural network is not the only area conquered by deep concept. The Multilayer Kernel Machine (MKM) as introduced in is one of those deep architectures which are not in the field of neural network, albeit, quite relevant. It is a way of learning highly nonlinear functions with the iterative applications of weakly nonlinear kernels. They use the kernel principle component analysis (KPCA), in, as method for unsupervised greedy layer-wise pre-training step of the deep learning architecture.

In this method, layer $$l+1$$-th learns the representation of the previous layer $$l$$, extracting the $$n_l$$ principle component (PC) of the projection layer $$l$$ output in the feature domain induced by the kernel. For the sake of dimensionality reduction of the updated representation in each layer, a supervised strategy is proposed to select the best informative features among the ones extracted by KPCA. We can numerate this process as follows:


 * ranking the $$n_l$$ features according to their mutual information with the class labels;
 * for different values of K and $$m_l \in\{1, \ldots, n_l\}$$, compute the classification error rate of a K-nearest neighbor (K-NN) classifier using only the $$m_l$$ most informative features on a validation set;
 * the value of $$m_l$$ with which the classifier has reached the lowest error rate determines the number of features to retain.

There are some drawbacks in using the KPCA method as the building cells of an MKM. It is a time consuming task, as the cross validation stage of the feature selection process needs quite long time. To circumvent this shortcoming, the use of a more efficient kernel method is used in, called the kernel partial least squares (KPLS). This new approach remove the cross validation stage, and merge the selection process into the projection strategy. The features are selected in an iterative supervised fashion, where the KPLS selects the j-th feature that most correlated with the class labels, solving an updated eigen-problem, at each iteration j. In this method, the eigenvalue $$\lambda_j$$ of the extracted feature indicates the discriminative importance of this feature. The number of iterations to be done is equal to the number of features to extract by KPLS, and is determined by a thresholding of $$\lambda_j$$.