User:Lizhongzheng/sandbox

Why Are Neural Networks So Powerful?
Well, that is a difficult question that has haunted us for about 30 years. The more successes we see from the applications of deep-learning, the more puzzled we are. Why is it that we can plug in this “thing” to pretty much any problem, classification or  prediction, and with just some limited amount of tuning, we almost always can get good results. While some of us are amazed by this power, particularly the universal applicability; some found it hard to be convinced without a deeper understanding.

Now here is an answer! Ready?
In short, Neural Networks extract from the data the most relevant part of the  information that describes the statistical dependence between the features and the labels. In other words, the size of a Neural Networks specifies a  data structure  that we can compute and store, and the result of training the network is the best approximation of the statistical relationship between the features and the labels that can be represented by this data structure.

I know you have two questions right away: '''REALLY? WHY?'''

The “why” part is a bit involved. We have a new paper that covers this. Briefly, we need to first define a metric that quantifies how valuable a piece of partial information is for a specific inference task, and then we can show that Neural Networks  actually draw the most valuable part of information from the data. As a bonus, the same argument can also be used in understanding and comparing other learning algorithms,  PCA, compressed sensing, etc. So everything ends up in the same picture. Pretty cool, huh? (Be aware, it is a loooong paper.)

In this page, we try to answer the “Really?” question. One way to do that is I can write a mathematical proof, which is included in the paper. Here, let’s just do some experiments.

Here is the Plan:
We will generate some data samples $$(\underline{x}, y )$$ pairs. $$\underline{x}$$ is the real-valued feature vector, and $$y $$ is the label. We will use these data samples to train a simple neural network so it can be used to do classification of the feature vectors. After the training we will take out some of the weights on the edges of the network and show that these weights actually are the empirical conditional expectations $${\mathbb E} [\underline{X}|Y=y] $$ for different values of $$y$$'s.

So why are these conditional expectations worth computing? This goes back all the way to Alfréd Rényi, and the notion of HGR correlation. In our paper, we can show that this conditional expectation as a function of is in fact the function that is the most relevant to the  classification problem.

Well, if the HGR thing is too abstract here, you can think there is a “correct” function that statisticians would like to compute. This is somewhat different from the conventional pictures where learning means to learn the statistic model that governs $$(\underline{X}, Y)$$. Since the features are often very high dimensional vectors or have other complex forms, learning this complete model can be difficult. The structure of a neural network  gives a constraint on the number of parameters that we can learn, which is often much smaller than the number of parameters needed to specify the full statistical model of $$(\underline{X}, Y)$$. Thus, we can only hope to learn a good approximate model that can be represented by these parameters. What the HGR business and our paper says is there is a theoretical way to identify what is the best approximation to the full model with only this number of parameters; and what we are demonstrating here is that at least for this specific example, a neural network  is computing exactly that!

Amazing! right? Imagine how the extensions of this can help you to use Neural Networks in your problems!

To follow the experiments:
You can just read the code and comments on this page, and trust me with the results; or you can run it yourself. To do that, you will need a standard Python environment, including Numpy, Matplotlib, etc. Also, you will need a standard neural network package. For that, I use Keras, and run it with TensorFlow. You can follow this link to get them installed. I recommend using Anaconda to install them, which takes, if you do it right, less than 10 minutes. Trust me, it’s a worthy effort. These packages are really well made and powerful. Of course, you are also welcome to just sit back, relax and enjoy the journey, as I will show you everything that you are supposed to see from these experiments.

To start
You need to have the following lines to initialize.

If you receive no error message after these, then congratulation, you have installed your packages right.

Generate Data Samples
Now let’s generate some data. To start with, let’s generate the feature vectors and labels $$ (\underline{x}_i, y_i), i=1, \ldots, N $$, from a  mixture Gaussian model

$$ p_{\underline{X}|Y}(\underline{x} | j) = {\cal N}(\underline{\mu}_j, \sigma^2I) $$

This is almost cheating as we know right away that $${\mathbb E}[\underline{X}|Y=j] = \underline{\mu}_j $$, so we only need to look for these  $$\underline{\mu}_j$$  values after the model is trained, and hope that they would magically show up as the weights in the network somewhere. Simple!

To make the story a little bit more interesting, we will pick the $$\underline{\mu}_j$$ ’s randomly. We will pick the probability $$p_Y$$ randomly too. Here is the code.

We can plot the data as something like this:



One can pretty much eyeball the different classes. Some classes might be easier to separate, and some might be too close to separate well.

Use Neural Network for Classification
Now let’s make a neural network to nail these data. The network we would like to make has a single layer of nodes, not exactly deep learning, but a good start point. It looks like this:



What the network does is to train some weight parameters $$(\underline{v}_j, b_j )$$, to form some linear transforms of the feature vectors, denoted as $$Z_j = \underline{v}_j^T \cdot \underline{x} + b_j$$, one for each node, or neuron. The SoftMax output unit computes a distribution based on these weights,

$$Q^{(v,b)}_{Y|\underline{X}}(j | \underline{x}) = \frac{e^{Z_l}}{\sum_i e^{Z_l}}$$

and maximizes the likelihood with the given collection of samples

$$\max_{v, b} \sum_{i=1}^N \log Q_{Y|\underline{X}}^{(v,b)}(y_i| \underline{x}_i) $$

The only thing we need to specify in this process is the number of nodes, which we choose as the number of possible values of the labels, $$Cy = |{\cal Y}| $$ in the code. The codes using Keras to implement this network is as simple as the follows:

The resulting weights can be accessed by calling model.get_weights, where we get the following results, in comparison with the centers of each class:

They do not look all that similar, right? The trick is that we need to regulate, to make each row vector above as a function of $$y$$ to have zero mean and unit variance with respect to $$p_Y$$. To do that, we make the following regulating function:

Finally, we can make the plots

and here are the results, comparing the empirical conditional expectation and the weights from the  neural network :

A Non-Gaussian Example
Well, this is not so terribly surprising for the mixture Gaussian case. The MAP classifier would compute the distance from a sample to the conditional mean, and put a weight on it according to the prior. With all the scaling and shifting, it is not unthinkable that the procedure becomes the inner product to the conditional mean. In fact, I wish someone can make an animation of this, to see how the decision regions varies with the parameters, which could be a good demo teaching classical decision theory. (I didn’t say “classical” in any condescending way).

So how about we try a non-Gaussian case, say, samples with Dirichlet distribution. Why Dirichlet? Because Python generates it, and I cannot remember the mean of this distribution.

Here are how the generated data samples look like. I had to add in the colors, as otherwise it is really hard to see. Not a very clear clustering problem, is it? The strange triangle shape comes from the fact that Dirichlet distribution has the support over a simplex. We generate samples on a 3-D simplex and project them down to the 2-D space.



and here are the comparison between the conditional expectation and the weights