User:Dnlbreen/Maximally Informative Dimensions (MID)

Maximally Informative Dimensions, or MID, refers to a computational method based on information theoretic principles used to calculate a neuron's receptive field. Maximizing information refers to maximizing the Kullbeck-Leibler divergence between two probability distributions - the prior distribution of reduced stimuli and the distribution of reduced stimuli conditional on the neuronal response - by means of appropriately selecting a neuron's receptive field. When analyzing neural responses, Shannon's mutual information becomes useful because it provides a rigorous way of comparing the two probability distributions. As an optimization scheme, MID has the advantage that it does not depend on any specific statistical properties of the stimulus ensemble.

Background
Sensory neurons potentially need to transduce many aspects of physical objects into electrical signals. Sensitivity to the time evolution of these object's physical properties is crucial to a biological organism's survival, compounding the complexity. A sensory neuron must bring order to a plethora of motion. In the language of computational neuroscience, sensory stimuli from the external environment is inherently high dimensional. For instance, visual neurons, like cameras, receive input on the order of at least hundreds of stimulus pixels from planar grids. Time dependence increases this dimensionality by orders of magnitude.

Neurons fire electric potentials discriminatively when interacting with their external environment. In the case of sensory neurons, they fire selectively to raw physical stimuli - from visual to olfactory to auditory input. As one moves to less peripheral neurons - secondary sensory neurons and CNS neurons - input is received from other neurons in the form of connections called synapses. Here neurons remain particular about what causes them to fire, weighting input from other neurons lightly or heavily.

Mathematically, this means assuming that the probability of response can be described by a number of receptive fields. Each of these receptive fields is interpreted as a coordinate axis of a Euclidean manifold, embedded within a large dimensional space, called the stimulus space. This means that the receptive field is assumed to linearly respond to the intensity of spatiotemporal physical properties composing the total incoming signal. Each physical property is measured on a continuum and assigned its own coordinate axis within the total stimulus space. The assumption, then, is that the neuron responds only to particular characteristics of the input signal. Because the input signal is a vector and the neuron's receptive field is a linear subspace within this stimulus vector space, mathematically the neuron is said to respond to the incoming signal's summed projection onto the receptive field's coordinate axes. This is a well known functional model, with

(1) The neuron's receptive field can be characterized as a linear subspace within the total stimulus space

(2) The neuron's instantaneous spike rate, the output of the filter, passes through a nonlinear function.

as two of the main premises of thelinear-nonlinear model of neural spiking.

In the linear-nonlinear model, the linear filtering stage performs dimensionality reduction, reducing the high-dimensional spatiotemporal stimulus space $$ (s^1,s^2,...s^D) $$ to a low dimensional feature space $$ (x^1,x^2,...x^K) $$. This common theme recurs in statistics under the name dimension reduction. In neuroscience, this goes by the name feature detection, meaning that a neuron selects only a small number of dimensions in the input space which will influence the output response. A simple case would consist of a one dimensional relevant subspace, requiring the projection of the stimulus s onto only a single special direction x. In a more general case,


 * $$x^i = e^i \cdot \mathbf{s}, i = 1,...,K $$

are the set of projections of a stimulus $$ \mathbf{s} $$ onto special directions $$ e^i $$. Maximizing the Kullback-Leibler divergence between $$ P(\mathbf{x}|\text{spike}) $$, the probability of observing the reduced stimulus $$ \mathbf{x} $$ conditional on a spike, and $$ P(\text{x}) $$, the reduced stimulus prior probability distribution, will give a linear subspace corresponding to the neuron's receptive field. Once the relevant subspace is found, the focus turns towards characterizing what happens to the neural response as the position $$ (x^1,x^2,...x^K) $$ within the relevant subspace is varied. This knowledge is quantified by the input-output function, which in general can be arbitrarily nonlinear, meaning that the neuronal response as a function of position in stimulus space is complex. This input-output function can be experimentally obtained by sampling the probability distributions $$ P(\text{spike}|\mathbf{x}) $$ and $$ P(\text{spike}) $$ within the relevant subspace and taking their quotient.


 * $$ f(x^1,x^2,...x^K)= \frac{P(\text{spike}|\mathbf{x})}{P(\text{spike})}

$$

$$ P(\text{spike}) $$ is the probability of observing a spike, and $$ f(x^1,x^2,...x^K) $$ is the input-output function. One can use Bayes' theorem to rewrite this equality as a function of P(x|spike) and P(x):


 * $$ f(x^1,x^2,...x^K)= \frac{P(\mathbf{x}|\text{spike})}{P(\mathbf{x})}

$$

Information as an Objective Function
Feature detection implies that only K dimensions $$ (\text{x}^1,\text{x}^2,...,\text{x}^K) $$ of the total stimulus space will be relevant. We assume the sampling of the probability distribution P(x) is ergodic and stationary across repetitions and choose a single spike as the response of interest. Projecting onto basis vectors $$ \mathbf{e}_i v^i $$ and constructing the probability distributions in the direction of the i th informative dimension gives


 * $$ \text{I}_\text{i,spike} = \int P(\text{x}^i|\text{spike})\log_2(\frac{P(\text{x}^i|\text{spike})}{P(\text{x}^i)})\, \text{d}\text{x}^i

$$

Both the probability distributions are obtained in practice as normalized histograms. The information carried by a spike within the full stimulus space can only be recovered only if the projection depends on only one direction in space. If the relevant subspace is multidimensional, every direction $$ \text{I}_\text{i,spike} $$ will give a value of information below this value. In the case of a multidimensional space, finding all K relevant vectors and evaluating the information over the entire space will give the limiting value $$ \text{I}_\text{spike} $$. Estimating the maximum average information per spike is an important step in the method of MID. Mathematically, the maximum average information per spike is


 * $$ \text{I}_\text{spike} = \int P(\mathbf{s}|\text{spike})\log_2(\frac{P(\mathbf{s}|\text{spike})}{P(\mathbf{s})})\, \text{d}\mathbf{s}

$$

where the integral is now evaluated over the entire spatiotemporal space instead of the smaller feature space. The probability distributions can be inverted using Bayes' rule, obtaining


 * $$ \text{I}_\text{spike} = \int P(\mathbf{s})\frac{P(\text{spike}|\mathbf{s})}{P(\text{spike})}\ \log_2(\frac{P(\text{spike}|\mathbf{s})}{P(\text{spike})})\, \text{d}\mathbf{s}

$$

which is useful in practice because because the ergodic condition allows the ensemble average to be replaced by the time average. $$ P(\text{spike}|\mathbf{s}) $$ can be replaced with a time dependent spike rate, while $$ P(\text{spike}) $$ can be replaced with the average neuronal firing rate.


 * $$ P(\text{spike}|\mathbf{s}) \longrightarrow \text{r(t)}

$$


 * $$ P(\text{spike}) \longrightarrow \bar{r}

$$


 * $$ P(\mathbf{s}) \longrightarrow \frac{1}{\text{T}}

$$

giving


 * $$ \text{I}_\text{spike} = \frac{1}{\text{T}} \int \frac{\text{r(t)}}{\bar{r}}\log_2(\frac{\text{r(t)}}{\bar{r}})\, \text{d}\text{t}

$$

This will give an estimate for the total information content, which will be used to maximize the information as a direction in stimulus space. Note that, as the number of spikes is finite, the true value of information per spike, corresponding to an infinite number of repetitions, will be found by subtracting an expected bias value due to binning spikes into finite width, instead of infinitesimal, sized bins. This bias value is


 * $$ \text{I}_\text{bias} = \frac{1}{P(\text{spike})\text{N} 2\ln(2)}

$$

or


 * $$ \text{I}_\text{bias} = \frac{T}{\text{N} 2\ln(2)}

$$

Optimization Algorithm
Maximizing the information as a function of stimulus direction requires a way to compute the information. To compute the information, a starting estimate of the preferred direction, usually the spike-triggered average, is given as input. Because the information is a continuous function, the gradient can be computed and information iteratively maximized according to this gradient. If the value of the information at the stationary point does not match with the estimated information, additional dimensions are added and the process is repeated. The gradient can be computed as


 * $$ \nabla_i\text{I}_\text{spike} = \int P(\text{x}^i)(<\mathbf{s}|\text{x}^i>_\text{spike}-<\mathbf{s}|\text{x}^i>)\operatorname{d}(\frac{P(\text{x}^i|\text{spike})}{P(\text{x}^i)})/\operatorname{d}\text{x}^i\, \text{d}\text{x}^i

$$

where


 * $$ <\mathbf{s}|\text{x}^i>_\text{spike} = \frac{1}{P(\text{x}^i|\text{spike})} \int \mathbf{s} \delta(\text{x}-\mathbf{s} \cdot \mathbf{v})P(\mathbf{s}|\text{spike})\, \text{d}\mathbf{s}

$$

Once all the relevant dimensions have been found, the probability distribution dependent on a spike in the spatiotemporal domain is identical to the probability distribution dependent on a spike in the feature space, causing the gradient to vanish. At a basic level, this method is known as gradient ascent or descent.

Summary
MID maximizes, as a function of direction in stimulus space, the Kullback–Leibler divergence between the neural response sequence and the stimulus projections onto trial directions within stimulus space. The central assumption of MID is that this produces the special aspects of an input signal that a neuron is attuned to. Vectors which maximize information, accounting for the total information per response of interest, span the relevant subspace. Reconstruction of the relevant subspace is done without ever assuming a form of the input output function, which can be strongly nonlinear within the relevant subspace and is estimated from experimental histograms for each trial direction independently. Most significantly, this method is applicable for any stimulus ensemble whatsoever, including strongly non-Gaussian distributions, such as natural signals.

Advantages and Disadvantages of MID
Although MID possesses advantages, such as the ability to remove correlations in natural stimuli, there are also some problems that MID has as an optimization method. The dimensionality of histograms is exponential in the dimension of the relevant subspace. Because the sampling set of data is finite, eventually this corresponds to an exponential growth in noise. Information measures the fluctuations from the average firing rate. For high dimensions, the contribution of fluctuations from noise becomes large, and therefore the information content of stimuli can be fallaciously measured to be large as well, even for randomly drawn stimuli. Because of this limitation, it is difficult to perform MID analysis for cells with a large number of relevant dimensions because sampling errors become too large.