User:Thepigdog/Vision systems

A vision system compares two images to determine if the images are of the same 3 dimensional object, and to give the relative orientation of the object as seen in the two images.

This may be achieved by creating a mapping array, which joins locations in the two images which are the same location in the 3 dimension object.

Information theory tells us that the representation of the data that has the minimum amount of information is the most likely to be correct. The mapping should be chosen to minimize the combined information of the two images and the mapping.

Mapping of a pixel to a pixel
The mapping vector is a two dimensional vector which gives the relative position of the pixel in the first image to the pixel in the second image. The vector has neighboring vectors, left and right, and above and below. The average of these vectors gives an estimate for this vector. The variation of the vector may be assumed to follow a normal distribution, so a vector close to the average will have a high probability, and low information content.

The relative colors of the two pixels may also follow a consistent pattern because of the different lighting, or angle of the light in the two images. Again the color variation vector may be treated in the same way as the position mapping vector.

The information content of the above 3 gives a measure of the information needed to obtain the second pixel from the first. By summing over all pixels the information content required to obtain the second image from the first is obtained. This is the measure that must be minimized in order to determine the best fit.

Measure of color information between images
Differences in colors between pixels may be assumed to follow a normal distribution. If the colors are similar the normal distribution will give a high probability, which corresponds to a small information content.

The normal distribution,


 * $$p(x) = \frac{1}{\sigma\sqrt{2\pi}} e^{ -\frac{(x-\mu)^2}{2\sigma^2} } $$

The information needed to encode an event with probability p(x) is,


 * $$i(x) = - \log_2 p(x) $$

So the information content for the normal distribution is


 * $$i(x) = \log_2 \sigma\sqrt{2\pi} + \frac{(x-\mu)^2}{2\sigma^2} $$

To find the best fit the information the information content needs to be minimized. The information content is the sum of the information for each pixel, and the (weighted) sum of squares of differences (ssd) is minimized,


 * $$ssd = \sum{ \frac{(x_i-\mu_i)^2}{\sigma_i^2} } $$

The information to be recorded is the colors of the pixels. We wish to estimate and get a probability for each pixel in terms of the proximity of the other pixels. From the probabilities, information quantity is calculated. The following formula has been constructed to give a measure of the information content of the image I.



I = c_s + c_c $$

The above formula sums the information content of each pixel i.

Every pixel on the same image may be considered as an estimate for this pixel. Pixels close to this pixel will be more correlated. In this model the color of the pixel is taken as the average and the standard deviation is the square of the distance. Estimates from each pixel are summed and then normalized.

$$c_s$$ is the information content, and the normalizing factor obtained from considering every other pixel from this image in estimating this pixel.



c_s[i] = \sum_{i} \sum_{j: i \ne j} e^{-((x_i-x_j)^2 + (y_i-y_j)^2 + (z_i-z_j)^2)}\frac{(r_i-r_j)^2 + (g_i-g_j)^2 + (b_i-b_j)^2} { (x_i-x_j)^2 + (y_i-y_j)^2 + (z_i-z_j)^2} $$

$$c_c$$ is the information content obtained from considering every pixel in the stereoscopic pair as an estimate for a pixel in the first image. The mapping m gives the coordinates of a pixel in the second image which is from the same point on the object from which the light originated. $$\sigma_s $$ represents the variation in color between pixels for the same point in the two images. The capital letters R, G, B represent the colors from the second image, while X, Y, and Z represent the coordinates of the pixel. z and Z represent depth, which for stereoscopic images is related the X coordinate of the mapping vector.



c_c = \sum_{i} \sum_{j} e^{-(((x_i-X_{m(j)})^2 + (y_i-Y_{m(j)})^2 + (z_i-Z_{m(j)})^2) * \sigma_s^2)} \frac{(r_i-R_{m(j)})^2 + (g_i-G_{m(j)})^2 + (b_i-B_{m(j)})^2} { ((x_i-X_{m(j)})^2 + (y_i-Y_{m(j)})^2 + (z_i-Z_{m(j)})^2) * \sigma_s^2} $$

A simple algorithm
The algorithm proceeds by iterating through a number of steps until stability is found. For each step every pixel from the first image is visited.

Start up
An initial estimate of the vector must be provided. Then in a neighborhood of the location given by the first vector look for the best minimum information position for the vector.

Deltas
Determine the rate of change of information if the vectors are moved away from the minimum, in the directions left, right, up or down.

Iteration step
Modify each component of the vector, taking into account the neighboring deltas and the rate of change at this pixel to give a new vector value. Again calculate the deltas.

The meta algorithm
The above algorithm may be applied first to a lower resolution image, giving initial estimates for mapping vectors. Then at each step in the increase in resolution the vectors from the previous step may be used as estimates for the next step.

Alternate algorithm
The above algorithm is not necessarily optimal, but its relative simplicity makes it easy to implement. By optimizing the processing at the pixel level it is possible that better performance may be obtained than from a more sophisticated algorithm.

An alternate approaches would be classify the pixels in the second image by the information content in the neighborhood, and and by color. This information would be added to a tree structure for efficient retrieval.

The pixels in the first image would then be sorted by local information content in a neighborhood. Starting from the pixel with the most information look up the tree for the second image and locate a group of pixels similar to this one. Then choose the best one based on closeness to prior vectors. After choosing each vector adjust the surrounding vectors in the neighborhood in a similar manner to the simple algorithm.

This alternate algorithm allows processing by filling in points of interest first while not wasting time analyzing blank parts of the image. But the benefits may be out weighed by the more complex data structures and processing.

Measure of similarity
If the total information content of the mapping is larger than the information content of the second images then the mapping has failed. Possibly the two images are unrelated.

The size of the information content of the mapping is a measure of the similarity between the two objects. Small size indicates similar objects.

Stereoscopy
In stereoscopic images, the mapping vector reveals 3 dimensional information about the object. The size of X component of the mapping vector at a point is related to the distance Z from the eye or camera.


 * $$ Z = W / M_x $$

This is normalized by $$M_x = 0$$ when $$Z$$ is far away (at infinity).

Relative orientation
For images of the same or similar objects taken at different times, patterns of change in the mapping vectors reveal rotation of the object around 3 axis. The 3 axis are,
 * X axis - Horizontal axis (relative to the camera or eye).
 * Y axis - Vertical axis (relative to the camera)
 * Z axis - Distance away from the camera.

In the following summary consider a traversal of the points in the images along an axis comparing the previous and the current mapping vectors.


 * X axis. Left to right change in vectors in the X direction indicates left right rotation.
 * $$ {dm_x \over dx} $$
 * Y axis. Down to up change in vectors in the Y direction indicates up down rotation.
 * $$ {dm_y \over dy} $$
 * any axis. Change in the vectors perpendicular to the traversal indicates Z axis rotation.
 * $$ ({dm_x \over dy}, {dm_y \over dx}) $$ is a tangent vector to an ellipse around the axis of rotation

From this information the relative orientation of the object in the two images may be calculated.