User:Elias.kunnas

Semantic texton forest (or semantic texton forests) is a recent feature extraction technique in computer vision that can be used classify individual pixels in an image into object categories using a small window of local context around each pixel. Semantic texton forests (STF) have been successfully used in higher-level techniques to achieve real-time semantic segmentation with near-state of the art accuracy .

General
Semantic texton forests are random forests that are constructed from pixel-wise comparisons around a small image patch of $$d x d$$ pixels centered around a given pixel in the image, where $$d$$ is typically between 5 and 30 pixels.

The semantic textons in semantic texton forests are prototypical patches of texture that are associated with classes of objects found in a set of images. Semantic textons are so called because they are trained to be discriminative of image object categories unlike to object class-neutral features such as edges.

Semantic textons are represented in semantic texton forests as the leaf nodes of trees. The set of semantic textons for a tree is chosen so as to make each semantic texton maximally predictive of the image classes where that the texton appears. At test time, the semantic texton corresponding to a pixel in given neighbourhood is assigned a class distribution by recursively testing pixels inside its $$d x d$$ neighbourhood according to the split tests used in each learned tree in the semantic texton forest, yielding $$T$$ leaf nodes that correspond to different estimates of the distribution $$P(\text{class}|\text{leaf node})$$. All class distribution are averaged together, and the estimated class for a pixel is predicted to be the mode of resulting distribution.

At the core of semantic texton forests are the split functions used to construct the individual trees of semantic textons. In general, the split functions used for semantic texton forests are operations between two pixels $$p_{x_1,y_1,z_1}$$ and $$p_{x_2,y_2,z_2}$$ for any position $$x_i,y_j$$ in the local window and in any colour channel $$z_k$$ ($$i,j\text{, and }k$$ could all be the same). Split functions used by include: taking the sum, the difference, and the absolute difference of the two pixels. More general functions are evaluated for use in split tests in, including Haar-like features.



Advantages

 * Speed: STF requires only $$O(TD)$$ operations per input image pixel for classification, where $$T =$$ the number of trees per forest (typically between 30 and 100), and $$D =$$ maximum tree depth (typically between 5 and 15). STF can hence be used in real-time in some applications.
 * Simplicity: STF is implemented using pixel-level comparisons using random forests

Limitations

 * STF does not model of inter-pixel relationships in an image, so it cannot by itself enforce coherence in the image local structure in segmentation.
 * STF is discriminative and requires fully supervised training, limiting some of its potential uses (under weak supervision, the labels are pre-filled into the image by randomly sampling the class distribution $$P(\text{class}|\text{topic})$$).
 * STF can only discriminate between a fixed number of classes and hence does not easily adapt to continuously growing or large datasets.

Training
Semantic texton forests are trained using variants of the random forests algorithm. trains each tree using a fixed-size random subset of the training set instead of a bootstrap sample. In, a generalisation of random forests is used called extremely randomised trees, or Extra-trees.

There are three ways to train semantic texton forests, depending on the type of labels available for an image. These are:
 * 1) Full supervision,
 * 2) Weak supervision, and
 * 3) No supervision (unsupervised)

In full supervision, each pixel in the images of the training set is labelled with some object class (or none).

At each step of the training algorithm, consider a set of randomly selected candidate split tests (defined below) and choose the one that maximises information gain in the class distribution $$P(\text{class}|\text{node})$$. The tree is grown until the maximum depth $$D$$ is reached, the information gain is zero, or if the number of training instances goes below a set threshold. The distribution $$P(\text{class}|\text{leaf node})$$ is estimated based on the observed class frequencies at that node. To smooth out the resulting distribution, a small Dirichlet prior can be used.

In weak supervision, none of the pixels have labels attached to them. Instead, the image as a whole has a set of topics, or tags, attached to it. A topic can be considered to be the underlying meaning generating object classes within an image (an image with the topic "city" would tend to generate classes corresponding to buildings and cars). In other words, we have a distribution $$P(\text{class }c | \text{topic }t)$$.

To train a classifier to recognise classes of objects under weak supervision, first pre-process the image as follows:
 * Sample labels for all pixels $$p$$ in the image from the distribution $$P(p|t) = P(p|c)P(c|t)$$. If $$P(p|c)$$ is the uniform distribution, the labels are sampled from the class distribution $$P(c|t)$$ directly.
 * Use the method for fully supervised training outlined above for the automatically labelled data.

In unsupervised mode, semantic texton forests no longer acts as a classifier but as a clusterer. Here, the split tests used for tree construction are chosen so as to split the training data maximally evenly into distinct classes.

Example
In order to use semantic texton forests to perform segmentation and classification, we would:
 * 1) Train semantic texton forests on a set of images with labels (such as the VOC 2007 dataset )
 * 2) * Since STF uses image pixels directly, it does not possess many desirable invariance properties such as rotational invariance or scale invariance directly. To achieve good generalisation performance, transformed copies of the training images should be added to the training set.
 * 3) * To add the desired levels of invariance to STF, typical transformations that are used on the training include combinations of the following: scaling the images up or down by a small factor (up to around 1.2), mirroring the images along the vertical axis, and rotating them by a small angle.
 * 4) Classify each pixel of an unlabelled image using the trained STF. The predicted class $$c_*$$ is the mode of the averaged class distributions of the trees in the forest: $$c_* = argmax_{c_i} P(c_i | L) = argmax_{c_i} \frac{1}{T} \sum_{t=1}^T P(c_i | l_t)$$
 * 5) The pixels produced by STF probably don't correspond to the boundaries of any object exactly (i.e. they contain noise). One way to get a more robust segmentation of the image is to decrease the resolution of the segmentation drastically, summarising lower-level labels. The STF-produced segmentation could additionally be embedded in a conditional random field model in order to achieve globally more coherent results, or to refine the low-resolution grid segmentation down to individual pixels.