Semi-global matching

Semi-global matching (SGM) is a computer vision algorithm for the estimation of a dense disparity map from a rectified stereo image pair, introduced in 2005 by Heiko Hirschmüller while working at the German Aerospace Center. Given its predictable run time, its favourable trade-off between quality of the results and computing time, and its suitability for fast parallel implementation in ASIC or FPGA, it has encountered wide adoption in real-time stereo vision applications such as robotics and advanced driver assistance systems.

Problem
Pixelwise stereo matching allows to perform real-time calculation of disparity maps by measuring the similarity of each pixel in one stereo image to each pixel within a subset in the other stereo image. Given a rectified stereo image pair, for a pixel with coordinates $$(x, y)$$ the set of pixels in the other image is usually selected as $$\{ (\hat{x},y)|\hat{x} \ge x, \hat{x} \le x + D \}$$, where $$D$$ is a maximum allowed disparity shift.

A simple search for the best matching pixel produces many spurious matches, and this problem can be mitigated with the addition of a regularisation term that penalises jumps in disparity between adjacent pixels, with a cost function in the form
 * $$E(\boldsymbol{d}) = \sum_p D(p, d_p) + \sum_{p,q \in \mathcal{N}} R(p, d_p, q, d_q)$$

where $$D(p, d_p)$$ is the pixel-wise dissimilarity cost at pixel $$p$$ with disparity $$d_p$$, and $$R(p, d_p, q, d_q)$$ is the regularisation cost between pixels $$p$$ and $$q$$ with disparities $$d_p$$ and $$d_q$$ respectively, for all pairs of neighbouring pixels $$\mathcal{N}$$. Such constraint can be efficiently enforced on a per-scanline basis by using dynamic programming (e.g. the Viterbi algorithm), but such limitation can still introduce streaking artefacts in the depth map, because little or no regularisation is performed across scanlines.

A possible solution is to perform global optimisation in 2D, which is however an NP-complete problem in the general case. For some families of cost functions (e.g. submodular functions) a solution with strong optimality properties can be found in polynomial time using graph cut optimization, however such global methods are generally too expensive for real-time processing.

Algorithm
The idea behind SGM is to perform line optimisation along multiple directions and computing an aggregated cost $$S(p, d)$$ by summing the costs to reach pixel $$p$$ with disparity $$d$$ from each direction. The number of directions affects the run time of the algorithm, and while 16 directions usually ensure good quality, a lower number can be used to achieve faster execution. A typical 8-direction implementation of the algorithm can compute the cost in two passes, a forward pass accumulating the cost from the left, top-left, top, and top-right, and a backward pass accumulating the cost from right, bottom-right, bottom, and bottom-left. A single-pass algorithm can be implemented with only five directions.

The cost is composed by a matching term $$D(p, d)$$ and a binary regularisation term $$R(d_p, d_q)$$. The former can be in principle any local image dissimilarity measure, and commonly used functions are absolute or squared intensity difference (usually summed over a window around the pixel, and after applying a high-pass filter to the images to gain some illumination invariance), Birchfield–Tomasi dissimilarity, Hamming distance of the census transform, Pearson correlation (normalized cross-correlation). Even mutual information can be approximated as a sum over the pixels, and thus used as a local similarity metric. The regularisation term has the form

R(d_p, d_q) = \begin{cases} 0  \quad &d_p = d_q \\ P_1      &|d_p - d_q| = 1 \\ P_2      &|d_p - d_q| > 1 \end{cases} $$ where $$P_1$$ and $$P_2$$ are two constant parameters, with $$P_1 < P_2$$. The three-way comparison allows to assign a smaller penalty for unitary changes in disparity, thus allowing smooth transitions corresponding e.g. to slanted surfaces, and penalising larger jumps while preserving discontinuities due to the constant penalty term. To further preserve discontinuities, the gradient of the intensity can be used to adapt the penalty term, because discontinuities in depth usually correspond to a discontinuity in image intensity $$I$$, by setting
 * $$P_2 = \max \left\{ P_1, \frac{\hat{P}_2}{|I(p) - I(q)|} \right\}$$

for each pair of pixels $$p$$ and $$q$$.

The accumulated cost $$S(p, d) = \sum_r L_r(p, d)$$ is the sum of all costs $$L_r(p, d)$$ to reach pixel $$p$$ with disparity $$d$$ along direction $$r$$. Each term can be expressed recursively as

L_r(p,d) = D(p,d) + \min \left\{ L_r(p - r, d), L_r(p - r, d - 1) + P_1, L_r(p - r, d + 1) + P_1, \min_i L_r(p - r, i) + P_2 \right\} - \min_k L_r(p - r, k) $$ where the minimum cost at the previous pixel $$\min_k L_r(p - r, k)$$ is subtracted for numerical stability, since it is constant for all values of disparity at the current pixel and therefore it does not affect the optimisation.

The value of disparity at each pixel is given by $$d^*(p) = \operatorname{argmin}_d S(p, d)$$, and sub-pixel accuracy can be achieved by fitting a curve in $$d^*(p)$$ and its neighbouring costs and taking the minimum along the curve. Since the two images in the stereo pair are not treated symmetrically in the calculations, a consistency check can be performed by computing the disparity a second time in the opposite direction, swapping the role of the left and right image, and invalidating the result for the pixels where the result differs between the two calculations. Further post-processing techniques for the refinement of the disparity image include morphological filtering to remove outliers, intensity consistency checks to refine textureless regions, and interpolation to fill in pixels invalidated by consistency checks.

The cost volume $$C(p,d)$$ for all values of $$p = (x, y)$$ and $$d$$ can be precomputed and in an implementation of the full algorithm, using $$D$$ possible disparity shifts and $$R$$ directions, each pixel is subsequently visited $$R$$ times, therefore the computational complexity of the algorithm for an image of size $$W \times H$$ is $$O(WHD)$$.

Memory efficient variant
The main drawback of SGM is its memory consumption. An implementation of the two-pass 8-directions version of the algorithm requires to store $$W \times H \times D + 3 \times W \times D + D$$ elements, since the accumulated cost volume has a size of $$W \times H \times D$$ and to compute the cost for a pixel during each pass it is necessary to keep track of the $$D$$ path costs of its left or right neighbour along one direction and of the $$W \times D$$ path costs of the pixels in the row above or below along 3 directions. One solution to reduce memory consumption is to compute SGM on partially overlapping image tiles, interpolating the values over the overlapping regions. This method also allows to apply SGM to very large images, that would not fit within memory in the first place.

A memory-efficient approximation of SGM stores for each pixel only the costs for the disparity values that represent a minimum along some direction, instead of all possible disparity values. The true minimum is highly likely to be predicted by the minima along the eight directions, thus yielding similar quality of the results. The algorithm uses eight directions and three passes, and during the first pass it stores for each pixel the cost for the optimal disparity along the four top-down directions, plus the two closest lower and higher values (for sub-pixel interpolation). Since the cost volume is stored in a sparse fashion, the four values of optimal disparity need also to be stored. In the second pass, the other four bottom-up directions are computed, completing the calculations for the four disparity values selected in the first pass, that now have been evaluated along all eight directions. An intermediate value of cost and disparity is computed from the output of the first pass and stored, and the memory of the four outputs from the first pass is replaced with the four optimal disparity values and their costs from the directions in the second pass. A third pass goes again along the same directions used in the first pass, completing the calculations for the disparity values from the second pass. The final result is then selected among the four minima from the third pass and the intermediate result computed during the second pass.

In each pass four disparity values are stored, together with three cost values each (the minimum and its two closest neighbouring costs), plus the disparity and cost values of the intermediate result, for a total of eighteen values for each pixel, making the total memory consumption equal to $$18 \times W \times H + 3 \times W \times D + D$$, at the cost in time of an additional pass over the image.