User:Spectral Decomposition/sandbox

Multidimensional digital signal processing (mD DSP) is fundamental to many application areas such as digital image and video processing, medical imaging, geophysical signal analysis, sonar, radar, lidar, array processing, computer vision, computational photography, and augmented and virtual reality. However, as the number of dimensions of a signal increases the computational complexity to operate on the signal increases rapidly. This relationship between the number of dimensions and the amount of complexity, related to both time and space, as studied in the field of algorithm analysis, is analogues to the concept of the curse of dimensionality. This large complexity generally results in a very large execution run-time of a given mD DSP application causing them to become impractical to be used for practical application; especially in real-time applications.

Motivation, Problem Statement, Basic Concepts, and Applications
Due to the end of frequency scaling of of processors, which is largely attributed to the effect of Dennard scaling around the year 2005, a common trend of processor manufacturers was to continue to exploit Moore's law by increasing the number of processors on a single chip termed, therefore creating what are termed manycore processor as opposed to [uniprocessor system|uniprocessors].

The challenging aspect of implementing mD DSP algorithms is the large magnitude of complexity as described in the previous section makes algorithmic implementation difficult primarily associated with run-time and power consumption. This article primarily addresses basic parallel concepts used to alleviate run-time of these algorithms. The concept of Parallel computing can be applied to mD DSP applications to exploit the fact that if a problem can be expressed in a parallel algorithmic form, then the methodology of parallel programming and multiprocessingcan be used in an attempt to increase the computational throughput of the mD DSP procedure on a given hardware platform. An increase in computational throughput can result in a decreased run-time, i.e. a speedup of a specific mD DSP algorithm.

Increasing throughput can be beneficial to strong scaling of a given mD DSP algorithm.

In addition to increasing computational throughput, a generally considered equally important goal is to maximally utilize the Memory bandwidth of a given computing memory architecture. The combination of the computational throughput and memory bandwidth usage can be achieved through the concept of operational intensity, which is summarized in what the roofline model. The concepts of operational intensity and the roofline model in general have recently become popular methods of quantifying the performance of mD DSP algorithms

Another possible benefit of increasing operational intensity is to allow for an increase in weak scaling, which allows the mD DSP procedure to operate on increased data sizes or larger data sets, which is important for application areas such as data mining and the training of deep neural networks using big data.

It should be noted that the goal of parallizing an algorithm is not always to decrease the traditional concept of complexity the algorithm because the term complexity as used in this context typically refers to the RAM | abstract computer model], which is by definition serial. [[Parallel programming model|parallel abstract computer models such as PRAM have been proposed to describe complexity for parallel algorithms such as mD signal processing algorithms

Another factor that is important to the performance of mD DSP algorithm implementations is the energy consumption

Parallel Implementations of Multidimensional Discrete Fourier Transforms
As a simple example of a mD-DSP algorithm that is commonly decomposed into a parallel form let’s consider the parallelization of the discrete Fourier transform, which is generally implemented using a form of the Fast Fourier Transform (FFT). There are hundreds of available software libraries that offer optimized FFT algorithms, and many of which offer parallelized versions of mD-FFT algorithms  parallel Versions of FFTw.

The most straightforward method of paralyzing the DFT is to utilize the row-column decomposition method. The following derivation is a close paraphrasing from the classical text Multidimensional Digital Signal Processing. The row-column can be applied to an arbitrary number of dimensions, but for illustrative purposes, the 2D row-column DFT will be described first. The 2D DFT is defined as
 * $$X \left( k_1, k_2 \right) = \sum_{n_1=0}^{N_1-1} \sum_{n_2=0}^{N_2-1} x \left(  n_1, n_2 \right) W_{N_2}^{n_2 k_2} W_{N_1}^{n_1 k_1}$$

where term $$W_{N} \stackrel{\text{def}}{=}\ \exp \left( -j \frac{2\pi}{N} \right)$$ is commonly referred to as the twiddle factor of the DFT in the signal processing literature.

Next, we re-writing the DFT equation in the form


 * $$X \left( k_1, k_2 \right) = \sum_{n_1=0}^{N_1-1} \left[ \sum_{n_2=0}^{N_2-1} \left( x \left(  n_1, n_2 \right) W_{N_2}^{n_2 k_2} \right) \right] W_{N_1}^{n_1 k_1}$$ where the quantity inside the brackets is a 2D sequence which we will denote as $$G \left( n_1, k_2 \right)$$.

We can then express the above equation as the pair of relations


 * $$G \left( n_1, k_2 \right) = \sum_{n_2=0}^{N_2-1} x \left( n_1, n_2 \right) W_{N_2}^{n_2 k_2} $$


 * $$X \left( k_1, k_2 \right) = \sum_{n_1=0}^{N_1-1} G \left( n_1, k_2 \right) W_{N_1}^{n_1 k_1} $$

Each column of $$ G $$ is the 1D DFT of the corresponding column of $$ x $$. Each row of $$ X $$ is the 1D DFT of the corresponding row of $$ G $$.

Expressing the 2D-DFT in the above form allows us to see that we can compute a 2D DFT by decomposing it into row and column DFTs. The DFT of each column of $$ x $$ can first be computed where the results of which are placed int an intermediate array. Then we can compute the DFT of each row of the intermediate array.

This row-column decomposition process can easily be extended to compute an MD DFT. First, the 1D DFT is computed with respect to one of independent variables, say $$ n_M $$, for each value of the remaining variables.

Next, 1D DFTs are computed with respect to the variable $$ n_{M-1} $$ for all values of the $$ \left( M-1 \right) $$-tuple $$ \left( n_1, n_2,...,n_{M-2},k_M \right) $$. We continue in this fashion until all 1D DFTs have been evaluated with respect to all the spatial variables.

The row-column decomposition of the DFT is parallelized in its most simplistic manner by noting that the row and column computations are independent of each other and therefore can be performed on separate processors in parallel. The parallel 1D DFT computations on each processor can then utilize the FFT algorithm for further optimization. One large advantage of this specific method of parallelizing an mD DFT is that each of the of the 1D FFTs being performed in parallel on separate processors can then be performed in a concurrent fashion on Shared memory multithreaded SIMD processors .

A specific hardware platform this simultaneous parallel and concurrent DFT implementation technique is highly amenable to is GPUs due to common GPUs having both a set of separate set of multithreaded SIMD processors (which are referred to as "streaming multiprocessors" by in the CUDA programming language, and "compute units" in the OpenCL language) and individual SIMD lanes (commonly referred to loosely as a "core", or more specifically a CUDA "thread processor" or as an OpenCL "processing elemeng") within each multithreaded SIMD processor.

A disadvantage to this technique of applying a seperate FFT on each shared memory multiprocessor is the required interleaving the data among the shared memory. One of the most popular libraries that utilizes this basic form of concurrent FFT computation is in the shared memory version of the FFTw library.

Implementations of Multidimensional Discrete Convolution via Shift Registers on an FPGA
Convolution on mD signals lends itself well to pipelining due to the fact each of single output convolution operation is independent of every other one. Due to this data independence between each convolution operation between the filters impulse response and the signal a new set of data calculations may begin at the instant the first convolution operation is finished. A common method of performing mD convolution in a raster scan fashion (including dimensions greater than 2) on a traditional general purpose CPU, or even a GPU, is to cache the set of output data from each scan line of each independent dimension into the the local cache. By utilizing the unique custom re-configurable architecture of an field-programmable gate array (FGPA) we can optimize this procedure dramatically by customize the cache structure.

In the illustrative example from which the presentation this description is derived from we are going to restrict our discussion on two dimensional signals. In the example we perform a set of convolutional operations between a general 2D signal and a 3x3 filter kernel. As the sequence of convolution operations proceed along each raster line the filter kernel is slid across one dimension of the input signal and the data read from the memory is cached. The first pass loads three new lines of data into cache. The OpenCL code for this procedure is scene below.

This caching technique is used to hide poor data to memory access pattern efficiency in terms of coalescing. However, with each successive loop only a single cache-line is updated. If make the reasonable assumption of a 1 pixel per cycle performance point, applying this proposed caching technique to an FPGA results in a cache requirement of 9 reads and one write per cycle. Utilizing this caching technique on an FPGA results in inefficient performance in terms of both power consumption and a creating a larger memory footprint than is required because there is a great deal of redundant reads into the cache. With an FPGA we can customize the cache structure to give rise to a much more efficient result.

The proposed method to alleviate this poor performance with an FPGA implementation as proposed in the corresponding literature is to customize the cache architecture through utilization of the re-configurable hardware resources of the FPGA. The important attribute to note here is that the FPGA creates the hardware based on the code that we write as opposed to writing code to run on a fixed architecture with a fixed instructions.

A description of how to modify the implementation to optimize the cache architecture on an FPGA will now be discussed. Again, let's begin with an initial 2D signal and assume it is of size $$W \times W$$. We can then remove all of the lines that aren't in the neighborhood of the window. We next can linearize the 2D signal of this restricted segment of the 2D signal after removing lines that aren't in the neighborhood of the window. We can achieve this linearization via a simple row-major data layout. After linearizing the 2D signal into a 1D array, under the assumption that we are not concerned with the boundary conditions of the convolution, we can discard any pixels that only contribute to the boundary computations - which is common in practice for many practical applications. Now, when we slide the window over each array element to perform the next computation, we have effectively created a shift register. The corresponding code for our shift register implementation of achieving this 2D filtering can be seen below.

By performing these aforementioned steps, the goal is to manage the data movement to match the FPGAs architectural strengths to achieve the highest performance. These architectural strengths allow custom implementation that can be based on the work being done as opposed to leveraging fixed operations, fixed data paths, and fixed widths, as would be done on a general purpose processor.

Parallel Implementations of Multidimensional FIR Filter Structures
The section will describe a method of implementing an mD digital finite impulse response (FIR) filter in a completely parallel realization. The proposed method for a completely parallel realization of a general FIR filter is achieved through the useof a combination of parallel sections consisting of cascaded 1D digital filters.

Consider the general desired ideal finite extent mD FIR digital filter in the complex $$z$$-domain, given as


 * $$H_d(z_1,z_2,...z_M) = \sum_{n_1=0}^{N_1} \sum_{n_2=0}^{N_2} \cdots \sum_{n_M=0}^{N_M} a(n_1,n_2,...n_M) z_1^{-n_1} z_2^{-n_2} \cdots z_3^{-n_M}$$

Placing the coefficients of $$a(n_1,n_2,...n_M)$$ into an array $$C=[c(l_1,l_2,...,l_M)]$$ and performing some algebraic manipulation as described in, then we arrive at an expression that allows us to decompose the filter into a filterbank, given as


 * $$H_d(z_1,z_2,...z_M) \approx H(z_1,z_2,...z_3) = \sum_{i=1}^{r} E_i(z_1) F_i(z_2) \cdots G_i(z_M)$$

where
 * $$E_i(z_1) = \sum_{n_1=0}^{N_1}C_{i1}(n_1+1) z_1^{-n_1}$$


 * $$F_i(z_2) = \sum_{n_2=0}^{N_3}C_{i2}(n_2+1) z_2^{-n_2}$$


 * $$G_i(z_M) = \sum_{n_M=0}^{N_M}C_{iM}(n_M+1) z_M^{-n_M}$$

Therefore, the original MD digital filter $$H_d(z_1,z_2,...z_M)$$ is approximately decomposed into a parallel filterbank realization composed of a set of seperable parallel filters $$E(z_1), F(z_2), G(z_M)$$, such that $$H_d(z_1,z_2,...z_M) \approx H(z_1,z_2,...z_M)$$. This proposed parallel FIR filter realization is represented by the the block diagram as seen in Figure 1.

The completely parallel realization as seen in figure 1 block diagrams can be implemented in hardware by noting that block diagrams, and their corresponding Signal-flow graphs (SFGs) are a common method of graphically representing any DSP algorithm that can be expressed as a linear constant coefficient difference equation. SFGs allow for easy transition from a difference equation into a hardware implementation by allowing one to visualize the difference equation in terms of digital logic components such as shift registers, and basic ALU digital circuit elements such as adders and multipliers. For this specific parallel realization, one could place each parallel section on a separate parallel processor to allow for each section to be mplemented in a completely task-parallel fashion.

Using the fork–join model Parallel_programming_model of task-parallel, the a 'fork' may be applied at the first pickoff point in the figure, and the summing junction can be implemented during a synchronization with a 'join' operation. Implemented an mD FIR filter in this fashion lends itself well to the MapReduce general programming model