User:Leakec/sandbox

$$K$$-vector (also known as the one-dimensional $$k$$-vector) is a data structure used to organize the elements of a one-dimensional database. The $$k$$-vector data structure is used in the $$k$$-vector range searching algorithm. The algorithm can be used for any application that involves range searching, such as star identification or isosurface identification for level set analysis. In addition, the algorithm can be used for a handful of other applications such as function inversion.

Historical background
The $$k$$-vector was originally developed and used in star identification algorithms , most notably the Pyramid star identification algorithm , and more recently in the Super k-ID algorithm. Star identification algorithms perform range searches on a large, one-dimensional database of interstellar angles. In order for the star identification and eventual attitude estimation to be useful, the range searches on the database of interstellar angles must be performed very quickly. Thus, the $$k$$-vector range searching technique was devised to search on large, one-dimensional databases by performing only two calculations: one to retrieve the upper index and a second to retrieve the lower index of elements in the sorted database that meet the search criteria. Hence, the number of operations needed to perform a range search is small and independent of the size of the database. Consequently, the $$k$$-vector is able to search in large, one-dimensional databases very quickly.

Informal description
The k-vector is a vector of integers that organizes elements of the sorted database into bins. The image to the right gives an example of the k-vector for a sample database of ten elements. The ten elements, each represented by a star in the image, have been sorted and plotted in ascending order. The y-axis of the image gives the value of each of the elements. The x-axis of the image has two sets of numbers. The upper set is a counting index that begins at one and ends at ten. The lower set is the k-vector.

The elements of the k-vector give the number of elements in the sorted database below the associated bin. For example, the first element of the k-vector has a value of zero because there are zero elements below the first bin, and the second element of the k-vector has a value of two because there are two elements below the second bin. Organizing the database elements using the k-vector in this way allows range searches to be performed on the database using a number of operations that is independent of the database size. In other words, the order of the algorithm is independent of the database size. This feature of the $$k$$-vector is one that does not exist in other one-dimensional search algorithms, such as the binary search tree, whose order goes as $$O(log(n))$$, where $$n$$ is the number of elements in the database.

Searching algorithm description
The searching algorithm is broken down into two steps: pre-processing and searching. The pre-processing step is only done once, and must be completed before the searching step. The pre-processing step assembles the k-vector for use in the searching step.

Pre-processing
During the pre-processing step the k-vector is assembled. In addition, an index that maps the sorted database to the original database is created and stored.

Let $$D$$ be a user-specified one dimensional database with $$n$$ elements. Then, a sorted database, $$S$$, and an index, $$I$$, are created using a sorting algorithm and sorting the database in ascending order such that:

$$S = D(I)$$

Then, a mapping function is created that will be used to define the location of the k-vector bins. Typically, this mapping function is a line, because a line is easily inverted. Let the mapping function have a slope, $$m$$, and intercept, $$q$$, defined by:

$$ m = \dfrac{D_{\max} - D_{\min} + 2\delta\varepsilon}{N - 1} $$

$$ q = D_{\min} - \delta\varepsilon $$

where $$D_{max}$$ is the largest element in the database, $$D_{min}$$ is the smallest element in the database, $$N$$ is the number of k-vector bins, and $$\delta\varepsilon = (n-1)\varepsilon$$. The symbol $$\varepsilon$$ is the relative machine precision ($$2.22\times 10^{-16}$$ for double precision numbers). The value of $$N$$, which the user chooses, is a trade off between memory and performance of the searching algorithm. If $$N$$ is larger, the algorithm will take less time to perform a search, but will use more memory. If $$N$$ is smaller, the algorithm will take more time to search, but use less memory. Generally, a good choice for the value of $$N$$ is the number of elements in the database, $$n$$.

The addition of $$\delta\varepsilon$$ into the slope and intercept equations ensures that the bottom bin will be below the smallest element in the database, and the top bin will be above the largest element in the database. This is necessary for all of the elements in the database to be searchable. Now the mapping function, $$z$$, is defined as:

$$z(i) = m(i-1)+q \quad \text{where} \quad i \in [1,2,...,N]$$

Now that the mapping function is defined, the $$k$$-vector is built using:

$$K(i) = j \quad | \quad S(j)<z(i)<S(j+1) \quad \text{where} \quad i \in [2,3,...,N-1] \quad \text{and} \quad j \in [1,2,...,N] $$

$$K(1) = 0 \quad \text{and} \quad K(N) = n$$

Searching
Once the pre-processing step has been completed, the database can be searched. Let the user defined search range be given by $$[y_a,y_b]$$. The searching algorithm steps to return the indices of the elements that meet the search criteria are

$$k_a = K\Bigg(\left\lfloor\dfrac{y_a - q}{m}\right\rfloor\Bigg) + 1$$

$$k_b = K\Bigg(\left\lceil\dfrac{y_b - q}{m}\right\rceil\Bigg)$$

$$J = I(i) \quad \text{where} \quad i \in [ka, ..., k_b]$$,

where $$J$$ are the indices of the elements in the original database, $$D$$, that meet the search criteria. Note, in the equations above the $$\lfloor \ \rfloor$$ symbol is used to denote the floor function, and $$\lceil \ \rceil$$ symbol is used to denote the ceiling function.

The search algorithm in the form implemented above may contain elements extraneous to the search range; however, those elements will be the elements in the database closest to the user defined search range. Databases that are approximately linear when sorted and whose $$k$$-vector has $$N=n$$ bins, the approximate number of extraneous elements for any search range is $$\frac{n}{n-1}$$. Thus, for large databases that meet the aforementioned criteria, there will be approximately one extraneous element per search. If the extraneous elements cannot be tolerated, a simple linear search can be implemented as follows.

For $$k_a$$,

$$\text{while} \ (D(I(k_a)) < y_a)$$

$$ \quad \quad k_a = k_a + 1$$,

and for $$k_b$$,

$$\text{while} \ (D(I(k_b)) > y_b)$$

$$ \quad \quad k_b = k_b - 1$$.