Compressed cover tree

The compressed cover tree is a type of data structure in computer science that is specifically designed to facilitate the speed-up of a k-nearest neighbors algorithm in finite metric spaces. Compressed cover tree is a simplified version of explicit representation of cover tree that was motivated by past issues in proofs of time complexity results of cover tree. The compressed cover tree was specifically designed to achieve claimed time complexities of cover tree in a mathematically rigorous way.

Problem statement
In the modern formulation, the k-nearest neighbor problem is to find all $$ k\geq 1 $$ nearest neighbors in a given reference set R for all points from another given query set Q. Both sets belong to a common ambient space X with a distance metric d satisfying all metric axioms.

Compressed cover tree
Let (R,d) be a finite metric space. A compressed cover tree $$\mathcal{T}(R)$$ has the vertex set R with a root $$r \in R $$ and a level function $$l:R \rightarrow \mathbb{Z} $$ satisfying the conditions below:


 * Root condition: the level of the root node r satisfies $$l(r) \geq 1 + \max\limits_{p \in R \setminus \{r\}}l(p)$$
 * Covering condition: For every node $$ q \in R\setminus \{r\} $$, we select a unique parent p and a level l(q) such that $$ d(q,p) \leq 2^{l(q)+1} $$ and $$ l(q) < l(p) $$ this parent node pp has a single link to its child node q.


 * Separation condition: For $$ i \in \Z $$, the cover set $$ C_i = \{p \in R \mid l(p) \geq i\} $$ has  $$ d_{\min}(C_i) = \min\limits_{p \in C_{i}}\min\limits_{q \in C_{i}\setminus \{p\}} d(p,q) > 2^{i} $$

Expansion constants
In a metric space, let $$ \bar B(p,t) $$ be the closed ball with a center p and a radius $$ t\geq 0 $$. The notation $$|\bar B(p,t)|$$ denotes the number (if finite) of points in the closed ball.

The expansion constant $$ c(R) $$ is the smallest  $$ c(R)\geq 2 $$ such that $$|\bar{B}(p,2t)|\leq c(R) \cdot |\bar{B}(p,t)| $$ for any point $$ p\in R $$ and $$ t\geq 0 $$.

the new minimized expansion constant $$c_m $$ is a discrete analog of the doubling dimension Navigating nets $$ c_m(R) = \lim\limits_{\xi \rightarrow 0^{+}}\inf\limits_{R\subseteq A\subseteq X}\sup\limits_{p \in A,t > \xi}\dfrac{|\bar{B}(p,2t) \cap A|}{|\bar{B}(p,t) \cap A|} $$, where A is a locally finite set which covers R.

Note that $$ c_m(R) \leq c(R) $$ for any finite metric space (R,d).

Aspect ratio
For any finite set R with a metric d, the diameter is $$ \mathrm{diam}(R) = \max_{p \in R}\max_{q \in R}d(p,q) $$. The aspect ratio is $$ \Delta(R) = \dfrac{\mathrm{diam}(R)}{d_{\min}(R)} $$, where $$ d_{\min}(R) $$ is the shortest distance between points of R.

Insert
Although cover trees provide faster searches than the naive approach, this advantage must be weighed with the additional cost of maintaining the data structure. In a naive approach adding a new point to the dataset is trivial because order does not need to be preserved, but in a compressed cover tree it can be bounded:
 * using expansion constant: $$ O(c(R)^{10} \cdot \log|R|) $$.
 * using minimized expansion constant / doubling dimension $$ O(c_m(R)^{8} \cdot \log\Delta(|R|)) $$.

K-nearest neighborhood search
Let Q and R be finite subsets of a metric space (X,d). Once all points of R are inserted into a compressed cover tree $$\mathcal{T}(R) $$ it can be used for find-queries of the query point set Q. The following time complexities have been proven for finding the k-nearest neighbor of a query point $$ q \in Q $$ in the reference set R:
 * using expansion constant: $$ O\Big ( c(R \cup \{q\})^2 \cdot \log_2(k) \cdot \big((c_m(R))^{10} \cdot \log_2(|R|) + c(R \cup \{q\}) \cdot k\big) \Big). $$.
 * using minimized expansion constant / doubling dimension $$ O\Big ((c_m(R))^{10} \cdot \log_2(k) \cdot \log_2(\Delta(R)) + |\bar{B}(q, 5d_k(q,R))| \cdot \log_2(k) \Big ) $$, where $$ |\bar{B}(q, 5d_k(q,R))| $$ is a number of points inside a closed ball around q having a radius 5 times the distance of q to its k-nearest neighbor.

Space
The compressed cover tree constructed on finite metric space R requires O(|R|) space, during the construction and during the execution of the Find algorithm.

Using doubling dimension as hidden factor
Tables below show time complexity estimates which use minimized expansion constant $$ c_m(R) $$ or dimensionality constant $$2^{\text{dim}}$$ related to doubling dimension. Note that $$ \Delta $$ denotes the aspect ratio.


 * Results for building data structures

Results for exact k-nearest neighbors of one query point $$q \in Q$$ in reference set R assuming that all data structures are already built. Below we denote the distance between a query point q and the reference set R as $$ d(q,R) $$ and distance from a query point q to its k-nearest neighbor in set R as $$ d_k(q,R) $$:

Using expansion constant as hidden factor
Tables below show time complexity estimates which use $$c(R)$$ or KR-type constant $$2^{\text{dim}_{KR}}$$ as a hidden factor. Note that the dimensionality factor $$2^{\text{dim}_{KR}}$$ is equivalent to $$c(R)^{O(1)}$$


 * Results for building data structures

Results for exact k-nearest neighbors of one query point $$q \in X$$ assuming that all data structures are already built.