User:StudentDH/sandbox

In machine learning, Adaptive Concentration is a framework introduced by Stefan Wager and Guenther Walther as a means to analyze the convergence of random forests. The basic concept of adaptive concentration is to break random forest training into a model selection stage and a model fitting stage. As indicated by the authors, this treatment is conceptually motivated by the valid post-selection inference framework.

History
Since introduced in 2001 by Leo Breiman, random forests have been proven difficult to analyze theoretically. Successes include the analysis of simplified models and the establishment of connections between random forests and other statistical methods such as the $k$-nearest neighbors algorithm or as in kernel random forests. In 2014, Scornet, Biau, and Vert were able to prove an asymptotic consistency result for random forests assuming an additive regression model.

Introduced in 2015, the adaptive concentration framework considers a model selection stage and a model fitting stage within random forest training. Adaptive concentration analyzes the convergence of fitted forests to the optimal forest arising from a particular choice of model, bounding the rate of this convergence uniformly over all valid models.

Background and Notation
Consider training data with $$d$$-dimensional features $$\{x^{(1)}, x^{(2)}, \dots, x^{(n)}\} \subset [0, 1]^d$$ sampled from a random variable $$X$$ and corresponding $$\{y^{(1)}, y^{(2)}, \dots, y^{(n)}\} \subset [-{M}/{2}, {M}/{2}]$$ for the response $$Y$$ and some bound $$M > 0$$. Because each tree in the random forest algorithm arises from a recursive splitting of the feature space, each tree gives rise to a partition of $$[0, 1]^d$$.
 * A partition $$V$$ of $$[0, 1]^d$$ is valid if it comes from such a recursive splitting.
 * For $$\alpha \in (0, {1}/{2})$$ and $$k \in \mathbb{N}$$, a valid partition $$V$$ is $$\{\alpha, k\}$$-valid if


 * 1) Each half of each split contains at least a fraction $$\alpha$$ of the data, and
 * 2) The blocks of the partition $$V$$ each contain at least $$k$$ data points.


 * For $$x \in [0, 1]^d$$, let $$V(x)$$ denote the block of $$V$$ which contains $$x$$, and let $$\#V(x)$$ denote the number of data points in this block.


 * To a collection $$\mathcal{F}$$ of $$m$$ valid partitions of $$[0, 1]^d$$, we associate a forest $$ H: [0, 1]^d \longrightarrow [-{M}/{2}, {M}/{2}] $$ defined by
 * $$H(x) = \frac{1}{m} \sum_{V \in \mathcal{F}}  \frac{1}{\# V(x)} \sum_{x^{(i)} \in V(x)} y^{(i)} $$ for $$x \in [0, 1]^d$$.


 * We denote by $$\mathcal{H}_{\alpha, k}$$ the set of all forests arising from $$\{\alpha, k\}$$-valid partitions.

The Adaptive Concentration Framework
The adaptive concentration framework considers random forest fitting to occur in two stages:


 * 1) Model Selection: Choosing a collection $$\mathcal{F}$$ of $$m$$ valid partitions of $$[0, 1]^d$$
 * 2) Model Fitting: Approximating the partition-optimal forest
 * $$H^*(x) = \frac{1}{m} \sum_{V \in \mathcal{F}} \mathbb{E}[Y | X \in V(x)] $$

by the forest
 * $$H(x) = \frac{1}{m} \sum_{V \in \mathcal{F}}  \frac{1}{\# V(x)} \sum_{x^{(i)} \in V(x)} y^{(i)} $$

In this framework, one then analyzes how well $$H$$ approximates $$H^*$$, and as detailed below, for a fixed $$\alpha $$ and $$k$$ this can be done uniformly over all forests in $$\mathcal{H}_{\alpha, k}$$.

Uniform Convergence of Random Forests
By treating random forests in the framework of adaptive concentration, Wager and Walther are able to prove the following uniform convergence result:

Assume:
 * 1) The features are  uniformly distributed on $$[0, 1]^d$$
 * 2) $$\lim_{n, k \to \infty} \frac{(\log n)^2}{k} = 0 $$
 * 3) $$ \log(d) = \Theta(\log(n))$$ where $$\Theta(\cdot)$$ is as in the  family of Bachmann–Landau notations.

Then


 * $$ \lim_{n, d, k \to \infty} \mathbb{P}\left[\sup_{x \in [0, 1]^d, H \in \mathcal{H}_{\alpha, k}} |H(x) - H^*(x)| \le 6f(n, d, k) \right] = 1 $$

where


 * $$f(n, d, k) = M\sqrt{\frac{\log(n)\log(d)}{k\log(1/(1-\alpha))}}$$.

Remark
This theorem assumes that there is no bagging of the training data, but the Wager and Walther consider an analysis which allows for bagging to be ″a promising avenue for future work.″