User:Bassis/Bootstrapping populations

sample.

Method
Given a $$\boldsymbol x=\{x_1,\ldots,x_m\}$$ of a random variable X and a sampling mechanism $$(g_{\boldsymbol\theta},Z)$$ for X, we have $$\boldsymbol x=\{g_{\boldsymbol\theta}(z_1),\ldots,g_{\boldsymbol\theta}(z_m)\}$$, with   $$\boldsymbol\theta=(\theta_1,\ldots,\theta_k)$$. Focusing on well behaving statistics

for their parameters, the master equations read

For each sample seed $$\{z_1,\ldots,z_m\}$$ you obtain a vector of parameters $$\boldsymbol\theta$$ from the solution of the above system with $$s_i$$ fixed to the observed values. Having computed a huge set of compatible vectors, say N, you obtain the empirical marginal distribution of $$\Theta_j$$ by:

denoting by $$\breve\theta_{j,i}$$ the j-th component of the generic solution of (1) and by $$I_{(-\infty,\theta]}(\breve\theta_{j,i})$$ the indicator function of  $$\breve\theta_{j,i}$$ in the interval $$(-\infty,\theta]$$. Some indeterminacies remain with X discrete which we will consider shortly. The whole procedure may be summed up in the form of the following Algorithm, where the index $$\boldsymbol\Theta$$ of $$\boldsymbol s_{\boldsymbol\Theta}$$ denotes the parameters vector which the statics vector refers to.

Algorithm
You may easily see from the Table of sufficient statistics that we obtain the curve in the picture on the left by computing the empirical distribution (2) on the population obtained through the above algorithm when: i) X is an Exponential random variable, ii) $$ s_\Lambda= \sum_{j=1}^m x_j $$, and
 * $$\text{ iii) Inv}(s_\Lambda,\boldsymbol u_i) =\sum_{j=1}^m(-\log u_{ij})/s_\Lambda$$,

and the curve in the picture on the right when: i) X is a Uniform random variable in $$[0,a] $$, ii) $$ s_A= \max_{j=1, \ldots, m} x_j $$, and
 * $$\text{iii) Inv}(s_A,\boldsymbol u_i) =s_A/\max_{j=1,\ldots,m}\{u_{ij}\}$$.

Remark
Note that the accuracy with which a parameter distribution law of populations compatible with a sample is obtained is not a function of the sample size. Instead, it is a function of the number of seeds we draw. In turn, this number is purely a matter of computational time but does not require any extension of the observed data. With other bootstrapping methods focusing on a generation of sample replicas (like those proposed by ) the accuracy of the estimate distributions depends on the sample size.

Example
For $$\boldsymbol x$$ expected to represent a Pareto distribution, whose specification requires values for the parameters $$a$$ and k, we have that the cumulative distribution function reads:


 * $$F_X(x)=1-\left(\frac{k}{x}\right)^a$$.

A sampling mechanism $$(g_{(a,k)}, U)$$ has $$[0,1]$$ uniform seed U and explaining function $$g_{(a,k)}$$ described by:


 * $$x= g_{(a,k)}=(1 - u)^{-\frac{1}{a}} k$$

A relevant statistic $$\boldsymbol s_\boldsymbol\Theta$$ is constituted by the pair of  joint sufficient statistics for $$A$$ and K, respectively  $$s_1=\sum_{i=1}^m \log x_i, s_{2}=\min\{x_i\}$$. The master equations read


 * $$s_1=\sum_{i=1}^m -\frac{1}{a}\log(1 - u_i)+m \log k$$


 * $$s_{2}=(1 - u_{\min})^{-\frac{1}{a}} k$$

with $$u_{\min}=\min\{u_i\}$$.

Figure on the right reports the three dimensional plot of the empirical cumulative distribution function (2) of $$(A,K)$$.