User:Pugget/pstest

Peer sampling is the process of selecting a random peer from a peer-to-peer (P2P) network. A peer sampling process consists of one or more distributed algorithms responsible for selecting a peer randomly, and, in some algorithms, a cache of sampled peers to reduce latency. Peer sampling is used to support more complex P2P algorithms and services, including topology maintenance, search and replication, epidemic gossip protocols, and network monitoring.

Overview
The goal in peer sampling is to return a reference to a single random peer from the set of all peers participating in the sampled P2P network. A returned peer reference contains the information necessary to contact the peer, such as of the IP address and TCP port number. In cases where the sampling process is interested in peer properties rather than peer references, the value of a given peer property (e.g., bandwidth usage) may be directly returned instead of a peer reference.

We can imagine a discrete probability distribution $$P$$ over the peers in the network defining the distribution from which we desire to randomly sample. Many peer sampling applications assume that $$P$$ is uniform, $$P(X=v) = 1/|\mathbf{V}|$$, where $$\mathbf{V}$$ is the set of all peers, and $$v$$ is a single peer. Other applications desire more complex distributions based on peer popularity or other metrics. It is the goal of the sampling algorithm to sample from the desired distribution with as little error and systematic bias as possible.

Algorithms
Peer sampling algorithms provide a simple interface consisting of a single method, getPeer, which returns a single random (see Properties) peer reference from the entire P2P network. The reference contains the necessary information to contact the peer. Two main classes of algorithms are used to select which peer to return: random walks and gossip.

Random Walks
As a random walk progresses, it becomes statistically decorrelated with its starting location. Sampling algorithms use this property and run random walks of a sufficient length such that the last peer visited in the walk is decorrelated from the starting peer.

The sample distribution generated by unbiased random walks, in which all possible next steps are weighted equally, is typically not desirable. This is because the underlying stationary distribution of the graph representing the P2P topology depends on vertex degree. For a P2P network with bidirectional links the underlying topology graph $$G = (V,E)$$ is undirected, and each peer $$v$$ has the probability to be sampled of $$p(v) = d(v)/2|\mathbf{E}|$$, where $$d(v)$$ is the degree of $$v$$. Unless the desired sample distribution $$P$$ takes the same form, a sampling algorithm must bias the probability to transition from peer to peer before running any sample walks. A particular set of transition probabilities is called a weighting.

It should be noted that a single weighting corresponds with a single sample distribution. If an application or algorithm needs to sample from multiple sample distributions, multiple weightings will need to be created. Most random walk sampling algorithms can only create a weighting for a single sample distribution, so multiple sampling algorithms may be needed in some cases.

Random walk peer sampling algorithms work in two phases, first determining the link-transition probabilities of the random walk, and then running biased random walks using the new transition probabilities. There are a number of ways to determine the new link-transition probabilities in the first phase. Denoting $$q(x,y)$$ as the probability to transition from peer $$x$$ to peer $$y$$, we can list some of the major weighting schemes:

q(x,y) = \begin{cases} 1/d_{max}       &  \mbox{if } x \ne y, \\ 1 - d_i/d_{max} &  \mbox{if } x = y. \end{cases} $$ q(x,y) = \begin{cases} 1/\rho       &  \mbox{if } x \ne y, \\ 1 - d_i/\rho &  \mbox{if } x = y. \end{cases} $$ q(x,y) = \begin{cases} w(x,y) \text{min}\left(\frac{P(y)w(y,x)}{P(x)w(x,y)}, 1\right)    & \mbox{if }x \ne y, \\ 1 - \sum_{x \ne y}{q(x,y)}                                        & \mbox{if }x = y. \end{cases} $$
 * Maximum-degree: Biases random-walks such that they sample uniformly. Assuming that the $$d_{max} = \max_{v \in \mathbf{V}} d(v)$$, the transition probabilities are set such that
 * Random weight distribution: Biases random-walks such that they sample uniformly. Similar in nature to maximum-degree, random weight distribution chooses some $$\rho > d_{max} \,$$, with the advantage that $$\rho \,$$ does not need to be calculated on-line, as long as it is guaranteed to be larger than $$d_{max}$$.  The transition probabilities are biased similarly to those in maximum-degree:
 * Metropolis-Hastings: Biases random-walks such that they sample from a particular target distribution $$P$$. Unlike other schemes, $$P$$ can be any desired sample distribution, including uniform.  Metropolis-Hastings makes use of detailed balance, which states that $$P(x)q(x,y) = P(y)q(y,x)$$ if a Markov chain is time-reversible.  As P2P topologies that are strongly connected and bidirectional are time-reversible when represented as Markov chains, detailed balance can be used to re-weight an existing network.  If $$w(y,x)$$ is the current transition probability from peer $$x$$ to peer $$y$$, the new set of transition probabilities $$q(x,y)$$ can be found by looking at the ratio of the detailed balance equation:
 * Doubly stochastic converge:

In all of the listed schemes, transition probability is only subtracted from links from one peer to another. To make up for this lost transition probability, each scheme introduces a self-loop, and assigns the remaining transition probability to it. Intuitively, the self-loop means that a random walk has some probability not to transition to a neighboring peer. Peers with an initially low probability to be sampled will have higher self-loops after balancing to increase their sample probability to the desired level.

After transition probabilities are determined, random walks can be run to generate individual peer samples. Two main factors determine the length of the random walk necessary: the topology of the P2P network, and the transition probabilities, particularly the self-loops. In general, the higher the self-loops, the longer a sample walk must be until it becomes uncorrelated from its starting location. Topologies with large diameter also require more hops.