Top-p sampling

Top-p sampling, also called nucleus sampling, is a technique for autoregressive language model decoding proposed by Ari Holtzman in 2019. Before the introduction of nucleus sampling, maximum likelihood decoding and beam search were the standard techniques for text generation, but, both of these decoding strategies are prone to generating texts that are repetitive and otherwise unnatural. Top-p sampling avoids this by setting a threshold $p$ and then restricting the sampling to the set of most probable tokens with cumulative probability less than $p$.

Top-k sampling is similar except that the sample is taken from the k-highest probability tokens regardless of their cumulative probability. The advantage of top-p sampling is that one avoids the difficult problem of choosing the optimal value of $k$ which can vary depending on the shape of the output distribution and the particular task and dataset.

The top-p sampling technique is used in popular large language model applications like ChatGPT and is implemented in language modeling frameworks like Hugging Face and Cohere.