Random binary tree

In computer science and probability theory, a random binary tree is a binary tree selected at random from some probability distribution on binary trees. Different distributions have been used, leading to different properties for these trees.

Random binary trees have been used for analyzing the average-case complexity of data structures based on binary search trees. For this application it is common to use random trees formed by inserting nodes one at a time according to a random permutation. The resulting trees are very likely to have logarithmic depth and logarithmic Strahler number. The treap and related balanced binary search trees use update operations that maintain this random structure even when the update sequence is non-random.

Other distributions on random binary trees include the uniform discrete distribution in which all distinct trees are equally likely, distributions on a given number of nodes obtained by repeated splitting, binary tries and radix trees for random data, and trees of variable size generated by branching processes.

For random trees that are not necessarily binary, see random tree.

Background
A binary tree is a rooted tree in which each node may have up to two children (the nodes directly below it in the tree), and those children are designated as being either left or right. It is sometimes convenient instead to consider extended binary trees in which each node is either an external node with zero children, or an internal node with exactly two children. A binary tree that is not in extended form may be converted into an extended binary tree by treating all its nodes as internal, and adding an external node for each missing child of an internal node. In the other direction, an extended binary tree with at least one internal node may be converted back into a non-extended binary tree by removing all its external nodes. In this way, these two forms are almost entirely equivalent for the purposes of mathematical analysis, except that the extended form allows a tree consisting of a single external node, which does not correspond to anything in the non-extended form. For the purposes of computer data structures, the two forms differ, as the external nodes of the first form may be represented explicitly as objects in a data structure.

In a binary search tree the internal nodes are labeled by numbers or other ordered values, called keys, arranged so that an inorder traversal of the tree lists the keys in sorted order. The external nodes remain unlabeled. Binary trees may also be studied with all nodes unlabeled, or with labels that are not given in sorted order. For instance, the Cartesian tree data structure uses labeled binary trees that are not necessarily binary search trees.

A random binary tree is a random tree drawn from a certain probability distribution on binary trees. In many cases, these probability distributions are defined using a given set of keys, and describe the probabilities of binary search trees having those keys. However, other distributions are possible, not necessarily generating binary search trees, and not necessarily giving a fixed number of nodes.

From random permutations
For any sequence of distinct ordered keys, one may form a binary search tree in which each key is inserted in sequence as a leaf of the tree, without changing the structure of the previously inserted keys. The position for each insertion can be found by a binary search in the previous tree. The random permutation model, for a given set of keys, is defined by choosing the sequence randomly from the permutations of the set, with each permutation having equal probability.

For instance, if the three keys 1,3,2 are inserted into a binary search tree in that sequence, the number 1 will sit at the root of the tree, the number 3 will be placed as its right child, and the number 2 as the left child of the number 3. There are six different permutations of the keys 1,2, and 3, but only five trees may be constructed from them. That is because the permutations 2,1,3 and 2,3,1 form the same tree. Thus, this tree has probability $$\tfrac26=\tfrac13$$ of being generated, whereas the other four trees each have probability $\tfrac16$.

Expected depth of a node
For any key $$x$$ in a given set of $$n$$ keys, the expected value of the length of the path from the root to $$x$$ in a random binary search tree is at most $$2\log n+O(1)$$, where "$$\log$$" denotes the natural logarithm function and the $$O$$ introduces big O notation. By linearity of expectation, the expected number of ancestors of $$x$$ equals the sum, over other keys $$y$$, of the probability that $$y$$ is an ancestor of $$x$$. A key $$y$$ is an ancestor of $$x$$ exactly when $$y$$ is the first key to be inserted from the interval $$[x,y]$$. Because each key in the interval is equally likely to be first, this happens with probability inverse to the length of the interval. Thus, the keys that are adjacent to $$x$$ in the sorted sequence of keys have probability $$\tfrac12$$ of being an ancestor of $$x$$, the keys one step away have probability $$\tfrac13$$, etc. The sum of these probabilities forms two copies of the harmonic series extending away from $$x$$ in both directions in the sorted sequence, giving the $$2\log n+O(1)$$ bound above. This bound also holds for the expected search path length for a value $$x$$ that is one of the given keys.

The longest path
The longest root-to-leaf path, in a random binary search tree, is longer than the expected path length, but only by a constant factor. Its length, for a tree with $$n$$ nodes, is with high probability approximately

where $$\beta$$ is the unique number in the range $$0<\beta<1$$ satisfying the equation

Expected number of leaves
In the random permutation model, each key except the smallest and largest has probability $$\tfrac13$$ of being a leaf in the tree. This is because it is a leaf when it inserted after its two neighbors, which happens for two out of the six permutations of it and its two neighbors, all of which are equally likely. By similar reasoning, the smallest and largest key have probability $$\tfrac12$$ of being a leaf. Therefore, the expected number of leaves is the sum of these probabilities, which for $$n\ge 2$$ is exactly $$(n+1)/3$$.

Strahler Number
The Strahler number of vertices in any tree is a measure of the complexity of the subtrees under those vertices. A leaf (external node) has Strahler number one. For any other node, the Strahler number is defined recursively from the Strahler numbers of its children. In a binary tree, if two children have different Strahler numbers, the Strahler number of their parent is the larger of the two child numbers. But if two children have equal Strahler numbers, their parent has a number that is greater by one. The Strahler number of the whole tree is the number at the root node. For $$n$$-node random binary search trees, simulations suggest that the expected Strahler number is $$\log_3 n + O(1)$$. A weaker upper bound $$\log_3 n + o(\log n)$$ has been proven.

Treaps and randomized binary search trees
In applications of binary search tree data structures, it is rare for the keys to be inserted without deletion in a random order, limiting the direct applications of random binary trees. However, algorithm designers have devised data structures that allow arbitrary insertions and deletions to preserve the property that the shape of the tree is random, as if the keys had been inserted randomly.

If a given set of keys is assigned numeric priorities (unrelated to their values), these priorities may be used to construct a Cartesian tree for the numbers, the binary search tree that would result from inserting the keys in priority order. By choosing the priorities to be independent random real numbers in the unit interval, and by maintaining the Cartesian tree structure using tree rotations after any insertion or deletion of a node, it is possible to maintain a data structure that behaves like a random binary search tree. Such a data structure is known as a treap or a randomized binary search tree.

Variants of the treap including the zip tree and zip-zip tree replace the tree rotations by "zipping" operations that split and merge trees, and that limit the number of random bits that need to be generated and stored alongside the keys. The result of these optimizations is still a tree with a random structure, but one that does not exactly match the random permutation model.

Uniformly random binary trees
The number of binary trees with $$n$$ nodes is a Catalan number. For $$n=1,2,3,\dots$$ these numbers of trees are

Thus, if one of these trees is selected uniformly at random, its probability is the reciprocal of a Catalan number. Trees generated from a model in this distribution are sometimes called random binary Catalan trees. They have expected depth proportional to the square root of $$n$$, rather than to the logarithm. More precisely, the expected depth of a randomly chosen node in an $$n$$-node tree of this type is

The expected Strahler number of a uniformly random $$n$$-node binary tree is $$\log_4 n+O(1)$$, lower than the expected Strahler number of random binary search trees.

Due to their large heights, this model of equiprobable random trees is not generally used for binary search trees. However, it has other applications, including:
 * Modeling the parse trees of algebraic expressions in compiler design. Here the internal nodes of the tree represent binary operations in an expression and the external nodes represent the variables or constants on which the expressions operate. The bound on Strahler number translates into the number of registers needed to evaluate an expression.
 * Modeling river networks, the original application for which the Strahler number was developed.
 * Modeling possible evolutionary trees for a fixed number of species. In this application, an extended binary tree is used, with the species at its external nodes.

An algorithm of Jean-Luc Rémy generates a uniformly random binary tree of a specified size in time linear in the size, by the following process. Start with a tree consisting of a single external node. Then, while the current tree has not reached the target size, repeatedly choose one of its nodes (internal or external) uniformly at random. Replace the chosen node by a new internal node, having the chosen node as one of its children (equally likely left or right), and having a new external node as its other child. Stop when the target size is reached.

Branching processes
The Galton–Watson process describes a family of distributions on trees in which the number of children at each node is chosen randomly, independently of other nodes. For binary trees, two versions of the Galton–Watson process are in use, differing only in whether an extended binary tree with only one node, an external root node, is allowed: Trees generated in this way have been called binary Galton–Watson trees. In the special case where $$p=\tfrac12$$ they are called critical binary Galton–Watson trees.
 * In the version where the root node may be external, it is chosen to be internal with some specified probability $$p$$ or external with probability $$1-p$$. If it is internal, its two children are trees generated recursively by the same process.
 * In the version where the root node must be internal, its left and right children are determined to be internal with probability $$p$$ or external with probability $1-p$, independently of each other. In the case where they are internal, they are the roots of trees that are generated recursively by the same process.

Analysis
The probability $$p=\tfrac12$$ marks a phase transition for the binary Galton–Watson process: for $$p\le\tfrac12$$ the resulting tree is almost certainly finite, whereas for $$p>\tfrac12$$ it is infinite with positive probability. More precisely, for any $p$, the probability that the tree remains finite is

Another way to generate the same trees is to make a sequence of coin flips, with probability $$p$$ of heads and probability $$1-p$$ of tails, until the first flip at which the number of tails exceeds the number of heads (for the model in which an external root is allowed) or exceeds one plus the number of heads (when the root must be internal), and then use this sequence of coin flips to determine the choices made by the recursive generation process, in depth-first order.

Because the number of internal nodes equals the number of heads in this coin flip sequence, all trees with a given number $$n$$ of nodes are generated from (unique) coin flip sequences of the same length, and are equally likely, regardless of $$p$$. That is, the choice of $$p$$ affects the variation in the size of trees generated by this process, but for a given size the trees are generated uniformly at random. For values of $$p$$ below the critical probability $p=\tfrac12$, smaller values of $$p$$ will produce trees with a smaller expected size, while larger values of $$p$$ will produce trees with a larger expected size. At the critical probability $$p=\tfrac12$$ there is no finite bound on the expected size of trees generated by this process. More precisely, for any $p$, the expected number of nodes at depth $$i$$ in the tree is $(2p)^i$, and the expected size of the tree can be obtained by summing the expected numbers of nodes at each depth. For $$p<\tfrac12$$ this gives a geometric series

for the expected tree size, but for $$p=\tfrac12$$ this gives 1 + 1 + 1 + 1 + ⋯, a divergent series.

For $p=\tfrac12$, any particular tree with $$n$$ internal nodes is generated with probability $1/2^{2n+1}$, and the probability that a random tree has this size is this probability multiplied by a Catalan number,

Applications
Galton–Watson processes were originally developed to study the spread and extinction of human surnames, and have been widely applied more generally to the dynamics of human or animal populations. These processes have been generalized to models where the probability of being an internal or external node at a given level of the tree (a generation, in the population dynamics application) is not fixed, but depends on the number of nodes at the previous level. A version of this process, with the critical probability $\tfrac12$, has been studied as a model for speciation, where it is known as the critical branching process. In this process, each species has an exponentially distributed lifetime, and over the course of its lifetime produces child species at a rate equal to the lifetime. When a child is produced, the parent continues as the left branch of the evolutionary tree, and the child becomes the right branch.

Another application of critical Galton–Watson trees (in the version where the root must be internal) arises in the Karger–Stein algorithm for finding minimum cuts in graphs, using a recursive edge contraction process. This algorithm calls itself twice recursively, with each call having probability at least $$\tfrac12$$ of preserving the correct solution value. The random tree models the subtree of correct recursive calls. The algorithm succeeds on a graph of $$n$$ vertices whenever this random tree of correct recursive calls has a branch of depth at least $$2\log_2 n$$, reaching the base case of its recursion. The success probability is $\Omega(1/\log n)$, producing one of the logarithmic factors in the algorithm's $$O(n^2\log^3 n)$$ runtime.

Yule process
Devroye and Robson consider a related continuous-time random process in which each external node is eventually replaced by an internal node with two external children, at an exponentially distributed time after its first appearance as an external node. The number of external nodes in the tree, at any time, is modeled by a simple birth process or Yule process in which the members of a population give birth at a constant rate: giving birth to one child, in the Yule process, corresponds to being replaced by two children, in Devroye and Robson's model. If this process is stopped at any fixed time, the result is a binary tree of a random size (depending on the stopping time), distributed according to the random permutation model for that size. Devroye and Robson use this model as part of an algorithm to quickly generate trees in the random permutation model, described by their numbers of nodes at each depth rather than by their exact structure. A discrete variant of this process starts with a tree consisting of a single external node, and repeatedly replaces a randomly-chosen external node by an internal node with two external children. Again, if this is stopped at a fixed time (with a fixed size), the resulting tree is distributed according to the random permutation model for that size.

Binary tries
Another form of binary tree, the binary trie or digital search tree, has a collection of binary numbers labeling some of its external nodes. The internal nodes of the tree represent prefixes of their binary representations that are shared by two or more of the numbers. The left and right children of an internal node are obtained by extending the corresponding prefix by one more bit, a zero or a one bit respectively. If this extension does not match any of the given numbers, or it matches only one of them, the result is an external node; otherwise it is another internal node. Random binary tries have been studied, for instance for sets of random real numbers generated independently in the unit interval. Despite the fact that these trees may have some empty external nodes, they tend to be better balanced than random binary search trees. For $$n$$ uniformly random real numbers in the unit interval, or more generally for any square-integrable probability distribution on the unit interval, the average depth of a node is asymptotically $$\log_2 n$$, and the average height of the whole tree is asymptotically $$2\log_2 n$$. The analysis of these trees can be applied to the computational complexity of trie-based sorting algorithms.

A variant of the trie, the radix tree or compressed trie, eliminates empty external nodes and their parent internal nodes. The remaining internal nodes correspond to prefixes for which both possible extensions, by a zero or a one bit, are used by at least one of the randomly chosen numbers. For a radix tree for $$n$$ uniformly distributed binary numbers, the shortest leaf-root path has length $$\log_2 n-\log_2\log n+o(\log\log n)$$ and the longest leaf-root path has length $$\log_2 n+\sqrt{2\log_2 n}+o(\sqrt{\log n}),$$ both with high probability.

Random split trees
Luc Devroye and Paul Kruszewski describe a recursive process for constructing random binary trees with $$n$$ nodes. It generates a real-valued random variable $$x$$ in the unit interval $$(0,1)$$, assigns the first $$xn$$ nodes (rounded down to an integer number of nodes) to the left subtree, the next node to the root, and the remaining nodes to the right subtree. Then, it continues recursively using the same process in the left and right subtrees. If $$x$$ is chosen uniformly at random in the interval, the result is the same as the random binary search tree generated by a random permutation of the nodes, as any node is equally likely to be chosen as root. However, this formulation allows other distributions to be used instead. For instance, in the uniformly random binary tree model, once a root is fixed each of its two subtrees must also be uniformly random, so the uniformly random model may also be generated by a different choice of distribution (depending on $$n$$) for $$x$$. As they show, by choosing a beta distribution on $$x$$ and by using an appropriate choice of shape to draw each of the branches, the mathematical trees generated by this process can be used to create realistic-looking botanical trees.