Leader election

In distributed computing, leader election is the process of designating a single process as the organizer of some task distributed among several computers (nodes). Before the task has begun, all network nodes are either unaware which node will serve as the "leader" (or coordinator) of the task, or unable to communicate with the current coordinator. After a leader election algorithm has been run, however, each node throughout the network recognizes a particular, unique node as the task leader.

The network nodes communicate among themselves in order to decide which of them will get into the "leader" state. For that, they need some method in order to break the symmetry among them. For example, if each node has unique and comparable identities, then the nodes can compare their identities, and decide that the node with the highest identity is the leader.

The definition of this problem is often attributed to LeLann, who formalized it as a method to create a new token in a token ring network in which the token has been lost.

Leader election algorithms are designed to be economical in terms of total bytes transmitted, and time. The algorithm suggested by Gallager, Humblet, and Spira for general undirected graphs has had a strong impact on the design of distributed algorithms in general, and won the Dijkstra Prize for an influential paper in distributed computing.

Many other algorithms have been suggested for different kinds of network graphs, such as undirected rings, unidirectional rings, complete graphs, grids, directed Euler graphs, and others. A general method that decouples the issue of the graph family from the design of the leader election algorithm was suggested by Korach, Kutten, and Moran.

Definition
The problem of leader election is for each processor eventually to decide whether it is a leader or not, subject to the constraint that exactly one processor decides that it is the leader. An algorithm solves the leader election problem if:
 * 1)  States of processors are divided into elected and not-elected states. Once elected, it remains as elected (similarly if not elected).
 * 2)  In every execution, exactly one processor becomes elected and the rest determine that they are not elected.

A valid leader election algorithm must meet the following conditions:
 * 1)  Termination: the algorithm should finish within a finite time once the leader is selected. In randomized approaches this condition is sometimes weakened (for example, requiring termination with probability 1).
 * 2)  Uniqueness: there is exactly one processor that considers itself as leader.
 * 3) Agreement: all other processors know who the leader is.

An algorithm for leader election may vary in the following aspects:
 * Communication mechanism: the processors are either synchronous in which processes are synchronized by a clock signal or asynchronous where processes run at arbitrary speeds.
 * Process names: whether processes have a unique identity or are indistinguishable (anonymous).
 * Network topology: for instance, ring, acyclic graph or complete graph.
 * Size of the network: the algorithm may or may not use knowledge of the number of processes in the system.

Leader election in rings
A ring network is a connected-graph topology in which each node is exactly connected to two other nodes, i.e., for a graph with n nodes, there are exactly n edges connecting the nodes. A ring can be unidirectional, which means processors only communicate in one direction (a node could only send messages to the left or only send messages to the right), or bidirectional, meaning processors may transmit and receive messages in both directions (a node could send messages to the left and right).

Anonymous rings
A ring is said to be anonymous if every processor is identical. More formally, the system has the same state machine for every processor. There is no deterministic algorithm to elect a leader in anonymous rings, even when the size of the network is known to the processes. This is due to the fact that there is no possibility of breaking symmetry in an anonymous ring if all processes run at the same speed. The state of processors after some steps only depends on the initial state of neighbouring nodes. So, because their states are identical and execute the same procedures, in every round the same messages are sent by each processor. Therefore, each processor state also changes identically and as a result if one processor is elected as a leader, so are all the others.

For simplicity, here is a proof in anonymous synchronous rings. It is a proof by contradiction. Consider an anonymous ring R with size n>1. Assume there exists an algorithm "A" to solve leader election in this anonymous ring R.


 * Lemma: after round $$k$$ of the admissible execution of A in R, all the processes have the same states.

Proof. Proof by induction on $$k$$.

Base case: $$k=0$$: all the processes are in the initial state, so all the processes are identical.

Induction hypothesis: assume the lemma is true for $$k-1$$ rounds.

Inductive step: in round $$k$$, every process send the same message $$m_r$$ to the right and send the same message $$m_l$$ to the left. Since all the processes are in the same state after round $$k-1$$, in round k, every process will receive the message $$m_r$$ from the left edge, and will receive the message $$m_l$$ from the right edge. Since all processes are receiving the same messages in round $$k$$, they are in the same state after round $$k$$.

The above lemma contradicts the fact that after some finite number of rounds in an execution of A, one process entered the elected state and other processes entered the non-elected state.

Randomized (probabilistic) leader election
A common approach to solve the problem of leader election in anonymous rings is the use of probabilistic algorithms. In such approaches, generally processors assume some identities based on a probabilistic function and communicate it to the rest of the network. At the end, through the application of an algorithm, a leader is selected (with high probability).

Asynchronous ring


Since there is no algorithm for anonymous rings (proved above), the asynchronous rings would be considered as asynchronous non-anonymous rings. In non-anonymous rings, each process has a unique $$id$$, and they don't know the size of the ring. Leader election in asynchronous rings can be solved by some algorithm with using $$O(n^2)$$ messages or $$O(n \log n)$$ messages.

In the $$O(n^2)$$ algorithm, every process sends a message with its $$id$$ to the left edge. Then waits until a message from the right edge. If the $$id$$ in the message is greater than its own $$id$$, then forwards the message to the left edge; else ignore the message, and does nothing. If the $$id$$ in the message is equal to its own $$id$$, then sends a message to the left announcing myself is elected. Other processes forward the announcement to the left and turn themselves to non-elected. It is clear that the upper bound is $$O(n^2)$$ for this algorithm.

In the $$O(n \log n)$$ algorithm, it is running in phases. On the $$k$$th phase, a process will determine whether it is the winner among the left side $$2^k$$ and right side $$2^k$$ neighbors. If it is a winner, then the process can go to next phase. In phase $$0$$, each process $$P$$ needs to determine itself is a winner or not by sending a message with its $$id$$ to the left and right neighbors (neighbor do not forward the message). The neighbor replies an $$ACK$$ only if the $$id$$ in the message is larger than the neighbor's $$id$$, else replies an $$ACK_{fault}$$. If $$P$$ receives two $$ACK$$s, one from the left, one from the right, then $$P$$ is the winner in phase $$0$$. In phase $$k$$, the winners in phase $$k-1$$ need to send a message with its $$id$$ to the $$2^k$$ left and $$2^k$$ right neighbors. If the neighbors in the path receive the $$id$$ in the message larger than their $$id$$, then forward the message to the next neighbor, otherwise reply an $$ACK_{fault}$$. If the $$2^k$$th neighbor receives the $$id$$ larger than its $$id$$, then sends back an $$ACK$$, otherwise replies an $$ACK_{fault}$$. If the process receives two $$ACK$$s, then it is the winner in phase $$k$$. In the last phase, the final winner will receive its own $$id$$ in the message, then terminates and send termination message to the other processes. In the worst case, each phase there are at most $$\frac{n}{2^k+1}$$ winners, where $$k$$ is the phase number. There are $$\lceil \log (n-1) \rceil$$ phases in total. Each winner sends in the order of $$2^k$$ messages in each phase. So, the messages complexity is $$O(n \log n)$$.

Synchronous ring
In Attiya and Welch's Distributed Computing book, they described a non-uniform algorithm using $$O(n)$$ messages in synchronous ring with known ring size $$n$$. The algorithm is operating in phases, each phase has $$n$$ rounds, each round is one time unit. In phase $$0$$, if there is a process with $$id=0$$, then process $$0$$ sends termination message to the other processes (sending termination messages cost $$n$$ rounds). Else, go to the next phase. The algorithm will check if there is a phase number equals to a process $$id$$, then does the same steps as phase $$0$$. At the end of the execution, the minimal $$id$$ will be elected as the leader. It used exactly $$n$$ messages and $$n(minimum\_id +1)$$ rounds.

Itai and Rodeh introduced an algorithm for a unidirectional ring with synchronized processes. They assume the size of the ring (number of nodes) is known to the processes. For a ring of size n, a≤n processors are active. Each processor decides with probability of a^(-1) whether to become a candidate. At the end of each phase, each processor calculates the number of candidates c and if it is equal to 1, it becomes the leader. To determine the value of c, each candidate sends a token (pebble) at the start of the phase which is passed around the ring, returning after exactly n time units to its sender. Every processor determines c by counting the number of pebbles which passed through. This algorithm achieves leader election with expected message complexity of O(nlogn). A similar approach is also used in which a time-out mechanism is employed to detect deadlocks in the system. There are also algorithms for rings of special sizes such as prime size and odd size.

Uniform algorithm
In typical approaches to leader election, the size of the ring is assumed to be known to the processes. In the case of anonymous rings, without using an external entity, it is not possible to elect a leader. Even assuming an algorithm exists, the leader could not estimate the size of the ring. i.e. in any anonymous ring, there is a positive probability that an algorithm computes a wrong ring size. To overcome this problem, Fisher and Jiang used a so-called leader oracle Ω? that each processor can ask whether there is a unique leader. They show that from some point upward, it is guaranteed to return the same answer to all processes.

Rings with unique IDs
In one of the early works, Chang and Roberts proposed a uniform algorithm in which a processor with the highest ID is selected as the leader. Each processor sends its ID in a clockwise direction. A process receiving a message and compares it with its own. If it is bigger, it passes it through, otherwise it will discard the message. They show that this algorithm uses at most $$O(n^2)$$ messages and $$O(n \log n)$$ in the average case. Hirschberg and Sinclair improved this algorithm with $$O(n \log n)$$ message complexity by introducing a 2 directional message passing scheme allowing the processors to send messages in both directions.

Leader election in a mesh
The mesh is another popular form of network topology, especially in parallel systems, redundant memory systems and interconnection networks. In a mesh structure, nodes are either corner (only two neighbours), border (only three neighbours) or interior (with four neighbours). The number of edges in a mesh of size a x b is m=2ab-a-b.

Unoriented mesh
A typical algorithm to solve the leader election in an unoriented mesh is to only elect one of the four corner nodes as the leader. Since the corner nodes might not be aware of the state of other processes, the algorithm should first wake up the corner nodes. A leader can be elected as follows. The message complexity is at most $$6(a + b) - 16$$, and if the mesh is square-shaped, $$O(\sqrt{n})$$.
 * 1) Wake-up process: in which $$k$$ nodes initiate the election process. Each initiator sends a wake-up message to all its neighbouring nodes. If a node is not initiator, it simply forwards the messages to the other nodes. In this stage at most $$3n+k$$ messages are sent.
 * 2) Election process: the election in outer ring takes two stages at most with $$6(a+b)-16$$ messages.
 * 3) Termination: leader sends a terminating message to all nodes. This requires at most 2n messages.

Oriented mesh
An oriented mesh is a special case where port numbers are compass labels, i.e. north, south, east and west. Leader election in an oriented mesh is trivial. We only need to nominate a corner, e.g. "north" and "east" and make sure that node knows it is a leader.

Torus
A special case of mesh architecture is a torus which is a mesh with "wrap-around". In this structure, every node has exactly 4 connecting edges. One approach to elect a leader in such a structure is known as electoral stages. Similar to procedures in ring structures, this method in each stage eliminates potential candidates until eventually one candidate node is left. This node becomes the leader and then notifies all other processes of termination. This approach can be used to achieve a complexity of O(n). There also more practical approaches introduced for dealing with presence of faulty links in the network.

Election in hypercubes
A Hypercube $$H_k$$ is a network consisting of $$n=2^k$$ nodes, each with degree of $$k$$ and $$O(n \log n)$$ edges. A similar electoral stages as before can be used to solve the problem of leader election. In each stage two nodes (called duelists) compete and the winner is promoted to the next stage. This means in each stage only half of the duelists enter the next stage. This procedure continues until only one duelist is left, and it becomes the leader. Once selected, it notifies all other processes. This algorithm requires $$O(n)$$ messages. In the case of unoriented hypercubes, a similar approach can be used but with a higher message complexity of $$O(n \log \log n)$$.

Election in complete networks
Complete networks are structures in which all processes are connected to one another, i.e., the degree of each node is n-1, n being the size of the network. An optimal solution with O(n) message and space complexity is known. In this algorithm, processes have the following states:
 * 1) Dummy: nodes that do not participate in the leader election algorithm.
 * 2) Passive: the initial state of processes before start.
 * 3) Candidate: the status of nodes after waking up. The candidate nodes will be considered to become the leader.

There is an assumption that although a node does not know the total set of nodes in the system, it is required that in this arrangement every node knows the identifier of its single successor, which is called neighbor, and every node is known by another one.

All processors initially start in a passive state until they are woken up. Once the nodes are awake, they are candidates to become the leader. Based on a priority scheme, candidate nodes collaborate in the virtual ring. At some point, candidates become aware of the identity of candidates that precede them in the ring. The higher priority candidates ask the lower ones about their predecessors. The candidates with lower priority become dummies after replying to the candidates with higher priority. Based on this scheme, the highest priority candidate eventually knows that all nodes in the system are dummies except itself, at which point it knows it is the leader.

The above algorithm is not correct &mdash; it needs further improvement.

Universal leader election techniques
As the name implies, these algorithms are designed to be used in every form of process networks without any prior knowledge of the topology of a network or its properties, such as its size.

Shout
Shout (protocol) builds a spanning tree on a generic graph and elects its root as leader. The algorithm has a total cost linear in the edges cardinality.

Mega-Merger
This technique in essence is similar to finding a Minimum Spanning Tree (MST) in which the root of the tree becomes the leader. The basic idea in this method is individual nodes merge with each other to form bigger structures. The result of this algorithm is a tree (a graph with no cycle) whose root is the leader of entire system. The cost of mega-merger method is $$O(m+n \log n)$$ where m is the number of edges and n is the number of nodes.

Yo-yo
Yo-yo (algorithm) is a minimum finding algorithm consisting of two parts: a preprocessing phase and a series of iterations. In the first phase or setup, each node exchanges its id with all its neighbours and based on the value it orients its incident edges. For instance, if node x has a smaller id than y, x orients towards y. If a node has a smaller id than all its neighbours it becomes a source. In contrast, a node with all inward edges (i.e., with id larger than all of its neighbours) is a sink. All other nodes are internal nodes. Once all the edges are oriented, the iteration phase starts. Each iteration is an electoral stage in which some candidates will be removed. Each iteration has two phases: YO- and –YO. In this phase sources start the process to propagate to each sink the smallest values of the sources connected to that sink.

Yo- -yo
 * 1) A source (local minima) transmits its value to all its out-neighbours
 * 2) An internal node waits to receive a value from all its in-neighbours. It calculates the minimum and sends it to out-neighbour.
 * 3) A sink (a node with no outgoing edge) receives all the values and compute their minimum.
 * 1) A sink sends YES to neighbours from which saw the smallest value and NO to others
 * 2) An internal node sends YES to all in-neighbours from which it received the smallest value and NO to others. If it receives only one NO, it sends NO to all.
 * 3) A source waits until it receives all votes. If all YES, it survives and if not, it is no longer a candidate.
 * 4) When a node x sends NO to an in-neighbour y, the logical direction of that edge is reversed.
 * 5) When a node y receives NO from an out-neighbour, it flips the direction of that link.

After the final stage, any source who receives a NO is no longer a source and becomes a sink. An additional stage, pruning, also is introduced to remove the nodes that are useless, i.e. their existence has no impact on the next iterations. This method has a total cost of O(mlogn) messages. Its real message complexity including pruning is an open research problem and is unknown.
 * 1) If a sink is leaf, then it is useless and therefore is removed.
 * If, in the YO- phase the same value is received by a node from more than one in-neighbour, it will ask all but one to remove the link connecting them.

Radio networks
In radio network protocols, leader election is often used as a first step to approach more advanced communication primitives, such as message gathering or broadcasts. The very nature of wireless networks induces collisions when adjacent nodes transmit at the same time; electing a leader allows to better coordinate this process. While the diameter D of a network is a natural lower bound for the time needed to elect a leader, upper and lower bounds for the leader election problem depend on the specific radio model studied.

Models and runtime
In radio networks, the n nodes may in every round choose to either transmit or receive a message. If no collision detection is available, then a node cannot distinguish between silence or receiving more than one message at a time. Should collision detection be available, then a node may detect more than one incoming message at the same time, even though the messages itself cannot be decoded in that case. In the beeping model, nodes can only distinguish between silence or at least one message via carrier sensing.

Known runtimes for single-hop networks range from a constant (expected with collision detection) to O(n log n) rounds (deterministic and no collision detection). In multi-hop networks, known runtimes differ from roughly O((D+ log n)(log2 log n)) rounds (with high probability in the beeping model), O(D log n) (deterministic in the beeping model), O(n) (deterministic with collision detection) to O(n log3/2 n (log log n)0.5) rounds (deterministic and no collision detection).