Yannakakis algorithm

The Yannakakis algorithm is an algorithm in database theory for computing the output of an (alpha-)acyclic conjunctive query. The algorithm is named after Mihalis Yannakakis.

High-level description
The algorithm relies on a join tree of the query, which is guaranteed to exist and can be computed in linear time for any acyclic query. The join tree is a tree structure that contains the query atoms as nodes and has the connectedness (or running intersection) property which states that for every query variable, the tree nodes that contain that variable form a connected subgraph. The tree can be rooted arbitrarily.

The algorithm materializes a relation for each query atom (this is necessary because the same input relation may be referenced by multiple query atoms) and performs two sweeps, one bottom-up in join tree order (from the leaves to the root), and one top-down (from the root to the leaves). In each node visited, it performs a semi-join between the corresponding relation and its parent or children (depending on the sweep phase). After these two sweeps, all spurious tuples that do not participate in producing any query answer are guaranteed to be removed from the relations. A final pass over the relations, performing joins and early projections, produces the query output.

Complexity
Let $$|D|$$ be the size of the database (i.e., the total number of tuples across all input relations), $$|Q|$$ the size of the query, and $$|OUT|$$ the number of tuples in the query output.

If the query does not project out any variables (referred to as a full conjunctive query or a join query or a conjunctive query with no existential quantifiers), then the complexity of the algorithm is $$O(|Q|(|D| + |OUT|)$$. Assuming a fixed query $$Q$$ (a setting referred to as data complexity), this means that the algorithm's worst-case running time is asymptotically the same as reading the input and writing the output, which is a natural lower bound.

If some variables are projected out in the query, then there is an additional $$|D|$$ factor, making the complexity $$O(|Q||D||OUT|)$$.

Connections to other problems
The algorithm has been influential in database theory and its core ideas are found in algorithms for other tasks such as enumeration and aggregate computation. An important realization is that the algorithm implicitly operates on the Boolean semiring (the elimination of a tuple corresponds to a False value in the semiring), but its correctness is maintained if we use any other semiring. For example, using the natural numbers semiring, where the operations are addition and multiplication, the same algorithm computes the total number of query answers.