E-graph

In computer science, an e-graph is a data structure that stores an equivalence relation over terms of some language.

Definition and operations
Let $$\Sigma$$ be a set of uninterpreted functions, where $$\Sigma_n$$ is the subset of $$\Sigma$$ consisting of functions of arity $$n$$. Let $$\mathbb{id}$$ be a countable set of opaque identifiers that may be compared for equality, called e-class IDs. The application of $$f\in\Sigma_n$$ to e-class IDs $$i_1, i_2, \ldots, i_n\in\mathbb{id}$$ is denoted $$f(i_1, i_2, \ldots, i_n)$$ and called an e-node.

The e-graph then represents equivalence classes of e-nodes, using the following data structures:


 * A union-find structure $$U$$ representing equivalence classes of e-class IDs, with the usual operations $$\mathrm{find}$$, $$\mathrm{add}$$ and $$\mathrm{merge}$$. An e-class ID $$e$$ is canonical if $$\mathrm{find}(U, e) = e$$; an e-node $$f(i_1,\ldots,i_n)$$  is canonical if each $$i_j$$ is canonical ($$j$$ in $$1,\ldots,n$$).
 * An association of e-class IDs with sets of e-nodes, called e-classes. This consists of
 * a hashcons $$H$$ (i.e. a mapping) from canonical e-nodes to e-class IDs, and
 * an e-class map $$M$$ that maps e-class IDs to e-classes, such that $$M$$ maps equivalent IDs to the same set of e-nodes: $$\forall i, j\in\mathbb{id},M[i]=M[j]\Leftrightarrow \mathrm{find}(U,i)=\mathrm{find}(U,j)$$

Invariants
In addition to the above structure, a valid e-graph conforms to several data structure invariants. Two e-nodes are equivalent if they are in the same e-class. The congruence invariant states that an e-graph must ensure that equivalence is closed under congruence, where two e-nodes $$f(i_1,\ldots,i_n),f(j_1,\ldots,j_n)$$ are congruent when $$\mathrm{find}(U, i_k)=\mathrm{find}(U, j_k),k\in \{1,\ldots,n\}$$. The hashcons invariant states that the hashcons maps canonical e-nodes to their e-class ID.

Operations
E-graphs expose wrappers around the $$\mathrm{add}$$, $$\mathrm{find}$$, and $$\mathrm{merge}$$ operations from the union-find that preserve the e-graph invariants. The last operation, e-matching, is described below.

E-matching
Let $$V$$ be a set of variables and let $$\mathrm{Term}(\Sigma, V)$$ be the smallest set that includes the 0-arity function symbols (also called constants), includes the variables, and is closed under application of the function symbols. In other words, $$\mathrm{Term}(\Sigma, V)$$ is the smallest set such that $$V\subset\mathrm{Term}( \Sigma, V)$$, $$\Sigma_0\subset\mathrm{Term}(\Sigma, V)$$, and when $$x_1, \ldots, x_n\in \mathrm{Term}(\Sigma, V)$$ and $$f\in\Sigma_n$$, then $$f(x_1,\ldots,x_n)\in\mathrm{Term}(\Sigma, V)$$. A term containing variables is called a pattern, a term without variables is called ground.

An e-graph $$E$$ represents a ground term $$t\in\mathrm{Term}(\Sigma, \emptyset)$$ if one of its e-classes represents $$t$$. An e-class $$C$$ represents $$t$$ if some e-node $$f(i_1,\ldots,i_n)\in C$$ does. An e-node $$f(i_1,\ldots,i_n)\in C$$ represents a term $$g(j_1,\ldots,j_n)$$ if $$f=g$$ and each e-class $$M[i_k]$$ represents the term $$j_k$$ ($$k$$ in $$1,\ldots,n$$).

e-matching is an operation that takes a pattern $$p\in\mathrm{Term}(\Sigma, V)$$ and an e-graph $$E$$, and yields all pairs $$(\sigma, C)$$ where $$\sigma\subset V\times\mathbb{id}$$ is a substitution mapping the variables in $$p$$ to e-class IDs and $$C\in\mathbb{id}$$ is an e-class ID such that each term $$\sigma(p)$$ is represented by $$C$$. There are several known algorithms for e-matching, the relational e-matching algorithm is based on worst-case optimal joins and is worst-case optimal.

Complexity

 * An e-graph with n equalities can be constructed in O(n log n) time.

Equality saturation
Equality saturation is a technique for building optimizing compilers using e-graphs. It operates by applying a set of rewrites using e-matching until the e-graph is saturated, a timeout is reached, an e-graph size limit is reached, a fixed number of iterations is exceeded, or some other halting condition is reached. After rewriting, an optimal term is extracted from the e-graph according to some cost function, usually related to AST size or performance considerations.

Applications
E-graphs are used in automated theorem proving. They are a crucial part of modern SMT solvers such as Z3 and CVC4, where they are used to decide the empty theory by computing the congruence closure of a set of equalities, and e-matching is used to instantiate quantifiers. In DPLL(T)-based solvers that use conflict-driven clause learning (also known as non-chronological backtracking), e-graphs are extended to produce proof certificates. E-graphs are also used in the Simplify theorem prover of ESC/Java.

Equality saturation is used in specialized optimizing compilers, e.g. for deep learning and linear algebra. Equality saturation has also been used for translation validation applied to the LLVM toolchain.

E-graphs have been applied to several problems in program analysis, including fuzzing, abstract interpretation, and library learning.