Double Cut and Join Model



The double cut and join (DCJ) model is a model for genome rearrangement used to define an edit distance between genomes based on gene order and orientation, rather than nucleotide sequence. It takes the fundamental units of a genome to be synteny blocks, maximal sections of DNA conserved between genomes. It focuses on changes due to genome rearrangement operations such as inversions, translocations as well as the creation and absorption of circular intermediates.

A genome is described as a directed, edge labeled graph with each vertex having degree 1 or 2. Edges are labeled as synteny blocks, vertices of degree 1 represent telomeres, and vertices of degree 2 representing adjacencies between blocks. This requires that the genome consist of cycles and paths. Each component is called a chromosome. The beginning of each edge is called the tail, the end of each edge is called the head; together heads and tails are known as extremities. Vertices are described by their roles as heads and tails of blocks, for instance, in the figure, the adjacency which forms the head of marker 1 and the tail of marker 2 is labelled (h1, t2), the telomere formed by the head of 2 is (h2). A double cut and join (DCJ) operation consists of one of the following four transformations:


 * (i) breaking two adjacencies (a, b) and (c, d) to create two more adjacencies, (a, c) and (b,d)
 * (ii) taking an adjacency (a, b) and a telomere (c) to create a new adjacency and telomere, either as (a, c), (b) or (b,c), (a).
 * (iii) taking two telomeres (a) and (b) and creating a new adjacency (a, b)
 * (iv) breaking an adjacency (a, b) to create the two telomeres (a) and (b).

An edit distance, the double cut and join distance, is defined between genomes with the same number of edges $$G_1$$ and $$G_2$$, $$d_{DCJ}(A, B)$$ as the minimum number of DCJ operations needed to transform $$A$$ into $$B$$.

Mathematical Results
The DCJ distance defines a metric space. To verify this, we first note that $$d_{DCJ}(G, G)=0$$, since no operations are needed to change G into itself, and if $$G_1\neq G_2$$, $$d_{DCJ}(G_1, G_2)>0$$, since at least one operation is needed to transform $$G_1$$ into any other genome. (A proof that the $$d_{DCJ}(G_1, G_2)$$ is always defined whenever $$G_1$$ and $$G_2$$ are genomes with the same edges will follow.) Note that each operation has an inverse: (i) and (ii) are their own inverses, and (iii) is the inverse of (iv). Thus $$d_{DCJ}(G_1, G_2)=d_{DCJ}(G_2, G_1)$$. The triangle inequality $$d_{DCJ}(G_1, G_3)\leq d_{DCJ}(G_1, G_2)+d_{DCJ}(G_2, G_3)$$ holds because a series of DCJ operations transforming $$G_1$$ to $$G_2$$ followed by a series of transformations from $$G_2$$ to $$G_3$$ will transform $$G_1$$ to $$G_3$$, so the minimal number of operations needed to transform $$G_1$$ to $$G_3$$ must be no longer than this.

To compute the DCJ distance between two genomes $$G_1$$ and $$G_2$$ with the same set of synteny blocks, we construct a bipartite multigraph known as the adjacency graph $$A=(V(G_1), V(G_2), E)$$ of the genomes. The vertex set of the adjacency graph is $$(V(G_1), V(G_2))$$, where $$V(G_1)$$ is the vertex set of $$G_1$$ and $$V(G_2)$$ is the vertex set of $$G_2$$. For $$a\in V(G_1)$$ and $$b\in V(G_2)$$, we have $$(a, b)\in E$$ if $$a$$ and $$b$$ are an extremity of the same synteny block. If $$a$$ and $$b$$ share two extremities, we add two edges between $$a$$ and $$b$$ to $$G$$.

If $$A=B$$, we see that the adjacency graph is composed entirely of paths of length 1, connecting two telomeres, and cycles of length 2, connecting two adjacencies. We can use this fact to calculate $$d_{DCJ}$$$$$$. Let $$N$$ be the number of synteny blocks in genomes $$G_1$$ and $$G_2$$, $$C$$ be the number of cycles in their adjacency graph, and $$I$$ be the number of paths in their adjacency graph. Then $$d_{DCJ}(G_1, G_2)=N-(C+1/2).$$ The proof shows that each DCJ operation can decrease $$N-(C+1/2)$$ by no more than 1, and that if $$G_1\neq G_2$$, there exists an operation decreasing $$N-(C+I/2).$$. This proves that $$d_{DCJ}$$ is always defined, and gives a method for its calculation. Since it is easy to count cycles and paths, $$d_{DCJ}$$ can be found in linear time.