Graphlets

Graphlets in mathematics are induced subgraph isomorphism classes in a graph, i.e. two graphlet occurrences are isomorphic, whereas two graphlets are non-isomorphic. Graphlets differ from network motifs in a statistical sense, network motifs are defined as over- or under-represented graphlets with respect to some random graph null model.

Relative graphlet frequency distance
RGF-distance compares the frequencies of the appearance of all 3-5-node graphlets in two networks. Let Ni(G) be the number of graphlets of type $$i$$ ($$i \in \{1,\ldots,29\}$$) in network G, and let $$T(G) = \sum_{i=1}^{29} N_i(G)$$ be the total number of graphlets of G. The "similarity" between two graphs should be independent of the total number of nodes or edges, and should depend only upon the differences between relative frequencies of graphlets. Thus, relative graphlet frequency distance D(G,H) between two graphs G and H is defined as:

$$D(G,H) = \sum_{i=1}^{29} | F_i(G) - F_i(H) |$$,

where $$F_i(G) = -\log(N_i(G)/T(G))$$. The logarithm of the graphlet frequency is used because frequencies of different graphlets can differ by several orders of magnitude and the distance measure should not be entirely dominated by the most frequent graphlets.

Graphlet degree distribution agreement
GDD-agreement generalizes the notion of the degree distribution to the spectrum of graphlet degree distributions (GDDs) in the following way. The degree distribution measures the number of nodes of degree k in graph G, i.e., the number of nodes "touching" k edges, for each value of k. Note that an edge is the only graphlet with two nodes. GDDs generalize the degree distribution to other graphlets: they measure for each 2-5-node graphlet Gi, $$i = 0, 1,..., 29$$, such as a triangle or a square, the number of nodes "touching" k graphlets Gi at a particular node. A node at which a graphlet is "touched" is topologically relevant, since it allows us to distinguish between nodes "touching", for example, a three node path at an end node or at the middle node. This is summarized by automorphism orbits (or just orbits, for brevity): by taking into account the "symmetries" between nodes of a graphlet, there are 73 different orbits across all 2-5-node graphlets (see [Pržulj, 2007] for details).

For each orbit j, one needs to measure the jth GDD, dGj(k), i.e., the distribution of the number of nodes in G "touching" the corresponding graphlet at orbit j k times. Clearly, the degree distribution is the 0th GDD. dGj(k) is scaled as $$S_G^j(k) = \frac{d_G^j(k)}{k}$$ to decrease the contribution of larger degrees in a GDD and then normalized with respect to its total area $$T_G^j = \sum_{k=1}^\infty S_G^j(k)$$ giving the "normalized distribution" $$N_G^j(k) = \frac{S_G^j(k)}{T_G^j}$$.

The jth GDD-agreement compares the jth GDDs of two networks. For two networks G and H and a particular orbit j, the "distance" Dj(G,H) between their normalized jth GDDs is:

$$D^{j}(G,H)=\frac{1}{\sqrt{2}}\left(\sum_{k=1}^{\infty}[N_{G}^{j}(k)-N_{H}^{j}(k)]^{2}\right)^{\frac{1}{2}}$$.

The distance is between 0 and 1, where 0 means that G and H have identical jth GDDs, and 1 means that their jth GDDs are far away. Next, Dj(G,H) is reversed to obtain the jth GDD-agreement:

$$A^j(G,H) = 1 - D^j(G,H)$$, for $$j \in \{0,1,\ldots,72\}$$.

The total GDD-agreement between two networks G and H is the arithmetic or the geometric average of the jth GDD-agreements over all j, i.e.,

$$A_{arith}(G,H) = \frac{1}{73} \sum_{j=0}^{72}A^j(G,H)$$,

and

$$A_{geo}(G,H) = \left(\prod_{j=0}^{72}A^j(G,H)\right)^{\frac{1}{73}}$$,

respectively. GDD-agreement is scaled to always be between 0 and 1, where 1 means that two networks are identical with respect to this property. (See [Pržulj, 2007] for details.)

Graphlet degree vectors (signatures) and signature similarities
This method generalizes the degree of a node, which counts the number of edges that the node touches, into the vector of graphlet degrees, or graphlet degree signature, counting the number of graphlets that the node touches at a particular orbit, for all graphlets on 2 to 5 nodes. The resulting vector of 73 coordinates is the signature of a node that describes the topology of node's neighborhood and captures its interconnectivities out to a distance of 4 (see [Milenković and Pržulj, 2008] for details). The graphlet degree signature of a node provides a highly constraining measure of local topology in its vicinity and comparing the signatures of two nodes provides a highly constraining measure of local topological similarity between them.

The signature similarity is computed as follows. For a node u in graph G, ui denotes the ith coordinate of its signature vector, i.e., ui is the number of times node u is touched by an orbit i in G. The distance Di(u,v) between the ith orbits of nodes u and v is defined as:

$$D_i(u,v) = w_i \times \frac{|\log(u_i + 1) - \log(v_i + 1)|}{\log(\max\{u_i, v_i\} + 2)}$$,

where wi is the weight of orbit i that accounts for dependencies between orbits (see [Milenković and Pržulj, 2008] for details). The total distance D(u,v) between nodes u and v is defined as:

$$D(u,v) = \frac{\sum_{i=0}^{72}D_i}{\sum_{i=0}^{72}w_i}$$.

The distance D(u,v) is in [0, 1), where distance 0 means that signatures of nodes u and v are identical. Finally, the signature similarity, S(u,v), between nodes u and v is:

$$S(u,v) = 1 - D(u,v)$$.

Clearly, a higher signature similarity between two nodes corresponds to a higher topological similarity between their extended neighborhoods (out to distance 4).

Application of graphlet-based network properties
RGF-distance and GDD-agreement were used to evaluate the fit of various network models to real-world networks and to discover a new, well-fitting, geometric random graph model for protein-protein interaction networks,  as well as other types of biological networks, such as protein structure networks, also called residue interaction graphs. These graphlet-based network properties are implemented in GraphCrunch, a software tool for large network analyses and modeling. Alternatively, a parallel implementation is provided in PGD, a software library for computing graphlet-based network properties in large and massive networks.

Graphlet degree vectors (signatures) and signature similarities were applied to biological networks to identify groups (or clusters) of topologically similar nodes in a network and predict biological properties of yet uncharacterized nodes based on known biological properties of characterized nodes. Specifically, they were applied to protein function prediction, cancer gene identification, and discovery of pathways underlying certain biological processes, such as melanogenesis or protein degradation. Additionally, GRAph ALigner (GRAAL), a global network alignment method, used graphlet degree vectors and signature similarities to produce topological alignments of biological networks, without using any information external to network topology.