Wikipedia:Reference desk/Archives/Mathematics/2011 March 7

= March 7 =

detecting clumps in weighted graph
I have a complete weighted graph. I'd like to run an algorithm to detect 'clumps', but I'm not sure how to define what I'm looking for. For example, consider a complete graph with six nodes arranged in two equilateral triangles ABC, DEF. The distances d(A,B)=d(A,C)=d(B,C)=1, and likewise for the triangle DEF. Between the two triangles, distances are large, say d(A,D)=d(B,E)=100 (and so on for all other edges between the two triangles). I'd like to conclude that this graph has two clumps, in the sense that, within each triangle, nodes are closer to members of that triangle than any other node. I am aware of connected component and modular decomposition algorithms, but these don't take weighting into account, and would just return one component / module for my complete graph. Is there some other terminology to describe my clumps? Bonus points if an algorithm is available in an R package. Thanks, SemanticMantis (talk) 14:52, 7 March 2011 (UTC)
 * See Cluster analysis. I have never worked with R but I'm sure it has many clustering algorithms available. If you describe your data I might be able to recommend an algorithm. -- Meni Rosenfeld (talk) 15:15, 7 March 2011 (UTC)
 * Ah, thanks. I was too hung-up on graphs to see further. The question is motivated ecologically. Here's more description of the problem if you care to think about it. We have an unweighted bipartite graph of insect nodes in a set I and plant nodes in a set P. Edges in this graph represent pollination. The question we consider is, "Are more closely related insects more likely to interact with more closely related plants?". We have taxonomic distance information for each side of the bipartite graph. My thought was to partition the sets I and P into clusters I_j and P_k, based on this distance, then examine whether the edges from a particular cluster I_j are more likely to land within within or between clusters P_k. The clustering part doesn't have much to do with graph theory per se, but the bigger question does. As far as what type of clustering would be appropriate, assume distances within I or P are integers in [1,6], and that there are on the order of hundreds of nodes on each side of the bipartite graph. SemanticMantis (talk) 15:53, 7 March 2011 (UTC)
 * Also, it would be preferable to have the number of clusters not be determined a priori. SemanticMantis (talk) 15:58, 7 March 2011 (UTC)


 * If I wanted to determine if there were clumps in the graph, I would do it "by visual inspection", rather than mathematically, since pattern recognition is one thing our brains are quite good at. However, since that won't constitute scientific proof, you may need to use mathematical methods for that, if proof is required.  One thing to consider is that there may be "sub-clumps".  For example, you may see three main clusters of data, but note that one of the clusters has two sub-clusters within it, plus a number of other data points not in either sub-cluster.  Finding an analysis method that could determine all this might be tricky. StuRat (talk) 22:05, 8 March 2011 (UTC)


 * In your particular case, another question comes up: "How do you determine how closely related a group of insects or plants are ?". I would think percentage of DNA in common might work, except that closely genetically related organisms sometimes are very different, due to the ability of a small genetic change to activate or deactivate much larger chunks of DNA. StuRat (talk) 22:09, 8 March 2011 (UTC)
 * For the moment I'm not too worried about relatedness. We're just operating under the assumption that congeneric species are more closely related to each other than species in different genera (and so on for higher order taxa). While genetic analysis would be nice, it would also be extremely cost and time prohibitive. SemanticMantis (talk) 13:47, 9 March 2011 (UTC)


 * I would be skeptical of such a study due to the lack of weights on the insect-plant edges. It seems that this would be able to hide all sorts of unknown biases in the raw data. Imagine, for example, that the grad student who counted pollinators of legumes happened to be particularly good at identifying bee subspecies, whereas the one who handled orchids just tended to jot down down "honeybee" or "bumblebee" and leave it at that. That could create an overabundance of edges between the "bee" cluster and the "legume" cluster, compared to the number of edges between "bee" and "orchid" -- even if (under the null hypothesis) the bees themselves have no preference.
 * That's an issue of taxonomic resolution, and we're pretty good on that, we have species-level ID on most, and everything is identified to genus. If we want, we can truncate everything to genus to remove any bias from taxonomic resolution. The weights on insect-plant edges would be frequency of visitation. Actually, we do have qualitative info on that, i.e. 'often', 'common' 'rare', etc. We may try to quantify those, but it might not be necessary to get interesting results. SemanticMantis (talk) 13:47, 9 March 2011 (UTC)
 * That being said, I wonder whether it's necessary to identify clusters at all. How about taking all pairs of edges, and plot the distance-between-the-insects versus distance-between-the-plants and then compute the correlation coefficient in the resulting scatterplot? –Henning Makholm (talk) 22:39, 8 March 2011 (UTC)
 * This is very interesting, but I can't quite see what you're describing. Can you please elaborate? How would the presence or absence of a given insect-plant edge be incorporated into the scatterplot? SemanticMantis (talk) 13:48, 9 March 2011 (UTC)
 * I think biclustering is related to your problem. -- Meni Rosenfeld (talk) 11:29, 9 March 2011 (UTC)
 * This also looks promising, thanks. SemanticMantis (talk) 13:47, 9 March 2011 (UTC)