User:Mitutitu2012/sandbox

Kcut is an efficient spectral algorithm for network community discovery while optimizing modularity. This algorithm efficiently discovers communities while preserving the quality of communities by optimizing modularity. This optimization is a NP-Hard problem. Kcut is first introduced in this paper by Jianhua Ruan and Weixiong Zhang. This algorithm is the combination of k-way partitioning and recursive 2-way partitioning methods. It finds out community structures in a large network efficiently and also maintains quality of these communities. l is denoted as the maximal number of partitions for each sub-network and l is restricted to small integers for minimizing the computing cost. The algorithm running with small l values like 3 or 4 can improve modularity over the standard bi-partitioning strategy.

History
In the area of complex network, detecting community structures have got much more attention recently. Social networks, biological networks, and information networks could be the examples of complex networks. The discovery of community structures in these networks will help us to find out various new characteristics and functionalities such as finding out hidden communities in the internet which might do any harmful activities or security threats. Though community discovery is pretty much similar to graph partitioning problem where group of vertices would be placed in different clusters, still it has different mechanism. In graph partitioning we know the number or size of partitions beforehand and partitioning algorithm will produce results accordingly. But community discovery algorithm may not produce any result if such good communities are not found. That is why it is a challenging task to find out best communities in a complex network without knowing how many communities there could be. To assess the quality of a community structure Newman and Girvan proposed a method called modularity (Q).

Jianhua Ruan and Weixiong Zhang proposed this spectral algorithm (Kcut) that finds high quality communities in an effective way. The algorithm adopts a direct k-way partition strategy as in the WS algorithm and computes the best k-way partition using the NJW algorithm by selecting the k that gives the highest Q value.

Algorithm
Kcut follows greedy approach. This algorithm will find communities of high modularity and each time it checks if the Q value is increased or not. If modularity cannot be improved then network/sub-network will not be divided. Γ$k$ is the partition of a network dividing vertices into k communities and thus the modularity is Q(Γ$k$) = Σ$k i=1$(e$ii$/c - (a$i$/c)$2$). Here e$ii$ is the number of edges with both vertices of an edge in the same community I, a$i$ is the number of edges with one or both vertices of an edge in community I, and the total number of edges is represented by c. The modularity, Q is the fraction of the edges that fall within the communities minus the expected such fraction if edges were randomly dispersed. The higher value of Q produces high quality of communities which are obviously better and stronger. Empirically, most real-world complex networks have modularity Q>0.3. The Q function is a quantitative measurement for discovering qualitative community structures and various algorithms are developed by optimizing Q to find out communities in a complex network. A spectral algorithm (WS) is proposed by White and Smyth that works efficiently on small-sized networks. They directly applied a k-way spectral algorithm for finding communities and k is ranging from minimum K$min$ to maximum K$max$ number of communities which is provided by the user as input. A particular k value which provides highest value of Q is considered as the most appropriate number of communities. A minimum modification of the WS algorithm is as follows. The important difference between the original WS algorithm and the one above is the underlying spectral algorithm. The original algorithm uses the second to the k-th largest eigenvectors of a transition matrix P = D$min$A, and ignores the first eigenvector. The underlying spectral algorithm in the original WS algorithm is equivalent to NJW except that the first eigenvector and first eigenvector is ignored as it is a constant, 1. However, NJW has been used as the underlying spectral algorithm. Though the WS algorithm efficiently finds good quality community structures, it shows poor performance to large networks. The reason is that it needs to execute k-means up to Kmax times. Kcut is such spectral algorithm for discovering communities that scales well to large networks while retaining effectiveness in finding good communities. The steps are as follows:
 * 1) NJW algorithm is applied to find a k-way partition for each k, K$max$ < k < K$k$ and the partition is denoted as Γ$$.
 * 2) k$k$ = arg max$k$Q(Γ$$) is the number of communities and Γ$k*$ = Γ$-1$ is the best community structure.
 * 1) Γ is initialized as a single cluster with all vertices by setting modularity to zero
 * 2) g is the sub-network of G with vertices for each cluster in Γ
 * 3) For each integer k from 2 to l, NJW is applied to find k-way partitioning of g and new Q value is computed for the entire network.
 * 4) The k that gives best Q value (i.e., highest modularity value) is selected and if partitioning g increases the Q value of entire network then the new partition is accepted otherwise not. The algorithm continues with the next cluster in Γ.

Pseudocode
Suppose G is a network and l is a small integer number for maximal number of partition considered in each sub-network, g. So the Kcut algorithm executes these steps.

1 Initialize Γ to be a single cluster with all vertices, and set Q = 0. 2 For each cluster P in Γ, 3     (a) Let g be the sub-network of G containing the vertices in P.  4      (b) For each integer k from 2 to l,    5          i. Apply NJW to find a k-way partitioning of g, denoted by Γ$g k$, 6         ii.Compute new Q value of the network as Q$' k$ = Q(Γ ∪ Γ$g k$/P). 7     (c) Find the k that gives the best Q value, i.e., k$$ = arg max$k$Q$' k$).  8      (d) If Q$' k*$ > Q, accept the partition by replacing P with Γ$g k*$, i.e., Γ = Γ ∪ Γ$g k*$/P,  9       and set Q = Q$' k*$.  10     (e) Advance to the next cluster in Γ, if there is any.

Proof of correctness
The loop of step 2(b) in Kcut algorithm is quite as same as the first step of the WS algorithm. The only difference is in 2(b)(ii) where the modularity of the whole network G has been computed. However, for re-computation of Q value there is no need to iterate over all communities in the network. By definition of modularity, the contribution of each community towards Q of the whole network is not dependent on other communities. That is why, communities are created by partitioning g and Q gets updated with these newly created communities. Step 2(c) is very crucial because it is responsible to find out good community structures in the entire network with highest Q values and this step accomplishes this important task by identifying the best partitioning in g which improves the value of Q to the maximum. In the next step, it justifies if partitioning of g can update the value of modularity Q positively. If the value of Q increases then the partition is accepted. In this way, this spectral algorithm finds out good quality communities and when no cluster remains for partitioning the algorithm terminates and no communities can be further created to improve Q. At the end, Γ contains the best community structure.

Complexity
The inner loop of the Kcut algorithm is WS algorithm except the slight difference in the computation of Q. WS algorithm has two major parts. One is computing eigenvectors and another is executing k-means to partition the network.

All Kmax eigenvectors would be obtained by solving eigen problem just once though WS calls NJW multiple times. Thus overall time complexity of WS is Ο(mK + nK$2$) and that could be close to O(n$3$) because the maximal number of communities for a sparse network may be linear in n.

Depth of the recursive calls is the key issue for running time of Kcut algorithm and the partitions could be highly imbalanced in the worst case scenario and the recursion depth is simply the number of partitions created, K. In the average depth estimation, it is close to log$l$K, where l is the maximal number of partitions considered by NJW algorithm. Thus, the time complexity can be O((mlh + nl$2$h) log$l$K) and simplifying furthermore it is O(mlh log$l$K), since l is small and generally m > nl. In this way, the average-case running time taken by k-means is O(nl$2$e log$l$K), and the total complexity is given by O((mlh + nl$2$e) log$l$K). But from the experimental results it shows that the time complexity is O(mlh log$l$K) = O(mh ln K l/lnl) for Kcut. Moreover, h is a constant and l is small and K = O(n). So, the total complexity is O(m log n), which is much smaller than the O(n$3$) running time of the WS algorithm. The spatial complexity of Kcut algorithms is O(m), linear to the number of edges.

Applications
There are vast range of applications of community discovery algorithms in many disciplines. Some applications of community discovery algorithm include an acquaintance network in a Karate club, the opponent network of American NCAA Di-vision I college football teams in the year 2000 , a co-performing network of Jazz Bands , a protein-protein interaction network of E. coli , the Autonomous Systems topology of the Internet , a collaboration network of physicists , communities in gene network , automatic detection of tumor types , and communities in newsgroups. Here, in the newsgroup application key criteria is to identify groups of users with common interests without looking at the actual content of their messages and to accomplish this group detection a message-replying network has been used, where each vertex is a unique user, and an edge between two users indicates that one has replied a message of the other. However, in the application of tumor type detection, Kcut algorithm is used to classify tumors based on their genetic profiles.

Related algorithms
Apart from the Kcut algorithm there are several other algorithms which have been developed for discovering communities by modularity optimization. Newman proposed a recursive spectral bi-partitioning algorithm. The algorithm runs recursively on each sub-network and stops if further improvement to Q is not possible. It is faster for small networks because k-means is not performed. However, the modularity matrix is very dense and the algorithm gets the memory complexity of O(n$2$) even for sparse networks. Whereas Kcut has the memory complexity of O(m). Moreover, the time complexity of this algorithm is O(n$2$log n) and for obvious reason it does not scale well to large networks. White and Smyth also proposed a fast greedy algorithm that is spectral-2 along with their optimal WS algorithm. Spectral-2 is recursive and uses spectral bi-partitioning to optimize Q. However, it is difficult to estimate the number of communities in advance for a very large networks, and expensive to compute a large number of eigenvectors. There are also several methods that are not spectral-based. These are edge betweenness algorithm and the extremal optimization algorithm. But they are quite slow with time complexity of O(n$3$) and O(n$2$log$2$n) respectively. The CNM algorithm is another greedy approach which has pretty much the same time complexity (O(m log$2$n)) as Kcut algorithm, but sometimes it returns poor quality communities.