Finding Communities via Edge-Deletion

1 Community Detection Using the Edge-Betweenness

Another approach to community detection with good sociological bona fides relies on edge betweenness, a concept we covered on the lecture notes on edge centralities. An edge has high betweenness if many other pairs of nodes need to use it to reach one another via shortest paths.

It follows that if we iteratively identify high betweenness edges, removing the top one, recalculate the edge betweenness on the resulting edge-deleted subgraph, and remove that top one (or flip a coin if two are tied), we have an algorithm for identifying communities as those subgraphs that end up being disconnected after removing the bridges (Girvan and Newman 2002). Note that, like the leading eigenvector method, this approach is divisive with all nodes starting in a single cluster, then that cluster splits into two, and so forth.

For instance, the function below implements a simple version of the Girvan-Newman edge-betweenness community partition algorithm.

Code
 eb.comms <- function(h, c, s = 123) {
 d <- components(h)$no #computing graph components
 set.seed(s)
 while(d < c) {
 eb <- edge_betweenness(h, directed = FALSE) #computing edge-betweenness vector
 e <- which(eb == max(eb)) #picking largest value
 if(length(e) > 1) {e <- sample(e, 1)} #grabbing an edge at random in case of tie 
 h <- delete_edges(h, e) #deleting to betwenness edge
 comps <- components(h) #computing graph components
 d <- comps$no #number of components
 }
 return(comps$membership) #returning membership vector
 }

The function takes two inputs: a graph h and the desired number of community partitions, c. Line two initializes the counter d which keeps track of the number of components in the graph (the number of disconnected clumps). Then we enter a while loop on lines 3-11. In line 4, we compute the edge betweenness of each edge in the graph (using the igraph function edge_betweenness), line 5 checks to see which one is the biggest, and lines 6-7 are designed to keep a random one in case of a tie (hence we need to set the seed for reproducibility). Line 8 deletes the selected edge, line 9 computes the number of components in the graph (using the igraph function components which returns a list of results), and line 10 sets the value of d to the number of components in the new edge-deleted subgraph (stored in the object comp$no). The while loop stops when the number of components comm$no is equal to the function argument c. Line 12 returns a vector of community membership assignments.

And here’s the function in action in the law firm friendship nomination network, requesting a partition into eight groups:

Code
 library(networkdata)
 library(igraph)
 g <- law_friends
 g <- subgraph(g, degree(g) >= 2)
 ug <- as_undirected(g, mode = "collapse")
 c8 <- eb.comms(ug, c = 8)
 c8
 [1] 1 1 2 1 1 3 2 1 1 1 1 1 1 1 4 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[39] 1 1 1 1 1 1 1 1 1 1 1 5 1 6 1 1 1 1 1 1 1 7 1 1 8 8 8 8 8 8

We can check the modularity of this partition using the igraph function modularity.

Code
 modularity(ug, c8)
[1] 0.07147862

The resulting partition is shown in

Note that edge betweenness provides a different picture of the network’s community structure than the eigenvector-based partitioning method. This is not surprising, as it is based on graph-theoretic principles rather than null-model principles such as modularity. The picture here is of two broad communities (in the middle) surrounded by many smaller islands, including some singletons.

In this respect, we can see that the edge-betweenness algorithm begins by selecting the edges, separating pendant nodes and other peripheral nodes into their own communities, which makes sense, since these are the easiest to disconnect (and the edges connecting them to the graph have the highest betweenness). To reach the “middle” or core of the graph, we need to request a partition into more communities. shows the result of using the eb.comms function above to partition the network into sixteen communities. Which looks a bit better, uncovering some actual groups of nodes in the periphery of the network that nominate one another as friends.

2 Combining Modularity Optimization and Edge-Betweenness

Of course, it is possible to combine the modularity optimization idea with the edge-betweenness partition idea. This is helpful because we usually do not know the actual number of communities in the network; we want the algorithm to tell us. So the trick is to modify the function eb.comms above to compute the modularity of a given partition at each step and record it in a vector. That way, we can pick the biggest number after going through all of the edges.

A function that does that looks like this:

Code
 eb.comms2 <- function(h, s = 123) {
 k <- 1 #initializing counter
 qv <- 0 #initializing modularity vector
 C <- list() #initializing membership vector list
 og <- h #original graph before deleting edges
 set.seed(s) #setting seed
 while(ecount(h) > 0) { #do until there are no more edges to delete
 eb <- edge_betweenness(h, directed = FALSE) #computing edge betweenness
 top.e <- which(eb == max(eb)) #grabbing top betweenness edges
 if(length(top.e) > 1) {top.e <- sample(top.e, 1)} #grabbing an edge at random in case of multi-way tie
 h <- delete_edges(h, top.e) #deleting top betweenness edge
 comp <- components(h) #computing number of components of edge-deleted graph
 C[[k]] <- comp$membership #creating new community membership vector based on component membership
 qv[k] <- modularity(og, comp$membership, directed = FALSE) #recomputing modularity of original graph
 k <- k + 1 #increasing counter
 }
 maxqpos <- min(which(qv == max(qv))) #picking partition with maximum modularity
 q <- qv[maxqpos] #picking partition with maximum modularity
 membership <- C[[maxqpos]] #picking membership vector with maximum modularity
 nc <- max(membership)
 return(list(q = q, nc = nc, membership = membership))
 }

The main additions here are lines 13 and 14. Line 13 puts the membership vector for the graph’s components into a list object called C at each step (the list is initialized to an empty list in line 4). Line 14 computes the modularity (using the igraph function modularity) on the original graph (not the vertex-deleted one), called og (get it?), initialized in line 5, but using the membership vector from the components of the vertex-deleted subgraph at a given step as community assignments. After the while loop ends (going through all the edges), lines 17-20 grab the community assignments corresponding to the maximum number stored in the modularity vector qv, and line 21 returns a list containing membership vector stored in the object membership, a scalar recording the number of communities (the maximum number in the membership vector) stored in the object nc, and a scalar recording the maximum modularity value stored in the object q.

Let’s see the new and improved algorithm in action:

Code
 c.mod <- eb.comms2(ug)
 c.mod
$q
[1] 0.2078357

$nc
[1] 23

$membership
 [1]  1  1  2  1  3  4  2  1  1  1  1  1  1  5  6  1  1  3  7  1  1  1  1  1  1
[26]  1  1  3  1  8  1  3  3  1  3  1  1  1  1  1  1  1  9  7 10 11  3  3 12 13
[51] 14 15 16 17  3 18  7 19 20 21 22  1 23 23 23 23 23 23

Which tells us that the optimal partition according to the edge betweenness is one that splits the graph into 23 communities, with a maximum modularity of Q = 0.208.

Of course, all of these steps are packaged in the igraph function called cluster_edge_betweenness. Let’s see how that works.

Code
 eb.res <- cluster_edge_betweenness(ug, directed = FALSE)

Note that the cluster_edge_betweenness function takes a graph as input. We have to tell it that it is an undirected graph by setting the argument directed to FALSE.

And as we can see, the results are exactly the same we obtained earlier, dividing the nodes in the network into 23 communities.
Code
 eb.res$membership
 [1]  1  1  2  1  3  4  2  1  1  1  1  1  1  5  6  1  1  3  7  1  1  1  1  1  1
[26]  1  1  3  1  8  1  3  3  1  3  1  1  1  1  1  1  1  9  7 10 11  3  3 12 13
[51] 14 15 16 17  3 18  7 19 20 21 22  1 23 23 23 23 23 23

Interestingly, even though the Girvan-Newman approach is clearly a divisive algorithm, the results can be presented as a dendrogram that stores top-down divisions rather than bottom-up merges. This is shown in

Code
 hc <- as.hclust(eb.res)
 par(cex = 0.5) #smaller labels
 plot(hc)
Figure 1: Dendrogram of Community Partition of the Law Friends Network Using the Edge-Betweenness.

Another thing we can do with the edge_betweenness function output is highlight the edges identified as bridges during the divisive process, since this information is stored in the bridges object. We can use this vector of edge IDs to create an edge-color vector that highlights the bridges in red. This is shown in .

3 Community Detection Using the Edge Clustering Coefficient

Radicchi et al. (2004) propose another version of an edge-deletion approach to community detection based on an edge-level metric they call the edge clustering coefficient \(C(e_{ij})\). This is essentially a measure of the extent to which an edge connects nodes that themselves share a lot of neighbors, normalized by the minimum of the degrees of the nodes minus one:

\[ C(e_{ij}) = \frac{z_{ij} + 1}{min\left[(k_i - 1), (k_j - 1)\right]} \tag{1}\]

Where \(z_{ij}\) is the number of common neighbors shared by nodes \(i\) and \(j\) incident to edge \(e_{ij}\), and \(k_i\) and \(k_j\) are their respective degrees. Radicchi et al. (2004) add one to the numerator of to account for the fact that some connected nodes may not share any neighbors.

The logic of the Radicchi et al. (2004) approach is that edges between well-clustered communities should have a low edge clustering coefficient (e.g., connecting nodes that themselves don’t share many neighbors and therefore belong to different communities). Meanwhile, edges with a high clustering coefficient should be more likely to link nodes that belong to the same community.

Ergo, deleting the edges with low edge clustering coefficient (like we did with the ones that had high edge betweenness) should partition a graph into well-defined communities.

To implement this approach, we need (1) a function to compute the clustering coefficient for each edge according to , and (2) a modification of the eb.comms2 function above that incorporates the idea of removing edges with minimum edge clustering scores.

To complete step 1, here’s a function to compute the edge clustering coefficient as defined by Radicchi et al. (2004):

Code
 edge.clust <- function(h) {
 d <- degree(h) #degree vector
 edgemat <- ends(h, E(h)) #edge list matrix
 w <- as.matrix(as_adjacency_matrix(h)) #graph adjacency matrix
 m <- w %*% w #commons neighbors matrix
 ec <- apply(edgemat, 1, function(row) { #applies function to each row of matrix edgemat
 i <- row[1] #first node
 j <- row[2] #second node
 z <- m[i, j] #common neighbors
 return((z + 1) / min((d[i] - 1), (d[j] - 1))) #edge clustering coefficient
 }
 )
 return(ec)
 }

This function takes some graph h as input and returns a vector of length \(|E|\) (number of edges in the graph) containing the estimated clustering coefficient of each edge as components. Note that we compute the number of shared neighbors by taking the second power of the adjacency matrix in lines 4 and 5. The apply call then iterates over each row of the edgelist (with the second argument set to 1) and applies the custom edge clustering coefficient function in the third argument.

And to complete step 2, here’s a modification of our edge-deletion function above to incorporate the edge clustering coefficient as the criterion for edge deletion:

Code
 ec.comms <- function(h) {
 qv <- 0 #initializing modularity vector
 C <- list() #initializing membership vector list
 og <- h #original graph before deleting edges
 k <- 1 #initializing counter
 while(ecount(h) > 0) { #do until there are no more edges to delete
 ec <- edge.clust(h) #computing edge clustering coefficient vector
 min.e <- which(ec == min(ec)) #grabbing minimum clustering edges
 h <- delete_edges(h, min.e) #deleting minimum clustering edges
 comp.vec <- components(h)$membership #component membership vector of edge-deleted graph
 C[[k]] <- comp.vec #storing membership vector on list
 qv[k] <- modularity(og, comp.vec, directed = FALSE) #recomputing modularity of original graph using membership vector
 k <- k + 1 #increasing counter
 }
 maxqpos <- min(which(qv == max(qv))) #picking partition with maximum modularity
 q <- qv[maxqpos] #picking partition with maximum modularity
 membership <- C[[maxqpos]] #picking membership vector with maximum modularity
 nc <- max(membership)
 return(list(q = q, nc = nc, membership = membership))
 }

Note that the main difference between this function and the one we used for the edge betweenness is that instead of flipping a coin and selecting a random edge in case of a tie, we now deterministically kill all edges that have minimum clustering (usually zero or a number close to zero) in line 9.

And here are the results for the law_friends network:

Code
 ec <- ec.comms(ug)
 ec
$q
[1] 0.3216964

$nc
[1] 6

$membership
 [1] 1 1 2 2 2 2 2 1 1 1 1 1 1 2 1 1 1 2 3 1 1 1 1 1 1 4 1 2 1 4 2 2 2 4 2 4 1 4
[39] 1 3 3 4 3 3 4 3 2 2 3 5 3 3 3 3 2 2 3 3 3 2 3 3 6 6 6 6 6 6
(a) Community Partition According to the Edge Betweenness (C = 8).
(b) Community Partition According to the Edge Betweenness (C = 16).
(c) Community Partition According to the Edge Betweenness with Maximum Modularity (C = 23).
(d) Community Partition According to the Edge Clustering with Maximum Modularity (C = 6).
Figure 2: Undirected Law Firm Friendship Network.

Which suggests a partition into 6 communities with a modularity score of 0.322 outperforming the best edge-betweenness partition by this metric.

shows the resulting communities obtained from the edge-clustering coefficient approach. As shown in the Figure, the edge-clustering coefficient approach produces strongly defined, intuitive communities, grouping together densely connected groups of nodes.

References

Girvan, Michelle, and Mark EJ Newman. 2002. “Community Structure in Social and Biological Networks.” Proceedings of the National Academy of Sciences 99 (12): 7821–26.
Radicchi, Filippo, Claudio Castellano, Federico Cecconi, Vittorio Loreto, and Domenico Parisi. 2004. “Defining and Identifying Communities in Networks.” Proceedings of the National Academy of Sciences 101 (9): 2658–63.