Finding Overlapping Communities by Optimizing Local Fitness Functions

Lancichinetti et al. (2009) describe a method for detecting overlapping communities in networks by maximizing a local quantity at the level of subgraphs of the larger graph. Let’s see how this works. First, let’s load up the friendship nomination network from Lazega (2001), constrained to be undirected:

Code

library(networkdata)
library(igraph)
g <- law_friends
g <- subgraph(g, degree(g) >= 2)
ug <- as_undirected(g, mode = "collapse")
V(ug)$name <- as.character(1:vcount(ug))

1 The Local Fitness Function

Lancichinetti et al. (2009) propose that we optimize the following function at the subgraph level:

\[ f(G') = \frac{k^{in}}{k^{in} + k^{out}} \tag{1}\]

Where $k^{in}$ is the internal degree of nodes in the subgraph (the sum of the degrees of each node in the subgraph, counting only intra-community neighbors) and $k^{out}$ is the external degree (the sum of the degrees of each node in the subgraph, only counting links to nodes outside the subgraph).

Here’s an R function that computes for any subgraph:

Code

sg.fitness <- function(graph, sub.nodes) {
  degrees.fg <- degree(graph)
  subgraph <- subgraph(graph, sub.nodes)
  degrees.sg <- degree(subgraph)
  k.i <- sum(degrees.sg)
  k.o <- sum(degrees.fg[sub.nodes] - degrees.sg[sub.nodes])
  fsg <- k.i / (k.i + k.o)
  return(fsg)
}

And here is the sg.fitness function in action:

Code

fg <- sg.fitness(ug, c("61", "62", "65", "67"))
fg

[1] 0.1333333

Which says that the fitness of the subgraph formed by nodes 61, 62, 65, and 67 is 0.13.

2 Relative Node Fitness

Lancichinetti et al. (2009) define the fitness of a node $v$ relative to a given subgraph $G'$ as the difference in the subgraph’s fitness with that node included versus the subgraph’s fitness with that node excluded:

\[ f^v_{G'} = f^v_{G' + v} - f^v_{G' - v} \tag{2}\]

We can write a function to determine the fitness of a node relative to a subgraph according to using our previous sg.fitness function like this:

Code

node.fit <- function(graph, subgraph, node) {
  plus <- unique(c(subgraph, node))
  minus <- subgraph[!subgraph %in% node]
  fit.i <- sg.fitness(graph, plus)
  fit.o <- sg.fitness(graph, minus)
  return(fit.i - fit.o)
}

So if we wanted to figure out the fitness of node 58 relative to the subgraph formed by nodes 61, 62, 65, and 67 we can just type:

Code

node.fit(ug, c("65", "67", "61", "62"), "58")

[1] 0.02539683

We can see that node 58’s fitness relative to this subgraph is positive, indicating that including it increases the subgraph’s fitness.

3 The Natural Community of Each Node

Lancichinetti et al. (2009) describe an algorithm to determine what they call the “natural community” of a given node. The idea is to start with that node as its own subgraph, add the neighbor with the highest fitness relative to the starting subgraph, and continue adding subgraph neighbors with the highest fitness relative to the growing subgraph until the remaining subgraph neighbors have a node fitness less than $\epsilon$ where $\epsilon$ is a small number (e.g., $\epsilon = 0.01$). The final subgraph thus constructed is the seed neighbor’s “natural community.”

How do we do that? First, let’s write a function that takes a list of nodes specifying a subgraph of a larger graph as input and returns all unique neighbors of those nodes as output.

Code

sub.nei <- function(graph, subgraph) {
  nei.list <- lapply(subgraph, function(x) {
    neighbors(graph, x)
  })
  nei.list <- unique(unlist(nei.list))
  nei.list <- nei.list[!nei.list %in% subgraph]
  return(nei.list)
}

So, for instance, if we wanted to see all the neighbors of the earlier subgraph, we would type:

Code

sub.nei(ug, c("65", "67", "61", "62"))

 [1] 40 51 53 63 64 66 68 43 38 41 42 44 45 49 50 54 55 56 57 58 59  4  8 11 13
[26] 24 26 27 31 36 46

Great. Now that we have this, all we need to do is loop through each of the focal node’s neighbors, figure out which one increases the fitness of the subgraph composed of the union of the original node and that neighbor the most, and keep adding neighbors (and neighbors of neighbors), until the remaining nodes don’t add much to the fitness.

Here’s a function that does that:

Code

node.comm <- function(graph, node, s = 123456, e = 0.01) {
  fit.vec <- 1 # initializing node fitness vector
  sub <- node # initializing subgraph
  set.seed(s) # setting seed
  while (sum(fit.vec <= e) != length(fit.vec)) {
    nei <- sub.nei(graph, sub)
    fit.vec <- sapply(nei, function(x) {
      node.fit(graph, sub, x)
    })
    names(fit.vec) <- nei
    max.fit <- names(which(fit.vec == max(fit.vec)))
    if (length(max.fit) > 1) {
      max.fit <- sample(max.fit, 1)
    }
    sub <- unique(c(sub, max.fit))
    sub.fit.vec <- sapply(sub, function(x) {
      node.fit(graph, sub, x)
    })
    low.fit <- names(which(sub.fit.vec < 0))
    if (length(low.fit) != 0) {
      sub <- sub[!sub %in% low.fit]
    }
  }
  sub <- sub[!sub %in% node]
  return(sub)
}

And here are the nodes that belong to node 31’s natural community:

Code

nc31 <- node.comm(ug, "31")
nc31

 [1] "60" "33" "32" "35" "28" "48" "18" "47" "55" "56" "30"

We can see from that indeed this set of nodes (pictured in blue) belongs to a cohesive subgroup proximate to node 31 (pictured in red).

4 An Algorithm to Detect Overlapping Communities

To build overlapping communities, Lancichinetti et al. (2009) recommend that we pick a node at random, compute its natural community, then pick another node at random that is not yet assigned to a community, and repeat until each node has been assigned to at least one community. In R, we can do that by gradually building a list object that contains the nodes assigned to a randomly chosen node’s “natural community” as determined by the node.comm function above across the various iterations. Here’s a function that does that:

Code

fit.comm <- function(graph, s = 123) {
  node.list <- V(graph)$name # initial node list
  comm.list <- list() # initializing the overlapping communities list
  set.seed(s) # setting the seed
  k <- 1
  while (length(node.list) != 0) {
    v <- sample(node.list, 1) # randomly selecting a node from the list
    nc <- node.comm(graph, v) # determining the natural community of the node
    comm.list[[k]] <- c(nc, v) # adding nodes to the list
    node.list <- node.list[!node.list %in% unique(c(unlist(comm.list)))] # deleting the nodes already assigned to a community from the list
    k <- k + 1
  }
  return(comm.list)
}

And here’s the function in action:

Code

fc <- fit.comm(ug)
fc

[[1]]
 [1] "60" "33" "32" "35" "28" "48" "18" "47" "55" "56" "30" "31"

[[2]]
 [1] "56" "55" "48" "47" "18" "35" "28" "32" "6"  "14" "51"

[[3]]
[1] "58" "57" "44" "19" "35" "47" "54"

[[4]]
 [1] "59" "43" "54" "41" "51" "49" "61" "53" "55" "46" "40" "63" "65" "52" "67"
[16] "68" "66" "64" "58" "62"

[[5]]
 [1] "8"  "10" "12" "9"  "21" "27" "17" "20" "13" "24" "4"  "23" "22" "2"  "1" 
[16] "26" "25" "16" "29" "37" "39" "15" "14" "11"

[[6]]
[1] "7"  "5"  "33" "3" 

[[7]]
 [1] "59" "54" "61" "58" "57" "44" "19" "50" "35" "42"

[[8]]
 [1] "43" "29" "34" "38" "42" "59" "36" "39" "26" "37" "41" "40" "46" "62" "54"
[16] "61" "49" "53" "51" "31" "63" "24" "55" "52" "65" "13" "45"

We can see that the algorithm returns a list object containing 8 communities.

We can now create the bi-adjacency matrix corresponding to the two-mode person by community network, as we did in the case of the clique percolation method for overlapping community detection:

Code

Z <- matrix(0, vcount(ug), length(fc))
for (i in seq_len(length(fc))) {
  Z[as.numeric(fc[[i]]), i] <- 1
}

(a) Plot showing Characteristic Community (Blue Nodes) of Node 31 (Red Node).

Here are the entries for the first 10 people:

Code

Z[1:10, ]

      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
 [1,]    0    0    0    0    1    0    0    0
 [2,]    0    0    0    0    1    0    0    0
 [3,]    0    0    0    0    0    1    0    0
 [4,]    0    0    0    0    1    0    0    0
 [5,]    0    0    0    0    0    1    0    0
 [6,]    0    1    0    0    0    0    0    0
 [7,]    0    0    0    0    0    1    0    0
 [8,]    0    0    0    0    1    0    0    0
 [9,]    0    0    0    0    1    0    0    0
[10,]    0    0    0    0    1    0    0    0

And here’s the community overlap matrix:

Code

t(Z) %*% Z

     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
[1,]   12    8    2    1    0    1    1    2
[2,]    8   11    2    2    1    0    1    2
[3,]    2    2    7    2    0    0    6    1
[4,]    1    2    2   20    0    0    4   15
[5,]    0    1    0    0   24    0    0    6
[6,]    1    0    0    0    0    4    0    0
[7,]    1    1    6    4    0    0   10    4
[8,]    2    2    1   15    6    0    4   27

And now we can use fancy igraph pie-chart plotting to visualize the overlapping communities. This is shown in

We can se that community 5 (in blue) is the second largest community with little overlap with other communities (except 8). Community 8 (in gray) is the largest but has high levels of overlap with others. Community 6 (in dark orange) is a small self-contained community.

References

Lancichinetti, Andrea, Santo Fortunato, and János Kertész. 2009. “Detecting the Overlapping and Hierarchical Community Structure in Complex Networks.” New Journal of Physics 11 (3): 033015.

Lazega, Emmanuel. 2001. The Collegial Phenomenon: The Social Mechanisms of Cooperation Among Peers in a Corporate Law Partnership. Oxford University Press.

--- title: "Finding Overlapping Communities by Optimizing Local Fitness Functions" --- @lancichinetti_etal09 describe a method for detecting overlapping communities in networks by maximizing a local quantity at the level of subgraphs of the larger graph. Let's see how this works. First, let's load up the friendship nomination network from @lazega01, constrained to be undirected: ```{r} library(networkdata) library(igraph) g <- law_friends g <- subgraph(g, degree(g) >= 2) ug <- as_undirected(g, mode = "collapse") V(ug)$name <- as.character(1:vcount(ug)) ``` ## The Local Fitness Function @lancichinetti_etal09 propose that we optimize the following function at the subgraph level: $$ f(G') = \frac{k^{in}}{k^{in} + k^{out}} $$ {#eq-fitness} Where $k^{in}$ is the internal degree of nodes in the subgraph (the sum of the degrees of each node in the subgraph, counting only intra-community neighbors) and $k^{out}$ is the external degree (the sum of the degrees of each node in the subgraph, only counting links to nodes outside the subgraph). Here's an `R` function that computes for any subgraph: ```{r} sg.fitness <- function(graph, sub.nodes) { degrees.fg <- degree(graph) subgraph <- subgraph(graph, sub.nodes) degrees.sg <- degree(subgraph) k.i <- sum(degrees.sg) k.o <- sum(degrees.fg[sub.nodes] - degrees.sg[sub.nodes]) fsg <- k.i / (k.i + k.o) return(fsg) } ``` And here is the `sg.fitness` function in action: ```{r} fg <- sg.fitness(ug, c("61", "62", "65", "67")) fg ``` Which says that the fitness of the subgraph formed by nodes 61, 62, 65, and 67 is `r round(fg, 2)`. ## Relative Node Fitness @lancichinetti_etal09 define the fitness of a node $v$ relative to a given subgraph $G'$ as the difference in the subgraph's fitness with that node included versus the subgraph's fitness with that node excluded: $$ f^v_{G'} = f^v_{G' + v} - f^v_{G' - v} $$ {#eq-nodefit} We can write a function to determine the fitness of a node relative to a subgraph according to using our previous `sg.fitness` function like this: ```{r} node.fit <- function(graph, subgraph, node) { plus <- unique(c(subgraph, node)) minus <- subgraph[!subgraph %in% node] fit.i <- sg.fitness(graph, plus) fit.o <- sg.fitness(graph, minus) return(fit.i - fit.o) } ``` So if we wanted to figure out the fitness of node 58 relative to the subgraph formed by nodes 61, 62, 65, and 67 we can just type: ```{r} node.fit(ug, c("65", "67", "61", "62"), "58") ``` We can see that node 58's fitness relative to this subgraph is positive, indicating that including it increases the subgraph's fitness. ## The Natural Community of Each Node @lancichinetti_etal09 describe an algorithm to determine what they call the "natural community" of a given node. The idea is to start with that node as its own subgraph, add the neighbor with the highest fitness relative to the starting subgraph, and continue adding subgraph neighbors with the highest fitness relative to the growing subgraph until the remaining subgraph neighbors have a node fitness less than $\epsilon$ where $\epsilon$ is a small number (e.g., $\epsilon = 0.01$). The final subgraph thus constructed is the seed neighbor's "natural community." How do we do that? First, let's write a function that takes a list of nodes specifying a subgraph of a larger graph as input and returns all unique neighbors of those nodes as output. ```{r} sub.nei <- function(graph, subgraph) { nei.list <- lapply(subgraph, function(x) { neighbors(graph, x) }) nei.list <- unique(unlist(nei.list)) nei.list <- nei.list[!nei.list %in% subgraph] return(nei.list) } ``` So, for instance, if we wanted to see all the neighbors of the earlier subgraph, we would type: ```{r} sub.nei(ug, c("65", "67", "61", "62")) ``` Great. Now that we have this, all we need to do is loop through each of the focal node's neighbors, figure out which one increases the fitness of the subgraph composed of the union of the original node and that neighbor the most, and keep adding neighbors (and neighbors of neighbors), until the remaining nodes don't add much to the fitness. Here's a function that does that: ```{r} node.comm <- function(graph, node, s = 123456, e = 0.01) { fit.vec <- 1 # initializing node fitness vector sub <- node # initializing subgraph set.seed(s) # setting seed while (sum(fit.vec <= e) != length(fit.vec)) { nei <- sub.nei(graph, sub) fit.vec <- sapply(nei, function(x) { node.fit(graph, sub, x) }) names(fit.vec) <- nei max.fit <- names(which(fit.vec == max(fit.vec))) if (length(max.fit) > 1) { max.fit <- sample(max.fit, 1) } sub <- unique(c(sub, max.fit)) sub.fit.vec <- sapply(sub, function(x) { node.fit(graph, sub, x) }) low.fit <- names(which(sub.fit.vec < 0)) if (length(low.fit) != 0) { sub <- sub[!sub %in% low.fit] } } sub <- sub[!sub %in% node] return(sub) } ``` And here are the nodes that belong to node 31's natural community: ```{r} nc31 <- node.comm(ug, "31") nc31 ``` We can see from that indeed this set of nodes (pictured in blue) belongs to a cohesive subgroup proximate to node 31 (pictured in red). ## An Algorithm to Detect Overlapping Communities To build overlapping communities, @lancichinetti_etal09 recommend that we pick a node at random, compute its natural community, then pick another node at random that is not yet assigned to a community, and repeat until each node has been assigned to at least one community. In `R`, we can do that by gradually building a `list` object that contains the nodes assigned to a randomly chosen node's "natural community" as determined by the `node.comm` function above across the various iterations. Here's a function that does that: ```{r} fit.comm <- function(graph, s = 123) { node.list <- V(graph)$name # initial node list comm.list <- list() # initializing the overlapping communities list set.seed(s) # setting the seed k <- 1 while (length(node.list) != 0) { v <- sample(node.list, 1) # randomly selecting a node from the list nc <- node.comm(graph, v) # determining the natural community of the node comm.list[[k]] <- c(nc, v) # adding nodes to the list node.list <- node.list[!node.list %in% unique(c(unlist(comm.list)))] # deleting the nodes already assigned to a community from the list k <- k + 1 } return(comm.list) } ``` And here's the function in action: ```{r} fc <- fit.comm(ug) fc ``` We can see that the algorithm returns a `list` object containing `r length(fc)` communities. We can now create the bi-adjacency matrix corresponding to the two-mode person by community network, as we did in the case of the [clique percolation method](comm-overlap-clique.qmd) for overlapping community detection: ```{r} Z <- matrix(0, vcount(ug), length(fc)) for (i in seq_len(length(fc))) { Z[as.numeric(fc[[i]]), i] <- 1 } ``` ```{r} #| label: fig-fitness #| fig-cap: "Undirected Law Firm Friendship Network." #| fig-subcap: #| - "Plot showing Characteristic Community (Blue Nodes) of Node 31 (Red Node)." #| - "Node Pie Chart Plot Showing Overlapping Community Memberships." #| layout-ncol: 2 #| fig-height: 12 #| fig-width: 12 #| echo: false c <- rep("tan", vcount(ug)) c[as.numeric(nc31)] <- "blue" c[31] <- "red" V(ug)$color <- c set.seed(123) ul <- layout_with_kk(ug) plot(ug, layout = ul, vertex.size = 6, vertex.frame.color = "lightgray", vertex.label = V(ug)$name, edge.arrow.size = 0.25, vertex.label.dist = 1.2, vertex.label.degree = 180, edge.curved = 0.2, vertex.label.cex = 1.5 ) W_raw <- split(Z / rowSums(Z), seq(nrow(Z))) W <- lapply(W_raw, function(x) { x[which(x != 0)] }) ccol <- lapply(W_raw, function(x) { which(x != 0) }) library(RColorBrewer) getPalette <- colorRampPalette(categorical_pal(8)) pal <- getPalette(ncol(Z)) V(ug)$shape <- "pie" V(ug)$pie <- W c.list <- lapply(ccol, function(x) { pal[x] }) names(c.list) <- V(ug)$name V(ug)$pie.color <- c.list plot(ug, layout = ul, vertex.size = 8, vertex.label.color = "black", vertex.label.font = 2, vertex.label.cex = 1.5, edge.width = 2, edge.color = "lightblue", vertex.label.dist = 1.2, vertex.label.degree = 180, edge.curved = 0.2 ) legend("bottomleft", legend = c(paste("Community", 1:ncol(Z), sep = "_")), fill = pal, bty = "n", # No box around legend cex = 1.5 ) # Text size ``` Here are the entries for the first 10 people: ```{r} Z[1:10, ] ``` And here's the community overlap matrix: ```{r} t(Z) %*% Z ``` And now we can use fancy `igraph` pie-chart plotting to visualize the overlapping communities. This is shown in We can se that community 5 (in blue) is the second largest community with little overlap with other communities (except 8). Community 8 (in gray) is the largest but has high levels of overlap with others. Community 6 (in dark orange) is a small self-contained community.