Role Similarities

1 The Role Problem

So far, we have defined distances and similarities primarily as functions of the number of shared neighbors between two nodes. Structural equivalence is an idealized version of this, obtaining whenever people share exactly the same set of neighbors.

Yet, similarities based on local neighborhood information have only been criticized (e.g., by Borgatti and Everett (1992)) for not quite capturing the sociological intuition behind the idea of a role, which is usually what they are deployed for.

That is, two doctors don’t occupy the same role because they treat the same set of patients (in fact, this would be weird); instead, the set of doctors occupy the same role to the extent they treat a set of actors in the same (or similar) role: Namely, patients, regardless of whether those actors are literally the same or not. Note that all the similarity notions studied in the local similarity lecture have structural equivalence as the limiting case of maximum similarity. So they are still based on the idea that nodes are similar to the extent they connect to the same others.

This worry led social network analysts to attempt to generalize the concept of structural equivalence using a less stringent algebraic definition, resulting in a proliferation of different types of equivalence, such as automorphic or regular equivalence (Borgatti and Everett 1992). Unfortunately, none of this work (unlike the work on structural equivalence) led to useful or actionable tools for data analysis, with some even concluding that certain definitions of equivalence (such as regular equivalence) are unlikely to be found in any interesting substantive setting.

A better approach here is to use a more relaxed notion of similarity to arrive at a more general definition that captures our role-based intuitions. One strategy is to define similarity in a more “generalized” way as follows: nodes are similar to the extent they connect to similar others, with the restriction that we can only use endogenous (structural connectivity) information—like with structural equivalence or common-neighbor approaches—to define everyone’s similarity (no exogenous attribute stuff).

2 Jeh and Widom’s SimRank

As Jeh and Widom (2002) note, this approach does lead to a coherent generalization of the idea of similarity for nodes in graphs because we can just define similarity recursively and iterate through the graph until the similarity scores stop changing.¹

More specifically, they propose to measure the similarity between two nodes $i$ and $j$ at each iteration step $t$ ($s_{ij}(t)$) using the formula:

\[ s_{ij}(t) = \frac{\alpha}{d_i d_j} \sum_{k \in N(i)} \sum_{l \in N(j)} s_{kl}(t-1) \tag{1}\]

So the similarity between two nodes at step $t$ is just the sum of the pairwise similarities between each of their neighbors (computed in the previous step, $t-1$), weighted by the ratio of a free parameter $\alpha$ (a number between zero and one) to the product of their in-degrees (to take a weighted average).

This measure nicely captures the idea that nodes are similar to the extent that they are both connected to people who are themselves similar. Note that it doesn’t matter whether these in-neighbors are shared between the two nodes (the summation occurs over each pair of nodes formed by crossing the set $p$ + $q$ against $p$ + $r$ as defined in the local similarity lecture), whether they are themselves neighbors, which deals with the doctor/patient problem we referred to earlier.

A function that implements this idea looks like:

Code

SimRank <- function(w, alpha = 0.8, beta = 1) {
  do <- rowSums(w) # outdegree vector
  di <- colSums(w) # indegree vector
  n <- nrow(w) # number of nodes in the graph
  z <- diag(1, n, n) # diagonal matrix (nodes are maximally similar to themselves)
  if (is.null(rownames(w)) == TRUE) {
    rownames(w) <- 1:nrow(w)
  } # checking if matrix has row names
  if (is.null(colnames(w)) == TRUE) {
    colnames(w) <- 1:nrow(w)
  } # checking if matrix has column names
  rownames(z) <- rownames(w) # naming similarity matrix rows
  colnames(z) <- colnames(w) # naming similarity matrix columns
  onei <- apply(w, 1, function(x) {
    which(x == 1)
  }, simplify = FALSE) # out-neighbors list
  inei <- apply(w, 2, function(x) {
    which(x == 1)
  }, simplify = FALSE) # in-neighbors list
  delta <- 1
  while (delta > 1e-10) { # run while delta is bigger than a tiny number
    z.o <- z # old similarity matrix at t-1
    for (i in 1:n) {
      for (j in 1:n) {
        if (i == j) {
          next
        } # skip when i and j are the same
        min.di <- min(di[i], di[j])
        min.do <- min(do[i], do[j])
        if (min.di != 0 & min.do != 0) { # both nodes have in and out neighbors
          a <- alpha / (di[i] * di[j]) # indegree weight
          b <- sum(as.matrix(z.o[inei[[i]], inei[[j]]])) # sum of similarity scores of in-neighbors in previous iteration
          c <- alpha / (do[i] * do[j]) # outdegree weight
          d <- sum(as.matrix(z.o[onei[[i]], onei[[j]]])) # sum of similarity scores of out-neighbors in previous iteration
          z[i, j] <- beta * a * b + (1 - beta) * c * d # updating current similarity matrix at t
        }
        if (min.di != 0 & min.do == 0) { # one node has no out-neighbors
          a <- alpha / (di[i] * di[j]) # indegree weight
          b <- sum(as.matrix(z.o[inei[[i]], inei[[j]]])) # sum of similarity scores of in-neighbors in previous iteration
          z[i, j] <- beta * a * b # updating current similarity matrix at t
        }
        if (min.di == 0 & min.do != 0) { # one node has no in-neighbors
          c <- alpha / (do[i] * do[j]) # outdegree weight
          d <- sum(as.matrix(z.o[onei[[i]], onei[[j]]])) # sum of similarity scores of out-neighbors in previous iteration
          z[i, j] <- (1 - beta) * c * d # updating current similarity matrix at t
        }
      } # end j for loop
    } # end i for loop
    delta <- abs(sum(abs(z)) - sum(abs(z.o))) # difference between similarity matrices at t and t-1
  } # end while loop
  return(z) # return similarity matrix after convergence
} # end function

This function takes an adjacency matrix (w) as input and returns a generalized similarity matrix (z) as output. Note that at the initial time step $t = 0$, the similarity matrix z is initialized to the identity matrix $\mathbf{I}_{n \times n}$ of the same dimensions as the adjacency matrix, saying that initially all nodes are maximally similar to themselves (which makes sense).

The above is an implementation of the generalized version of SimRank developed by Zhao et al. (2009), called P-Rank, which accounts for similarities between in- and out-neighbors (weighted by the $\beta$ parameter). Jeh and Widom’s (2002) original version of SimRank (which only takes into account in-neighbor similarities) is obtained when $\beta = 1$ (the default in the SimRank function above).

Let us see how it works using the toy example that Jeh and Widom (2002, 540, Figure 2) used in their original paper, shown in Figure 1, which is intended to represent a set of web pages at a university. There is a link from the main University page to the two professors ’ pages, and a link from the professors to the students’ pages. In addition, “Student B” has a link on their webpage to the corresponding professor, and “Student A” has a link to the main university page on their webpage.

We can now test the SimRank function on the toy example:

Code

A <- as.matrix(as_adjacency_matrix(g))
S <- SimRank(A)
round(S, 3)

          Univ ProfA ProfB StudentA StudentB
Univ     1.000 0.000 0.132    0.000    0.034
ProfA    0.000 1.000 0.414    0.000    0.106
ProfB    0.132 0.414 1.000    0.042    0.088
StudentA 0.000 0.000 0.042    1.000    0.331
StudentB 0.034 0.106 0.088    0.331    1.000

Which reproduces the pairwise similarity numbers reported by Jeh and Widom (2002, 540, Figure 2) in their original paper (with $\alpha = 0.8$).

We can transform the similarity matrix into a distance matrix and then perform a hierarchical clustering to obtain the roles:

Code

D <- as.dist(1 - S)
h.res <- hclust(D, method = "ward.D") # hierarchical clustering
plot(h.res)

Which, as we can see, recovers the Professor, Student, and University Roles in the network.

Let’s try it out on more realistic data using the Pulp Fiction graph:

Code

library(networkdata)
library(stringr) # using stringr to change names from all caps to title case
g <- movie_559
V(g)$name <- str_to_title(V(g)$name)
V(g)$name[which(V(g)$name == "Esmarelda")] <- "Esmeralda" # fixing misspelled name
g <- delete_vertices(g, degree(g) < 2) # deleting low degree vertices
A <- as.matrix(as_adjacency_matrix(g))
S <- SimRank(A)
round(S[1:5, 1:5], 2)

            Brett Buddy Butch Capt Koons Ed Sullivan
Brett        1.00  0.17  0.17       0.17        0.17
Buddy        0.17  1.00  0.17       0.27        0.47
Butch        0.17  0.17  1.00       0.17        0.17
Capt Koons   0.17  0.27  0.17       1.00        0.27
Ed Sullivan  0.17  0.47  0.17       0.27        1.00

We can transform the generalized similarities to distances and plot:

Code

D <- as.dist(1 - S)
h.res1 <- hclust(D, method = "ward.D2") # hierarchical clustering
plot(h.res1)

The plot suggests a division into five major node clusters. On the left, we have the people from the diner scene (e.g., Pumpkin and Honey Bunny), followed by a small cluster from the sex dungeon scene (Maynard, The Gimp, and Zed). We have the main character cluster in the middle (Jules, Mia, Vincent, Butch, Marsellus), then the cleanup scene (Jimmie, The Wolf), followed by a cluster of smaller characters.

3 Does SimRank Escape the Role Problem?

Recall that we turned to SimRank because it promised to solve Borgatti and Everett’s (1992) role problem, or the idea that nodes can play the same role in the network even when they don’t share any neighbors.

Now, we know from Equation 1 that SimRank uses information about the neighborhoods of the two nodes, but relaxes the local structural-similarity assumption that these neighbors need to be shared.

However, it is clear that nodes that have completely disjoint neighborhoods would be considered by SimRank to be low similarity even though they could be playing the same roles.² So maybe SimRank doesn’t solve the doctor/patient problem after all.

To build some intuition, consider the graph in Figure 2, which is a directed version of a toy example from Borgatti and Everett (1992, 14, Figure 7(a)). Going by the standard of structural equivalence criterion (see here), we would say that the sets of structurally equivalent nodes are {J, H}, {G, I}, {D, B}, {A, C}, because each pair in brackets has identical in and out neighborhoods. Nodes {E} and {F} should be in their own singleton sets, as they have distinct out-neighborhoods even though they share the same in-neighborhood (each other).

We can ascertain this by reviving our old Euclidean distance function:

Code

d.euclid <- function(w) {
  n <- nrow(w)
  z <- matrix(0, n, n)
  for (i in 1:n) {
    for (j in i:n) {
      if (i == j) {
        next
      }
      d.ij <- 0
      for (k in 1:n) {
        if (i != k & j != k) {
          d.ij <- d.ij + ((w[i, k] - w[j, k])^2 + (w[k, i] - w[k, j])^2)
        }
      }
      z[i, j] <- sqrt(d.ij)
      z[j, i] <- sqrt(d.ij)
    }
  }
  rownames(z) <- rownames(w)
  colnames(z) <- colnames(w)
  return(z)
}

And then using it to cluster the nodes in the Borgatti-Everett graph:

Code

A <- as.matrix(as_adjacency_matrix(g))
E <- d.euclid(A)
E <- as.dist(E) # transforming E to a dist object
h.res <- hclust(E, method = "ward.D2")
plot(h.res)

Figure 3: Structural Equivalence Partition.

We can see in that, indeed, the good old Euclidean distance finds the set of structurally equivalent nodes in the graph (Burt 1976), although it assigns nodes $E$ and $F$ to the same cluster when they should be in their own cluster.

Yet, going by our intuitive idea that two nodes occupy the same position if they are structurally similar (regardless of whether they share the same set of neighbors) we want to say that nodes {J, H, A, C} occupy the same role, as do nodes {G, I, D, B} and nodes {E, F}. So, going by this abstract idea of role, we would expect there to be three role positions in the graph, not five as suggested by the structural equivalence approach.

However, standard approaches to structural equivalence, such as Euclidean distance, will not capture these higher-level similarities; they will only capture the lower-level similarity based on local neighborhood overlap that structural equivalence algorithms are designed to extract.

Can SimRank do better? Let’s see:

Code

S <- SimRank(A, beta = 0.5)
D <- as.dist(1 - S)
h.res <- hclust(D, method = "ward.D2")
plot(h.res)

We specify $\beta = 0.5$ so that SimRank gives equal weight to in and out-neighbors. However, we can see in Figure 4 that Zhao et al.’s (2009) generalized version of SimRank does not identify the more general role classes in the Borgatti-Everett graph. It does correctly assigns {E}, {F} to their separate group, but it creates two larger similarity clusters containing nodes {A, B, C, D} and {G, H, I, J} which make sense from the point of view of proximity or cohesion but not from the point of view of role similarity independently of proximity (Borgatti and Everett 1992).

4 RoleSim

We can see that the problem is that nodes like {J, H, A, C} in Figure 2 (a) are not structurally equivalent (since J and H don’t share any neighbors with A and C). Instead, they are automorphically equivalent (see Borgatti and Everett (1992) for relevant discussion).

While automorphic equivalence is a terrible name coming from the depths of graph theory, it has a simple explanation. Two nodes are automorphically equivalent if we could swap the labels of any pair (e.g., J for A and A for J) in the relevant set in Figure 2 (a), and nothing would change in the graph. That is, every graph theoretic quantity we could compute would remain the same (density, average shortest path length, diameter, etc.).

It is easy to see that nodes inside the set {A, C, J, H} pass this test in Figure 2 (a). For instance, the graph in Figure 2 (b) has the same properties as Figure 2 (a), even though J is now a neighbor of B and D, and A is a neighbor of G and I. Such an evil twin graph is called an automorphism of the original graph.

Of course, this strict graph theoretic notion of automorphic equivalence is too strong for most practical applications outside of a toy graph like Figure 2, since it only returns binary outcomes: Either two nodes are in the same automorphically equivalent set ($s_{ij} = 1$), or they are not ($s_{ij} = 0$), with nothing in between.

What we need instead is a metric of automorphic similarity that admits to gradations so that two nodes could be more or less automorphically similar, with the idea being that the more automorphically similar two nodes are, the less disruption there is to the graph structure when they are exchanged, and with the Platonic ideal case being no disruption whatsoever for strictly automorphically equivalent nodes.

So the problem with a SimRank style measure is that it is not sensitive to automorphic similarity, as it is only sensitive to structural similarity, and we need a measure that is.

Fortunately, such a measure exists, and it’s called RoleSim (Jin et al. 2014). It can be seen as a generalization of SimRank with the difference being that rather than computing the similarity between $i$ and $j$ at iteration $t$ as equivalent to the average similarity of $i$ and $j$’s neighbors at iteration $t-1$, it instead computes the similarity between $i$ and $j$ at iteration $t$ as the matching between the most similar neighbor pairs at iteration $t-1$.

Such a maximum matching is defined as finding the top $k$ similarity scores between $i$ and $j$’s neighbors at $t-1$ (where $k$ is the minimum of the degrees of $i$ and $j$) and adding them up.

In formulese:

\[ s_{ij}(t) = \beta \frac{\sum_{(k,l) \in M_{i,j}} s_{kl}(t-1)}{max(d_i, d_j)} + \left(1 - \beta\right) \tag{2}\]

Where $\beta$ is a parameter chosen by the researcher with the restriction that $0 \geq \beta \leq 1$.

Note that the sum of neighbor similarities in happens only among similarities in the matching set $M_{i, j}$. The idea of a matching set between two pairs of nodes is simple.

Imagine a graph in which node A is connected to nodes {B, C, D} and node H is connected to nodes {I, J, K, L}. Now imagine that the hypothetical matrix below is a running similarity matrix at some iteration $t$ between A’s and H’s neighbors:

     I    J    K    L
B 0.29 0.88 0.53 0.46
C 0.79 0.94 0.89 0.96
D 0.41 0.05 0.55 0.45

A maximal matching selects the maximum numbers across the smallest dimension of the matrix (which in this case is the row dimension, as the matrix has three rows and four columns) and then adds them up.

Code

row.max <- apply(s, 1, max)
round(row.max, 2)

   B    C    D 
0.88 0.96 0.55

Code

match.AH <- sum(row.max)
round(match.AH, 2)

[1] 2.39

We can see that the maximum in row one, corresponding to node B, is 0.88 (the similarity between B and J), the maximum in row two, corresponding to node C, is 0.96 (the similarity between C and J), and the maximum in row three, corresponding to node D, is 0.55 (the similarity between D and K).

This means that the maximum matching for the neighbors of nodes A and H in this hypothetical example is $0.88 + 0.96 + 0.55 = 2.39$, which would be the equivalent of the numerator in the fraction of Equation 2.

Because in a directed graph we have to worry about similarities between in and out neighbors, we modify Equation 2 like this:

\[ s_{ij}(t) = \beta \frac{\sum_{(k,l) \in M^{out}_{i,j}} s_{kl}(t-1) + \sum_{(k,l) \in M^{in}_{i,j}} s_{kl}(t-1)}{max(d^{out}_i, d^{out}_j) + max(d^{in}_i, d^{in}_j)} + \left(1 - \beta\right) \tag{3}\]

Where $M^{out}_{ij}$ is the maximum matching set between $i$ and $j$’s out neighbors and $M^{in}_{ij}$ is the matching set between $i$ and $j$’s in-neighbors. This is the implementation of RoleSim found in Shao et al. (2019).

Here’s a modification of the SimRank function that does all of the above:

Code

RoleSim <- function(w, beta = 0.5) {
  do <- rowSums(w) # outdegree vector
  di <- colSums(w) # indegree vector
  n <- nrow(w) # number of nodes in the graph
  z <- matrix(1, n, n) # all ones matrix (nodes are maximally similar to themselves)
  if (is.null(rownames(w)) == TRUE) {
    rownames(w) <- 1:nrow(w)
  } # checking if matrix has row names
  if (is.null(colnames(w)) == TRUE) {
    colnames(w) <- 1:nrow(w)
  } # checking if matrix has column names
  rownames(z) <- rownames(w) # naming similarity matrix rows
  colnames(z) <- colnames(w) # naming similarity matrix columns
  onei <- apply(w, 1, function(x) {
    which(x == 1)
  }, simplify = FALSE) # out-neighbors list
  inei <- apply(w, 2, function(x) {
    which(x == 1)
  }, simplify = FALSE) # in-neighbors list
  delta <- 1
  while (delta > 1e-10) { # run while delta is bigger than a tiny number
    z.o <- z # old similarity matrix at t-1
    for (i in 1:n) {
      for (j in i:n) {
        if (i == j) {
          next
        } # skip if i = j
        min.di <- min(di[i], di[j]) # minimum indegree between i an j
        max.di <- max(di[i], di[j]) # maximum indegree between i an j
        min.do <- min(do[i], do[j]) # minimum outdegree between i an j
        max.do <- max(do[i], do[j]) # maximum outdegree between i an j
        if (min.di != 0 & min.do != 0) { # both nodes have in and out-neighbors
          k <- z.o[inei[[i]], inei[[j]], drop = FALSE] # creating sub-matrix of similarities between i and j's in-neighbors
          d <- as.numeric(which(c(di[i], di[j]) == min.di)[1]) # selecting minimum of rows or columns of k
          mu.i <- sum(apply(k, d, max)) # maximum in-neighbor similarity matching sum
          k <- z.o[onei[[i]], onei[[j]], drop = FALSE] # creating sub-matrix of similarities between i and j's out-neighbors
          d <- as.numeric(which(c(do[i], do[j]) == min.do)[1]) # selecting minimum of rows or columns of k
          mu.o <- sum(apply(k, d, max)) # maximum out-neighbor similarity matching sum
          z[i, j] <- (beta * (mu.o + mu.i) / (max.do + max.di)) + (1 - beta) # updating similarity matrix
        }
        if (min.di != 0 & min.do == 0) { # one node has no out-neighbors
          k <- z.o[inei[[i]], inei[[j]], drop = FALSE] # creating sub-matrix of similarities between i and j's in-neighbors
          d <- as.numeric(which(c(di[i], di[j]) == min.di)[1]) # selecting minimum of rows or columns of k
          mu.i <- sum(apply(k, d, max)) # maximum in-neighbor similarity matching sum
          z[i, j] <- (beta * (mu.i / max.di)) + (1 - beta) # updating similarity matrix
        }
        if (min.di == 0 & min.do != 0) { # one node has no in-neighbors
          k <- z.o[onei[[i]], onei[[j]], drop = FALSE] # creating sub-matrix of similarities between i and j's out-neighbors
          d <- as.numeric(which(c(do[i], do[j]) == min.do)[1]) # selecting minimum of rows or columns of k
          mu.o <- sum(apply(k, d, max)) # maximum out-neighbor similarity matching sum
          z[i, j] <- (beta * (mu.o / max.do)) + (1 - beta) # updating similarity matrix
        }
      } # end j for loop
    } # end i for loop
    z[lower.tri(z)] <- t(z)[lower.tri(z)] # copying upper triangle into lower triangle
    delta <- sum(abs(abs(z) - abs(z.o))) # difference between similarity matrices at t and t-1
  } # end while loop
  return(z) # return similarity matrix after convergence
} # end function

And now we apply it to the Borgatti/Everett graph:

Code

S <- RoleSim(A)
D <- as.dist(1 - S)
h.res <- hclust(D, method = "ward.D") # hierarchical clustering
plot(h.res)

And we can see in that, indeed, the RoleSim algorithm groups nodes in terms of their automorphic similarity rather than their structural similarity, with nodes {A, C, J, H} being put in the same class, as are nodes {B, D, G, I} and {E, F}.

We can see this clearly if we cut the dendrogram to return three clusters:

Here is the five cluster partition of the dendrogram in Figure 5:

Code

blocks <- cutree(h.res, k = 3)
blocks

A B C D E F G H I J 
1 2 1 2 3 3 2 1 2 1

4.1 Undirected Case

Hanneman & Riddle use the undirected graph in Figure 6 as another example where automorphic equivalence is relevant (the graph is taken from Wasserman and Faust (1994)). Using a (costly) algorithm that identifies the “orbits” of a graph (sets of automorphically equivalent nodes) they note that there are five such positions in Figure 6: Actors $\{A\}$ (the “top” node), $\{C\}$, and $\{G\}$ occupying unique (singleton) positions, with $\{B, D\}$ and $\{E, F, H, I\}$ being sets of automorphically equivalent (exchangeable) nodes.

Figure 7 shows the dendrogram corresponding to the rolesim partition for , with $\beta = 1.0$ (since the graph is undirected).

Code

plot(h.res)

Figure 7: RoleSim Partition of an undirected graph.

We can see that indeed, rolesim identifies the automorphically equivalent classes, assigning $\{A, C, G\}$ to unique positions and grouping the relevant automorphically equivalent nodes into the expected classes. Here is the five cluster partition of the dendrogram in 2:

Code

blocks <- cutree(h.res, k = 5)
blocks

A B C D E F G H I 
1 2 3 2 4 4 5 4 4

Role problem solved!

5 CentSim

As we noted, automorphically equivalent nodes in a graph are completely exchangeable, meaning that we can swap them without changing anything we can compute at the graph level.

The same goes for any quantity we can compute at the node level, which means that automorphically equivalent nodes have the property of being graph-theoretically indistinguishable. That means that they will have the same score on any centrality measure we could compute on them.

So another approach to finding automorphically similar pairs of nodes in a graph is just to compute a bunch of node-level quantities, such as centrality measures, and count nodes as similar if their centrality measures are similar.

Li et al. (2015) built on this observation to compute a role similarity measure called CentSim on each pair of nodes in the graph, by comparing their centrality scores across various measures. Li et al.’s (2015) proposal is that the similarity between $i$ and $j$ is given by:

\[ s_{ij} = \frac{\sum_{k=1}^l \theta_k(i,j)}{l} \tag{4}\]

Where the function $\theta_k$ defined on each pair of nodes $i$, $j$, is given by:

\[ \theta_k(i,j) = \frac{min\left[c_k(i), c_k(j)\right]}{max\left[c_k(i), c_k(j)\right]} \tag{5}\]

Where $c_k(i)$ is the $k^{th}$ centrality score computed on node $i$.

So the idea is to compute a bunch of different centrality scores for each node, then compute the average ratio of the minimum to the maximum score for each pair.

Here’s a function that does that, using degree, closeness, betweenness, and eigenvector centrality:

Code

cent.sim <- function(x) {
  library(igraph)
  c1 <- degree(x)
  c2 <- harmonic_centrality(x)
  c3 <- betweenness(x)
  library(igraph)
  c1 <- degree(x)
  c2 <- harmonic_centrality(x)
  c3 <- betweenness(x)
  c4 <- eigen_centrality(x)$vector
  n <- vcount(x)

  min.max <- function(a, b) {
    if (max(a, b) != 0) {
      r <- min(a, b) / max(a, b)
    } else {
      r <- 1
    }
    return(r)
  }

  z <- diag(1, n, n)
  rownames(z) <- V(x)$name
  colnames(z) <- V(x)$name

  for (i in 1:n) {
    for (j in i:n) {
      w1 <- min.max(c1[i], c1[j])
      w2 <- min.max(c2[i], c2[j])
      w3 <- min.max(c3[i], c3[j])
      w4 <- min.max(c4[i], c4[j])
      z[i, j] <- (w1 + w2 + w3 + w4) / 4
    }
  }

  z[lower.tri(z)] <- t(z)[lower.tri(z)] # copying upper triangle into lower triangle
  return(z)
}

The cent.sim function takes an igraph graph object as input and returns a similarity matrix z as output.

Can CentSim find the automorphically equivalent classes in that we are looking for? Let’s find out:

Code

S <- cent.sim(g)
D <- as.dist(1 - S)
h.res <- hclust(D, method = "ward.D") # hierarchical clustering
plot(h.res)

Indeed it does! As we can see in Figure 8, nodes {A, C, J, H} are put in the same class, as are nodes {B, D, G, I} and {E, F}.

Let us compare the clustering that we obtain in the Pulp Fiction network from a measure that is sensitive to automorphic similarity (like RoleSim) to the one we obtained using SimRank:

Code

g <- movie_559
V(g)$name <- str_to_title(V(g)$name)
V(g)$name[which(V(g)$name == "Esmarelda")] <- "Esmeralda" # fixing misspelled name
g <- delete_vertices(g, degree(g) < 2) # deleting low degree vertices
A <- as.matrix(as_adjacency_matrix(g))
S <- RoleSim(A)
D <- as.dist(1 - S)
h.res2 <- hclust(D, method = "ward.D") # hierarchical clustering
plot(h.res2)

We can see that the RoleSim partition is different from the one offered by SimRank. While the main clusters in the SimRank partition are organized by scene (e.g., all the characters in the diner scene appear together), the RoleSim partition actually organizes the characters according to more abstract roles. So the main characters {Jules, Vincent, Butch} form their own cluster, while {Marsellus, Mia, Pumpkin, Honey Bunny} are also assigned to a similarity cluster, despite not appearing in the same scenes.

References

Borgatti, Stephen P, and Martin G Everett. 1992. “Notions of Position in Social Network Analysis.” Sociological Methodology 22: 1–35.

Burt, Ronald S. 1976. “Positions in Networks.” Social Forces 55 (1): 93–122.

Jeh, Glen, and Jennifer Widom. 2002. “Simrank: A Measure of Structural-Context Similarity.” Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 538–43.

Jin, Ruoming, Victor E Lee, and Longjie Li. 2014. “Scalable and Axiomatic Ranking of Network Role Similarity.” ACM Transactions on Knowledge Discovery from Data (TKDD) 8 (1): 1–37.

Kusumoto, Mitsuru, Takanori Maehara, and Ken-ichi Kawarabayashi. 2014. “Scalable Similarity Search for SimRank.” Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, 325–36.

Li, Longjie, Lvjian Qian, Victor E. Lee, Mingwei Leng, Mei Chen, and Xiaoyun Chen. 2015. “Fast and Accurate Computation of Role Similarity via Vertex Centrality.” In Web-Age Information Management, edited by Xin Luna Dong, Xiaohui Yu, Jian Li, and Yizhou Sun. Springer International Publishing.

Shao, Yingxia, Jialin Liu, Shuyang Shi, Yuemei Zhang, and Bin Cui. 2019. “Fast de-Anonymization of Social Networks with Structural Information.” Data Science and Engineering 4 (1): 76–92.

Wasserman, Stanley, and Katherine Faust. 1994. Social Network Analysis: Methods and Applications.

Zhao, Peixiang, Jiawei Han, and Yizhou Sun. 2009. “P-Rank: A Comprehensive Structural Similarity Measure over Information Networks.” Proceedings of the 18th ACM Conference on Information and Knowledge Management, 553–62.

Footnotes

Note the similarity (heh, heh) between this idea and the various status ranking algorithms for computing prestige scores.↩︎
In fact, Kusumoto et al. (2014) show that the SimRank score of two nodes tends towards zero as their geodesic distance increases.↩︎

--- title: "Role Similarities" --- ## The Role Problem So far, we have defined distances and similarities primarily as functions of the number of shared neighbors between two nodes. Structural equivalence is an idealized version of this, obtaining whenever people share exactly the same set of neighbors. Yet, similarities based on local neighborhood information have only been criticized (e.g., by @borgatti_everett92) for not quite capturing the sociological intuition behind the idea of a *role*, which is usually what they are deployed for. That is, two doctors don't occupy the same role because they treat the *same* set of patients (in fact, this would be weird); instead, the set of doctors occupy the same role to the extent they treat a set of actors in the same (or similar) role: Namely, patients, regardless of whether those actors are literally the same or not. Note that all the similarity notions studied in the [local similarity lecture](sim-loc.qmd) have structural equivalence as the limiting case of maximum similarity. So they are still based on the idea that *nodes are similar to the extent they connect to the same others*. This worry led social network analysts to attempt to generalize the concept of structural equivalence using a less stringent algebraic definition, resulting in a proliferation of different types of equivalence, such as **automorphic** or **regular equivalence** [@borgatti_everett92]. Unfortunately, none of this work (unlike the work on structural equivalence) led to useful or actionable tools for data analysis, with some even concluding that certain definitions of equivalence (such as regular equivalence) are unlikely to be found in any interesting substantive setting. A better approach here is to use a more relaxed notion of similarity to arrive at a more general definition that captures our role-based intuitions. One strategy is to define similarity in a more "generalized" way as follows: *nodes are similar to the extent they connect to similar others*, with the restriction that we can only use endogenous (structural connectivity) information---like with structural equivalence or common-neighbor approaches---to define everyone's similarity (no exogenous attribute stuff). ## Jeh and Widom's SimRank As @jeh_widom02 note, this approach does lead to a coherent generalization of the idea of similarity for nodes in graphs because we can just define similarity *recursively* and iterate through the graph until the similarity scores stop changing.^[Note the similarity (heh, heh) between this idea and the various status ranking algorithms for computing [prestige scores](prest-tm-gen.qmd).] More specifically, they propose to measure the similarity between two nodes $i$ and $j$ at each iteration step $t$ ($s_{ij}(t)$) using the formula: $$ s_{ij}(t) = \frac{\alpha}{d_i d_j} \sum_{k \in N(i)} \sum_{l \in N(j)} s_{kl}(t-1) $$ {#eq-simrank} So the similarity between two nodes at step $t$ is just the sum of the pairwise similarities between each of their neighbors (computed in the previous step, $t-1$), weighted by the ratio of a free parameter $\alpha$ (a number between zero and one) to the product of their in-degrees (to take a weighted average). This measure nicely captures the idea that nodes are similar to the extent that they are both connected to people who are themselves similar. Note that it doesn't matter whether these in-neighbors are shared between the two nodes (the summation occurs over each pair of nodes formed by crossing the set $p$ + $q$ against $p$ + $r$ as defined in the [local similarity](sim-loc.qmd) lecture), whether they are themselves neighbors, which deals with the doctor/patient problem we referred to earlier. A function that implements this idea looks like: ```{r} SimRank <- function(w, alpha = 0.8, beta = 1) { do <- rowSums(w) # outdegree vector di <- colSums(w) # indegree vector n <- nrow(w) # number of nodes in the graph z <- diag(1, n, n) # diagonal matrix (nodes are maximally similar to themselves) if (is.null(rownames(w)) == TRUE) { rownames(w) <- 1:nrow(w) } # checking if matrix has row names if (is.null(colnames(w)) == TRUE) { colnames(w) <- 1:nrow(w) } # checking if matrix has column names rownames(z) <- rownames(w) # naming similarity matrix rows colnames(z) <- colnames(w) # naming similarity matrix columns onei <- apply(w, 1, function(x) { which(x == 1) }, simplify = FALSE) # out-neighbors list inei <- apply(w, 2, function(x) { which(x == 1) }, simplify = FALSE) # in-neighbors list delta <- 1 while (delta > 1e-10) { # run while delta is bigger than a tiny number z.o <- z # old similarity matrix at t-1 for (i in 1:n) { for (j in 1:n) { if (i == j) { next } # skip when i and j are the same min.di <- min(di[i], di[j]) min.do <- min(do[i], do[j]) if (min.di != 0 & min.do != 0) { # both nodes have in and out neighbors a <- alpha / (di[i] * di[j]) # indegree weight b <- sum(as.matrix(z.o[inei[[i]], inei[[j]]])) # sum of similarity scores of in-neighbors in previous iteration c <- alpha / (do[i] * do[j]) # outdegree weight d <- sum(as.matrix(z.o[onei[[i]], onei[[j]]])) # sum of similarity scores of out-neighbors in previous iteration z[i, j] <- beta * a * b + (1 - beta) * c * d # updating current similarity matrix at t } if (min.di != 0 & min.do == 0) { # one node has no out-neighbors a <- alpha / (di[i] * di[j]) # indegree weight b <- sum(as.matrix(z.o[inei[[i]], inei[[j]]])) # sum of similarity scores of in-neighbors in previous iteration z[i, j] <- beta * a * b # updating current similarity matrix at t } if (min.di == 0 & min.do != 0) { # one node has no in-neighbors c <- alpha / (do[i] * do[j]) # outdegree weight d <- sum(as.matrix(z.o[onei[[i]], onei[[j]]])) # sum of similarity scores of out-neighbors in previous iteration z[i, j] <- (1 - beta) * c * d # updating current similarity matrix at t } } # end j for loop } # end i for loop delta <- abs(sum(abs(z)) - sum(abs(z.o))) # difference between similarity matrices at t and t-1 } # end while loop return(z) # return similarity matrix after convergence } # end function ``` This function takes an adjacency matrix (`w`) as input and returns a generalized similarity matrix (`z`) as output. Note that at the initial time step $t = 0$, the similarity matrix `z` is initialized to the identity matrix $\mathbf{I}_{n \times n}$ of the same dimensions as the adjacency matrix, saying that initially all nodes are maximally similar to themselves (which makes sense). The above is an implementation of the generalized version of `SimRank` developed by @zhao_etal09, called `P-Rank`, which accounts for similarities between in- and out-neighbors (weighted by the $\beta$ parameter). Jeh and Widom's [-@jeh_widom02] original version of `SimRank` (which only takes into account in-neighbor similarities) is obtained when $\beta = 1$ (the default in the `SimRank` function above). Let us see how it works using the toy example that @jeh_widom02 [p. 540, Figure 2] used in their original paper, shown in @fig-jw, which is intended to represent a set of web pages at a university. There is a link from the main University page to the two professors ' pages, and a link from the professors to the students' pages. In addition, "Student B" has a link on their webpage to the corresponding professor, and "Student A" has a link to the main university page on their webpage. ```{r} #| label: fig-jw #| fig-cap: "Jeh and Widom's Toy Example." #| fig-height: 6 #| fig-width: 6 #| echo: false library(igraph) g <- make_empty_graph(5, directed = TRUE) V(g)$name <- c("Univ", "ProfA", "ProfB", "StudentA", "StudentB") g <- g + edge("Univ", "ProfA") g <- g + edge("Univ", "ProfB") g <- g + edge("ProfA", "StudentA") g <- g + edge("ProfB", "StudentB") g <- g + edge("StudentB", "ProfB") g <- g + edge("StudentA", "Univ") plot(g, vertex.size = 14, vertex.frame.color = "lightgray", vertex.label = V(g)$name, edge.arrow.size = 1, layout = layout.auto, vertex.label.cex = 1, vertex.label.dist = 2 ) ``` We can now test the `SimRank` function on the toy example: ```{r} A <- as.matrix(as_adjacency_matrix(g)) S <- SimRank(A) round(S, 3) ``` Which reproduces the pairwise similarity numbers reported by @jeh_widom02 [p. 540, Figure 2] in their original paper (with $\alpha = 0.8$). We can transform the similarity matrix into a distance matrix and then perform a hierarchical clustering to obtain the roles: ```{r} D <- as.dist(1 - S) h.res <- hclust(D, method = "ward.D") # hierarchical clustering plot(h.res) ``` Which, as we can see, recovers the Professor, Student, and University Roles in the network. Let's try it out on more realistic data using the *Pulp Fiction* graph: ```{r} library(networkdata) library(stringr) # using stringr to change names from all caps to title case g <- movie_559 V(g)$name <- str_to_title(V(g)$name) V(g)$name[which(V(g)$name == "Esmarelda")] <- "Esmeralda" # fixing misspelled name g <- delete_vertices(g, degree(g) < 2) # deleting low degree vertices A <- as.matrix(as_adjacency_matrix(g)) S <- SimRank(A) round(S[1:5, 1:5], 2) ``` We can transform the generalized similarities to distances and plot: ```{r} #| fig-height: 10 #| fig-width: 10 D <- as.dist(1 - S) h.res1 <- hclust(D, method = "ward.D2") # hierarchical clustering plot(h.res1) ``` The plot suggests a division into five major node clusters. On the left, we have the people from the diner scene (e.g., Pumpkin and Honey Bunny), followed by a small cluster from the sex dungeon scene (Maynard, The Gimp, and Zed). We have the main character cluster in the middle (Jules, Mia, Vincent, Butch, Marsellus), then the cleanup scene (Jimmie, The Wolf), followed by a cluster of smaller characters. ## Does SimRank Escape the Role Problem? Recall that we turned to SimRank because it promised to solve Borgatti and Everett's [-@borgatti_everett92] *role problem*, or the idea that nodes can play the same role in the network even when they don't share *any* neighbors. Now, we know from @eq-simrank that *SimRank* uses information about the neighborhoods of the two nodes, but relaxes the local structural-similarity assumption that these neighbors need to be *shared*. However, it is clear that nodes that have *completely disjoint* neighborhoods would be considered by *SimRank* to be low similarity even though they could be playing the same *roles*.^[In fact, @kusumoto_etal14 show that the `SimRank` score of two nodes tends towards zero as their geodesic distance increases.] So maybe *SimRank* doesn't solve the doctor/patient problem after all. ```{r} #| label: fig-be #| fig-height: 8 #| fig-width: 8 #| fig-cap: "Borgatti and Everett's Thinking Graph." #| fig-subcap: #| - "Original Graph." #| - "An Automorphism of the Original Graph." #| layout-ncol: 2 #| echo: false g <- make_empty_graph(10, directed = TRUE) V(g)$name <- LETTERS[1:10] g <- g + edge("A", "B") g <- g + edge("A", "C") g <- g + edge("A", "D") g <- g + edge("B", "C") g <- g + edge("B", "E") g <- g + edge("C", "A") g <- g + edge("C", "B") g <- g + edge("C", "D") g <- g + edge("D", "A") g <- g + edge("D", "E") g <- g + edge("E", "F") g <- g + edge("F", "E") g <- g + edge("G", "F") g <- g + edge("I", "F") g <- g + edge("J", "I") g <- g + edge("H", "G") g <- g + edge("H", "J") g <- g + edge("J", "H") g <- g + edge("J", "G") g <- g + edge("H", "I") g <- g + edge("I", "H") g <- g + edge("G", "J") set.seed(123) plot(g, vertex.size = 12, vertex.frame.color = "lightgray", layout = layout.auto, vertex.label.cex = 1.5, edge.arrow.size = 1 ) h <- g V(h)$name <- c("J", "B", "H", "D", "E", "F", "G", "C", "I", "A") set.seed(123) plot(h, vertex.size = 12, vertex.frame.color = "lightgray", layout = layout.auto, vertex.label.cex = 1.5, edge.arrow.size = 1 ) ``` To build some intuition, consider the graph in @fig-be, which is a directed version of a toy example from @borgatti_everett92 [p. 14, Figure 7(a)]. Going by the standard of *structural equivalence* criterion (see [here](sim-structequiv.qmd)), we would say that the sets of structurally equivalent nodes are {J, H}, {G, I}, {D, B}, {A, C}, because each pair in brackets has *identical* in and out neighborhoods. Nodes {E} and {F} should be in their own singleton sets, as they have distinct out-neighborhoods even though they share the same in-neighborhood (each other). We can ascertain this by reviving our old Euclidean distance function: ```{r} d.euclid <- function(w) { n <- nrow(w) z <- matrix(0, n, n) for (i in 1:n) { for (j in i:n) { if (i == j) { next } d.ij <- 0 for (k in 1:n) { if (i != k & j != k) { d.ij <- d.ij + ((w[i, k] - w[j, k])^2 + (w[k, i] - w[k, j])^2) } } z[i, j] <- sqrt(d.ij) z[j, i] <- sqrt(d.ij) } } rownames(z) <- rownames(w) colnames(z) <- colnames(w) return(z) } ``` And then using it to cluster the nodes in the Borgatti-Everett graph: ```{r} #| label: fig-se #| fig-height: 6 #| fig-width: 6 #| fig-cap: "Structural Equivalence Partition." A <- as.matrix(as_adjacency_matrix(g)) E <- d.euclid(A) E <- as.dist(E) # transforming E to a dist object h.res <- hclust(E, method = "ward.D2") plot(h.res) ``` We can see in that, indeed, the good old Euclidean distance finds the set of structurally equivalent nodes in the graph [@burt76], although it assigns nodes $E$ and $F$ to the same cluster when they should be in their own cluster. Yet, going by our intuitive idea that two nodes occupy the same position if they are structurally *similar* (regardless of whether they share the same set of neighbors) we want to say that nodes {J, H, A, C} occupy the same *role*, as do nodes {G, I, D, B} and nodes {E, F}. So, going by this abstract idea of role, we would expect there to be *three* role positions in the graph, not five as suggested by the structural equivalence approach. However, standard approaches to structural equivalence, such as Euclidean distance, will not capture these higher-level similarities; they will only capture the lower-level similarity based on local neighborhood overlap that structural equivalence algorithms are designed to extract. Can `SimRank` do better? Let's see: ```{r} #| label: fig-sr #| fig-height: 6 #| fig-width: 6 #| fig-cap: "SimRank Partition." S <- SimRank(A, beta = 0.5) D <- as.dist(1 - S) h.res <- hclust(D, method = "ward.D2") plot(h.res) ``` We specify $\beta = 0.5$ so that `SimRank` gives equal weight to in and out-neighbors. However, we can see in @fig-sr that Zhao et al.'s [-@zhao_etal09] generalized version of `SimRank` does not identify the more general role classes in the Borgatti-Everett graph. It does correctly assigns {E}, {F} to their separate group, but it creates two larger similarity clusters containing nodes {A, B, C, D} and {G, H, I, J} which make sense from the point of view of *proximity* or *cohesion* but not from the point of view of role similarity independently of proximity [@borgatti_everett92]. ## RoleSim We can see that the problem is that nodes like {J, H, A, C} in @fig-be-1 are not structurally equivalent (since J and H don't share any neighbors with A and C). Instead, they are **automorphically equivalent** (see @borgatti_everett92 for relevant discussion). While automorphic equivalence is a terrible name coming from the depths of graph theory, it has a simple explanation. Two nodes are automorphically equivalent if we could swap the labels of any pair (e.g., J for A and A for J) in the relevant set in @fig-be-1, and *nothing* would change in the graph. That is, *every* graph theoretic quantity we could compute would remain the same (density, average shortest path length, diameter, etc.). It is easy to see that nodes inside the set {A, C, J, H} pass this test in @fig-be-1. For instance, the graph in @fig-be-2 has the same properties as @fig-be-1, even though J is now a neighbor of B and D, and A is a neighbor of G and I. Such an evil twin graph is called an **automorphism** of the original graph. Of course, this strict graph theoretic notion of automorphic equivalence is too strong for most practical applications outside of a toy graph like @fig-be, since it only returns binary outcomes: Either two nodes are in the same automorphically equivalent set ($s_{ij} = 1$), or they are not ($s_{ij} = 0$), with nothing in between. What we need instead is a metric of **automorphic similarity** that admits to gradations so that two nodes could be *more or less* automorphically similar, with the idea being that the more automorphically similar two nodes are, the less disruption there is to the graph structure when they are exchanged, and with the Platonic ideal case being *no* disruption whatsoever for strictly automorphically equivalent nodes. So the problem with a `SimRank` style measure is that it is not sensitive to automorphic similarity, as it is only sensitive to structural similarity, and we need a measure that is. Fortunately, such a measure exists, and it's called `RoleSim` [@jin_etal14]. It can be seen as a generalization of `SimRank` with the difference being that rather than computing the similarity between $i$ and $j$ at iteration $t$ as equivalent to the *average* similarity of $i$ and $j$'s neighbors at iteration $t-1$, it instead computes the similarity between $i$ and $j$ at iteration $t$ as the *matching* between the most similar neighbor pairs at iteration $t-1$. Such a maximum matching is defined as finding the top $k$ similarity scores between $i$ and $j$'s neighbors at $t-1$ (where $k$ is the minimum of the degrees of $i$ and $j$) and adding them up. In formulese: $$ s_{ij}(t) = \beta \frac{\sum_{(k,l) \in M_{i,j}} s_{kl}(t-1)}{max(d_i, d_j)} + \left(1 - \beta\right) $$ {#eq-rolesim} Where $\beta$ is a parameter chosen by the researcher with the restriction that $0 \geq \beta \leq 1$. Note that the sum of neighbor similarities in happens only among similarities in the matching set $M_{i, j}$. The idea of a matching set between two pairs of nodes is simple. Imagine a graph in which node A is connected to nodes {B, C, D} and node H is connected to nodes {I, J, K, L}. Now imagine that the hypothetical matrix below is a running similarity matrix at some iteration $t$ between A's and H's neighbors: ```{r} #| echo: false set.seed(123) s <- matrix(runif(12), 3, 4) rownames(s) <- c("B", "C", "D") colnames(s) <- c("I", "J", "K", "L") round(s, 2) ``` A *maximal matching* selects the *maximum numbers* across the smallest dimension of the matrix (which in this case is the row dimension, as the matrix has three rows and four columns) and then adds them up. ```{r} row.max <- apply(s, 1, max) round(row.max, 2) match.AH <- sum(row.max) round(match.AH, 2) ``` We can see that the maximum in row one, corresponding to node B, is 0.88 (the similarity between B and J), the maximum in row two, corresponding to node C, is 0.96 (the similarity between C and J), and the maximum in row three, corresponding to node D, is 0.55 (the similarity between D and K). This means that the maximum matching for the neighbors of nodes A and H in this hypothetical example is $0.88 + 0.96 + 0.55 = 2.39$, which would be the equivalent of the numerator in the fraction of @eq-rolesim. Because in a directed graph we have to worry about similarities between in and out neighbors, we modify @eq-rolesim like this: $$ s_{ij}(t) = \beta \frac{\sum_{(k,l) \in M^{out}_{i,j}} s_{kl}(t-1) + \sum_{(k,l) \in M^{in}_{i,j}} s_{kl}(t-1)}{max(d^{out}_i, d^{out}_j) + max(d^{in}_i, d^{in}_j)} + \left(1 - \beta\right) $$ {#eq-rolesimplus} Where $M^{out}_{ij}$ is the maximum matching set between $i$ and $j$'s out neighbors and $M^{in}_{ij}$ is the matching set between $i$ and $j$'s in-neighbors. This is the implementation of `RoleSim` found in @shao_etal19. Here's a modification of the `SimRank` function that does all of the above: ```{r} RoleSim <- function(w, beta = 0.5) { do <- rowSums(w) # outdegree vector di <- colSums(w) # indegree vector n <- nrow(w) # number of nodes in the graph z <- matrix(1, n, n) # all ones matrix (nodes are maximally similar to themselves) if (is.null(rownames(w)) == TRUE) { rownames(w) <- 1:nrow(w) } # checking if matrix has row names if (is.null(colnames(w)) == TRUE) { colnames(w) <- 1:nrow(w) } # checking if matrix has column names rownames(z) <- rownames(w) # naming similarity matrix rows colnames(z) <- colnames(w) # naming similarity matrix columns onei <- apply(w, 1, function(x) { which(x == 1) }, simplify = FALSE) # out-neighbors list inei <- apply(w, 2, function(x) { which(x == 1) }, simplify = FALSE) # in-neighbors list delta <- 1 while (delta > 1e-10) { # run while delta is bigger than a tiny number z.o <- z # old similarity matrix at t-1 for (i in 1:n) { for (j in i:n) { if (i == j) { next } # skip if i = j min.di <- min(di[i], di[j]) # minimum indegree between i an j max.di <- max(di[i], di[j]) # maximum indegree between i an j min.do <- min(do[i], do[j]) # minimum outdegree between i an j max.do <- max(do[i], do[j]) # maximum outdegree between i an j if (min.di != 0 & min.do != 0) { # both nodes have in and out-neighbors k <- z.o[inei[[i]], inei[[j]], drop = FALSE] # creating sub-matrix of similarities between i and j's in-neighbors d <- as.numeric(which(c(di[i], di[j]) == min.di)[1]) # selecting minimum of rows or columns of k mu.i <- sum(apply(k, d, max)) # maximum in-neighbor similarity matching sum k <- z.o[onei[[i]], onei[[j]], drop = FALSE] # creating sub-matrix of similarities between i and j's out-neighbors d <- as.numeric(which(c(do[i], do[j]) == min.do)[1]) # selecting minimum of rows or columns of k mu.o <- sum(apply(k, d, max)) # maximum out-neighbor similarity matching sum z[i, j] <- (beta * (mu.o + mu.i) / (max.do + max.di)) + (1 - beta) # updating similarity matrix } if (min.di != 0 & min.do == 0) { # one node has no out-neighbors k <- z.o[inei[[i]], inei[[j]], drop = FALSE] # creating sub-matrix of similarities between i and j's in-neighbors d <- as.numeric(which(c(di[i], di[j]) == min.di)[1]) # selecting minimum of rows or columns of k mu.i <- sum(apply(k, d, max)) # maximum in-neighbor similarity matching sum z[i, j] <- (beta * (mu.i / max.di)) + (1 - beta) # updating similarity matrix } if (min.di == 0 & min.do != 0) { # one node has no in-neighbors k <- z.o[onei[[i]], onei[[j]], drop = FALSE] # creating sub-matrix of similarities between i and j's out-neighbors d <- as.numeric(which(c(do[i], do[j]) == min.do)[1]) # selecting minimum of rows or columns of k mu.o <- sum(apply(k, d, max)) # maximum out-neighbor similarity matching sum z[i, j] <- (beta * (mu.o / max.do)) + (1 - beta) # updating similarity matrix } } # end j for loop } # end i for loop z[lower.tri(z)] <- t(z)[lower.tri(z)] # copying upper triangle into lower triangle delta <- sum(abs(abs(z) - abs(z.o))) # difference between similarity matrices at t and t-1 } # end while loop return(z) # return similarity matrix after convergence } # end function ``` And now we apply it to the Borgatti/Everett graph: ```{r} #| label: fig-rs #| fig-height: 6 #| fig-width: 6 #| fig-cap: "RoleSim Partition." S <- RoleSim(A) D <- as.dist(1 - S) h.res <- hclust(D, method = "ward.D") # hierarchical clustering plot(h.res) ``` And we can see in that, indeed, the `RoleSim` algorithm groups nodes in terms of their automorphic similarity rather than their structural similarity, with nodes {A, C, J, H} being put in the same class, as are nodes {B, D, G, I} and {E, F}. We can see this clearly if we cut the dendrogram to return three clusters: Here is the five cluster partition of the dendrogram in @fig-rs: ```{r} blocks <- cutree(h.res, k = 3) blocks ``` ### Undirected Case [Hanneman & Riddle](https://faculty.ucr.edu/~hanneman/nettext/C14_Automorphic_Equivalence.html) use the undirected graph in @fig-wf as another example where automorphic equivalence is relevant (the graph is taken from @wasserman_faust94). Using a (costly) algorithm that identifies the "orbits" of a graph (sets of automorphically equivalent nodes) they note that there are five such positions in @fig-wf: Actors $\{A\}$ (the "top" node), $\{C\}$, and $\{G\}$ occupying unique (singleton) positions, with $\{B, D\}$ and $\{E, F, H, I\}$ being sets of automorphically equivalent (exchangeable) nodes. ```{r} #| label: fig-wf #| fig-height: 8 #| fig-width: 8 #| fig-cap: "An undirected graph." #| echo: false h <- make_empty_graph(9, directed = FALSE) V(h)$name <- LETTERS[1:9] h <- h + edge("A", "B") h <- h + edge("A", "C") h <- h + edge("A", "D") h <- h + edge("B", "E") h <- h + edge("B", "F") h <- h + edge("C", "G") h <- h + edge("D", "H") h <- h + edge("D", "I") plot(h, vertex.size = 18, vertex.frame.color = "lightgray", layout = layout_as_tree(h, root = "A"), vertex.label.cex = 1.5, edge.arrow.size = 1 ) B <- as.matrix(as_adjacency_matrix(h)) S <- RoleSim(B, beta = 1) D <- as.dist(1 - S) h.res <- hclust(D, method = "ward.D") # hierarchical clustering ``` @fig-rs2 shows the dendrogram corresponding to the `rolesim` partition for , with $\beta = 1.0$ (since the graph is undirected). ```{r} #| label: fig-rs2 #| fig-height: 6 #| fig-width: 6 #| fig-cap: "RoleSim Partition of an undirected graph." plot(h.res) ``` We can see that indeed, `rolesim` identifies the automorphically equivalent classes, assigning $\{A, C, G\}$ to unique positions and grouping the relevant automorphically equivalent nodes into the expected classes. Here is the five cluster partition of the dendrogram in 2: ```{r} blocks <- cutree(h.res, k = 5) blocks ``` Role problem solved! ## CentSim As we noted, automorphically equivalent nodes in a graph are completely exchangeable, meaning that we can swap them without changing anything we can compute at the graph level. The same goes for any quantity we can compute at the node level, which means that automorphically equivalent nodes have the property of being *graph-theoretically indistinguishable*. That means that they will have the same score on *any* centrality measure we could compute on them. So another approach to finding automorphically similar pairs of nodes in a graph is just to compute a bunch of node-level quantities, such as centrality measures, and count nodes as similar if their centrality measures are similar. @li_etal15 built on this observation to compute a role similarity measure called `CentSim` on each pair of nodes in the graph, by comparing their centrality scores across various measures. Li et al.'s [-@li_etal15] proposal is that the similarity between $i$ and $j$ is given by: $$ s_{ij} = \frac{\sum_{k=1}^l \theta_k(i,j)}{l} $$ {#eq-li1} Where the function $\theta_k$ defined on each pair of nodes $i$, $j$, is given by: $$ \theta_k(i,j) = \frac{min\left[c_k(i), c_k(j)\right]}{max\left[c_k(i), c_k(j)\right]} $$ {#eq-li1} Where $c_k(i)$ is the $k^{th}$ centrality score computed on node $i$. So the idea is to compute a bunch of different centrality scores for each node, then compute the average ratio of the minimum to the maximum score for each pair. Here's a function that does that, using degree, closeness, betweenness, and eigenvector centrality: ```{r} cent.sim <- function(x) { library(igraph) c1 <- degree(x) c2 <- harmonic_centrality(x) c3 <- betweenness(x) library(igraph) c1 <- degree(x) c2 <- harmonic_centrality(x) c3 <- betweenness(x) c4 <- eigen_centrality(x)$vector n <- vcount(x) min.max <- function(a, b) { if (max(a, b) != 0) { r <- min(a, b) / max(a, b) } else { r <- 1 } return(r) } z <- diag(1, n, n) rownames(z) <- V(x)$name colnames(z) <- V(x)$name for (i in 1:n) { for (j in i:n) { w1 <- min.max(c1[i], c1[j]) w2 <- min.max(c2[i], c2[j]) w3 <- min.max(c3[i], c3[j]) w4 <- min.max(c4[i], c4[j]) z[i, j] <- (w1 + w2 + w3 + w4) / 4 } } z[lower.tri(z)] <- t(z)[lower.tri(z)] # copying upper triangle into lower triangle return(z) } ``` The `cent.sim` function takes an igraph graph object as input and returns a similarity matrix `z` as output. Can `CentSim` find the automorphically equivalent classes in that we are looking for? Let's find out: ```{r} #| label: fig-cs #| fig-height: 6 #| fig-width: 6 #| fig-cap: "CentSim Partition." S <- cent.sim(g) D <- as.dist(1 - S) h.res <- hclust(D, method = "ward.D") # hierarchical clustering plot(h.res) ``` Indeed it does! As we can see in @fig-cs, nodes {A, C, J, H} are put in the same class, as are nodes {B, D, G, I} and {E, F}. Let us compare the clustering that we obtain in the *Pulp Fiction* network from a measure that is sensitive to automorphic similarity (like `RoleSim`) to the one we obtained using `SimRank`: ```{r} #| fig-height: 10 #| fig-width: 10 g <- movie_559 V(g)$name <- str_to_title(V(g)$name) V(g)$name[which(V(g)$name == "Esmarelda")] <- "Esmeralda" # fixing misspelled name g <- delete_vertices(g, degree(g) < 2) # deleting low degree vertices A <- as.matrix(as_adjacency_matrix(g)) S <- RoleSim(A) D <- as.dist(1 - S) h.res2 <- hclust(D, method = "ward.D") # hierarchical clustering plot(h.res2) ``` We can see that the `RoleSim` partition is different from the one offered by `SimRank`. While the main clusters in the `SimRank` partition are organized by scene (e.g., all the characters in the diner scene appear together), the `RoleSim` partition actually organizes the characters according to more abstract roles. So the main characters {Jules, Vincent, Butch} form their own cluster, while {Marsellus, Mia, Pumpkin, Honey Bunny} are also assigned to a similarity cluster, despite not appearing in the same scenes.