Global Vertex Similarities

Global Similarity Indices

As we saw in the lecture notes on local similarity metrics, the most important ingredient of structural similarity measures between pairs of nodes is the number of common of neighbors (followed by the degrees of each node), and this quantity is given by the square of the adjacency matrix \(\mathbf{A}^2\). So we can say that this matrix gives us a basic similarity measure between nodes, namely, the common neighbors similarity:

\[ \mathbf{S} = \mathbf{A}^2 \]

Another way of seeing this is that a common neighbor defines a path of length two between a pair of nodes. So the number of common neighbors between two nodes is equivalent to the number of paths of length two between them. We are thus saying that the similarity between two nodes increases as the number of paths of length two between them increases, and that info is also recorded in the \(\mathbf{A}^2\) matrix.

The Local Path Similarity Index

But if the similarity between node pairs increases in the number of paths of length two between them, wouldn’t nodes be even more similar if they also have a bunch of paths of length three between them?

Lü, Jin, and Zhou (2009) asked themselves the same question and proposed the following as a similarity metric based on paths:

\[ \mathbf{S} = \mathbf{A}^2 + \alpha \mathbf{A}^3 \tag{1}\]

This is the so-called local path similarity index (Lü, Jin, and Zhou 2009). Obviously, structurally equivalent nodes will be counted as similar by this metric (lots of paths of length two between them) but also nodes indirectly connected by many paths of length three, but to a lesser extent given the discount parameter \(\alpha\) (a number between zero and one).

A function that does this is:

   local.path <- function(w, alpha = 0.5) {
      w2 <- w %*% w
      s <- w2 + alpha*(w2 %*% w)
   return(s)
   }

Here’s how the local path similarity looks in the Flintstones network:

   library(networkdata)
   library(igraph)
   library(stringr) #using stringr to change names from all caps to title case
   g <- movie_267
   V(g)$name <- str_to_title(V(g)$name)
   g <- delete_vertices(g, degree(g) <= 3) #deleting low degree vertices
   A <- as.matrix(as_adjacency_matrix(g))
   S <- local.path(A)
   S[1:10, 1:10]
             Bam-Bam Barney Betty Feldspar Fred Headmistress Herdmaster Lava
Bam-Bam         20.0   29.5  30.5      6.5 31.5         12.5        9.5 25.0
Barney          29.5   50.0  51.0     13.0 52.0         21.0       21.0 46.5
Betty           30.5   51.0  52.0     11.0 53.0         24.0       18.0 46.0
Feldspar         6.5   13.0  11.0      3.0 13.5          4.5        5.0 10.0
Fred            31.5   52.0  53.0     13.5 55.0         21.5       21.5 47.5
Headmistress    12.5   21.0  24.0      4.5 21.5         13.0        6.5 21.5
Herdmaster       9.5   21.0  18.0      5.0 21.5          6.5       10.0 17.0
Lava            25.0   46.5  46.0     10.0 47.5         21.5       17.0 42.0
Leach            9.5   15.5  16.0      3.5 17.5          7.5        5.5 15.5
Morris           8.5   13.0  15.5      2.5 13.0          8.0        3.5 12.5
             Leach Morris
Bam-Bam        9.5    8.5
Barney        15.5   13.0
Betty         16.0   15.5
Feldspar       3.5    2.5
Fred          17.5   13.0
Headmistress   7.5    8.0
Herdmaster     5.5    3.5
Lava          15.5   12.5
Leach          6.0    4.0
Morris         4.0    6.0

Of course as \(\alpha\) approaches zero, then the local path measure reduces to the number of common neighbors, while numbers closer to one count paths of length three more.

Another thing people may wonder if why not keep going and add paths of length four:

\[ \mathbf{S} = \mathbf{A}^2 + \alpha \mathbf{A}^3 + \alpha^2 \mathbf{A}^4 \]

Or paths of length whatever:

\[ \mathbf{S} = \mathbf{A}^2 + \alpha \mathbf{A}^3 + \alpha^2 \mathbf{A}^4 ... + \alpha^{k-2} \mathbf{A}^k \]

Where \(k\) is the length of the maximum path considered. Lü, Jin, and Zhou (2009) argue that these higher order paths don’t matter, so maybe we don’t have to worry about them.

The Katz Similarity

Another issue is that there was already an all-paths similarity measure in existence, one developed by the mathematical social scientist Leo Katz (1953) in the 1950s (!).

The basic idea was to use linear algebra tricks to solve:

\[ \mathbf{S} = \sum_{k=1}^{\infty} \alpha^k \mathbf{A}^k \]

Which would theoretically count the all the paths of all possible lengths between two nodes while discounting the contribution of the longer paths in proportion to their length (as \(k\) gets bigger, with \(\alpha\) a number between zero and one, \(\alpha^k\) gets smaller and smaller).

Linear algebra hocus-pocus (non-technical explainer here) turns the above infinite sum into the more tractable:

\[ \mathbf{S} = (\mathbf{I} - \alpha \mathbf{A})^{-1} \tag{2}\]

Where \(\mathbf{I}\) is the identity matrix (a matrix of all zeros except that it has the number one in each diagonal cell) of the same dimensions as the original. Raising the result of the subtraction in parentheses to minus one indicates the matrix inverse operation (most matrices are invertible, unless they are weird).

A function to compute the Katz similarity between all node pairs looks like:

   katz.sim <- function(w) {
      I <- diag(nrow(w)) #creating identity matrix
      alpha <- 1/eigen(w)$values[1] - 1e-10 #reciprocal of first eigenvalue of adjacency matrix minus a tiny number
      S <- solve(I - alpha * w) 
      return(S)
   }

This function takes the network’s adjacency matrix (w) as input and returns the Katz similarity matrix (S) as output.

For technical reasons (e.g., guarantee that the infinite sum converges) we need to choose \(\alpha\) to be a number larger than zero but smaller than the reciprocal of the first eigenvalue of the matrix. Here we just pick the reciprocal of the largest (first) eigenvalue minus a very small number (\(10^{-10}\)).

Line 4 computes the actual Katz similarity using the native R function solve to find the relevant matrix inverse.1

In the Flintstones network the Katz similarity looks like:

   set.seed(456)
   S <- katz.sim(A)
   round(S[1:10, 1:10], 2)
              Bam-Bam    Barney     Betty Feldspar      Fred Headmistress
Bam-Bam      43912268  75530407  76630126 17269355  78596698     34072232
Barney       75530407 129914551 131806098 29703805 135188661     58605253
Betty        76630126 131806098 133725188 30136290 137157000     59458542
Feldspar     17269355  29703805  30136290  6791511  30909683     13399569
Fred         78596698 135188661 137157000 30909683 140676885     60984437
Headmistress 34072232  58605253  59458542 13399569  60984437     26437192
Herdmaster   28235263  48565494  49272605 11104068  50537096     21908193
Lava         68595891 117986974 119704858 26976671 122776865     53224651
Leach        24107266  41465215  42068946  9480652  43148569     18705214
Morris       21013980  36144672  36670936  8264157  37612029     16305084
             Herdmaster      Lava    Leach   Morris
Bam-Bam        28235263  68595891 24107266 21013980
Barney         48565494 117986974 41465215 36144672
Betty          49272605 119704858 42068946 36670936
Feldspar       11104068  26976671  9480652  8264157
Fred           50537096 122776865 43148569 37612029
Headmistress   21908193  53224651 18705214 16305084
Herdmaster     18155067  44106651 15500794 13511834
Lava           44106651 107154482 37658255 32826196
Leach          15500794  37658255 13234577 11536403
Morris         13511834  32826196 11536403 10056129

Which are pretty big numbers! Which makes sense, since these are estimates of all the paths of any length between every pair of nodes in the network. We can get more tractable figures by normalizing the matrix by its maximum:

   S <- S/max(S)
   round(S[1:10, 1:10], 2)
             Bam-Bam Barney Betty Feldspar Fred Headmistress Herdmaster Lava
Bam-Bam         0.31   0.54  0.54     0.12 0.56         0.24       0.20 0.49
Barney          0.54   0.92  0.94     0.21 0.96         0.42       0.35 0.84
Betty           0.54   0.94  0.95     0.21 0.97         0.42       0.35 0.85
Feldspar        0.12   0.21  0.21     0.05 0.22         0.10       0.08 0.19
Fred            0.56   0.96  0.97     0.22 1.00         0.43       0.36 0.87
Headmistress    0.24   0.42  0.42     0.10 0.43         0.19       0.16 0.38
Herdmaster      0.20   0.35  0.35     0.08 0.36         0.16       0.13 0.31
Lava            0.49   0.84  0.85     0.19 0.87         0.38       0.31 0.76
Leach           0.17   0.29  0.30     0.07 0.31         0.13       0.11 0.27
Morris          0.15   0.26  0.26     0.06 0.27         0.12       0.10 0.23
             Leach Morris
Bam-Bam       0.17   0.15
Barney        0.29   0.26
Betty         0.30   0.26
Feldspar      0.07   0.06
Fred          0.31   0.27
Headmistress  0.13   0.12
Herdmaster    0.11   0.10
Lava          0.27   0.23
Leach         0.09   0.08
Morris        0.08   0.07

Which looks better. As we would expect, Fred is very similar to Barney and Betty. We can then do the standard hierarchical clustering to see how this similarity measure groups nodes:

   D <- dist(1- S)
   h.res <- hclust(D, method = "ward.D2") #hierarchical clustering
   plot(h.res)

   blocks  <- cutree(h.res, k = 6)
   blocks
        Bam-Bam          Barney           Betty        Feldspar            Fred 
              1               2               2               3               2 
   Headmistress      Herdmaster            Lava           Leach          Morris 
              4               4               5               3               3 
      Mrs Slate         Pebbles Pebbles Bam-Bam        Piltdown      Poindexter 
              1               1               6               6               6 
         Pyrite           Slate           Wilma 
              6               5               5 

The Katz similarity (correctly) assigns Fred, Barney, Betty, Wilma (along with Slate and Lava) to the same group, and puts the kids in a separate block.

Degree-Weighted Katz Similarity (AKA Leicht-Holme-Newman)

Leicht, Holme, and Newman (2006) argue that the Katz approach is fine and dandy as a similarity measure, but note that is an unweighted index (like the raw number of common neighbors). This means that nodes with large degree will end up being “similar” to a bunch of other nodes in the graph, just because they have lots of paths of length whatever between them and those nodes.

Leicht, Holme, and Newman (2006) propose a “fix” for this weakness in the Katz similarity, resulting in the matrix linear algebra equivalent of a degree-normalized similarity measure like the Jaccard or Cosine.

So instead of Katz they suggest we compute:

\[ \mathbf{S} = \mathbf{D}^{-1} \left( \frac{\alpha A}{\lambda_1} \right)^{-1} \mathbf{D}^{-1} \]

Here \(\mathbf{D}\) is a matrix containing the degrees of each node along the diagonal. The inverse of this matrix \(\mathbf{D}^{-1}\) will contain the reciprocal of each degree \(1/k_i\) along the diagonals. \(\lambda_1\), on the other hand, is just the first eigenvalue of the adjacency matrix.

So, the LHN Similarity is just the Katz similarity weighted by the degree of the sender and receiver node along each path, further discounting paths featuring high-degree nodes at either or both ends.

Which leads to the function:

   LHN.sim <- function(A, alpha = 0.9) {
      D <- solve(diag(rowSums(A))) #inverse of degree matrix
      lambda <- eigen(A)$values[1] #first eigenvalue of adjacency matrix
      S <- D %*% solve((alpha * A)/lambda) %*% D #LHN index
      rownames(S) <- rownames(A)
      colnames(S) <- colnames(A)
      return(S)
   }

And the Flintstones result:

   S <- LHN.sim(A)
   round(S[, 1:5], 2)
                Bam-Bam Barney Betty Feldspar  Fred
Bam-Bam           -0.09   0.01  0.01    -0.16 -0.01
Barney             0.01  -0.02  0.00     0.11  0.01
Betty              0.01   0.00 -0.04    -0.48  0.00
Feldspar          -0.16   0.11 -0.48    -2.55  0.25
Fred              -0.01   0.01  0.00     0.25 -0.01
Headmistress       0.20   0.05 -0.13    -2.96 -0.05
Herdmaster         0.16   0.08  0.10     0.07 -0.07
Lava               0.00  -0.02 -0.02    -0.42  0.02
Leach             -0.27  -0.10  0.04     2.01  0.09
Morris            -0.25  -0.03  0.12     1.90  0.03
Mrs Slate         -0.05   0.01  0.04     0.74 -0.01
Pebbles            0.19   0.01  0.01    -0.16 -0.01
Pebbles Bam-Bam    0.03   0.00  0.03     0.16  0.00
Piltdown           0.00   0.00  0.00    -0.35  0.00
Poindexter         0.01   0.03  0.04     0.66 -0.02
Pyrite             0.00   0.00  0.00    -0.35  0.00
Slate             -0.08  -0.01  0.04     0.91  0.01
Wilma              0.01   0.00  0.02     0.09  0.00

Note that as Leicht, Holme, and Newman (2006) discuss, the entries in the LHN version of the similarity \(\mathbf{S}\) can be either positive or negative. Negative entries are nodes that are surprisingly dissimilar given their degrees, and positive numbers indicating node pairs that are surprisingly similar. Numbers closer to zero are nodes that are neither similar nor dissimilar given their degrees.

Here we can see that Barney and Fred are actually not that similar to one another (after we take into account their very high degree) and that Barney is actually most similar to Feldspar (and Fred even more so).

Because the LHN similarities have negative and positive values, they already define a distance between nodes in the graph, like the correlation distance. So if we wanted to find blocks of actors using this similarity criterion, all we need to do is:

   D <- dist(S)
   h.res <- hclust(D, method = "ward.D2")
   plot(h.res)

   blocks <- cutree(h.res, k = 8)
   blocks
        Bam-Bam          Barney           Betty        Feldspar            Fred 
              1               2               1               3               2 
   Headmistress      Herdmaster            Lava           Leach          Morris 
              4               5               1               6               7 
      Mrs Slate         Pebbles Pebbles Bam-Bam        Piltdown      Poindexter 
              8               1               2               1               8 
         Pyrite           Slate           Wilma 
              1               8               2 

It seems like this approach is geared towards finding smaller, more fine-grained clusters of similar actors in the data.

References

Katz, Leo. 1953. “A New Status Index Derived from Sociometric Analysis.” Psychometrika 18 (1): 39–43.
Leicht, Elizabeth A, Petter Holme, and Mark EJ Newman. 2006. “Vertex Similarity in Networks.” Physical Review E—Statistical, Nonlinear, and Soft Matter Physics 73 (2): 026120.
Lü, Linyuan, Ci-Hang Jin, and Tao Zhou. 2009. “Similarity Index Based on Local Paths for Link Prediction of Complex Networks.” Phys. Rev. E 80: 046122.

Footnotes

  1. See here for an explainer of the matrix inverse in general and its computation in R.↩︎