Global Vertex Similarities

Global Similarity Indices

As we saw in the lecture notes on local similarity metrics, the most important ingredient of structural similarity measures between pairs of nodes is the number of common of neighbors (followed by the degrees of each node), and this quantity is given by the square of the adjacency matrix \(\mathbf{A}^2\). So we can say that this matrix gives us a basic similarity measure between nodes, namely, the common neighbors similarity:

\[ \mathbf{S} = \mathbf{A}^2 \]

Another way of seeing this is that a common neighbor defines a path of length two between a pair of nodes. So the number of common neighbors between two nodes is equivalent to the number of paths of length two between them. We are thus saying that the similarity between two nodes increases as the number of paths of length two between them increases, and that info is also recorded in the \(\mathbf{A}^2\) matrix.

The Local Path Similarity Index

But if the similarity between node pairs increases in the number of paths of length two between them, wouldn’t nodes be even more similar if they also have a bunch of paths of length three between them?

Lü, Jin, and Zhou (2009) asked themselves the same question and proposed the following as a similarity metric based on paths:

\[ \mathbf{S} = \mathbf{A}^2 + \alpha \mathbf{A}^3 \tag{1}\]

This is the so-called local path similarity index (Lü, Jin, and Zhou 2009). Obviously, structurally equivalent nodes will be counted as similar by this metric (lots of paths of length two between them) but also nodes indirectly connected by many paths of length three, but to a lesser extent given the discount parameter \(\alpha\) (a number between zero and one).

A function that does this is:

   local.path <- function(w, alpha = 0.5) {
      w2 <- w %*% w
      s <- w2 + alpha*(w2 %*% w)
   return(s)
   }

Here’s how the local path similarity looks in the Flintstones network:

   library(networkdata)
   library(igraph)
   library(stringr) #using stringr to change names from all caps to title case
   g <- movie_267
   V(g)$name <- str_to_title(V(g)$name)
   g <- delete_vertices(g, degree(g) <= 3) #deleting low degree vertices
   A <- as.matrix(as_adjacency_matrix(g))
   S <- local.path(A)
   S[1:10, 1:10]

             Bam-Bam Barney Betty Feldspar Fred Headmistress Herdmaster Lava
Bam-Bam         20.0   29.5  30.5      6.5 31.5         12.5        9.5 25.0
Barney          29.5   50.0  51.0     13.0 52.0         21.0       21.0 46.5
Betty           30.5   51.0  52.0     11.0 53.0         24.0       18.0 46.0
Feldspar         6.5   13.0  11.0      3.0 13.5          4.5        5.0 10.0
Fred            31.5   52.0  53.0     13.5 55.0         21.5       21.5 47.5
Headmistress    12.5   21.0  24.0      4.5 21.5         13.0        6.5 21.5
Herdmaster       9.5   21.0  18.0      5.0 21.5          6.5       10.0 17.0
Lava            25.0   46.5  46.0     10.0 47.5         21.5       17.0 42.0
Leach            9.5   15.5  16.0      3.5 17.5          7.5        5.5 15.5
Morris           8.5   13.0  15.5      2.5 13.0          8.0        3.5 12.5
             Leach Morris
Bam-Bam        9.5    8.5
Barney        15.5   13.0
Betty         16.0   15.5
Feldspar       3.5    2.5
Fred          17.5   13.0
Headmistress   7.5    8.0
Herdmaster     5.5    3.5
Lava          15.5   12.5
Leach          6.0    4.0
Morris         4.0    6.0

Of course as \(\alpha\) approaches zero, then the local path measure reduces to the number of common neighbors, while numbers closer to one count paths of length three more.

Let us now cluster the nodes in the graph according to the local path similarity. To do that, we need to transform the similarities into proper distances that we can feed to the usual hierarchical clustering algorithm. We can do by computing \(\mathbf{D} = 1-\mathbf{S}\).¹

In R we can do this as follows:

   D <- dist(1- S)
   h.res <- hclust(D, method = "ward.D2") #hierarchical clustering
   plot(h.res)

“Eyeballing” the dendrogram, it seems like there are three similarity blocks in the network. We can obtain them like this:

   blocks  <- cutree(h.res, k = 3) #requesting three blocks
   blocks

        Bam-Bam          Barney           Betty        Feldspar            Fred 
              1               2               2               3               2 
   Headmistress      Herdmaster            Lava           Leach          Morris 
              3               3               2               3               3 
      Mrs Slate         Pebbles Pebbles Bam-Bam        Piltdown      Poindexter 
              1               1               1               1               1 
         Pyrite           Slate           Wilma 
              1               2               2

We can see that Betty, Barney, Fred, and Wilma (along with Slate and Lava) end up in the same block.

The Katz Similarity

The local path similarity considers number of paths of lengths two and three linking two nodes in computing their similarity. Naturally, people may wonder why we can’t keep going and add paths of length four:

\[ \mathbf{S} = \mathbf{A}^2 + \alpha \mathbf{A}^3 + \alpha^2 \mathbf{A}^4 \]

Or paths of length whatever:

\[ \mathbf{S} = \mathbf{A}^2 + \alpha \mathbf{A}^3 + \alpha^2 \mathbf{A}^4 ... + \alpha^{k-2} \mathbf{A}^k \]

Where \(k\) is the length of the maximum path considered. Lü, Jin, and Zhou (2009) argue that these higher order paths don’t matter, so maybe we don’t have to worry about them.

Another issue is that there was already an all-paths similarity measure in existence, one developed by the mathematical social scientist Leo Katz (1953) in the 1950s (!).

The basic idea was to use linear algebra tricks to solve:

\[ \mathbf{S} = \sum_{k=1}^{\infty} \alpha^k \mathbf{A}^k \]

Which would theoretically count the all the paths of all possible lengths between two nodes while discounting the contribution of the longer paths in proportion to their length (as \(k\) gets bigger, with \(\alpha\) a number between zero and one, \(\alpha^k\) gets smaller and smaller).

Linear algebra hocus-pocus (non-technical explainer here) turns the above infinite sum into the more tractable:

\[ \mathbf{S} = (\mathbf{I} - \alpha \mathbf{A})^{-1} \tag{2}\]

Where \(\mathbf{I}\) is the identity matrix (a matrix of all zeros except that it has the number one in each diagonal cell) of the same dimensions as the original. Raising the result of the subtraction in parentheses to minus one indicates the matrix inverse operation (most matrices are invertible, unless they are weird).

A function to compute the Katz similarity between all node pairs looks like:

   katz.sim <- function(w) {
      I <- diag(nrow(w)) #creating identity matrix
      alpha <- 1/eigen(w)$values[1] - 1e-10 #reciprocal of first eigenvalue of adjacency matrix minus a tiny number
      S <- solve(I - alpha * w) 
      return(S)
   }

This function takes the network’s adjacency matrix (w) as input and returns the Katz similarity matrix (S) as output.

For technical reasons (e.g., guarantee that the infinite sum converges) we need to choose \(\alpha\) to be a number larger than zero but smaller than the reciprocal of the first eigenvalue of the matrix. Here we just pick the reciprocal of the largest (first) eigenvalue minus a very small number (\(10^{-10}\)).

Line 4 computes the actual Katz similarity using the native R function solve to find the relevant matrix inverse.²

In the Flintstones network the Katz similarity looks like:

   set.seed(456)
   S <- katz.sim(A)
   round(S[1:10, 1:10], 2)

              Bam-Bam    Barney     Betty Feldspar      Fred Headmistress
Bam-Bam      43912268  75530407  76630126 17269355  78596698     34072232
Barney       75530407 129914551 131806098 29703805 135188661     58605253
Betty        76630126 131806098 133725188 30136290 137157000     59458542
Feldspar     17269355  29703805  30136290  6791511  30909683     13399569
Fred         78596698 135188661 137157000 30909683 140676885     60984437
Headmistress 34072232  58605253  59458542 13399569  60984437     26437192
Herdmaster   28235263  48565494  49272605 11104068  50537096     21908193
Lava         68595891 117986974 119704858 26976671 122776865     53224651
Leach        24107266  41465215  42068946  9480652  43148569     18705214
Morris       21013980  36144672  36670936  8264157  37612029     16305084
             Herdmaster      Lava    Leach   Morris
Bam-Bam        28235263  68595891 24107266 21013980
Barney         48565494 117986974 41465215 36144672
Betty          49272605 119704858 42068946 36670936
Feldspar       11104068  26976671  9480652  8264157
Fred           50537096 122776865 43148569 37612029
Headmistress   21908193  53224651 18705214 16305084
Herdmaster     18155067  44106651 15500794 13511834
Lava           44106651 107154482 37658255 32826196
Leach          15500794  37658255 13234577 11536403
Morris         13511834  32826196 11536403 10056129

Which are pretty big numbers! Which makes sense, since these are estimates of all the paths of any length between every pair of nodes in the network. We can get more tractable figures by normalizing the matrix by its maximum:

   S <- S/max(S)
   round(S[1:10, 1:10], 2)

             Bam-Bam Barney Betty Feldspar Fred Headmistress Herdmaster Lava
Bam-Bam         0.31   0.54  0.54     0.12 0.56         0.24       0.20 0.49
Barney          0.54   0.92  0.94     0.21 0.96         0.42       0.35 0.84
Betty           0.54   0.94  0.95     0.21 0.97         0.42       0.35 0.85
Feldspar        0.12   0.21  0.21     0.05 0.22         0.10       0.08 0.19
Fred            0.56   0.96  0.97     0.22 1.00         0.43       0.36 0.87
Headmistress    0.24   0.42  0.42     0.10 0.43         0.19       0.16 0.38
Herdmaster      0.20   0.35  0.35     0.08 0.36         0.16       0.13 0.31
Lava            0.49   0.84  0.85     0.19 0.87         0.38       0.31 0.76
Leach           0.17   0.29  0.30     0.07 0.31         0.13       0.11 0.27
Morris          0.15   0.26  0.26     0.06 0.27         0.12       0.10 0.23
             Leach Morris
Bam-Bam       0.17   0.15
Barney        0.29   0.26
Betty         0.30   0.26
Feldspar      0.07   0.06
Fred          0.31   0.27
Headmistress  0.13   0.12
Herdmaster    0.11   0.10
Lava          0.27   0.23
Leach         0.09   0.08
Morris        0.08   0.07

Which looks better. As we would expect, Fred is very similar to Barney and Betty. We can then do the standard hierarchical clustering to see how this similarity measure groups nodes:

   D <- dist(1- S)
   h.res <- hclust(D, method = "ward.D2") #hierarchical clustering
   plot(h.res)

   blocks  <- cutree(h.res, k = 3)
   blocks

        Bam-Bam          Barney           Betty        Feldspar            Fred 
              1               2               2               3               2 
   Headmistress      Herdmaster            Lava           Leach          Morris 
              3               3               2               3               3 
      Mrs Slate         Pebbles Pebbles Bam-Bam        Piltdown      Poindexter 
              1               1               1               1               1 
         Pyrite           Slate           Wilma 
              1               2               2

The Katz similarity, like the local paths one, assigns Fred, Barney, Betty, Wilma (along with Slate and Lava) to the same group, and puts the kids in a separate block.

Degree-Weighted Katz Similarity (AKA Leicht-Holme-Newman)

Leicht, Holme, and Newman (2006) argue that the Katz approach is fine and dandy as a similarity measure, but note that is an unweighted index (like the raw number of common neighbors). This means that nodes with large degree will end up being “similar” to a bunch of other nodes in the graph, just because they have lots of paths of length whatever between them and those nodes.

Leicht, Holme, and Newman (2006) propose a “fix” for this weakness in the Katz similarity, resulting in the matrix linear algebra equivalent of a degree-normalized similarity measure like the Jaccard or Cosine.

So instead of Katz they suggest we compute:

\[ \mathbf{S} = \mathbf{D}^{-1} \left( \mathbf{I} - \alpha \mathbf{A} \right)^{-1} \mathbf{D}^{-1} \]

Where \(\mathbf{D}\) is the degree matrix containing the degrees of each node along the diagonals, meaning that \(\mathbf{D}^{-1}\) will contain the inverse of each node’s degree \(\frac{1}{k_i}\) along the diagonals.

So, the LHN Similarity is just the Katz similarity (the term in parentheses) weighted by the degree of the sender and receiver node along each path, further discounting paths featuring high-degree nodes at either or both ends.

Which leads to the function:

   LHN.sim <- function(w) {
      I <- diag(nrow(w)) #creating identity matrix
      d <- rowSums(w) #degree vector
      D <- diag(d) #creating degree matrix from degree vector
      alpha <- 1/eigen(w)$values[1] - 1e-10 #reciprocal of first eigenvalue of adjacency matrix minus a tiny number
      z <- solve(D) %*% solve(I - alpha * w)  %*% solve(D) #LHN index
      rownames(z) <- rownames(w)
      colnames(z) <- colnames(w)
      return(z)
   }

And the Flintstones result:

   S <- LHN.sim(A)
   S <- S/max(S)
   round(S[, 1:5], 2)

                Bam-Bam Barney Betty Feldspar Fred
Bam-Bam            0.72   0.57  0.58     0.85 0.55
Barney             0.57   0.45  0.46     0.67 0.44
Betty              0.58   0.46  0.47     0.68 0.44
Feldspar           0.85   0.67  0.68     1.00 0.65
Fred               0.55   0.44  0.44     0.65 0.42
Headmistress       0.67   0.53  0.54     0.79 0.51
Herdmaster         0.69   0.55  0.56     0.82 0.53
Lava               0.61   0.49  0.49     0.72 0.47
Leach              0.79   0.63  0.64     0.93 0.61
Morris             0.69   0.55  0.55     0.81 0.53
Mrs Slate          0.60   0.48  0.49     0.71 0.46
Pebbles            0.72   0.57  0.58     0.85 0.55
Pebbles Bam-Bam    0.69   0.55  0.56     0.82 0.53
Piltdown           0.69   0.54  0.55     0.81 0.53
Poindexter         0.67   0.53  0.54     0.79 0.52
Pyrite             0.69   0.54  0.55     0.81 0.53
Slate              0.66   0.52  0.53     0.77 0.50
Wilma              0.56   0.44  0.45     0.66 0.43

Here we can see that, compared to the number we got for the Katz similarity, Barney and Fred are actually not that similar to one another after we take into account their very high degree, and that Barney is actually most similar to Feldspar.

Let’s block the actors in this network according to the LHN similarities:

   D <- dist(1 - S)
   h.res <- hclust(D, method = "ward.D2")
   plot(h.res)

   blocks <- cutree(h.res, k = 3)
   blocks

        Bam-Bam          Barney           Betty        Feldspar            Fred 
              1               2               2               3               2 
   Headmistress      Herdmaster            Lava           Leach          Morris 
              1               1               2               3               1 
      Mrs Slate         Pebbles Pebbles Bam-Bam        Piltdown      Poindexter 
              2               1               1               1               1 
         Pyrite           Slate           Wilma 
              1               1               2

It seems like this approach is geared towards finding smaller, more fine-grained clusters (e.g., Feldspar and Leach) of similar actors in the data, although it still correctly assigns Wilma, Fred, Barney, and Betty to the same cluster.

The Matrix Exponential Similarity

Another approach to computing global similarities between nodes, similar to the Katz similarities, is to use an operation called the matrix exponential (Fouss, Saerens, and Shimbo 2016, 87).

For the adjacency matrix, the matrix exponential similarity is defined as:

\[ \mathbf{S} = \sum_{k=1}^{\infty} \frac{\alpha^k\mathbf{A}^k}{k!} \]

Where, the denominator is the factorial of the \(k^{th}\) matrix power \((k \times (k-1) \times (k-2) \ldots \times 1)\).

Recall that the powers of the adjacency matrix \(\mathbf{A}^k\) produces a matrix counting the number of walks of length \(k\) between every pair of nodes (and cycles of length \(k\) in the diagonals). So this global measure of similarity, like the Katz similarity, counts nodes as similar if they are connected by indirect paths of all lengths, discounted by their length, as given by the denominator \(k!\).

Matrix algebra magic turns the above infinite sum into the more tractable:

\[ \mathbf{S} = e^{\alpha\mathbf{A}} \]

Where \(e\) refers to the matrix exponential operation.

In R we can calculate the matrix exponential similarity using the package expm. Here’s a quick function that does that:

   #install.packages(c("expm", "Matrix"))
   me.sim <- function(w, alpha) {
      library(expm)
      s <- expm(alpha * w)
      return(s)
      }

And now to test it out:

   S <- me.sim(A, alpha = 1.5)
   round(S[1:10, 1:10], 1)

             Bam-Bam  Barney   Betty Feldspar    Fred Headmistress Herdmaster
Bam-Bam      25550.4 43929.1 44572.7  10044.3 45715.2      19816.1    16417.6
Barney       43929.1 75565.6 76661.1  17279.0 78632.2      34082.8    28252.7
Betty        44572.7 76661.1 77781.2  17526.6 79772.8      34585.3    28655.0
Feldspar     10044.3 17279.0 17526.6   3953.4 17980.9       7789.8     6462.2
Fred         45715.2 78632.2 79772.8  17980.9 81827.1      35463.6    29399.8
Headmistress 19816.1 34082.8 34585.3   7789.8 35463.6      15388.7    12732.9
Herdmaster   16417.6 28252.7 28655.0   6462.2 29399.8      12732.9    10572.9
Lava         39893.6 68624.8 69622.9  15690.0 71411.0      30958.1    25655.1
Leach        14022.3 24115.9 24468.0   5514.0 25097.1      10880.7     9013.7
Morris       12224.2 21019.7 21332.3   4803.7 21870.9       9491.2     7850.8
                Lava   Leach  Morris
Bam-Bam      39893.6 14022.3 12224.2
Barney       68624.8 24115.9 21019.7
Betty        69622.9 24468.0 21332.3
Feldspar     15690.0  5514.0  4803.7
Fred         71411.0 25097.1 21870.9
Headmistress 30958.1 10880.7  9491.2
Herdmaster   25655.1  9013.7  7850.8
Lava         62327.6 21904.4 19091.1
Leach        21904.4  7701.0  6710.2
Morris       19091.1  6710.2  5857.4

Note that we set the alpha parameter to \(\alpha = 1.5\). When we choose a value of \(\alpha\) closer to zero (\(\alpha \rightarrow 0\)), longer paths get discounted a lot; this means that the weight of the measure is borne by the degree of each node (direct links). As \(\alpha\) increases, we give more weight to indirect paths (Benzi and Klymko 2014).

As we can see, just like Katz and Leicht-Holme-Newman, the resulting matrix has pretty giant numbers as entries, which we would expect since we decide to put the weight on longer paths. Dividing by the maximum gives us the similarities we seek:

   S <- S/max(S)
   round(S[1:10, 1:10], 2)

             Bam-Bam Barney Betty Feldspar Fred Headmistress Herdmaster Lava
Bam-Bam         0.31   0.54  0.54     0.12 0.56         0.24       0.20 0.49
Barney          0.54   0.92  0.94     0.21 0.96         0.42       0.35 0.84
Betty           0.54   0.94  0.95     0.21 0.97         0.42       0.35 0.85
Feldspar        0.12   0.21  0.21     0.05 0.22         0.10       0.08 0.19
Fred            0.56   0.96  0.97     0.22 1.00         0.43       0.36 0.87
Headmistress    0.24   0.42  0.42     0.10 0.43         0.19       0.16 0.38
Herdmaster      0.20   0.35  0.35     0.08 0.36         0.16       0.13 0.31
Lava            0.49   0.84  0.85     0.19 0.87         0.38       0.31 0.76
Leach           0.17   0.29  0.30     0.07 0.31         0.13       0.11 0.27
Morris          0.15   0.26  0.26     0.06 0.27         0.12       0.10 0.23
             Leach Morris
Bam-Bam       0.17   0.15
Barney        0.29   0.26
Betty         0.30   0.26
Feldspar      0.07   0.06
Fred          0.31   0.27
Headmistress  0.13   0.12
Herdmaster    0.11   0.10
Lava          0.27   0.23
Leach         0.09   0.08
Morris        0.08   0.07

And the resulting clustering solution looks like:

   D <- dist(1-S)
   h.res <- hclust(D, method = "ward.D2")
   plot(h.res)

Which is substantively identical to the three cluster solution obtained using the Katz similarity, which makes sense given the similarity (pun intended) between the two metrics.

References

Benzi, Michele, and Christine Klymko. 2014. “A Matrix Analysis of Different Centrality Measures.” Technical Report: TR-2013-014. https://www.cs.emory.edu/technical-reports/techrep-00259.pdf.

Fouss, François, Marco Saerens, and Masashi Shimbo. 2016. Algorithms and Models for Network Data and Link Analysis. Cambridge University Press.

Katz, Leo. 1953. “A New Status Index Derived from Sociometric Analysis.” Psychometrika 18 (1): 39–43.

Leicht, Elizabeth A, Petter Holme, and Mark EJ Newman. 2006. “Vertex Similarity in Networks.” Physical Review E—Statistical, Nonlinear, and Soft Matter Physics 73 (2): 026120.

Lü, Linyuan, Ci-Hang Jin, and Tao Zhou. 2009. “Similarity Index Based on Local Paths for Link Prediction of Complex Networks.” Phys. Rev. E 80: 046122.

Footnotes

Note that this does not return a metric distance in the technical sense, since the triangle inequality (\(d(x, z) \leq d{x, y} + d{y, z}\) is not guaranteed to be respected.)↩︎
See here for an explainer of the matrix inverse in general and its computation in R.↩︎