Global Vertex Similarities
Global Similarity Indices
As we saw in the lecture notes on local similarity metrics, the most important ingredient of structural similarity measures between pairs of nodes is the number of common of neighbors (followed by the degrees of each node), and this quantity is given by the square of the adjacency matrix \(\mathbf{A}^2\). So we can say that this matrix gives us a basic similarity measure between nodes, namely, the common neighbors similarity:
\[ \mathbf{S} = \mathbf{A}^2 \]
Another way of seeing this is that a common neighbor defines a path of length two between a pair of nodes. So the number of common neighbors between two nodes is equivalent to the number of paths of length two between them. We are thus saying that the similarity between two nodes increases as the number of paths of length two between them increases, and that info is also recorded in the \(\mathbf{A}^2\) matrix.
The Local Path Similarity Index
But if the similarity between node pairs increases in the number of paths of length two between them, wouldn’t nodes be even more similar if they also have a bunch of paths of length three between them?
Lü, Jin, and Zhou (2009) asked themselves the same question and proposed the following as a similarity metric based on paths:
\[ \mathbf{S} = \mathbf{A}^2 + \alpha \mathbf{A}^3 \tag{1}\]
This is the so-called local path similarity index (Lü, Jin, and Zhou 2009). Obviously, structurally equivalent nodes will be counted as similar by this metric (lots of paths of length two between them) but also nodes indirectly connected by many paths of length three, but to a lesser extent given the discount parameter \(\alpha\) (a number between zero and one).
A function that does this is:
Here’s how the local path similarity looks in the Flintstones network:
library(networkdata)
library(igraph)
library(stringr) #using stringr to change names from all caps to title case
g <- movie_267
V(g)$name <- str_to_title(V(g)$name)
g <- delete_vertices(g, degree(g) <= 3) #deleting low degree vertices
A <- as.matrix(as_adjacency_matrix(g))
S <- local.path(A)
S[1:10, 1:10] Bam-Bam Barney Betty Feldspar Fred Headmistress Herdmaster Lava
Bam-Bam 20.0 29.5 30.5 6.5 31.5 12.5 9.5 25.0
Barney 29.5 50.0 51.0 13.0 52.0 21.0 21.0 46.5
Betty 30.5 51.0 52.0 11.0 53.0 24.0 18.0 46.0
Feldspar 6.5 13.0 11.0 3.0 13.5 4.5 5.0 10.0
Fred 31.5 52.0 53.0 13.5 55.0 21.5 21.5 47.5
Headmistress 12.5 21.0 24.0 4.5 21.5 13.0 6.5 21.5
Herdmaster 9.5 21.0 18.0 5.0 21.5 6.5 10.0 17.0
Lava 25.0 46.5 46.0 10.0 47.5 21.5 17.0 42.0
Leach 9.5 15.5 16.0 3.5 17.5 7.5 5.5 15.5
Morris 8.5 13.0 15.5 2.5 13.0 8.0 3.5 12.5
Leach Morris
Bam-Bam 9.5 8.5
Barney 15.5 13.0
Betty 16.0 15.5
Feldspar 3.5 2.5
Fred 17.5 13.0
Headmistress 7.5 8.0
Herdmaster 5.5 3.5
Lava 15.5 12.5
Leach 6.0 4.0
Morris 4.0 6.0
Of course as \(\alpha\) approaches zero, then the local path measure reduces to the number of common neighbors, while numbers closer to one count paths of length three more.
Another thing people may wonder if why not keep going and add paths of length four:
\[ \mathbf{S} = \mathbf{A}^2 + \alpha \mathbf{A}^3 + \alpha^2 \mathbf{A}^4 \]
Or paths of length whatever:
\[ \mathbf{S} = \mathbf{A}^2 + \alpha \mathbf{A}^3 + \alpha^2 \mathbf{A}^4 ... + \alpha^{k-2} \mathbf{A}^k \]
Where \(k\) is the length of the maximum path considered. Lü, Jin, and Zhou (2009) argue that these higher order paths don’t matter, so maybe we don’t have to worry about them.
The Katz Similarity
Another issue is that there was already an all-paths similarity measure in existence, one developed by the mathematical social scientist Leo Katz (1953) in the 1950s (!).
The basic idea was to use linear algebra tricks to solve:
\[ \mathbf{S} = \sum_{k=1}^{\infty} \alpha^k \mathbf{A}^k \]
Which would theoretically count the all the paths of all possible lengths between two nodes while discounting the contribution of the longer paths in proportion to their length (as \(k\) gets bigger, with \(\alpha\) a number between zero and one, \(\alpha^k\) gets smaller and smaller).
Linear algebra hocus-pocus (non-technical explainer here) turns the above infinite sum into the more tractable:
\[ \mathbf{S} = (\mathbf{I} - \alpha \mathbf{A})^{-1} \tag{2}\]
Where \(\mathbf{I}\) is the identity matrix (a matrix of all zeros except that it has the number one in each diagonal cell) of the same dimensions as the original. Raising the result of the subtraction in parentheses to minus one indicates the matrix inverse operation (most matrices are invertible, unless they are weird).
A function to compute the Katz similarity between all node pairs looks like:
This function takes the network’s adjacency matrix (w) as input and returns the Katz similarity matrix (S) as output.
For technical reasons (e.g., guarantee that the infinite sum converges) we need to choose \(\alpha\) to be a number larger than zero but smaller than the reciprocal of the first eigenvalue of the matrix. Here we just pick the reciprocal of the largest (first) eigenvalue minus a very small number (\(10^{-10}\)).
Line 4 computes the actual Katz similarity using the native R function solve to find the relevant matrix inverse.1
In the Flintstones network the Katz similarity looks like:
Bam-Bam Barney Betty Feldspar Fred Headmistress
Bam-Bam 43912268 75530407 76630126 17269355 78596698 34072232
Barney 75530407 129914551 131806098 29703805 135188661 58605253
Betty 76630126 131806098 133725188 30136290 137157000 59458542
Feldspar 17269355 29703805 30136290 6791511 30909683 13399569
Fred 78596698 135188661 137157000 30909683 140676885 60984437
Headmistress 34072232 58605253 59458542 13399569 60984437 26437192
Herdmaster 28235263 48565494 49272605 11104068 50537096 21908193
Lava 68595891 117986974 119704858 26976671 122776865 53224651
Leach 24107266 41465215 42068946 9480652 43148569 18705214
Morris 21013980 36144672 36670936 8264157 37612029 16305084
Herdmaster Lava Leach Morris
Bam-Bam 28235263 68595891 24107266 21013980
Barney 48565494 117986974 41465215 36144672
Betty 49272605 119704858 42068946 36670936
Feldspar 11104068 26976671 9480652 8264157
Fred 50537096 122776865 43148569 37612029
Headmistress 21908193 53224651 18705214 16305084
Herdmaster 18155067 44106651 15500794 13511834
Lava 44106651 107154482 37658255 32826196
Leach 15500794 37658255 13234577 11536403
Morris 13511834 32826196 11536403 10056129
Which are pretty big numbers! Which makes sense, since these are estimates of all the paths of any length between every pair of nodes in the network. We can get more tractable figures by normalizing the matrix by its maximum:
Bam-Bam Barney Betty Feldspar Fred Headmistress Herdmaster Lava
Bam-Bam 0.31 0.54 0.54 0.12 0.56 0.24 0.20 0.49
Barney 0.54 0.92 0.94 0.21 0.96 0.42 0.35 0.84
Betty 0.54 0.94 0.95 0.21 0.97 0.42 0.35 0.85
Feldspar 0.12 0.21 0.21 0.05 0.22 0.10 0.08 0.19
Fred 0.56 0.96 0.97 0.22 1.00 0.43 0.36 0.87
Headmistress 0.24 0.42 0.42 0.10 0.43 0.19 0.16 0.38
Herdmaster 0.20 0.35 0.35 0.08 0.36 0.16 0.13 0.31
Lava 0.49 0.84 0.85 0.19 0.87 0.38 0.31 0.76
Leach 0.17 0.29 0.30 0.07 0.31 0.13 0.11 0.27
Morris 0.15 0.26 0.26 0.06 0.27 0.12 0.10 0.23
Leach Morris
Bam-Bam 0.17 0.15
Barney 0.29 0.26
Betty 0.30 0.26
Feldspar 0.07 0.06
Fred 0.31 0.27
Headmistress 0.13 0.12
Herdmaster 0.11 0.10
Lava 0.27 0.23
Leach 0.09 0.08
Morris 0.08 0.07
Which looks better. As we would expect, Fred is very similar to Barney and Betty. We can then do the standard hierarchical clustering to see how this similarity measure groups nodes:

Bam-Bam Barney Betty Feldspar Fred
1 2 2 3 2
Headmistress Herdmaster Lava Leach Morris
4 4 5 3 3
Mrs Slate Pebbles Pebbles Bam-Bam Piltdown Poindexter
1 1 6 6 6
Pyrite Slate Wilma
6 5 5
The Katz similarity (correctly) assigns Fred, Barney, Betty, Wilma (along with Slate and Lava) to the same group, and puts the kids in a separate block.
Degree-Weighted Katz Similarity (AKA Leicht-Holme-Newman)
Leicht, Holme, and Newman (2006) argue that the Katz approach is fine and dandy as a similarity measure, but note that is an unweighted index (like the raw number of common neighbors). This means that nodes with large degree will end up being “similar” to a bunch of other nodes in the graph, just because they have lots of paths of length whatever between them and those nodes.
Leicht, Holme, and Newman (2006) propose a “fix” for this weakness in the Katz similarity, resulting in the matrix linear algebra equivalent of a degree-normalized similarity measure like the Jaccard or Cosine.
So instead of Katz they suggest we compute:
\[ \mathbf{S} = \mathbf{D}^{-1} \left( \frac{\alpha A}{\lambda_1} \right)^{-1} \mathbf{D}^{-1} \]
Here \(\mathbf{D}\) is a matrix containing the degrees of each node along the diagonal. The inverse of this matrix \(\mathbf{D}^{-1}\) will contain the reciprocal of each degree \(1/k_i\) along the diagonals. \(\lambda_1\), on the other hand, is just the first eigenvalue of the adjacency matrix.
So, the LHN Similarity is just the Katz similarity weighted by the degree of the sender and receiver node along each path, further discounting paths featuring high-degree nodes at either or both ends.
Which leads to the function:
And the Flintstones result:
Bam-Bam Barney Betty Feldspar Fred
Bam-Bam -0.09 0.01 0.01 -0.16 -0.01
Barney 0.01 -0.02 0.00 0.11 0.01
Betty 0.01 0.00 -0.04 -0.48 0.00
Feldspar -0.16 0.11 -0.48 -2.55 0.25
Fred -0.01 0.01 0.00 0.25 -0.01
Headmistress 0.20 0.05 -0.13 -2.96 -0.05
Herdmaster 0.16 0.08 0.10 0.07 -0.07
Lava 0.00 -0.02 -0.02 -0.42 0.02
Leach -0.27 -0.10 0.04 2.01 0.09
Morris -0.25 -0.03 0.12 1.90 0.03
Mrs Slate -0.05 0.01 0.04 0.74 -0.01
Pebbles 0.19 0.01 0.01 -0.16 -0.01
Pebbles Bam-Bam 0.03 0.00 0.03 0.16 0.00
Piltdown 0.00 0.00 0.00 -0.35 0.00
Poindexter 0.01 0.03 0.04 0.66 -0.02
Pyrite 0.00 0.00 0.00 -0.35 0.00
Slate -0.08 -0.01 0.04 0.91 0.01
Wilma 0.01 0.00 0.02 0.09 0.00
Note that as Leicht, Holme, and Newman (2006) discuss, the entries in the LHN version of the similarity \(\mathbf{S}\) can be either positive or negative. Negative entries are nodes that are surprisingly dissimilar given their degrees, and positive numbers indicating node pairs that are surprisingly similar. Numbers closer to zero are nodes that are neither similar nor dissimilar given their degrees.
Here we can see that Barney and Fred are actually not that similar to one another (after we take into account their very high degree) and that Barney is actually most similar to Feldspar (and Fred even more so).
Because the LHN similarities have negative and positive values, they already define a distance between nodes in the graph, like the correlation distance. So if we wanted to find blocks of actors using this similarity criterion, all we need to do is:

Bam-Bam Barney Betty Feldspar Fred
1 2 1 3 2
Headmistress Herdmaster Lava Leach Morris
4 5 1 6 7
Mrs Slate Pebbles Pebbles Bam-Bam Piltdown Poindexter
8 1 2 1 8
Pyrite Slate Wilma
1 8 2
It seems like this approach is geared towards finding smaller, more fine-grained clusters of similar actors in the data.