Local Vertex Similarities

As we noted in the role equivalence handout, for odd reasons, classical approaches to structural similarity in networks used distance metrics but did not measure similarity directly. More recent approaches from network and information science prefer to define vertex similarity using direct similarity measures based on local structural characteristics, like node neighborhoods.

Mathematically, similarity is a less stringent (but also less well-defined compared to distances) relation between pairs of nodes in a graph than distance, so it can be easier to work with in most applications.

For instance, similarity is required to be symmetric (\(s_{ij} = s_{ji}\) for all \(i\) and \(j\)) and most metrics have reasonable bounds (e.g., 0 for least similar and 1.0 for maximally similar). Given such a bounded similarity we can get to dissimilarity by subtracting one: \(d_{ij} = 1 - s_{ij}\)

Basic Ingredients of Vertex Similarity Metrics

Consider two nodes and the set of nodes that are the immediate neighbors to each. In this case, most vertex similarity measures will make use of three pieces of information:

  1. The number of common neighbors \(p\).

  2. The number of actors \(q\) who are connected to node \(i\) but not to node \(j\).

  3. The number of actors \(r\) who are connected to node \(j\) but not to node \(i\).

In the simplest case of the binary undirected graph then these are given by:

\[ p = \sum_{k = 1}^{n} a_{ik} a_{jk} \]

\[ q = \sum_{k = 1}^{n} a_{ik} (1 - a_{jk}) \]

\[ r = \sum_{k = 1}^{n} (1- a_{ik}) a_{jk} \]

In matrix form:

\[ \mathbf{A}(p) = \mathbf{A} \mathbf{A} = \mathbf{A}^2 \]

\[ \mathbf{A}(q) = \mathbf{A} (1 - \mathbf{A}) \]

\[ \mathbf{A}(r) = (1 - \mathbf{A}) \mathbf{A} \]

Let’s look at an example:

   library(networkdata)
   library(igraph)
   library(stringr) #using stringr to change names from all caps to title case
   g <- movie_267
   V(g)$name <- str_to_title(V(g)$name)
   g <- delete_vertices(g, degree(g) <= 3) #deleting low degree vertices
   A <- as.matrix(as_adjacency_matrix(g))
   A.p <- A %*% A #common neighbors matrix
   A.q <- A %*% (1 - A) #neighbors of i not connected to j
   A.r <- (1 - A) %*% A #neighbors of j not connected to i
   A.p[1:10, 1:10]
             Bam-Bam Barney Betty Feldspar Fred Headmistress Herdmaster Lava
Bam-Bam            6      4     5        2    5            2          2    5
Barney             4     13     9        1   11            5          3    7
Betty              5      9    13        2    9            3          4    8
Feldspar           2      1     2        2    1            0          2    2
Fred               5     11     9        1   14            4          3    9
Headmistress       2      5     3        0    4            5          0    3
Herdmaster         2      3     4        2    3            0          4    4
Lava               5      7     8        2    9            3          4   11
Leach              2      3     3        1    2            2          1    2
Morris             2      3     2        0    2            3          0    2
             Leach Morris
Bam-Bam          2      2
Barney           3      3
Betty            3      2
Feldspar         1      0
Fred             2      2
Headmistress     2      3
Herdmaster       1      0
Lava             2      2
Leach            3      1
Morris           1      3
   A.q[1:10, 1:10]
             Bam-Bam Barney Betty Feldspar Fred Headmistress Herdmaster Lava
Bam-Bam            0      2     1        4    1            4          4    1
Barney             9      0     4       12    2            8         10    6
Betty              8      4     0       11    4           10          9    5
Feldspar           0      1     0        0    1            2          0    0
Fred               9      3     5       13    0           10         11    5
Headmistress       3      0     2        5    1            0          5    2
Herdmaster         2      1     0        2    1            4          0    0
Lava               6      4     3        9    2            8          7    0
Leach              1      0     0        2    1            1          2    1
Morris             1      0     1        3    1            0          3    1
             Leach Morris
Bam-Bam          4      4
Barney          10     10
Betty           10     11
Feldspar         1      2
Fred            12     12
Headmistress     3      2
Herdmaster       3      4
Lava             9      9
Leach            0      2
Morris           2      0
   A.r[1:10, 1:10]
             Bam-Bam Barney Betty Feldspar Fred Headmistress Herdmaster Lava
Bam-Bam            0      9     8        0    9            3          2    6
Barney             2      0     4        1    3            0          1    4
Betty              1      4     0        0    5            2          0    3
Feldspar           4     12    11        0   13            5          2    9
Fred               1      2     4        1    0            1          1    2
Headmistress       4      8    10        2   10            0          4    8
Herdmaster         4     10     9        0   11            5          0    7
Lava               1      6     5        0    5            2          0    0
Leach              4     10    10        1   12            3          3    9
Morris             4     10    11        2   12            2          4    9
             Leach Morris
Bam-Bam          1      1
Barney           0      0
Betty            0      1
Feldspar         2      3
Fred             1      1
Headmistress     1      0
Herdmaster       2      3
Lava             1      1
Leach            0      2
Morris           2      0

Note that while \(\mathbf{A}(p)\) is necessarily symmetric, neither \(q\) nor \(r\) have to be. Barney has many more neighbors that Bam-Bam is not connected to than vice versa. Also note that the \(\mathbf{A}(r)\) matrix is just the transpose of the \(\mathbf{A}(q)\) matrix in the undirected case.

So the most obvious measure of similarity between two nodes is simply the number of common neighbors (Leicht, Holme, and Newman 2006):

\[ s_{ij} = p_{ij} \]

We have already seen a version of this in the directed case when talking about the HITS algorithm (Kleinberg 1999), which computes a spectral (eigenvector-based) ranking based on the matrices of common in and out-neighbors in a directed graph.

\[ p^{in}_{ij} = \sum_{k = 1}^{n} a_{ki} a_{kj} \]

\[ p^{out}_{ij} = \sum_{k = 1}^{n} a_{ik} a_{jk} \]

Which in matrix form is:

\[ \mathbf{A}(p^{out}) = \mathbf{A}^T \mathbf{A} \]

\[ \mathbf{A}(p^{in}) = \mathbf{A} \mathbf{A}^T \]

In this case, similarity can be measured either by the number of common in-neighbors or the number of common out-neighbors.

If the network under consideration is a (directed) citation network with nodes being papers and links between papers defined as a citation from paper \(i\) to paper \(j\), then the number of common in-neighbors between two papers is their co-citation similarity (the number of other papers that cite both papers), and the number of common out-neighbors is their bibliographic coupling similarity (the overlap in their list of references).

One problem with using unbounded quantities like the sheer number of common (in or out) neighbors to define node similarity is that they are only limited by the number of nodes in the network (Leicht, Holme, and Newman 2006). Thus, an actor with many neighbors will end up having lots of other neighbors in common with lots of other nodes, which will mean we would count them as “similar” to almost everyone.

Normalized Similarity Metrics

Normalized similarity metrics deal with this issue by adjusting the raw similarity based on \(p\) using the number of non-shared neighbors \(q\) and \(r\).

The two most popular versions of normalized vertex similarity scores are the Jaccard index and the cosine similarity (sometimes also referred to as the Salton Index).

The Jacccard index is given by:

\[ s_{ij} = \frac{p}{p + q + r} \]

Which is the ratio of the size of the intersection of the neighborhoods of the two nodes (number of common neighbors) divided by the size of the union of the two neighborhoods.

In our example, this would be:

   J <- A.p / (A.p + A.q + A.r)
   round(J[1:10, 1:10], 2)
             Bam-Bam Barney Betty Feldspar Fred Headmistress Herdmaster Lava
Bam-Bam         1.00   0.27  0.36     0.33 0.33         0.22       0.25 0.42
Barney          0.27   1.00  0.53     0.07 0.69         0.38       0.21 0.41
Betty           0.36   0.53  1.00     0.15 0.50         0.20       0.31 0.50
Feldspar        0.33   0.07  0.15     1.00 0.07         0.00       0.50 0.18
Fred            0.33   0.69  0.50     0.07 1.00         0.27       0.20 0.56
Headmistress    0.22   0.38  0.20     0.00 0.27         1.00       0.00 0.23
Herdmaster      0.25   0.21  0.31     0.50 0.20         0.00       1.00 0.36
Lava            0.42   0.41  0.50     0.18 0.56         0.23       0.36 1.00
Leach           0.29   0.23  0.23     0.25 0.13         0.33       0.17 0.17
Morris          0.29   0.23  0.14     0.00 0.13         0.60       0.00 0.17
             Leach Morris
Bam-Bam       0.29   0.29
Barney        0.23   0.23
Betty         0.23   0.14
Feldspar      0.25   0.00
Fred          0.13   0.13
Headmistress  0.33   0.60
Herdmaster    0.17   0.00
Lava          0.17   0.17
Leach         1.00   0.20
Morris        0.20   1.00

Here showing that Barney is more similar to Fred and Betty than he is to Bam-Bam.

The cosine similarity is given by:

\[ s_{ij} = \frac{p}{\sqrt{p + q} \sqrt{p + r}} \]

Which is the ratio of the number of common neighbors divided by the product of the square root of the degrees of each node (or the square root of the product which is the same thing), since \(p\) + \(q\) is the degree of node \(i\) and \(p\) + \(r\) is the degree of node \(j\).

In our example, this would be:

   C <- A.p / (sqrt(A.p + A.q) * sqrt(A.p + A.r))
   round(C[1:10, 1:10], 2)
             Bam-Bam Barney Betty Feldspar Fred Headmistress Herdmaster Lava
Bam-Bam         1.00   0.45  0.57     0.58 0.55         0.37       0.41 0.62
Barney          0.45   1.00  0.69     0.20 0.82         0.62       0.42 0.59
Betty           0.57   0.69  1.00     0.39 0.67         0.37       0.55 0.67
Feldspar        0.58   0.20  0.39     1.00 0.19         0.00       0.71 0.43
Fred            0.55   0.82  0.67     0.19 1.00         0.48       0.40 0.73
Headmistress    0.37   0.62  0.37     0.00 0.48         1.00       0.00 0.40
Herdmaster      0.41   0.42  0.55     0.71 0.40         0.00       1.00 0.60
Lava            0.62   0.59  0.67     0.43 0.73         0.40       0.60 1.00
Leach           0.47   0.48  0.48     0.41 0.31         0.52       0.29 0.35
Morris          0.47   0.48  0.32     0.00 0.31         0.77       0.00 0.35
             Leach Morris
Bam-Bam       0.47   0.47
Barney        0.48   0.48
Betty         0.48   0.32
Feldspar      0.41   0.00
Fred          0.31   0.31
Headmistress  0.52   0.77
Herdmaster    0.29   0.00
Lava          0.35   0.35
Leach         1.00   0.33
Morris        0.33   1.00

Showing results similar (pun intended) to those obtained using the Jaccard index.

A less commonly used option is the Dice Coefficient (sometimes called the Sorensen Index) given by:

\[ s_{ij} = \frac{2p}{2p + q + r} \]

Which is given by the ratio of twice the number of common neighbors divided by twice the same quantity plus the sum of the non-shared neighbors (and thus a variation of the Jaccard measure).

In our example, this would be:

   D <- (2 * A.p) / ((2 * A.p) + A.q + A.r)
   round(D[1:10, 1:10], 2)
             Bam-Bam Barney Betty Feldspar Fred Headmistress Herdmaster Lava
Bam-Bam         1.00   0.42  0.53     0.50 0.50         0.36       0.40 0.59
Barney          0.42   1.00  0.69     0.13 0.81         0.56       0.35 0.58
Betty           0.53   0.69  1.00     0.27 0.67         0.33       0.47 0.67
Feldspar        0.50   0.13  0.27     1.00 0.12         0.00       0.67 0.31
Fred            0.50   0.81  0.67     0.12 1.00         0.42       0.33 0.72
Headmistress    0.36   0.56  0.33     0.00 0.42         1.00       0.00 0.38
Herdmaster      0.40   0.35  0.47     0.67 0.33         0.00       1.00 0.53
Lava            0.59   0.58  0.67     0.31 0.72         0.38       0.53 1.00
Leach           0.44   0.38  0.38     0.40 0.24         0.50       0.29 0.29
Morris          0.44   0.38  0.25     0.00 0.24         0.75       0.00 0.29
             Leach Morris
Bam-Bam       0.44   0.44
Barney        0.38   0.38
Betty         0.38   0.25
Feldspar      0.40   0.00
Fred          0.24   0.24
Headmistress  0.50   0.75
Herdmaster    0.29   0.00
Lava          0.29   0.29
Leach         1.00   0.33
Morris        0.33   1.00

Once again, showing results comparable to the previous.

Note, that all three of these pairwise measures of similarity are bounded between zero and one, with nodes being maximally similar to themselves and with pairs of distinct nodes being maximally similar when they have the same set of neighbors (e.g., they are structurally equivalent).

Leicht, Holme, and Newman (2006) introduce a variation on the theme of normalized structural similarity scores. Their point is that maybe we should care about nodes that are surprisingly similar given some suitable null model. They propose the configuration model as such a null model. This model takes a graph with the same degree distribution as the original but with connections formed at random as reference.

The LHN similarity index (for Leicht, Holme, and Newman) is then given by:

\[ s_{ij} = \frac{p}{(p + q)(p + r)} \]

Which can be seen as a variation of the cosine similarity defined earlier.

In our example, this would be:

   D <- A.p / ((A.p + A.q) * (A.p + A.r))
   round(D[1:10, 1:10], 2)
             Bam-Bam Barney Betty Feldspar Fred Headmistress Herdmaster Lava
Bam-Bam         0.17   0.05  0.06     0.17 0.06         0.07       0.08 0.08
Barney          0.05   0.08  0.05     0.04 0.06         0.08       0.06 0.05
Betty           0.06   0.05  0.08     0.08 0.05         0.05       0.08 0.06
Feldspar        0.17   0.04  0.08     0.50 0.04         0.00       0.25 0.09
Fred            0.06   0.06  0.05     0.04 0.07         0.06       0.05 0.06
Headmistress    0.07   0.08  0.05     0.00 0.06         0.20       0.00 0.05
Herdmaster      0.08   0.06  0.08     0.25 0.05         0.00       0.25 0.09
Lava            0.08   0.05  0.06     0.09 0.06         0.05       0.09 0.09
Leach           0.11   0.08  0.08     0.17 0.05         0.13       0.08 0.06
Morris          0.11   0.08  0.05     0.00 0.05         0.20       0.00 0.06
             Leach Morris
Bam-Bam       0.11   0.11
Barney        0.08   0.08
Betty         0.08   0.05
Feldspar      0.17   0.00
Fred          0.05   0.05
Headmistress  0.13   0.20
Herdmaster    0.08   0.00
Lava          0.06   0.06
Leach         0.33   0.11
Morris        0.11   0.33

Which, once again, produces similar results to what we found before. Note, however, that the LHN is not naturally maximal for self-similar nodes.

Similarity and Structural Equivalence

All normalized similarity measures bounded between zero and one (like Jaccard, Cosine, and Dice) also define a distance on each pair of nodes which is equal to one minus the similarity. So the cosine distance between two nodes is one minus the cosine similarity, and so on for the Jaccard and Dice indexes.

Because they define distances, this also means that these approaches can be used to find approximately structurally equivalent classes of nodes in a graph just like we did with the Euclidean and correlation distances.

For instance, consider our toy graph from before with four structurally equivalent sets of nodes:

A toy graph demonstrating structural equivalence.

The cosine similarity matrix for this graph is:

   W <- as.matrix(as_adjacency_matrix(h))
   W.p <- W %*% W
   W.q <- W %*% (1 - W) 
   W.r <- (1 - W) %*% W
   C <- W.p / (sqrt(W.p + W.q) * sqrt(W.p + W.r))
   round(C, 2)
     A    B    C    D    E    F    G    H    I
A 1.00 0.26 0.26 0.58 0.58 0.58 0.67 0.00 0.00
B 0.26 1.00 1.00 0.00 0.00 0.00 0.26 0.67 0.67
C 0.26 1.00 1.00 0.00 0.00 0.00 0.26 0.67 0.67
D 0.58 0.00 0.00 1.00 1.00 1.00 0.58 0.25 0.25
E 0.58 0.00 0.00 1.00 1.00 1.00 0.58 0.25 0.25
F 0.58 0.00 0.00 1.00 1.00 1.00 0.58 0.25 0.25
G 0.67 0.26 0.26 0.58 0.58 0.58 1.00 0.00 0.00
H 0.00 0.67 0.67 0.25 0.25 0.25 0.00 1.00 0.75
I 0.00 0.67 0.67 0.25 0.25 0.25 0.00 0.75 1.00

Note that structurally equivalent nodes have similarity scores equal to 1.0. In this case, the distance matrix is given by:

   D <- 1 - C

And a hierarchical clustering on this matrix reveals the structurally equivalent classes:

   D <- dist(D) 
   h.res <- hclust(D, method = "ward.D2")
   plot(h.res)

We can package all that we said before into a handy function that takes a graph as input and returns all four normalized similarity metrics as output:

   vertex.sim <- function(x) {
      A <- as.matrix(as_adjacency_matrix(x))
      A.p <- A %*% A 
      A.q <- A %*% (1 - A) 
      A.r <- (1 - A) %*% A 
      J <- A.p / (A.p + A.q + A.r)
      C <- A.p / (sqrt(A.p + A.q) * sqrt(A.p + A.r))
      D <- (2 * A.p) / ((2 * A.p) + A.q + A.r)
      L <- A.p / ((A.p + A.q) * (A.p + A.r))
      return(list(J = J, C = C, D = D, L = L))
      }

In the Flintstones network, we could then find structurally equivalent blocks from the similarity analysis as follows (using Cosine):

   D <- dist(1- vertex.sim(g)$C)
   h.res <- hclust(D, method = "ward.D2") #hierarchical clustering
   plot(h.res)

We can then extract our blocks just like we did with the Euclidean distance hierarchical clustering results in the previous handout:

   blocks  <- cutree(h.res, k = 6)
   blocks
        Bam-Bam          Barney           Betty        Feldspar            Fred 
              1               2               2               3               2 
   Headmistress      Herdmaster            Lava           Leach          Morris 
              4               3               2               5               4 
      Mrs Slate         Pebbles Pebbles Bam-Bam        Piltdown      Poindexter 
              2               1               5               6               5 
         Pyrite           Slate           Wilma 
              6               2               2 

This analysis now puts \(\{Barney, Betty, Fred, Lava, Mrs. Slate, Slate, Wilma\}\) in the second block of equivalent actors (shown as the right-most cluster of actors in the dendrogram).

References

Kleinberg, Jon M. 1999. “Authoritative Sources in a Hyperlinked Environment.” Journal of the ACM (JACM) 46 (5): 604–32.
Leicht, Elizabeth A, Petter Holme, and Mark EJ Newman. 2006. “Vertex Similarity in Networks.” Physical Review E—Statistical, Nonlinear, and Soft Matter Physics 73 (2): 026120.