Local Vertex Similarities

As we noted in the structural equivalence lecture notes, for odd reasons, classical approaches to structural similarity in networks used distance metrics but did not measure similarity directly. More recent approaches from network and information science prefer to define vertex similarity using direct similarity measures based on local structural characteristics, aka node neighborhoods.

Compared to distance, similarity is a less stringent (but also less well-defined mathematically) relation between pairs of nodes in a graph, so it can be easier to work with in most applications.

For instance, similarity is required to be symmetric (\(s_{ij} = s_{ji}\) for all \(i\) and \(j\)) and most metrics have reasonable bounds (e.g., 0 for least similar and 1.0 for maximally similar). Given such a bounded similarity we can get to dissimilarity by subtracting one: \(d_{ij} = 1 - s_{ij}\), although this is not necessarily guaranteed to return a strict distance metric.

Basic Ingredients of Vertex Similarity Metrics

Consider two nodes and the set of nodes that are the immediate neighbors to each. In this case, most vertex similarity measures will make use of three pieces of information:

The number of common neighbors \(p\).
The number of actors \(q\) who are connected to node \(i\) but not to node \(j\).
The number of actors \(r\) who are connected to node \(j\) but not to node \(i\).

In the simplest case of the binary undirected graph then these are given by:

\[ p = \sum_{k = 1}^{n} a_{ik} a_{jk} \]

\[ q = \sum_{k = 1}^{n} a_{ik} (1 - a_{jk}) \]

\[ r = \sum_{k = 1}^{n} (1- a_{ik}) a_{jk} \]

In matrix form:

\[ \mathbf{A}(p) = \mathbf{A} \mathbf{A} = \mathbf{A}^2 \]

\[ \mathbf{A}(q) = \mathbf{A} (1 - \mathbf{A}) \]

\[ \mathbf{A}(r) = (1 - \mathbf{A}) \mathbf{A} \]

Let’s look at an example:

   library(networkdata)
   library(igraph)
   library(stringr) #using stringr to change names from all caps to title case
   g <- movie_267
   V(g)$name <- str_to_title(V(g)$name)
   g <- delete_vertices(g, degree(g) <= 3) #deleting low degree vertices
   A <- as.matrix(as_adjacency_matrix(g))
   A.p <- A %*% A #common neighbors matrix
   A.q <- A %*% (1 - A) #neighbors of i not connected to j
   A.r <- (1 - A) %*% A #neighbors of j not connected to i
   A.p[1:10, 1:10]

             Bam-Bam Barney Betty Feldspar Fred Headmistress Herdmaster Lava
Bam-Bam            6      4     5        2    5            2          2    5
Barney             4     13     9        1   11            5          3    7
Betty              5      9    13        2    9            3          4    8
Feldspar           2      1     2        2    1            0          2    2
Fred               5     11     9        1   14            4          3    9
Headmistress       2      5     3        0    4            5          0    3
Herdmaster         2      3     4        2    3            0          4    4
Lava               5      7     8        2    9            3          4   11
Leach              2      3     3        1    2            2          1    2
Morris             2      3     2        0    2            3          0    2
             Leach Morris
Bam-Bam          2      2
Barney           3      3
Betty            3      2
Feldspar         1      0
Fred             2      2
Headmistress     2      3
Herdmaster       1      0
Lava             2      2
Leach            3      1
Morris           1      3

   A.q[1:10, 1:10]

             Bam-Bam Barney Betty Feldspar Fred Headmistress Herdmaster Lava
Bam-Bam            0      2     1        4    1            4          4    1
Barney             9      0     4       12    2            8         10    6
Betty              8      4     0       11    4           10          9    5
Feldspar           0      1     0        0    1            2          0    0
Fred               9      3     5       13    0           10         11    5
Headmistress       3      0     2        5    1            0          5    2
Herdmaster         2      1     0        2    1            4          0    0
Lava               6      4     3        9    2            8          7    0
Leach              1      0     0        2    1            1          2    1
Morris             1      0     1        3    1            0          3    1
             Leach Morris
Bam-Bam          4      4
Barney          10     10
Betty           10     11
Feldspar         1      2
Fred            12     12
Headmistress     3      2
Herdmaster       3      4
Lava             9      9
Leach            0      2
Morris           2      0

   A.r[1:10, 1:10]

             Bam-Bam Barney Betty Feldspar Fred Headmistress Herdmaster Lava
Bam-Bam            0      9     8        0    9            3          2    6
Barney             2      0     4        1    3            0          1    4
Betty              1      4     0        0    5            2          0    3
Feldspar           4     12    11        0   13            5          2    9
Fred               1      2     4        1    0            1          1    2
Headmistress       4      8    10        2   10            0          4    8
Herdmaster         4     10     9        0   11            5          0    7
Lava               1      6     5        0    5            2          0    0
Leach              4     10    10        1   12            3          3    9
Morris             4     10    11        2   12            2          4    9
             Leach Morris
Bam-Bam          1      1
Barney           0      0
Betty            0      1
Feldspar         2      3
Fred             1      1
Headmistress     1      0
Herdmaster       2      3
Lava             1      1
Leach            0      2
Morris           2      0

Note that while \(\mathbf{A}(p)\) is necessarily symmetric, neither \(q\) nor \(r\) have to be. Barney has many more neighbors that Bam-Bam is not connected to than vice versa. Also note that the \(\mathbf{A}(r)\) matrix is just the transpose of the \(\mathbf{A}(q)\) matrix \(\left[\mathbf{A}(q)\right]^T\) in the undirected case.

Raw Number of Common Neighbors

The most obvious measure of similarity between two nodes is simply the number of common neighbors (Leicht, Holme, and Newman 2006):

\[ s_{ij}^{CN} = p_{ij} \]

We have already seen a version of this in the directed case when talking about the HITS algorithm (Kleinberg 1999), which computes a spectral (eigenvector-based) ranking based on the matrices of common in and out-neighbors in a directed graph.

\[ p^{in}_{ij} = \sum_{k = 1}^{n} a_{ki} a_{kj} \]

\[ p^{out}_{ij} = \sum_{k = 1}^{n} a_{ik} a_{jk} \]

Which in matrix form is:

\[ \mathbf{A}(p^{out}) = \mathbf{A}^T \mathbf{A} \]

\[ \mathbf{A}(p^{in}) = \mathbf{A} \mathbf{A}^T \]

In this case, similarity can be measured either by the number of common in-neighbors or the number of common out-neighbors.

If the network under consideration is a (directed) citation network with nodes being papers and links between papers defined as a citation from paper \(i\) to paper \(j\), then the number of common in-neighbors between two papers is their co-citation similarity (the number of other papers that cite both papers), and the number of common out-neighbors is their bibliographic coupling similarity (the overlap in their list of references).

Similarities Based on the Vertex Degrees

The Preferential Attachment Index

One approach is to count two nodes as similar if they are both connected to lots of other nodes. In terms of the quantities above, this preferential attachment index, proposed by Barabási and Albert (1999) is very simple to compute since it boils down to:

\[ s_{ij}^{PA} = (p + q)(p + r) \tag{1}\]

Which is the just the products of the degrees of each of the nodes. In terms of the matrices defined earlier, the preferential attachment index can be calculated as follows:

   P.a <- (A.p + A.q) * (A.p + A.r)
   P.a <- P.a / max(P.a)
   round(P.a[1:10, 1:10], 2)

             Bam-Bam Barney Betty Feldspar Fred Headmistress Herdmaster Lava
Bam-Bam         0.18   0.40  0.40     0.06 0.43         0.15       0.12 0.34
Barney          0.40   0.86  0.86     0.13 0.93         0.33       0.27 0.73
Betty           0.40   0.86  0.86     0.13 0.93         0.33       0.27 0.73
Feldspar        0.06   0.13  0.13     0.02 0.14         0.05       0.04 0.11
Fred            0.43   0.93  0.93     0.14 1.00         0.36       0.29 0.79
Headmistress    0.15   0.33  0.33     0.05 0.36         0.13       0.10 0.28
Herdmaster      0.12   0.27  0.27     0.04 0.29         0.10       0.08 0.22
Lava            0.34   0.73  0.73     0.11 0.79         0.28       0.22 0.62
Leach           0.09   0.20  0.20     0.03 0.21         0.08       0.06 0.17
Morris          0.09   0.20  0.20     0.03 0.21         0.08       0.06 0.17
             Leach Morris
Bam-Bam       0.09   0.09
Barney        0.20   0.20
Betty         0.20   0.20
Feldspar      0.03   0.03
Fred          0.21   0.21
Headmistress  0.08   0.08
Herdmaster    0.06   0.06
Lava          0.17   0.17
Leach         0.05   0.05
Morris        0.05   0.05

Which shows Barney to be very similar to Betty and Fred as we would expect.

One problem with using unbounded quantities like the sheer number of common of (in or out) neighbors to define node similarity is that they are only limited by the number of nodes in the network (Leicht, Holme, and Newman 2006). Thus, an actor with many neighbors will end up having lots of other neighbors in common with lots of other nodes, which will mean we would count them as “similar” to almost everyone.

Degree-Discounted Common Neighbor Metrics

Adamic-Adar Index

The raw number of common neighbors is a simple and intuitive measure of similarity, but it counts every common neighbor between two nodes as the same in determining the similarity between two nodes.

To deal with this issue, Adamic and Adar (2003) propose a tweak on the raw number of common neighbors measure that is inspired by the PageRank style (see here) degree-weighting approach. The basic idea is to discount a common connection to a high-degree neighbor and count a common connection to a more selective neighbor more in determining the pairwise similarity.

This metric, called the Adamic-Adar index can be computed as follows:

\[ s_{ij}^{AA} \sum_{k \in N(i) \cap N(j)}\frac{1}{log(d_k)} \tag{2}\]

Where \(d_k\) is the degree of common neighbor \(k\). We normalize by the natural logarithm of degree to lower the discount as degree increases.

The Resource Allocation Index

Another approach inspired by the same idea as Adamic and Adar (2003) is that proposed by Zhou, Lü, and Zhang (2009), which is the same as Adamic and Adar (2003) but without using the log to lower the discount for high degree nodes as degree increases:

\[ s_{ij}^{RA} \sum_{k \in N(i) \cap N(j)}\frac{1}{log(d_k)} \tag{3}\]

This is the resource allocation index of Zhou, Lü, and Zhang (2009).

Fullly Normalized Similarity Metrics

Jaccard Index

The most popular version of a normalized vertex similarity scores are the Jaccard index, which is given by:

\[ s_{ij}^J = \frac{p}{p + q + r} \tag{4}\]

Which is the ratio of the size of the intersection of the neighborhoods of the two nodes (number of common neighbors \(p\)) divided by the size of the union of the two neighborhoods (\(p + q + r\)). When the neighborhoods of the two nodes coincide (e.g., \(q = 0\) and \(r = 0\)) then Equation 4 turns into \(\frac{p}{p} = 1.0\), which is the maximum Jaccard similarity between two nodes.

In our example, we can compute the Jaccard similarity as follows:

   J <- A.p / (A.p + A.q + A.r)
   round(J[1:10, 1:10], 2)

             Bam-Bam Barney Betty Feldspar Fred Headmistress Herdmaster Lava
Bam-Bam         1.00   0.27  0.36     0.33 0.33         0.22       0.25 0.42
Barney          0.27   1.00  0.53     0.07 0.69         0.38       0.21 0.41
Betty           0.36   0.53  1.00     0.15 0.50         0.20       0.31 0.50
Feldspar        0.33   0.07  0.15     1.00 0.07         0.00       0.50 0.18
Fred            0.33   0.69  0.50     0.07 1.00         0.27       0.20 0.56
Headmistress    0.22   0.38  0.20     0.00 0.27         1.00       0.00 0.23
Herdmaster      0.25   0.21  0.31     0.50 0.20         0.00       1.00 0.36
Lava            0.42   0.41  0.50     0.18 0.56         0.23       0.36 1.00
Leach           0.29   0.23  0.23     0.25 0.13         0.33       0.17 0.17
Morris          0.29   0.23  0.14     0.00 0.13         0.60       0.00 0.17
             Leach Morris
Bam-Bam       0.29   0.29
Barney        0.23   0.23
Betty         0.23   0.14
Feldspar      0.25   0.00
Fred          0.13   0.13
Headmistress  0.33   0.60
Herdmaster    0.17   0.00
Lava          0.17   0.17
Leach         1.00   0.20
Morris        0.20   1.00

Here showing that Barney is more similar to Fred and Betty than he is to Bam-Bam.

The Salton Index

The cosine similarity (also called the Salton index) is given by:

\[ s_{ij}^{SAL} = \frac{p}{\sqrt{p + q} \sqrt{p + r}} \tag{5}\]

Which is the ratio of the number of common neighbors divided by the product of the square root of the degrees of each node (or the square root of the product which is the same thing), since \(p\) + \(q\) is the degree of node \(i\) and \(p\) + \(r\) is the degree of node \(j\).

When the neighborhoods of the two nodes coincide (e.g., \(q = 0\) and \(r = 0\)) then Equation 5 turns into \(\frac{p}{\sqrt{p^2}} = \frac{p}{p} = 1.0\).

In our example, we can compute the Cosine similarity as follows:

   C <- A.p / (sqrt(A.p + A.q) * sqrt(A.p + A.r))
   round(C[1:10, 1:10], 2)

             Bam-Bam Barney Betty Feldspar Fred Headmistress Herdmaster Lava
Bam-Bam         1.00   0.45  0.57     0.58 0.55         0.37       0.41 0.62
Barney          0.45   1.00  0.69     0.20 0.82         0.62       0.42 0.59
Betty           0.57   0.69  1.00     0.39 0.67         0.37       0.55 0.67
Feldspar        0.58   0.20  0.39     1.00 0.19         0.00       0.71 0.43
Fred            0.55   0.82  0.67     0.19 1.00         0.48       0.40 0.73
Headmistress    0.37   0.62  0.37     0.00 0.48         1.00       0.00 0.40
Herdmaster      0.41   0.42  0.55     0.71 0.40         0.00       1.00 0.60
Lava            0.62   0.59  0.67     0.43 0.73         0.40       0.60 1.00
Leach           0.47   0.48  0.48     0.41 0.31         0.52       0.29 0.35
Morris          0.47   0.48  0.32     0.00 0.31         0.77       0.00 0.35
             Leach Morris
Bam-Bam       0.47   0.47
Barney        0.48   0.48
Betty         0.48   0.32
Feldspar      0.41   0.00
Fred          0.31   0.31
Headmistress  0.52   0.77
Herdmaster    0.29   0.00
Lava          0.35   0.35
Leach         1.00   0.33
Morris        0.33   1.00

Showing results similar (pun intended) to those obtained using the Jaccard index.

The Sorensen Index

A less commonly used option is the Sorensen Index (sometimes called the Dice Coefficient) is given by:

\[ s_{ij}^{SOR} = \frac{2p}{(p + q) + (p + r)} = \frac{2p}{2p + q + r} \tag{6}\]

Which is given by the ratio of twice the number of common neighbors divided by the sum of the number of total neighbors of the two nodes (hence the number of common neighbors \(p\) shows up twice also in the denominator). When the neighborhoods of the two nodes coincide (e.g., \(q = 0\) and \(r = 0\)) then Equation 6 turns into \(\frac{2p}{2p} = 1.0\).

In our example, we can compute the Dice similarity as follows:

   D <- (2 * A.p) / ((2 * A.p) + A.q + A.r)
   round(D[1:10, 1:10], 2)

             Bam-Bam Barney Betty Feldspar Fred Headmistress Herdmaster Lava
Bam-Bam         1.00   0.42  0.53     0.50 0.50         0.36       0.40 0.59
Barney          0.42   1.00  0.69     0.13 0.81         0.56       0.35 0.58
Betty           0.53   0.69  1.00     0.27 0.67         0.33       0.47 0.67
Feldspar        0.50   0.13  0.27     1.00 0.12         0.00       0.67 0.31
Fred            0.50   0.81  0.67     0.12 1.00         0.42       0.33 0.72
Headmistress    0.36   0.56  0.33     0.00 0.42         1.00       0.00 0.38
Herdmaster      0.40   0.35  0.47     0.67 0.33         0.00       1.00 0.53
Lava            0.59   0.58  0.67     0.31 0.72         0.38       0.53 1.00
Leach           0.44   0.38  0.38     0.40 0.24         0.50       0.29 0.29
Morris          0.44   0.38  0.25     0.00 0.24         0.75       0.00 0.29
             Leach Morris
Bam-Bam       0.44   0.44
Barney        0.38   0.38
Betty         0.38   0.25
Feldspar      0.40   0.00
Fred          0.24   0.24
Headmistress  0.50   0.75
Herdmaster    0.29   0.00
Lava          0.29   0.29
Leach         1.00   0.33
Morris        0.33   1.00

Once again, showing results comparable to the previous.

The Leicht-Holme-Newman Index

Note, that all three of the fully normalized pairwise measures of similarity are bounded between zero and one, with nodes being maximally similar to themselves and with pairs of distinct nodes being maximally similar when they have the same set of neighbors (e.g., they are structurally equivalent).

Leicht, Holme, and Newman (2006) introduce a variation on the theme of normalized structural similarity scores. Their point is that maybe we should care about nodes that are surprisingly similar given some suitable null model. They propose the configuration model as such a null model. This model takes a graph with the same degree distribution as the original but with connections formed at random as reference.

The LHN similarity index (for Leicht, Holme, and Newman) is then given by:

\[ s_{ij}^{LHN} = \frac{p}{(p + q)(p + r)} \tag{7}\]

Which can be seen as a variation of the cosine similarity defined earlier. Note that in contrast to the other similarities we have considered which reach a maximum of 1.0 when the two nodes have the same neighborhood (\(q = 0\) and \(r = 0\)) the LHN similarity only reaches that limit for two nodes that are adjacent and don’t have any other neighbors (\(p=1\)).

For nodes with identical neighborhoods, the maximum LHN score will always be \(\frac{p}{p^2} = \frac{1}{p}\) which means that it will decrease as node degree increases, even when the two nodes share the same neighbors. This makes sense, since it is much more “surprising” for two nodes to share neighbors when they only have a few friends than we they have lots of friends.

In our example, we can compute the LHN similarity as follows:

   L <- A.p / ((A.p + A.q) * (A.p + A.r))
   round(L[1:10, 1:10], 2)

             Bam-Bam Barney Betty Feldspar Fred Headmistress Herdmaster Lava
Bam-Bam         0.17   0.05  0.06     0.17 0.06         0.07       0.08 0.08
Barney          0.05   0.08  0.05     0.04 0.06         0.08       0.06 0.05
Betty           0.06   0.05  0.08     0.08 0.05         0.05       0.08 0.06
Feldspar        0.17   0.04  0.08     0.50 0.04         0.00       0.25 0.09
Fred            0.06   0.06  0.05     0.04 0.07         0.06       0.05 0.06
Headmistress    0.07   0.08  0.05     0.00 0.06         0.20       0.00 0.05
Herdmaster      0.08   0.06  0.08     0.25 0.05         0.00       0.25 0.09
Lava            0.08   0.05  0.06     0.09 0.06         0.05       0.09 0.09
Leach           0.11   0.08  0.08     0.17 0.05         0.13       0.08 0.06
Morris          0.11   0.08  0.05     0.00 0.05         0.20       0.00 0.06
             Leach Morris
Bam-Bam       0.11   0.11
Barney        0.08   0.08
Betty         0.08   0.05
Feldspar      0.17   0.00
Fred          0.05   0.05
Headmistress  0.13   0.20
Herdmaster    0.08   0.00
Lava          0.06   0.06
Leach         0.33   0.11
Morris        0.11   0.33

Which, once again, produces similar results to what we found before. Note, however, that the LHN is not naturally maximal for self-similar nodes.

We can package all that we said before into a handy function that takes a graph as input and returns all four fully normalized similarity metrics as output:

   vertex.sim <- function(w) {
      x <- as.matrix(as_adjacency_matrix(w))
      p <- x %*% x 
      q <- x %*% (1 - x) 
      r <- (1 - x) %*% x
      j <- p / (p + q + r)
      c <- p / (sqrt(p + q) * sqrt(p + r))
      d <- (2 * p) / ((2 * p) + q + r)
      l <- p / ((p + q) * (p + r))
      return(list(J = j, C = c, D = d, L = l))
      }

Partially Normalized Similarity Metrics

The fully normalized similarity metrics are designed to mute the influence of with high degree in determining the pairwise similarity of two nodes, because we normalize by some function of both the degrees of the nodes in question.

Hub Promoted Index

Sometimes, however, we may actually want to magnify the influence of high-degree nodes in determining the pairwise similarity. One approach due to Ravasz et al. (2002), is to normalize by the minimum of the two degrees. This is called the hub promoted index and is given by:

\[ s_{ij}^{HP} = \frac{p}{min\left[(p + q), (p + r)\right]} \tag{8}\]

Here, two nodes are maximally similar when the neighborhood of the smaller node is fully included in the neighborhood of the bigger node (\(p = min\left[(p + q), (p + r)\right]\)). This approach thus “promotes” high degree nodes by making them more similar to low degree nodes.

Hub Depressed Index

We can, of course, follow the reverse approach and suppress the influence of high-degree nodes and count nodes as similar only when their degrees are also similar. This is the hub depressed index also proposed by Ravasz et al. (2002), and is computed as follows:

\[ s_{ij}^{HD} = \frac{p}{max\left[(p + q), (p + r)\right]} \tag{9}\]

Here by dividing by the maximum, pairings of high degree and low degree nodes are not counted as similar, while pairings of similar-degree nodes are.

Similarity and Structural Equivalence

All normalized similarity measures bounded between zero and one (like Jaccard, Cosine, and Dice) also define a distance on each pair of nodes which is equal to one minus the similarity. So the cosine distance between two nodes is one minus the cosine similarity, and so on for the Jaccard and Dice indexes.¹

Because they define distances, this also means that these approaches can be used to find approximately structurally equivalent classes of nodes in a graph just like we did with the Euclidean and correlation distances.

For instance, consider our toy graph from our previous discussion showcasing four structurally equivalent sets of nodes:

A toy graph demonstrating structural equivalence.

The cosine similarity matrix for this graph is:

   W <- as.matrix(as_adjacency_matrix(h))
   W.p <- W %*% W
   W.q <- W %*% (1 - W) 
   W.r <- (1 - W) %*% W
   C <- W.p / (sqrt(W.p + W.q) * sqrt(W.p + W.r))
   round(C, 2)

     A    B    C    D    E    F    G    H    I
A 1.00 0.26 0.26 0.58 0.58 0.58 0.67 0.00 0.00
B 0.26 1.00 1.00 0.00 0.00 0.00 0.26 0.67 0.67
C 0.26 1.00 1.00 0.00 0.00 0.00 0.26 0.67 0.67
D 0.58 0.00 0.00 1.00 1.00 1.00 0.58 0.25 0.25
E 0.58 0.00 0.00 1.00 1.00 1.00 0.58 0.25 0.25
F 0.58 0.00 0.00 1.00 1.00 1.00 0.58 0.25 0.25
G 0.67 0.26 0.26 0.58 0.58 0.58 1.00 0.00 0.00
H 0.00 0.67 0.67 0.25 0.25 0.25 0.00 1.00 0.75
I 0.00 0.67 0.67 0.25 0.25 0.25 0.00 0.75 1.00

Note that structurally equivalent nodes have similarity scores equal to 1.0. In this case, the distance matrix is given by:

   D <- 1 - C
   round(D, 2)

     A    B    C    D    E    F    G    H    I
A 0.00 0.74 0.74 0.42 0.42 0.42 0.33 1.00 1.00
B 0.74 0.00 0.00 1.00 1.00 1.00 0.74 0.33 0.33
C 0.74 0.00 0.00 1.00 1.00 1.00 0.74 0.33 0.33
D 0.42 1.00 1.00 0.00 0.00 0.00 0.42 0.75 0.75
E 0.42 1.00 1.00 0.00 0.00 0.00 0.42 0.75 0.75
F 0.42 1.00 1.00 0.00 0.00 0.00 0.42 0.75 0.75
G 0.33 0.74 0.74 0.42 0.42 0.42 0.00 1.00 1.00
H 1.00 0.33 0.33 0.75 0.75 0.75 1.00 0.00 0.25
I 1.00 0.33 0.33 0.75 0.75 0.75 1.00 0.25 0.00

And a hierarchical clustering on this matrix reveals the structurally equivalent classes:

   D <- as.dist(D) 
   h.res <- hclust(D, method = "ward.D2")
   plot(h.res)

In the Flintstones network, we could then find structurally equivalent blocks from the similarity analysis as follows (using Cosine):

   D <- as.dist(1- vertex.sim(g)$C)
   h.res <- hclust(D, method = "ward.D2") #hierarchical clustering

We can then plot the resulting clusters into a dendrogram:

   plot(h.res)

Which reveals four major blocks of structurally similar actors.

We can then extract our blocks just like we did with the Euclidean distance hierarchical clustering results in our discussion of structural equivalence:

   blocks  <- cutree(h.res, k = 4) #requesting four blocks
   blocks

        Bam-Bam          Barney           Betty        Feldspar            Fred 
              1               2               2               3               2 
   Headmistress      Herdmaster            Lava           Leach          Morris 
              4               3               2               1               4 
      Mrs Slate         Pebbles Pebbles Bam-Bam        Piltdown      Poindexter 
              2               1               1               3               2 
         Pyrite           Slate           Wilma 
              3               2               2

This analysis puts \(\{\) Barney, Betty, Fred, Lava, Mrs. Slate, Slate, Wilma \(\}\) in the second block of equivalent actors (shown as the right-most cluster in the dendrogram). Which seems about right!

Local Vertex Similarities in Directed Graphs

As we have seen before when considering directed graphs, everything doubles. The same goes for vertex similarity measures.

The basic problem is that when determining wether two vertices are similar or not in a directed graph, we need to choose between considering them similar if they have a lot of common in-neighbors (also referred to as co-citation in the information science literature) or whether they have lots of common out-neighbors (also referred to as coupling in the information science literature). This means that we will have a co-citation and a coupling version of each of the measures considered above.

A solution to this issue is due to Amsler (1972), who suggested that we combine both pieces of information in a generalized as follows:

\[ s_{ij} = \lambda cc_{ij} + (1-\lambda) co_{ij} \tag{10}\]

Where \(cc_{ij}\) is the co-citation similarity between \(i\) and \(j\), \(co_{ij}\) is the coupling similarity, and \(\lambda\) is a researcher chosen parameter with the restriction \(0 \geq \lambda \leq 1\). When \(\lambda = 0\) the similarity is driven only by co-citation, when \(\lambda = 1\) the similarity is driven only by coupling, and when \(\lambda = 0.5\) it is equally shared between the two.

References

Adamic, Lada A, and Eytan Adar. 2003. “Friends and Neighbors on the Web.” Social Networks 25 (3): 211–30.

Amsler, Robert A. 1972. “Applications of Citation-Based Automatic Classification.” University of Texas at Austin.

Barabási, Albert-László, and Réka Albert. 1999. “Emergence of Scaling in Random Networks.” Science 286 (5439): 509–12.

Kleinberg, Jon M. 1999. “Authoritative Sources in a Hyperlinked Environment.” Journal of the ACM (JACM) 46 (5): 604–32.

Leicht, Elizabeth A, Petter Holme, and Mark EJ Newman. 2006. “Vertex Similarity in Networks.” Physical Review E—Statistical, Nonlinear, and Soft Matter Physics 73 (2): 026120.

Ravasz, Erzsébet, Anna Lisa Somera, Dale A Mongru, Zoltán N Oltvai, and A-L Barabási. 2002. “Hierarchical Organization of Modularity in Metabolic Networks.” Science 297 (5586): 1551–55.

Zhou, Tao, Linyuan Lü, and Yi-Cheng Zhang. 2009. “Predicting Missing Links via Local Information.” The European Physical Journal B 71 (4): 623–30.

Footnotes

Note that for a measure to count as a true metric distance, the triangle inequality must be respected (e.g., \(d(x,z) \leq d(x,y) + d(z,y)\)). This is the case for the complement of the Jaccard and the Cosine similarities.↩︎