Random Walk Centralities

In the lecture notes on random walk concepts we defined two basic quantities helpful for analyzing networks in terms of dynamic diffusion processes: The average first passage time and the average commute distance.

Both of these concepts serve as measures of distances (and its inverse proximity) based on how to two nodes are connected via direct and indirect paths, where the paths are defined by how something (e.g., gossip, disease, a meme) diffuses through the walk structure of the network based on transition probabilities governed by the inverse of the degrees of each node.

In the lecture on closeness centrality we saw how one of the fundamental centrality measures in social network analysis is defined according to a distance measure, in that case the geodesic or shortest path distance. Two nodes are far if the length of the shortest path separating is very long, and are close if it is very short (with the minimum being adjacency or length one).

Since random walk concepts are distance/proximity concepts, we can use them to build analogues of the centrality measures based on the geodesic distance.

Randon Walk Closeness

The most obvious extension is to build a measure of closeness based on the random walk concept fo the average first passage time. Two nodes are distant is the average number of steps it takes for something to get from one to the other is big, and they are close if the number of steps is small. So the average first passage time metric is the random-walk analogue to the geodesic distance, but it is based on random walks on all paths in the graph regardless of length.

Thus, if \(m_{ij}\) is the average first passage time between nodes \(i\) and \(j\) then the random walk closeness centrality of node \(i\) is given by:

\[ C^{RCLO}_i = \left[\sum_j m_{ij}\right]^{-1} \tag{1}\]

Where the denominator of the fraction is just the sum of the rows of the matrix \(\mathbf{M}\)

Just like closeness, we can also compute the normalized version of random walk closeness, by taking the average first passage time of each node across all of the other nodes in the graph:

\[ C^{NRCL}_i = \frac{N-1}{\sum_j m_{ij}} \tag{2}\]

Let’s see how this would work in real data. Let’s load up our trusty Pulp Fiction dataset based on scene co-appearances in the movie:

    library(networkdata)
    library(igraph)
    library(stringr) #using stringr to change names from all caps to title case
    g <- movie_559
    V(g)$name <- str_to_title(V(g)$name)
    V(g)$name[which(V(g)$name == "Esmarelda")] <- "Esmeralda" #fixing misspelled name
    E(g)$weight <- 1 #setting edge weights to 1.0

First, we need the code we used before to compute the average first passage time to get our random walk distance matrix \(\mathbf{M}\):

    A <- as.matrix(as_adjacency_matrix(g))
    d <- rowSums(A) #degree vector 
    D <- diag(d) #degree matrix
    L <- D - A #Laplacian
    E <- matrix(1, nrow(A), nrow(A))/nrow(A)
    L.p <- solve(L + E) - E #pseudo-inverse of Laplacian
    e <- matrix(1, nrow(A), 1) #column vector of all ones
    l <- diag(L.p) #diagonal vector of Laplacian inverse
    vol.A <- sum(d) #graph volume 
    M <- vol.A * (e %*% t(l))
    M <- M - (vol.A * L.p)
    M <- M + (L.p %*% d) %*% t(e)
    M <- M - (e %*% (t(d) %*% L.p))

And here are the first ten rows and columns of the \(\mathbf{M}\) mmatrix:

    rownames(M) <- colnames(M)
    round(M, 1)[1:10, 1:10]
             Brett Buddy Butch Capt Koons Ed Sullivan English Dave Esmeralda
Brett          0.0 110.8  13.1       49.1       110.8         53.9     216.1
Buddy         35.0   0.0  16.7       44.4       102.0         51.6     219.7
Butch         30.6 110.0   0.0       44.2       110.0         50.6     203.0
Capt Koons    34.5 105.7  12.2        0.0       105.7         52.3     215.2
Ed Sullivan   35.0 102.0  16.7       44.4         0.0         51.6     219.7
English Dave  32.8 106.4  12.0       45.7       106.4          0.0     215.0
Esmeralda     31.6 111.0   1.0       45.2       111.0         51.6       0.0
Fabienne      21.6 111.8  11.1       49.1       111.8         54.6     214.1
Fourth Man    33.2 109.2  18.2       49.7       109.2         55.8     221.2
Gawker #2     31.1 111.2   7.8       47.6       111.2         50.9     210.8
             Fabienne Fourth Man Gawker #2
Brett            62.2      106.1      83.0
Buddy            76.6      106.3      87.3
Butch            69.2      108.5      77.2
Capt Koons       75.1      108.0      84.9
Ed Sullivan      76.6      106.3      87.3
English Dave     74.2      107.6      81.7
Esmeralda        70.2      109.5      78.2
Fabienne          0.0      105.7      83.5
Fourth Man       73.4        0.0      88.4
Gawker #2        72.2      109.4       0.0

Now to compute random walk closeness according to Equation 1:

   rw.clos <- rowSums(M)^-1
   sort(rw.clos, decreasing = TRUE)[1:10] 
       The Gimp             Zed Sportscaster #2         Maynard Sportscaster #1 
   0.0003172878    0.0003172878    0.0003158847    0.0003049048    0.0003001459 
      Esmeralda       Gawker #2      Pedestrian           Buddy     Ed Sullivan 
   0.0002890414    0.0002776414    0.0002776414    0.0002773396    0.0002773396 

And the normalized version:

   nrw.clos <- (vcount(g) - 1)/rowSums(M)
   sort(nrw.clos, decreasing = TRUE)[1:10]
       The Gimp             Zed Sportscaster #2         Maynard Sportscaster #1 
     0.01173965      0.01173965      0.01168773      0.01128148      0.01110540 
      Esmeralda       Gawker #2      Pedestrian           Buddy     Ed Sullivan 
     0.01069453      0.01027273      0.01027273      0.01026157      0.01026157 

In contrast to the measures based on shortest paths, in which the top characters also have the top closeness, the random walk version selects more obscure characters as the most central (e.g., The Gimp and Zed).

Note, however, that because the average first passage time is an asymmetric measure, we have a choice as to whether to compute it based on the “out” or the “in” distances. When we use the row sums of the \(\mathbf{M}\) matrix we count a node as close if the average first passage time between them and the other nodes in the graph is small.

But we can also say that a node is close to others if the average first passage time between the others and that node is small. In that case we would want to use the column sums of the \(\mathbf{M}\) matrix:

   nrw.clos <- (vcount(g) - 1)/colSums(M)
   sort(nrw.clos, decreasing = TRUE)[1:10]
    Vincent       Butch       Jules         Mia   Marsellus       Brett 
 0.10885425  0.06704354  0.06392210  0.04047785  0.04008249  0.02937037 
     Marvin       Roger Honey Bunny     Pumpkin 
 0.02620981  0.02620981  0.02388114  0.02388114 

Using this measure gives us more interpretable results, with the most important characters in the story being in the top ten.

Current Flow Closeness

What if we were to use the average commute time as our distance measure instead? This makes sense, because just like the average first passage time, two nodes are far if the average commute time between them is big, and close if it is small.

If we do that, we end up with a closeness centrality measure called Current Flow Closeness which is a weird name due to the fact that it was developed in the study of electrical circuit networks!

So if \(n_{ij}\) is the average commute time distance between nodes \(i\) and \(j\) then the current flow closeness of \(i\) is given by:

\[ C^{CFCL}_i = \left[\sum_j n_{ij}\right]^{-1} \tag{3}\]

And the normalized version:

\[ C^{NCFC}_i = \frac{N-1}{\sum_j n_{ij}} \tag{4}\]

Remember that once we have \(\mathbf{M}\), the \(\mathbf{N}\) matrix of average commute time distances is just:

   N <- M + t(M)

And here are the first ten rows and columns of the \(\mathbf{N}\) mmatrix:

   round(N, 1)[1:10, 1:10]
             Brett Buddy Butch Capt Koons Ed Sullivan English Dave Esmeralda
Brett          0.0 145.8  43.6       83.6       145.8         86.7     247.6
Buddy        145.8   0.0 126.8      150.1       204.0        158.0     330.8
Butch         43.6 126.8   0.0       56.3       126.8         62.6     204.0
Capt Koons    83.6 150.1  56.3        0.0       150.1         98.0     260.3
Ed Sullivan  145.8 204.0 126.8      150.1         0.0        158.0     330.8
English Dave  86.7 158.0  62.6       98.0       158.0          0.0     266.6
Esmeralda    247.6 330.8 204.0      260.3       330.8        266.6       0.0
Fabienne      83.7 188.3  80.3      124.2       188.3        128.8     284.3
Fourth Man   139.3 215.4 126.7      157.7       215.4        163.4     330.7
Gawker #2    114.1 198.5  85.0      132.5       198.5        132.5     289.0
             Fabienne Fourth Man Gawker #2
Brett            83.7      139.3     114.1
Buddy           188.3      215.4     198.5
Butch            80.3      126.7      85.0
Capt Koons      124.2      157.7     132.5
Ed Sullivan     188.3      215.4     198.5
English Dave    128.8      163.4     132.5
Esmeralda       284.3      330.7     289.0
Fabienne          0.0      179.1     155.7
Fourth Man      179.1        0.0     197.8
Gawker #2       155.7      197.8       0.0

And now to compute the current flow closeness:

   cf.clos <- rowSums(N/sum(d))^-1
   sort(cf.clos, decreasing = TRUE)[1:10] 
    Vincent       Butch       Jules         Mia   Marsellus       Brett 
 0.05076202  0.04883195  0.04788957  0.04459401  0.04443553  0.04139034 
     Marvin       Roger Honey Bunny     Pumpkin 
 0.04015006  0.04015006  0.03897407  0.03897407 

Note that we divided the entries in the average commute time matrix \(\mathbf{N}\) by the volume of the graph \(\sum_i d_i\) (the sum of degrees) to transform them into resistance distances (Fouss, Saerens, and Shimbo 2016, 78), which does not make a difference for the rankings because it is just dividing by a constant.

And the normalized version:

   ncf.clos <- (vcount(g)-1)/rowSums(N/sum(d))
   sort(ncf.clos, decreasing = TRUE)[1:10] 
    Vincent       Butch       Jules         Mia   Marsellus       Brett 
   1.878195    1.806782    1.771914    1.649978    1.644115    1.531442 
     Marvin       Roger Honey Bunny     Pumpkin 
   1.485552    1.485552    1.442041    1.442041 

We can see that in contrast to the random walk closeness based on out-distances, but like the random walk closeness based on in-distances, the current flow closeness gives results that are closer to those obtained using the geodesic distance version of closeness with the most important characters in the story being in the top ten.

Random Walk Betweenness

We can extend a similar “random walk” reasoning to the concept of betweenness centrality. Recall that the standard computation of betweenness centrality only counts the number shortest paths that a given node is an intermediary on. We can extend the same reasoning beyond shortest paths to consider all paths of any length.

Consider a random walk process on a social network of the sort we talked about before. In this version we select a “seed” or “source” node \(i\) and a destination (also called “sink) node \(j\). We start a diffusion process on \(i\) and stop it when it reaches \(j\) (this is called an absorbing random walk). The node-to-node transitions are governed by the probabilities stored in the matrix \(\mathbf{P}\).

If we replay this process many times, each time selecting a random source and a random destination node in the network (such that all pairs of nodes get to play each role) then we could count the number of times the thing diffusing through the network when through some third node \(k\).

A node has a high random walk betweenness centrality1 if it is likely to be a frequent intermediary on random walk diffusion processes that start and end with some other pair of nodes \(i\) and \(j\) in the graph.

Computing Random Walk Betweeness

How do we calculate random walk betweenness? Turns out the information we seek is stored in the pseudo-inverse of the Laplacian matrix (\(\mathbf{L}^+\)), which you recall from our previous discussion is defined as:

\[ \mathbf{L}^+ = \left(\mathbf{L} - \frac{\mathbf{E}}{N}\right) ^{-1}- \frac{\mathbf{E}}{N} \tag{5}\]

Where \(\mathbf{L} = \mathbf{D} - \mathbf{A}\), \(\mathbf{D}\) is the matrix containing the degrees of each node in the diagonal, \(\mathbf{E}\) is the all ones matrix of the same dimension as \(\mathbf{A}\), and \(N\) is the number of nodes in the network. Note that we already computed this matrix when calculating random walk closeness earlier for the Pulp Fiction network (L.p).

As Fouss, Saerens, and Shimbo (2016, 161) note, if we wanted to know the amount of “flow” (\(\phi\)) that goes through a node \(k\) during a random walk process involving a seed node \(i\) and destination node \(l\) all we need to do is compute:

\[ \phi_k(i,l) = \frac{1}{2}\sum_{j \in N(k)}^{N} a_{jk}|l_{ij}^+ - l_{jl}^+ - l_{ik}^+ + l_{kl}^+| \tag{6}\]

Where the summation goes over all of \(k\)’s neighbors \(j\). What this formula tells us is that the amount of flow that \(k\) intermediates when a random walk process starts at \(i\) and ends with \(l\) is the sum of the amount of flow that would come through each edge \(e_{jk}\) in the graph, connecting \(k\) to one of their neighbors \(j\).

Let’s write a function to test this out in the Pulp Fiction network:

    flow.k <- function(a, b, c, w) {
        z <- as.matrix(as_adjacency_matrix(w))
        i <- which(V(w)$name == a)
        l <- which(V(w)$name == b)
        k <- which(V(w)$name == c)
        f <- 0
        for (j in 1:vcount(w)) {
            if (i !=k & l != k) {
                f <- f + (z[j,k] * abs(L.p[i,j] - L.p[j,l] - L.p[i,k] + L.p[k,l]))
                }
            }
        f <- f/2
        return(f)
        }

The flow.k function takes the seed node a, the destination node b, the intermediary node c and the graph w as arguments, and returns the total amount of flow that would go from each of \(k\)’s neighbors \(j\) as output according to Equation 6.

For instance, if we wanted to find the amount of flow that would go through Vincent if we started a random walk process with Mia and ended it with Pumpkin an infinite number of times, we would just type:

    round(flow.k("Mia", "Pumpkin", "Vincent", g), 3)
[1] 0.711

And if we wanted to change to the intermediary node to Brett:

    round(flow.k("Mia", "Pumpkin", "Brett", g), 3)
[1] 0.048

Indicating that Vincent would intermediate much more of the random walk flow between Mia and Pumpkin than would Brett.

Now, the overall random walk betweenness centrality of any node is just their average flow score across each pair of nodes in the graph:

\[ C^{RWB}_k = \frac{\sum_{i \neq l} \sum_{l \neq i} \phi_k(i,l)}{(N-1)(N-2)} \]

The following wrapper around flow.k figures that out:

    flow.bet <- function(c, w) {
        b <- 0
        n <- vcount(w)
        for (i in V(w)$name) {
            for (j in V(w)$name) {
                b <- b + flow.k(i, j, c, w) #summing flows
            }
        }
        b <- b * ((n - 1) * (n- 2))^-1 #averaging across pairs
        return(b)
    }

We can now find Vincent’s random walk betweenness:

    round(flow.bet("Vincent", g), 3)
[1] 0.498

Which is higher than Brett’s

    round(flow.bet("Brett", g), 3)
[1] 0.077

Fouss, Saerens, and Shimbo (2016, 162) show that we can use some matrix tricks to compute these quantities more efficiently (e.g., without looping). First, they note that the total flow going through a particular edge \(e_{jk}\) in the graph for every source/destination pair of nodes can be computed as follows.

First, we compute the matrix \(\mathbf{N}_{jk}\):

\[ \mathbf{N}_{jk} = (\mathbf{l}^+_j - \mathbf{l}^+_k)\mathbf{e^T} - \mathbf{e}(\mathbf{l}^+_j - \mathbf{l}^+_k)^T \]

Where \(\mathbf{l}^+_j\) is the \(j^{th}\) column of \(\mathbf{L}^+\).

Once we have \(\mathbf{N}_{jk}\), the total random walk flow going through edge \(e_{jk}\) is given by:

\[ \phi_{jk} = (\mathbf{e} - \mathbf{e}_k)^T |\mathbf{N}_{jk}| (\mathbf{e} - \mathbf{e}_k) \]

Where \(\mathbf{e}_k\) is a column vector of all zeros except for the \(k^th\) position, where it contains a one.

The random walk betweenness of \(k\) is then the sum of \(\phi_{jk}\) across each neighbor \(j \in N(k)\).

Here’s a function that packages those tricks to compute the random walk betweenness of each node in the graph, based on Fouss, Saerens, and Shimbo (2016, algorithm 4.5):

    cur.flow.bet <- function(w) {
        n <- vcount(w)
        b <- rep(0, n)
        for (k in 1:n) {
            l.k <- L.p[, k]
            for (j in as.vector(neighbors(w, k))) {
                l.j <- L.p[, j]
                Njk <- ((l.j - l.k) %*% t(e)) - (e %*% t(l.j - l.k))
                e.k <- matrix(0, n, 1)
                e.k[k] <- 1
                b[k] <- b[k] + (t(e - e.k) %*% abs(Njk) %*% (e - e.k))
            }
        }
    b <- b * (2 * ((n - 1) * (n- 2)))^-1 #averaging across pairs
    names(b) <- V(w)$name
    return(b)
    }

And now for the big reveal:

    cf.bet <- cur.flow.bet(g)
    round(sort(cf.bet, decreasing = TRUE), 3)[1:10]
    Vincent       Butch       Jules         Mia   Marsellus     Pumpkin 
      0.498       0.454       0.309       0.187       0.147       0.115 
Honey Bunny     Maynard       Brett      Marvin 
      0.115       0.106       0.077       0.060 

Which gives us the same answer as we obtained earlier for Vincent (the top node) while also showing the other top random walk betweenness centrality nodes in the network. These nodes, are not surprisingly, also some of the most central characters in the story.

References

Fouss, François, Marco Saerens, and Masashi Shimbo. 2016. Algorithms and Models for Network Data and Link Analysis. Cambridge University Press.

Footnotes

  1. Also called current flow betweenness and absorbing random walk betweenness.↩︎