Random Walk Centralities
In the lecture notes on random walk concepts we defined two basic quantities helpful for analyzing networks in terms of dynamic diffusion processes: The average first passage time and the average commute distance.
Both of these concepts serve as measures of distances (and its inverse proximity) based on how to two nodes are connected via direct and indirect paths, where the paths are defined by how something (e.g., gossip, disease, a meme) diffuses through the walk structure of the network based on transition probabilities governed by the inverse of the degrees of each node.
In the lecture on closeness centrality we saw how one of the fundamental centrality measures in social network analysis is defined according to a distance measure, in that case the geodesic or shortest path distance. Two nodes are far if the length of the shortest path separating is very long, and are close if it is very short (with the minimum being adjacency or length one).
Since random walk concepts are distance/proximity concepts, we can use them to build analogues of the centrality measures based on the geodesic distance.
Randon Walk Closeness
The most obvious extension is to build a measure of closeness based on the random walk concept fo the average first passage time. Two nodes are distant is the average number of steps it takes for something to get from one to the other is big, and they are close if the number of steps is small. So the average first passage time metric is the random-walk analogue to the geodesic distance, but it is based on random walks on all paths in the graph regardless of length.
Thus, if \(m_{ij}\) is the average first passage time between nodes \(i\) and \(j\) then the random walk closeness centrality of node \(i\) is given by:
\[ C^{RCLO}_i = \left[\sum_j m_{ij}\right]^{-1} \tag{1}\]
Where the denominator of the fraction is just the sum of the rows of the matrix \(\mathbf{M}\)
Just like closeness, we can also compute the normalized version of random walk closeness, by taking the average first passage time of each node across all of the other nodes in the graph:
\[ C^{NRCL}_i = \frac{N-1}{\sum_j m_{ij}} \tag{2}\]
Let’s see how this would work in real data. Let’s load up our trusty Pulp Fiction dataset based on scene co-appearances in the movie:
First, we need the code we used before to compute the average first passage time to get our random walk distance matrix \(\mathbf{M}\):
A <- as.matrix(as_adjacency_matrix(g))
d <- rowSums(A) #degree vector
D <- diag(d) #degree matrix
L <- D - A #Laplacian
E <- matrix(1, nrow(A), nrow(A))/nrow(A)
L.p <- solve(L + E) - E #pseudo-inverse of Laplacian
e <- matrix(1, nrow(A), 1) #column vector of all ones
l <- diag(L.p) #diagonal vector of Laplacian inverse
vol.A <- sum(d) #graph volume
M <- vol.A * (e %*% t(l))
M <- M - (vol.A * L.p)
M <- M + (L.p %*% d) %*% t(e)
M <- M - (e %*% (t(d) %*% L.p))And here are the first ten rows and columns of the \(\mathbf{M}\) mmatrix:
Brett Buddy Butch Capt Koons Ed Sullivan English Dave Esmeralda
Brett 0.0 110.8 13.1 49.1 110.8 53.9 216.1
Buddy 35.0 0.0 16.7 44.4 102.0 51.6 219.7
Butch 30.6 110.0 0.0 44.2 110.0 50.6 203.0
Capt Koons 34.5 105.7 12.2 0.0 105.7 52.3 215.2
Ed Sullivan 35.0 102.0 16.7 44.4 0.0 51.6 219.7
English Dave 32.8 106.4 12.0 45.7 106.4 0.0 215.0
Esmeralda 31.6 111.0 1.0 45.2 111.0 51.6 0.0
Fabienne 21.6 111.8 11.1 49.1 111.8 54.6 214.1
Fourth Man 33.2 109.2 18.2 49.7 109.2 55.8 221.2
Gawker #2 31.1 111.2 7.8 47.6 111.2 50.9 210.8
Fabienne Fourth Man Gawker #2
Brett 62.2 106.1 83.0
Buddy 76.6 106.3 87.3
Butch 69.2 108.5 77.2
Capt Koons 75.1 108.0 84.9
Ed Sullivan 76.6 106.3 87.3
English Dave 74.2 107.6 81.7
Esmeralda 70.2 109.5 78.2
Fabienne 0.0 105.7 83.5
Fourth Man 73.4 0.0 88.4
Gawker #2 72.2 109.4 0.0
Now to compute random walk closeness according to Equation 1:
The Gimp Zed Sportscaster #2 Maynard Sportscaster #1
0.0003172878 0.0003172878 0.0003158847 0.0003049048 0.0003001459
Esmeralda Gawker #2 Pedestrian Buddy Ed Sullivan
0.0002890414 0.0002776414 0.0002776414 0.0002773396 0.0002773396
And the normalized version:
The Gimp Zed Sportscaster #2 Maynard Sportscaster #1
0.01173965 0.01173965 0.01168773 0.01128148 0.01110540
Esmeralda Gawker #2 Pedestrian Buddy Ed Sullivan
0.01069453 0.01027273 0.01027273 0.01026157 0.01026157
In contrast to the measures based on shortest paths, in which the top characters also have the top closeness, the random walk version selects more obscure characters as the most central (e.g., The Gimp and Zed).
Note, however, that because the average first passage time is an asymmetric measure, we have a choice as to whether to compute it based on the “out” or the “in” distances. When we use the row sums of the \(\mathbf{M}\) matrix we count a node as close if the average first passage time between them and the other nodes in the graph is small.
But we can also say that a node is close to others if the average first passage time between the others and that node is small. In that case we would want to use the column sums of the \(\mathbf{M}\) matrix:
Vincent Butch Jules Mia Marsellus Brett
0.10885425 0.06704354 0.06392210 0.04047785 0.04008249 0.02937037
Marvin Roger Honey Bunny Pumpkin
0.02620981 0.02620981 0.02388114 0.02388114
Using this measure gives us more interpretable results, with the most important characters in the story being in the top ten.
Current Flow Closeness
What if we were to use the average commute time as our distance measure instead? This makes sense, because just like the average first passage time, two nodes are far if the average commute time between them is big, and close if it is small.
If we do that, we end up with a closeness centrality measure called Current Flow Closeness which is a weird name due to the fact that it was developed in the study of electrical circuit networks!
So if \(n_{ij}\) is the average commute time distance between nodes \(i\) and \(j\) then the current flow closeness of \(i\) is given by:
\[ C^{CFCL}_i = \left[\sum_j n_{ij}\right]^{-1} \tag{3}\]
And the normalized version:
\[ C^{NCFC}_i = \frac{N-1}{\sum_j n_{ij}} \tag{4}\]
Remember that once we have \(\mathbf{M}\), the \(\mathbf{N}\) matrix of average commute time distances is just:
And here are the first ten rows and columns of the \(\mathbf{N}\) mmatrix:
Brett Buddy Butch Capt Koons Ed Sullivan English Dave Esmeralda
Brett 0.0 145.8 43.6 83.6 145.8 86.7 247.6
Buddy 145.8 0.0 126.8 150.1 204.0 158.0 330.8
Butch 43.6 126.8 0.0 56.3 126.8 62.6 204.0
Capt Koons 83.6 150.1 56.3 0.0 150.1 98.0 260.3
Ed Sullivan 145.8 204.0 126.8 150.1 0.0 158.0 330.8
English Dave 86.7 158.0 62.6 98.0 158.0 0.0 266.6
Esmeralda 247.6 330.8 204.0 260.3 330.8 266.6 0.0
Fabienne 83.7 188.3 80.3 124.2 188.3 128.8 284.3
Fourth Man 139.3 215.4 126.7 157.7 215.4 163.4 330.7
Gawker #2 114.1 198.5 85.0 132.5 198.5 132.5 289.0
Fabienne Fourth Man Gawker #2
Brett 83.7 139.3 114.1
Buddy 188.3 215.4 198.5
Butch 80.3 126.7 85.0
Capt Koons 124.2 157.7 132.5
Ed Sullivan 188.3 215.4 198.5
English Dave 128.8 163.4 132.5
Esmeralda 284.3 330.7 289.0
Fabienne 0.0 179.1 155.7
Fourth Man 179.1 0.0 197.8
Gawker #2 155.7 197.8 0.0
And now to compute the current flow closeness:
Vincent Butch Jules Mia Marsellus Brett
0.05076202 0.04883195 0.04788957 0.04459401 0.04443553 0.04139034
Marvin Roger Honey Bunny Pumpkin
0.04015006 0.04015006 0.03897407 0.03897407
Note that we divided the entries in the average commute time matrix \(\mathbf{N}\) by the volume of the graph \(\sum_i d_i\) (the sum of degrees) to transform them into resistance distances (Fouss, Saerens, and Shimbo 2016, 78), which does not make a difference for the rankings because it is just dividing by a constant.
And the normalized version:
Vincent Butch Jules Mia Marsellus Brett
1.878195 1.806782 1.771914 1.649978 1.644115 1.531442
Marvin Roger Honey Bunny Pumpkin
1.485552 1.485552 1.442041 1.442041
We can see that in contrast to the random walk closeness based on out-distances, but like the random walk closeness based on in-distances, the current flow closeness gives results that are closer to those obtained using the geodesic distance version of closeness with the most important characters in the story being in the top ten.
Random Walk Betweenness
We can extend a similar “random walk” reasoning to the concept of betweenness centrality. Recall that the standard computation of betweenness centrality only counts the number shortest paths that a given node is an intermediary on. We can extend the same reasoning beyond shortest paths to consider all paths of any length.
Consider a random walk process on a social network of the sort we talked about before. In this version we select a “seed” or “source” node \(i\) and a destination (also called “sink) node \(j\). We start a diffusion process on \(i\) and stop it when it reaches \(j\) (this is called an absorbing random walk). The node-to-node transitions are governed by the probabilities stored in the matrix \(\mathbf{P}\).
If we replay this process many times, each time selecting a random source and a random destination node in the network (such that all pairs of nodes get to play each role) then we could count the number of times the thing diffusing through the network when through some third node \(k\).
A node has a high random walk betweenness centrality1 if it is likely to be a frequent intermediary on random walk diffusion processes that start and end with some other pair of nodes \(i\) and \(j\) in the graph.
Computing Random Walk Betweeness
How do we calculate random walk betweenness? Turns out the information we seek is stored in the pseudo-inverse of the Laplacian matrix (\(\mathbf{L}^+\)), which you recall from our previous discussion is defined as:
\[ \mathbf{L}^+ = \left(\mathbf{L} - \frac{\mathbf{E}}{N}\right) ^{-1}- \frac{\mathbf{E}}{N} \tag{5}\]
Where \(\mathbf{L} = \mathbf{D} - \mathbf{A}\), \(\mathbf{D}\) is the matrix containing the degrees of each node in the diagonal, \(\mathbf{E}\) is the all ones matrix of the same dimension as \(\mathbf{A}\), and \(N\) is the number of nodes in the network. Note that we already computed this matrix when calculating random walk closeness earlier for the Pulp Fiction network (L.p).
As Fouss, Saerens, and Shimbo (2016, 161) note, if we wanted to know the amount of “flow” (\(\phi\)) that goes through a node \(k\) during a random walk process involving a seed node \(i\) and destination node \(l\) all we need to do is compute:
\[ \phi_k(i,l) = \frac{1}{2}\sum_{j \in N(k)}^{N} a_{jk}|l_{ij}^+ - l_{jl}^+ - l_{ik}^+ + l_{kl}^+| \tag{6}\]
Where the summation goes over all of \(k\)’s neighbors \(j\). What this formula tells us is that the amount of flow that \(k\) intermediates when a random walk process starts at \(i\) and ends with \(l\) is the sum of the amount of flow that would come through each edge \(e_{jk}\) in the graph, connecting \(k\) to one of their neighbors \(j\).
Let’s write a function to test this out in the Pulp Fiction network:
The flow.k function takes the seed node a, the destination node b, the intermediary node c and the graph w as arguments, and returns the total amount of flow that would go from each of \(k\)’s neighbors \(j\) as output according to Equation 6.
For instance, if we wanted to find the amount of flow that would go through Vincent if we started a random walk process with Mia and ended it with Pumpkin an infinite number of times, we would just type:
And if we wanted to change to the intermediary node to Brett:
Indicating that Vincent would intermediate much more of the random walk flow between Mia and Pumpkin than would Brett.
Now, the overall random walk betweenness centrality of any node is just their average flow score across each pair of nodes in the graph:
\[ C^{RWB}_k = \frac{\sum_{i \neq l} \sum_{l \neq i} \phi_k(i,l)}{(N-1)(N-2)} \]
The following wrapper around flow.k figures that out:
We can now find Vincent’s random walk betweenness:
Which is higher than Brett’s
Fouss, Saerens, and Shimbo (2016, 162) show that we can use some matrix tricks to compute these quantities more efficiently (e.g., without looping). First, they note that the total flow going through a particular edge \(e_{jk}\) in the graph for every source/destination pair of nodes can be computed as follows.
First, we compute the matrix \(\mathbf{N}_{jk}\):
\[ \mathbf{N}_{jk} = (\mathbf{l}^+_j - \mathbf{l}^+_k)\mathbf{e^T} - \mathbf{e}(\mathbf{l}^+_j - \mathbf{l}^+_k)^T \]
Where \(\mathbf{l}^+_j\) is the \(j^{th}\) column of \(\mathbf{L}^+\).
Once we have \(\mathbf{N}_{jk}\), the total random walk flow going through edge \(e_{jk}\) is given by:
\[ \phi_{jk} = (\mathbf{e} - \mathbf{e}_k)^T |\mathbf{N}_{jk}| (\mathbf{e} - \mathbf{e}_k) \]
Where \(\mathbf{e}_k\) is a column vector of all zeros except for the \(k^th\) position, where it contains a one.
The random walk betweenness of \(k\) is then the sum of \(\phi_{jk}\) across each neighbor \(j \in N(k)\).
Here’s a function that packages those tricks to compute the random walk betweenness of each node in the graph, based on Fouss, Saerens, and Shimbo (2016, algorithm 4.5):
cur.flow.bet <- function(w) {
n <- vcount(w)
b <- rep(0, n)
for (k in 1:n) {
l.k <- L.p[, k]
for (j in as.vector(neighbors(w, k))) {
l.j <- L.p[, j]
Njk <- ((l.j - l.k) %*% t(e)) - (e %*% t(l.j - l.k))
e.k <- matrix(0, n, 1)
e.k[k] <- 1
b[k] <- b[k] + (t(e - e.k) %*% abs(Njk) %*% (e - e.k))
}
}
b <- b * (2 * ((n - 1) * (n- 2)))^-1 #averaging across pairs
names(b) <- V(w)$name
return(b)
}And now for the big reveal:
Vincent Butch Jules Mia Marsellus Pumpkin
0.498 0.454 0.309 0.187 0.147 0.115
Honey Bunny Maynard Brett Marvin
0.115 0.106 0.077 0.060
Which gives us the same answer as we obtained earlier for Vincent (the top node) while also showing the other top random walk betweenness centrality nodes in the network. These nodes, are not surprisingly, also some of the most central characters in the story.
References
Footnotes
Also called current flow betweenness and absorbing random walk betweenness.↩︎