Hubs and Authorities

Recall from previous discussions that everything doubles (sometimes quadruples like degree correlations) in directed graphs. The same goes for status as reflected in a distributive model through the network.

Consider two ways of showing your status in a system governed by directed relations (like advice). You can be highly sought after by others (be an “authority”), or you can be discerning in who you seek advice from, preferably seeking out people who are also sought after (e.g., be a “hub” pointing to high-quality others).

These two forms of status are mutually defining (Bonacich and Lloyd 2001). The top authorities are those who are sought after by the top hubs, and the top hubs are the ones who seek the top authorities!

So this leads to a doubling of the Bonacich prestige status accounting equation:

\[ x^h_i = \sum_j a_{ij} x^a_j \]

\[ x^a_i = \sum_i a_{ij} x^h_i \]

Which says that the hub score $x^h$ of a node is the sum of the authority scores $x^a$ of the nodes they point to (sum over $j$; the outdegree), and the authority score of a node is the sum of the hub scores of the nodes that point to it (sum over $i$; the indegree).

So we need to make our status game a bit more complicated (but not too much) to account for this duality:

   status2 <- function(w) {
     a <- rep(1/nrow(w), nrow(w))  #initializing authority scores
     d <- 1 #initializing delta
     k <- 0 #initializing counter
     while (d >= 1e-10) {
         o.a <- a #old authority scores
         h <- w %*% o.a #new hub scores a function of previous authority scores of their out-neighbors
         h <- h/norm(h, type = "E")
         a <- t(w) %*% h #new authority scores a function of current hub scores of their in-neighbors
         a <- a/norm(a, type = "E")
         d <- abs(sum(abs(o.a) - abs(a))) #delta between old and new authority scores
         k <- k + 1
         }
   return(list(h = as.vector(h), a = as.vector(a), k = k))
   }

Everything is like our previous status1 function except now we are keeping track of two mutually defining scores a and h. We first initialize the authority scores by setting them to the value of $1/n$ (where $n$ is the number of nodes or the number of rows in the adjacency matrix) in line 2. We then initialize the $\delta$ difference and $k$ counter in lines 3-4. The while loop in lines 5-13 then updates the hub scores (to be the sum of the authority scores of each out-neighbor) in line 7 normalize them in line 8 and update the new authority scores to be the sum (across each in-neighbor) of these new hub scores, which are then themselves normalized in line 10.

So at each step $t$, the authority and hub scores are calculated like this:

\[ x^h_i(t) = \sum_j a_{ij} x^a_j(t-1) \]

\[ x^a_i(t) = \sum_j a^T_{ij} x^h_j(t) \]

Where $a^T_{ij}$ is the corresponding entry in the transpose of the adjacency matrix (t(w) in line 9 of the above function).

As you may have guessed this is just an implementation of the “HITS” algorithm developed by Kleinberg (1999).¹

The results for the Krackhardt advice network are:

   library(networkdata)
   library(igraph)
   g <- ht_advice
   A <- as.matrix(as_adjacency_matrix(g))
   hits.res1 <- status2(A)
   round(hits.res1$a/max(hits.res1$a), 3)

 [1] 0.782 1.000 0.356 0.496 0.330 0.644 0.684 0.711 0.290 0.615 0.769 0.498
[13] 0.323 0.677 0.267 0.570 0.645 0.871 0.323 0.589 0.776

   round(hits.res1$h/max(hits.res1$h), 3)

 [1] 0.370 0.176 0.841 0.709 0.835 0.065 0.492 0.490 0.773 0.672 0.206 0.122
[13] 0.331 0.279 1.000 0.274 0.313 0.800 0.581 0.687 0.600

Which are equivalent to using the igraph function hits_scores:

   ha <- hits_scores(g, scale = TRUE)
   round(ha$authority, 3)

 [1] 0.782 1.000 0.356 0.496 0.330 0.644 0.684 0.711 0.290 0.615 0.769 0.498
[13] 0.323 0.677 0.267 0.570 0.645 0.871 0.323 0.589 0.776

   round(ha$hub, 3)

 [1] 0.370 0.176 0.841 0.709 0.835 0.065 0.492 0.490 0.773 0.672 0.206 0.122
[13] 0.331 0.279 1.000 0.274 0.313 0.800 0.581 0.687 0.600

Note that the just like the status2 function, the igraph function hits_scores returns the two sets of scores as elements of a list, so we need to access them using the $ operator on the object that we store the results in (in this case ha). We also set the scale argument to TRUE so that the scores are normalized by the maximum.

Hubs, Authorities and Eigenvectors

Recall that both the eigenvector and PageRank status scores computed via the network status distribution game routine ended up being equivalent to the leading eigenvector of a network proximity matrix (the adjacency matrix $\mathbf{A}$ and the probability matrix $\mathbf{P}$ respectively). It would be surprising if the same wasn’t true of the hub and authority status scores.

Let’s find out which ones!

Consider the matrices:

\[ \mathbf{M}_h = \mathbf{A}\mathbf{A}^T \]

\[ \mathbf{M}_a = \mathbf{A}^T\mathbf{A} \]

Let’s see what they look like in the Krackhardt manager’s network:

   M.h = A %*% t(A)
   M.a = t(A) %*% A
   M.h[1:10, 1:10]

      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
 [1,]    6    1    5    5    5    1    3    4    5     5
 [2,]    1    3    3    2    3    1    2    3    3     0
 [3,]    5    3   15   11   12    1    8    8   12     8
 [4,]    5    2   11   12   11    1    7    6   11     8
 [5,]    5    3   12   11   15    1    7    7   12    10
 [6,]    1    1    1    1    1    1    1    1    1     0
 [7,]    3    2    8    7    7    1    8    5    8     4
 [8,]    4    3    8    6    7    1    5    8    7     4
 [9,]    5    3   12   11   12    1    8    7   13     7
[10,]    5    0    8    8   10    0    4    4    7    14

   M.a[1:10, 1:10]

      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
 [1,]   13   13    4    5    5    6    8    8    4     8
 [2,]   13   18    5    8    5    9   11   10    4     9
 [3,]    4    5    5    4    4    2    4    4    2     3
 [4,]    5    8    4    8    3    4    6    6    3     4
 [5,]    5    5    4    3    5    1    3    3    3     3
 [6,]    6    9    2    4    1   10    7    7    2     6
 [7,]    8   11    4    6    3    7   13    6    3     7
 [8,]    8   10    4    6    3    7    6   10    3     6
 [9,]    4    4    2    3    3    2    3    3    4     3
[10,]    8    9    3    4    3    6    7    6    3     9

What’s in these matrices? Well let’s look at $\mathbf{M}_h$. The diagonals will look familiar because they happen to be the outdegree of each node:

   degree(g, mode = "out")[1:10]

 [1]  6  3 15 12 15  1  8  8 13 14

You may have guessed that the diagonals of matrix $\mathbf{M}_a$ contain the indegrees:

   degree(g, mode = "in")[1:10]

 [1] 13 18  5  8  5 10 13 10  4  9

Which means that the off-diagonals cells of each matrix $m_{ij}$ and $n_{ij}$, contain the common out-neighbors and common in-neighbors shared by nodes $i$ and $j$ in the graph, respectively.

Figure 1: Subgraph from Krackhardt’s Managers Network.

In information science, $\mathbf{M}_h$ and $\mathbf{M}_a$ matrices have special interpretations. Consider the subgraph shown in Figure 1, which contains nodes 2 and 11 from the Krackhardt advice network and their respective neighbors:

If these nodes where papers, then we would say that both 2 and 11 point to a common third node 1. In an information network, the papers that other papers point to are their common references. Therefore the number of common out-neighbors of two nodes is is called the bibliographic coupling score between the two papers. In the same way, we can see that 2 and 11 are pointed to by a common third neighbor 21. The number of common in-neighbors between two-papers is called their co-citation score.

Both the bibliographic coupling and the co-citation scores get at two ways that nodes can be similar in a directed graph. In the social context of advice seeking, for instance, two people can be similar if they seek advice from the same others, or two people can be similar if they are sought after for advice by the same others.

The $\mathbf{M}_h$ and $\mathbf{M}_a$ matrices, therefore are two (unweighted) similarity matrices between the nodes in a directed graph. As you may also be suspecting, the hub and authorities scores are the leading eigenvectors of the $\mathbf{M}_h$ and $\mathbf{M}_a$ matrices (Kleinberg 1999):

   a <- eigen(M.a)$vector[,1] * -1
   h <- eigen(M.h)$vector[,1] * -1
   round(a/max(a), 3)

 [1] 0.782 1.000 0.356 0.496 0.330 0.644 0.684 0.711 0.290 0.615 0.769 0.498
[13] 0.323 0.677 0.267 0.570 0.645 0.871 0.323 0.589 0.776

   round(h/max(h), 3)

 [1] 0.370 0.176 0.841 0.709 0.835 0.065 0.492 0.490 0.773 0.672 0.206 0.122
[13] 0.331 0.279 1.000 0.274 0.313 0.800 0.581 0.687 0.600

Given the connection to the HITS dual status ranking, sometimes the $\mathbf{M}_h$ is called the hub matrix and the $\mathbf{M}_a$ is called the authority matrix (Ding et al. 2002).

Note that this also means we could have obtained the hub and authority scores using our old status1 function, but we would have had to play the game twice, once for the matrix $\mathbf{M}_a$ and the other one for the matrix $\mathbf{M}_h$:

   status1 <- function(w) {
      x <- rep(1, nrow(w)) #initial status vector set to all ones of length equal to the number of nodes
      d <- 1 #initial delta
      k <- 0 #initializing counter
      while (d > 1e-10) {
          o.x <- x #old status scores
          x <- w %*% o.x #new scores a function of old scores and adjacency matrix
          x <- x/norm(x, type = "E") #normalizing new status scores
          d <- abs(sum(abs(x) - abs(o.x))) #delta between new and old scores
          k <- k + 1 #incrementing while counter
      }
   return(as.vector(x))
   }

   a <- status1(M.a)
   round(a/max(a), 3)

 [1] 0.782 1.000 0.356 0.496 0.330 0.644 0.684 0.711 0.290 0.615 0.769 0.498
[13] 0.323 0.677 0.267 0.570 0.645 0.871 0.323 0.589 0.776

   h <- status1(M.h)
   round(h/max(h), 3)

 [1] 0.370 0.176 0.841 0.709 0.835 0.065 0.492 0.490 0.773 0.672 0.206 0.122
[13] 0.331 0.279 1.000 0.274 0.313 0.800 0.581 0.687 0.600

This link once again demonstrates the equivalence between the eigenvectors of the hub and authority matrices, as similarity matrices between nodes in the network, and our prismatic status distribution game!

Combining PageRank and HITS: SALSA

Lempel and Moran (2001) show that we can combine the logic of PageRank and HITS. Their basic idea is to use the same mutually reinforcing approach as in HITS but with degree-normalized (stochastic) versions of the adjacency matrix (like in PageRank).²

Let’s see how it works.

Recall that PageRank works on the $\mathbf{P}$ matrix, which is defined like this:

\[ \mathbf{P}_{a} = \mathbf{D}_{out}^{-1}\mathbf{A} \]

In R we compute it like this:

   D.o <- diag(1/rowSums(A))
   P.a <- D.o %*% A

This matrix is row-stochastic, because each row is divided by the row total (the outdegrees of each node), meaning its rows sum to one, like we saw before:

   rowSums(P.a)

 [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

It is also possible to compute the indegree normalized version of the $\mathbf{P}$ matrix, defined like this:

\[ \mathbf{P}_{h} = \mathbf{D}_{in}^{-1} \mathbf{A}^T \]

Where $\mathbf{D}_{in}^{-1}$ is a matrix containing the inverse of the indegrees along the diagonals (and zeroes elsewhere) and $\mathbf{A}^T$ is the transpose of the adjacency matrix. Each non-zero entry of is thus equal to one divided by that row node’s indegree.

In R we compute it like this:

   D.i <- diag(1/colSums(A))
   P.h <- D.i %*% t(A)

Like the $\mathbf{P}_{a}$ matrix, the $\mathbf{P}_{a}$ matrix is row-stochastic, meaning its rows sum to 1.0:

   rowSums(P.h)

 [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

To get the SALSA version of the hub and authority scores, we can just play our status game over newly defined versions of the hub and authority matrices (Langville and Meyer 2005, 156).

The SALSA hub matrix is defined like this:

\[ \mathbf{Q}_h = \mathbf{P}_a\mathbf{P}_h \]

And the SALSA authority matrix like this:

\[ \mathbf{Q}_a = \mathbf{P}_h\mathbf{P}_a \]

Which in R looks like:

   Q.h <- P.a %*% P.h
   Q.a <- P.h %*% P.a

Each of these matrices are row stochastic:

   rowSums(Q.h)

 [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

   rowSums(Q.a)

 [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

Which means that inequalities will be defined according to differences in the in-degrees of each node just like PageRank.

And now to obtain our SALSA hub and authority scores, we simply play our status1 game on (the transpose of) these matrices, just like we did for PageRank:

   salsa.h <- status1(t(Q.h)) 
   round(salsa.h/sum(salsa.h), 2)

 [1] 0.03 0.02 0.08 0.06 0.08 0.01 0.04 0.04 0.07 0.07 0.02 0.01 0.03 0.02 0.11
[16] 0.02 0.03 0.09 0.06 0.06 0.06

   salsa.a <- status1(t(Q.a)) 
   round(salsa.a/sum(salsa.a), 2)

 [1] 0.07 0.09 0.03 0.04 0.03 0.05 0.07 0.05 0.02 0.05 0.06 0.04 0.02 0.05 0.02
[16] 0.04 0.05 0.08 0.02 0.04 0.08

What are these numbers? Well, it turns out that they are equivalent to the out and indegrees of each node, divided by the total number of edges in the network (Fouss, Renders, and Saerens 2004, 451).

So the SALSA hub and authority scores can also be obtained like this, without having to play our status game over any matrix:

   round(rowSums(A)/sum(A), 2)

 [1] 0.03 0.02 0.08 0.06 0.08 0.01 0.04 0.04 0.07 0.07 0.02 0.01 0.03 0.02 0.11
[16] 0.02 0.03 0.09 0.06 0.06 0.06

   round(colSums(A)/sum(A), 2)

 [1] 0.07 0.09 0.03 0.04 0.03 0.05 0.07 0.05 0.02 0.05 0.06 0.04 0.02 0.05 0.02
[16] 0.04 0.05 0.08 0.02 0.04 0.08

Note that for the SALSA scores we use a different normalization compared to before. Instead of dividing by the maximum, we divide by the sum, so that the vector of SALSA hub and authority scores sum to 1.0.

The reason for this is that, as Fouss, Renders, and Saerens (2004) explain, these numbers have a straightforward interpretation in terms of the probability that a random walker in the network will find itself in that particular node, when the walker goes from hub -> authority -> hub -> authority (e.g., never going from hub to hub or authority to authority) using the entries in the P.a and P.h matrices to determine the probability of jumping from hub $i$ to authority $j$ and vice versa. Thus, the higher the probability the more “central” the specific hub or authority is, just like the random walk interpretation of the PageRank scores.

HITS versus Principal Components Analysis

Saerens and Fouss (2005) argue that there is an intimate relationship between the HITS dual ranking scores and one of the most widely used multivariate analysis techniques, Principal Components Analysis (PCA).

In fact, they argue that HITS is just PCA on an uncentered data matrix (which in the this case is the network’s adjacency matrix). Technically we could also just argue that PCA is just HITS on a centered adjacency matrix, as we will see in just a bit.

Let’s see how this works. First, let’s create a centered version of the network adjacency matrix. To center a typical data matrix (e.g., of individuals in the rows by variables in the columns) we subtract the column mean (the average score of all individuals on that variable) from each individual’s score.

So to get the centered network adjacency matrix, we first need to compute the corresponding column means of the matrix. Note that this will be equivalent to each individual’s indegree (the sum of the columns) divided by the total number of nodes in the network ($k^{in}/n$), which is kind of a normalized degree centrality score, except the usual normalization is to divide by $n-1$ (so as to have a maximum of 1.0) as we saw in our discussion of centrality.

First, let’s compute the column means vector, using the native R function colMeans which takes a matrix as input and returns a vector of column means:

   cm <- colMeans(A)
   round(cm, 2)

 [1] 0.62 0.86 0.24 0.38 0.24 0.48 0.62 0.48 0.19 0.43 0.52 0.33 0.19 0.48 0.19
[16] 0.38 0.43 0.71 0.19 0.38 0.71

Now that we have this vector of column means, all we need to do is subtract it from each column of the adjacency matrix, to create the centered adjacency matrix:

   A.c <- t(t(A) - cm)

Note that to subtract the column mean vector from each column of the adjacency matrix we first transpose it, do the subtraction, and “untranspose” back to the original (by taking the transpose of the transpose). We can use the colMeans function to check that the column means of the centered adjacency matrix are indeed zero:

   round(colMeans(A.c), 2)

 [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Now all we need to do is play our HITS status game on the centered adjacency matrix:

   hits.res <- status2(A.c)

We then use the function below to normalize each status score to be within the minus one to plus one interval and have mean zero:

   norm.v <- function(x) {
      x <- x - min(x)
      x <- x/max(x)
      x <- x - mean(x)
      return(x)
   }

The first thing the function does (line 2) is subtract the minimum value from the vector (which becomes zero), in line 3 we divided by the maximum (which becomes one), and in line 4 we subtract the mean from each value.

The following code applies the normalization to the results from our status game on the centered adjacency matrix:

   h2 <- round(norm.v(hits.res$h), 3)
   a2 <- round(norm.v(hits.res$a), 3)
   names(h2) <- 1:21
   names(a2) <- 1:21

And the resulting normalized hub and authority scores for each node in the Krackhardt managers advice network are:

   round(h2, 3)

     1      2      3      4      5      6      7      8      9     10     11 
-0.162 -0.374  0.318  0.241  0.364 -0.429 -0.027 -0.117  0.239  0.334 -0.335 
    12     13     14     15     16     17     18     19     20     21 
-0.421 -0.160 -0.339  0.571 -0.240 -0.295  0.385  0.105  0.206  0.134

   round(a2, 3)

     1      2      3      4      5      6      7      8      9     10     11 
 0.052 -0.155 -0.031 -0.177 -0.150 -0.020 -0.555  0.415 -0.199  0.106  0.411 
    12     13     14     15     16     17     18     19     20     21 
 0.031  0.020  0.215 -0.233  0.232  0.314 -0.056  0.020  0.343 -0.585

Saerens and Fouss (2005) show that these are the same scores we would obtain from a simple PCA of the regular old adjacency matrix. To see this, let’s do the PCA analysis using the PCA function from the package FactorMineR.³

   library(FactoMineR)
   pca.res <- PCA(A, graph = FALSE, scale.unit = FALSE)

Note that we set the argument scale.unit to FALSE so that the PCA is conducted on the centered adjacency matrix and not a standardized (to unit variance) version of it.

The PCA function stores the corresponding scores for the rows and columns of the matrix in these objects:

   pca.h <- pca.res$ind$coord[, 1]
   pca.a <- pca.res$var$coord[, 1]
   names(pca.a) <- 1:21

And now, for the big reveal:

   round(norm.v(pca.h), 3)

     1      2      3      4      5      6      7      8      9     10     11 
-0.162 -0.374  0.318  0.241  0.364 -0.429 -0.027 -0.117  0.239  0.334 -0.335 
    12     13     14     15     16     17     18     19     20     21 
-0.421 -0.160 -0.339  0.571 -0.240 -0.295  0.385  0.105  0.206  0.134

   round(norm.v(pca.a), 3)

     1      2      3      4      5      6      7      8      9     10     11 
 0.052 -0.155 -0.031 -0.177 -0.150 -0.020 -0.555  0.415 -0.199  0.106  0.411 
    12     13     14     15     16     17     18     19     20     21 
 0.031  0.020  0.215 -0.233  0.232  0.314 -0.056  0.020  0.343 -0.585

Which are indeed the same scores we obtained earlier when we played our status game on the centered adjacency matrix!

We will see one way to interpret these scores in the next section.

HITS and SALSA versus Correspondence Analysis

Fouss, Renders, and Saerens (2004) also argue that there is a close link between a method to analyze two-way tables called correspondence analysis, and both Lempel and Moran’s SALSA and Kleinberg’s HITS algorithms.

They first ask: What if we play our status distribution game not on the transpose of the SALSA hub and authority matrices like we just did but just on the regular matrices without transposition?

Here’s what happens:

   round(status1(Q.h), 3)

 [1] 0.218 0.218 0.218 0.218 0.218 0.218 0.218 0.218 0.218 0.218 0.218 0.218
[13] 0.218 0.218 0.218 0.218 0.218 0.218 0.218 0.218 0.218

   round(status1(Q.a), 3)

 [1] 0.218 0.218 0.218 0.218 0.218 0.218 0.218 0.218 0.218 0.218 0.218 0.218
[13] 0.218 0.218 0.218 0.218 0.218 0.218 0.218 0.218 0.218

OK, so that’s weird. All that we get is a vector with the same number repeated twenty one times (in this case, the number of nodes in the graph). What’s going on?

Recall from the previous that the status game computes the leading eigenvector of the matrix we play the game on, and spits that vector out as our status scores for that matrix. The leading eigenvector is that associated with the largest eigenvalue (if the matrix contains one).

So all that this is telling us is that the first eigenvector of the un-transposed versions of the SALSA hub and authority matrices is pretty useless because it assigns everyone the same status score.

But Fouss, Renders, and Saerens (2004) note, like we did at the beginning, that a matrix has many eigenvector/eigenvalue pairs and that perhaps the second leading eigenvector is not that useless; this is the eigenvector associated with the second largest eigenvalue.

How do we get that vector? Well, as always, there is a mathematical workaround. The trick is to create a new matrix that removes the influence of that first (useless) eigenvector and then play our status game on that matrix.

To do that, let’s create a matrix that is equal to the original useless eigenvector times its own transpose. In R this goes like this:

   v1 <- status1(Q.h)
   D <- v1 %*% t(v1)

What’s in this matrix? Let’s see the first ten rows and columns:

   round(D[1:10, 1:10], 3)

       [,1]  [,2]  [,3]  [,4]  [,5]  [,6]  [,7]  [,8]  [,9] [,10]
 [1,] 0.048 0.048 0.048 0.048 0.048 0.048 0.048 0.048 0.048 0.048
 [2,] 0.048 0.048 0.048 0.048 0.048 0.048 0.048 0.048 0.048 0.048
 [3,] 0.048 0.048 0.048 0.048 0.048 0.048 0.048 0.048 0.048 0.048
 [4,] 0.048 0.048 0.048 0.048 0.048 0.048 0.048 0.048 0.048 0.048
 [5,] 0.048 0.048 0.048 0.048 0.048 0.048 0.048 0.048 0.048 0.048
 [6,] 0.048 0.048 0.048 0.048 0.048 0.048 0.048 0.048 0.048 0.048
 [7,] 0.048 0.048 0.048 0.048 0.048 0.048 0.048 0.048 0.048 0.048
 [8,] 0.048 0.048 0.048 0.048 0.048 0.048 0.048 0.048 0.048 0.048
 [9,] 0.048 0.048 0.048 0.048 0.048 0.048 0.048 0.048 0.048 0.048
[10,] 0.048 0.048 0.048 0.048 0.048 0.048 0.048 0.048 0.048 0.048

So it’s just a matrix of the same dimension as the SALSA hub matrix with the same number over and over. In fact that number is equal to:

   round(0.2182179^2, 3)

[1] 0.048

Which is just the useless constant status score multiplied by itself (squared).

Now, we create new SALSA hub and authority matrices, which are equal to the original minus the constant D matrix above:

   Q.h2 <- Q.h - D
   Q.a2 <- Q.a - D

And now we play our status game on these matrices:

   h3 <- status1(Q.h2)
   a3 <- status1(Q.a2)
   names(h3) <- 1:21
   names(a3) <- 1:21

We then use the same function we used for the PCA scores to normalize each status score to be within the minus one to plus one interval and have mean zero:

   h3 <- norm.v(h3)
   a3 <- norm.v(a3)

And the resulting scores are:

   round(h3, 3)

     1      2      3      4      5      6      7      8      9     10     11 
 0.007 -0.436  0.074  0.074  0.138 -0.626 -0.019 -0.098  0.023  0.374 -0.063 
    12     13     14     15     16     17     18     19     20     21 
-0.530  0.334 -0.243  0.212  0.119 -0.147  0.292  0.290  0.107  0.115

   round(a3, 3)

     1      2      3      4      5      6      7      8      9     10     11 
-0.008 -0.146  0.299 -0.106  0.414 -0.329 -0.455 -0.005  0.223 -0.049 -0.027 
    12     13     14     15     16     17     18     19     20     21 
-0.158  0.291  0.035  0.323  0.026 -0.057 -0.114  0.291  0.139 -0.586

Now, these scores don’t seem useless. They are different across each node; and like the PCA scores we obtained earlier, some are positive and some are negative.

Fouss, Renders, and Saerens (2004) show that these are the same scores you would obtain from a correspondence analysis (CA) of the original affiliation matrix.

Let’s check that out in our case. The same package we used to compute the PCA of the adjacency matrix (FactoMineR) can be used to compute the correspondence analysis of any matrix in R, including a network adjacency matrix, using the function CA:

   #library(FactoMineR)
   ca.res <- CA(A, graph = FALSE)
   ca.h <- ca.res$row$coord[, 1]
   ca.a <- ca.res$col$coord[, 1]
   names(ca.a) <- 1:21

In line 2 we store the CA results in the object ca.res. We then grab the CA scores associated with the columns of the adjacency matrix and put them in the object ca.h in line 3 and the scores associated with the rows of the adjacency matrix and put them in the object ca.a inline 4.

Now for the big reveal:

   round(norm.v(ca.h), 3)

     1      2      3      4      5      6      7      8      9     10     11 
 0.007 -0.436  0.074  0.074  0.138 -0.626 -0.019 -0.098  0.023  0.374 -0.063 
    12     13     14     15     16     17     18     19     20     21 
-0.530  0.334 -0.243  0.212  0.119 -0.147  0.292  0.290  0.107  0.115

   round(norm.v(ca.a), 3)

     1      2      3      4      5      6      7      8      9     10     11 
-0.008 -0.146  0.299 -0.106  0.414 -0.329 -0.455 -0.005  0.223 -0.049 -0.027 
    12     13     14     15     16     17     18     19     20     21 
-0.158  0.291  0.035  0.323  0.026 -0.057 -0.114  0.291  0.139 -0.586

Which shows that indeed the CA scores are the same ones as we obtain from playing our status game on the corrected versions of the un-transposed SALSA hub and authorities matrices!

But what are the CA (and PCA) scores supposed to capture? Let’s look at them side-by-side next to the SALSA hub and authority scores, as shown in Table 1 which has nodes rank ordered by their SALSA hub score.

Table 1: Hubs and Authorties Scores for Krackhardt’s managers advice network
	salsa.h	salsa.a	pca.hub	pca.auth	ca.hub	ca.auth
15	0.416	0.088	0.571	-0.233	0.212	0.323
18	0.353	0.331	0.385	-0.056	0.292	-0.114
3	0.312	0.110	0.318	-0.031	0.074	0.299
5	0.312	0.110	0.364	-0.150	0.138	0.414
10	0.291	0.199	0.334	0.106	0.374	-0.049
9	0.270	0.088	0.239	-0.199	0.023	0.223
4	0.249	0.177	0.241	-0.177	0.074	-0.106
20	0.249	0.177	0.206	0.343	0.107	0.139
19	0.229	0.088	0.105	0.020	0.290	0.291
21	0.229	0.331	0.134	-0.585	0.115	-0.586
7	0.166	0.287	-0.027	-0.555	-0.019	-0.455
8	0.166	0.221	-0.117	0.415	-0.098	-0.005
1	0.125	0.287	-0.162	0.052	0.007	-0.008
13	0.125	0.088	-0.160	0.020	0.334	0.291
17	0.104	0.199	-0.295	0.314	-0.147	-0.057
14	0.083	0.221	-0.339	0.215	-0.243	0.035
16	0.083	0.177	-0.240	0.232	0.119	0.026
2	0.062	0.398	-0.374	-0.155	-0.436	-0.146
11	0.062	0.243	-0.335	0.411	-0.063	-0.027
12	0.042	0.155	-0.421	0.031	-0.530	-0.158
6	0.021	0.221	-0.429	-0.020	-0.626	-0.329

Using information in Table 1, Figure 2 (a) shows a point and line network plot but using the CA hub and authority scores to embed the nodes in a common two-dimensional space. In the plot nodes are colored by the difference between their SALSA hub and authority scores such that nodes with positive scores (hub score larger than their authority scores) appear in blue and nodes with negative scores (authority scores larger than their hub scores) appear in red.

We can see that the CA hub score places the “hubbiest” of hubs (blue nodes) in the lower-right quadrant of the diagram, with positive CA hub scores and negative CA authority scores (this last multiplied by minus one from the ones shown above). Note from Table 1, that the nodes in this region (e.g. {15, 5, 3, 19, 9, 13}) all have large SALSA hub scores and very low SALSA authority scores.

In the same way, the most authoritative of authorities (red nodes) appear in the upper-left side, with negative CA hub scores and positive CA authority scores. Note that nodes in this region have very large SALSA authority scores and very low SALSA hub scores.

Finally, note that nodes in the upper-right quadrant of the diagram (e.g. {21, 7, 4, 18, 10}) are “ambidextrous” nodes, showcasing relatively large scores as both hubs and authorities according to SALSA.

Thus what the CA scores seem to be doing is separating out the purest examples of hubs and authorities in the network from those who play both roles.

Note that as shown in Figure 2 (b), we get an even stronger separation between hubs and authorities when using the PCA scores to place nodes in a common space, with all the authorities (except nodes 7 and 21) to the upper left, and the hubs on the right-hand side of the plot. So the PCA scores reflect the same distinction between hub and authority status as the CA scores.

A Hybrid PageRank/HITS Approach

Borodin et al. (2005) argue that perhaps a better approach to combining PageRank and HITS is to normalize only one of the scores by degree while leaving the other score alone.

Picky Hubs

For instance, in some settings, it might make more sense to assign more authority to nodes that are pointed to by picky hubs (e.g., people who seek advice from a few select others), and discount the authority scores of nodes that are pointed to by indiscriminate hubs (people who seek advice from everyone).

We can do this by changing the way we compute the hub score of each node. Instead of just summing the authority scores of all the nodes they point to (like in regular HITS) we instead take the average of the authority scores of all nodes they point to. We then feed this average back to the authority score calculation.

This entails slightly modifying the HITS status game as follows:

   status3 <- function(w) {
     a <- rep(1, nrow(w))  #initializing authority scores
     d.o <- rowSums(w) #outdegree of each node
     d <- 1 #initializing delta
     k <- 0 #initializing counter
     while (d >= 1e-10) {
         o.a <- a #old authority scores
         h <- (w %*% o.a) * 1/d.o #new hub scores a function of previous authority scores of their out-neighbors
         h <- h/norm(h, type = "E")
         a <- t(w) %*% h #new authority scores a function of current hub scores of their in-neighbors
         a <- a/norm(a, type = "E")
         d <- abs(sum(abs(o.a) - abs(a))) #delta between old and new authority scores
         k <- k + 1
         }
   return(list(h = as.vector(h), a = as.vector(a), k = k))
   }

Note that the only modification is the addition of the multiplication by the inverse of the outdegrees in line 8, which divides the standard HITS hub score by the outdegree of each hub.

The resulting hubs and authorities scores are:

   hits.res2 <- status3(A)
   round(hits.res2$a/max(hits.res2$a), 3)

 [1] 0.693 1.000 0.221 0.414 0.223 0.538 0.759 0.494 0.187 0.470 0.552 0.361
[13] 0.174 0.498 0.180 0.395 0.451 0.817 0.174 0.375 0.893

   round(hits.res2$h/max(hits.res2$h), 3)

 [1] 0.749 0.818 0.635 0.657 0.619 1.000 0.716 0.762 0.683 0.493 0.916 0.925
[13] 0.639 0.972 0.543 0.835 0.842 0.508 0.590 0.642 0.604

This is an implementation of the “HubAvg” algorithm described by Borodin et al. (2005, 238–39).

Exclusive Authorities

In the same way, depending on the application, it might make more sense to assign a larger hub score to hubs that point to exclusive authorities (authorities that are sought after by a few select others) and discount the “hubness” of hubs that point to popular authorities (those who are sought after by everyone).

We can implement this approach—let’s call it the “AuthAvg” algorithm—with a slight modification of the status2 function similar to the one we used to create status3:

   status4 <- function(w) {
     a <- rep(1, nrow(w))  #initializing authority scores
     d.i <- colSums(w) #indegree of each node
     d <- 1 #initializing delta
     k <- 0 #initializing counter
     while (d >= 1e-10) {
         o.a <- a #old authority scores
         h <- w %*% o.a #new hub scores a function of previous authority scores of their out-neighbors
         h <- h/norm(h, type = "E")
         a <- (t(w) %*% h) * 1/d.i #new authority scores a function of current hub scores of their in-neighbors
         a <- a/norm(a, type = "E")
         d <- abs(sum(abs(o.a) - abs(a))) #delta between old and new authority scores
         k <- k + 1
         }
   return(list(h = as.vector(h), a = as.vector(a), k = k))
   }

And the resulting hubs and authorities scores are:

   hits.res3 <- status4(A)
   round(hits.res3$a/max(hits.res3$a), 3)

 [1] 0.651 0.579 0.877 0.694 0.819 0.658 0.548 0.784 0.855 0.733 0.761 0.736
[13] 1.000 0.738 0.808 0.797 0.768 0.600 1.000 0.847 0.524

   round(hits.res3$h/max(hits.res3$h), 3)

 [1] 0.266 0.116 0.700 0.564 0.734 0.035 0.358 0.341 0.593 0.734 0.119 0.072
[13] 0.283 0.150 1.000 0.171 0.200 0.869 0.532 0.561 0.523

References

Bonacich, Phillip, and Paulette Lloyd. 2001. “Eigenvector-Like Measures of Centrality for Asymmetric Relations.” Social Networks 23 (3): 191–201.

Borodin, Allan, Gareth O Roberts, Jeffrey S Rosenthal, and Panayiotis Tsaparas. 2005. “Link Analysis Ranking: Algorithms, Theory, and Experiments.” ACM Transactions on Internet Technology (TOIT) 5 (1): 231–97.

Ding, Chris, Xiaofeng He, Parry Husbands, Hongyuan Zha, and Horst Simon. 2002. “PageRank, HITS and a Unified Framework for Link Analysis.” Lawrence Berkeley National Laboratory, University of California, Berkeley.

Fouss, François, Jean-Michel Renders, and Marco Saerens. 2004. “Some Relationships Between Kleinberg’s Hubs and Authorities, Correspondence Analysis, and the Salsa Algorithm.” In International Conference on the Statistical Analysis of Textual Data (JADT 2004).

Kleinberg, Jon M. 1999. “Authoritative Sources in a Hyperlinked Environment.” Journal of the ACM (JACM) 46 (5): 604–32.

Langville, Amy N, and Carl D Meyer. 2005. “A Survey of Eigenvector Methods for Web Information Retrieval.” SIAM Review 47 (1): 135–61.

Lempel, Ronny, and Shlomo Moran. 2001. “SALSA: The Stochastic Approach for Link-Structure Analysis.” ACM Transactions on Information Systems (TOIS) 19 (2): 131–60.

Saerens, Marco, and Francois Fouss. 2005. “HITS Is Principal Components Analysis.” In The 2005 IEEE/WIC/ACM International Conference on Web Intelligence (WI’05), 782–85. IEEE.

Footnotes

HITS is an acronym for the unfortunate and impossible to remember name “Hypertext Induced Topic Selection” reflecting the origins the approach in web-based information science.↩︎
Lempel and Moran (2001) call this method the “Stochastic Approach for Link Structure Analysis” or SALSA (an actually clever and memorable acronym!).↩︎
See http://factominer.free.fr/ for all of the package’s capabilities.↩︎