Basic Network Statistics

Loading Data

Here we will analyze a small network and computer some basic statistics of interest. The first thing we need to do is get some data! For this purpose, we will use the package networkdata (available here). To install the package, use the following code:

   #install.packages("remotes") 
   remotes::install_github("schochastics/networkdata")

To load the network datasets in the networkdata just type:

   library(networkdata)

The package contains a bunch of human and animal social networks to browse through them, type:

   data(package = "networkdata")

We will pick one of the movies for this analysis, namely, Pulp Fiction. This is movie_559. In the movie network two characters are linked by an edge if they appear in a scene together. The networkdata data sets come in igraph format, so we need to load that package (or install it using install.packages if you haven’t done that yet).

   #install.packages("igraph") 
   library(igraph)
   g <- movie_559

Number of Nodes and Edges

Now we are ready to compute some basic network statistics. As with any network, we want to know what the number of nodes and the number of edges (links) are. Since this is a relatively small network, we can begin by listing the actors.

   V(g)
+ 38/38 vertices, named, from 9e7cc7a:
 [1] BRETT           BUDDY           BUTCH           CAPT KOONS     
 [5] ED SULLIVAN     ENGLISH DAVE    ESMARELDA       FABIENNE       
 [9] FOURTH MAN      GAWKER #2       HONEY BUNNY     JIMMIE         
[13] JODY            JULES           LANCE           MANAGER        
[17] MARSELLUS       MARVIN          MAYNARD         MIA            
[21] MOTHER          PATRON          PEDESTRIAN      PREACHER       
[25] PUMPKIN         RAQUEL          ROGER           SPORTSCASTER #1
[29] SPORTSCASTER #2 THE GIMP        THE WOLF        VINCENT        
[33] WAITRESS        WINSTON         WOMAN           YOUNG MAN      
[37] YOUNG WOMAN     ZED            

The function V takes the igraph network object as input and returns an igraph.vs object as output (short for “igraph vertex sequence”), listing the names (if given as a graph attribute) of each node. The first line also tells us that there are 38 nodes in this network.

The igraph.vs object operates much like an R character vector, so we can query its length to figure out the number of nodes:

   length(V(g))
[1] 38

The analogue function for edges in igraph is E which also takes the network object as input and returns an object of class igraph.es (“igraph edge sequence”) as output:

   E(g)
+ 102/102 edges from 9e7cc7a (vertex names):
 [1] BRETT      --MARSELLUS       BRETT      --MARVIN         
 [3] BRETT      --ROGER           BRETT      --VINCENT        
 [5] BUDDY      --MIA             BUDDY      --VINCENT        
 [7] BRETT      --BUTCH           BUTCH      --CAPT KOONS     
 [9] BUTCH      --ESMARELDA       BUTCH      --GAWKER #2      
[11] BUTCH      --JULES           BUTCH      --MARSELLUS      
[13] BUTCH      --PEDESTRIAN      BUTCH      --SPORTSCASTER #1
[15] BUTCH      --ENGLISH DAVE    BRETT      --FABIENNE       
[17] BUTCH      --FABIENNE        FABIENNE   --JULES          
[19] FOURTH MAN --JULES           FOURTH MAN --VINCENT        
+ ... omitted several edges

This tells us that there are 102 edges (connected dyads) in the network. Some of these include Brett and Marsellus and Fabienne and Jules, but not all can be listed for reasons of space.

igraph also has two dedicated functions that return the number of nodes and edges in the graph in one fell swoop. They are called vcount and ecount and take the graph object as input:

   vcount(g)
[1] 38
   ecount(g)
[1] 102

Graph Density

Once we have the number of edges and nodes, we can calculate the most basic derived statistic in a network, which is the density. Since the movie network is an undirected graph, the density is given by:

\[ \frac{2m}{n(n-1)} \]

Where \(m\) is the number of edges and \(n\) is the number of nodes, or in our case:

   (2 * 102) / (38 * (38 - 1))
[1] 0.1450925

Of course, igraph has a dedicated function called edge_density to compute the density too, which takes the igraph object as input:

   edge_density(g)
[1] 0.1450925

Degree

The next set of graph metrics are based on the degree of the graph. We can list the graph’s degree set using the igraph function degree:

   degree(g)
          BRETT           BUDDY           BUTCH      CAPT KOONS     ED SULLIVAN 
              7               2              17               5               2 
   ENGLISH DAVE       ESMARELDA        FABIENNE      FOURTH MAN       GAWKER #2 
              4               1               3               2               3 
    HONEY BUNNY          JIMMIE            JODY           JULES           LANCE 
              8               3               4              16               4 
        MANAGER       MARSELLUS          MARVIN         MAYNARD             MIA 
              5              10               6               3              11 
         MOTHER          PATRON      PEDESTRIAN        PREACHER         PUMPKIN 
              5               5               3               3               8 
         RAQUEL           ROGER SPORTSCASTER #1 SPORTSCASTER #2        THE GIMP 
              3               6               2               1               2 
       THE WOLF         VINCENT        WAITRESS         WINSTON           WOMAN 
              3              25               4               3               5 
      YOUNG MAN     YOUNG WOMAN             ZED 
              4               4               2 

The degree function takes the igraph network object as input and returns a plain old R named vector as output with the names being the names attribute of vertices in the network object.

Usually we are interested in who are the “top nodes” in the network by degree (a kind of centrality). To figure that out, all we need to do is sort the degree set (to generate the graph’s degree sequence) and list the top entries:

   d <- degree(g)
   d.sort <- sort(d, decreasing = TRUE)
   d.sort[1:8]
    VINCENT       BUTCH       JULES         MIA   MARSELLUS HONEY BUNNY 
         25          17          16          11          10           8 
    PUMPKIN       BRETT 
          8           7 

Line 1 stores the degrees in an object “d”, line 2 creates a “sorted” version of the same object (from bigger to smaller) and line 3 shows the first eight entries of the sorted degree sequence.

Because the degree vector “d” is just a regular old vector we can use native R mathematical operations to figure out things like the sum, maximum, minimum, and average degree of the graph:

   sum(d)
[1] 204
   max(d)
[1] 25
   min(d)
[1] 1
   mean(d)
[1] 5.368421

So the sum of degrees is 204, the maximum degree is 25 (belonging to Vincent), the minimum is one, and the average is about 5.4.

Note that these numbers recreate some well-known equalities in graph theory:

   2 * ecount(g)
[1] 204
  • The average degree is just the sum of degrees divided by the number of nodes:
   sum(d)/vcount(g)
[1] 5.368421
  • The density is just the average degree divided by the number of nodes minus one, as explained here:
   mean(d)/(vcount(g) - 1)
[1] 0.1450925

Some people also consider the degree variance of the graph as a measure of inequality of connectivity in the system. It is equal to the average sum of square deviations of each node’s degree from the average:

\[ \mathcal{v}(G) = \frac{\sum_i (k_i - \bar{k})^2}{n} \]

   sum((d - mean(d))^2)/vcount(g)
[1] 22.96953

This tells us that there is a lot of inequality in the distribution of degrees in the graph (a graph with all nodes equal degree would have variance zero).

The Degree Distribution

Another way of looking at inequalities of degrees in a graph is to examine its degree distribution. This gives us the probability of observing a node with a given degree k in the graph.

   deg.dist <- degree_distribution(g)
   deg.dist <- round(deg.dist, 3)
   deg.dist
 [1] 0.000 0.053 0.158 0.237 0.158 0.132 0.053 0.026 0.053 0.000 0.026 0.026
[13] 0.000 0.000 0.000 0.000 0.026 0.026 0.000 0.000 0.000 0.000 0.000 0.000
[25] 0.000 0.026

The igraph function degree_distribution just returns a numeric vector of the same length as the maximum degree of the graph plus one. In this case that’s a vector of length 25 + 1 = 26. The first entry gives us the proportion of nodes with degree zero (isolates), the second the proportion of nodes of degree one, and so on up to the graph’s maximum degree.

Since there are no isolates in the network, we can ignore the first element of this vector, to get the proportion of nodes of each degree in the Pulp Fiction network. To that, we fist create a two-column data.frame with the degrees in the first column and the proportions in the second:

   degree <- c(1:max(d))
   prop <- deg.dist
   prop <- prop[-1]
   deg.dist <- data.frame(degree, prop)
   deg.dist
   degree  prop
1       1 0.053
2       2 0.158
3       3 0.237
4       4 0.158
5       5 0.132
6       6 0.053
7       7 0.026
8       8 0.053
9       9 0.000
10     10 0.026
11     11 0.026
12     12 0.000
13     13 0.000
14     14 0.000
15     15 0.000
16     16 0.026
17     17 0.026
18     18 0.000
19     19 0.000
20     20 0.000
21     21 0.000
22     22 0.000
23     23 0.000
24     24 0.000
25     25 0.026

Of course, a better way to display the degree distribution of a graph is via some kind of data visualization, particularly for large networks where a long table of numbers is just not feasible. To do that, we can call on our good friend ggplot:

   # install.packages(ggplot2)
   library(ggplot2)
   p <- ggplot(data = deg.dist, aes(x = degree, y = prop))
   p <- p + geom_bar(stat = "identity", fill = "red", color = "red")
   p <- p + theme_minimal()
   p <- p + labs(x = "", y = "Proportion", 
                 title = "Degree Distribution in Pulp Fiction Network") 
   p <- p + geom_vline(xintercept = mean(d), 
                       linetype = 2, linewidth = 0.5, color = "blue")
   p <- p + scale_x_continuous(breaks = c(1, 5, 10, 15, 20, 25))
   p

The plot clearly shows that the Pulp Fiction network degree distribution is skewed with a small number of characters having a large degree \(k \geq 15\) while most other characters in the movie have a small degree \(k \leq 5\) indicating inequality of connectivity in the system.

The Degree Correlation

Another overall network statistic we may want to know is the degree correlation (Newman 2002). How do we compute it? Imagine taking each edge in the network and creating two degree vectors, one based on the degree of the node in one end and the degre of the node in another. Then the degree assortativity coefficient is just the Pearson product moment correlation between these two vectors.

Let’s see how this would work for the Pulp Fiction network. First we need to extract an edge list from the graph:

   g.el <- as_edgelist(g) #transforming graph to edgelist
   head(g.el)
     [,1]    [,2]       
[1,] "BRETT" "MARSELLUS"
[2,] "BRETT" "MARVIN"   
[3,] "BRETT" "ROGER"    
[4,] "BRETT" "VINCENT"  
[5,] "BUDDY" "MIA"      
[6,] "BUDDY" "VINCENT"  

We can see that the as_edgelist function takes the igraph network object as input and returns an \(E \times 2\) matrix, with \(E = 102\) being the number of rows. Each column of the matrix records the name of the node on each end of the edge. So the first row of the edge list with entries “BRETT” and “MARSELLUS” tells us that there is an edge linking Brett and Marsellus, and so forth for each row.

To compute the correlation between the degrees of each node, all we need to do is attach the corresponding degrees to each name for each of the columns of the edge list, which can be done via data wrangling magic from the dplyr package (part of the tidyverse):

   # install.packages(dplyr)
   library(dplyr)
   deg.dat <- data.frame(name1 = names(d), name2 = names(d), d)
   el.temp <- data.frame(name2 = g.el[, 2]) %>% 
      left_join(deg.dat, by = "name2") %>% 
      dplyr::select(c("name2", "d")) %>% 
      rename(d2 = d) 
   d.el <- data.frame(name1 = g.el[, 1]) %>% 
      left_join(deg.dat, by = "name1") %>% 
      dplyr::select(c("name1", "d")) %>% 
      rename(d1 = d) %>% 
      cbind(el.temp)
head(d.el)
  name1 d1     name2 d2
1 BRETT  7 MARSELLUS 10
2 BRETT  7    MARVIN  6
3 BRETT  7     ROGER  6
4 BRETT  7   VINCENT 25
5 BUDDY  2       MIA 11
6 BUDDY  2   VINCENT 25

Line 3 creates a two-column data frame called “deg.dat” with as many rows as there are nodes in the network. The first two columns contain the names of each node (identically listed with different names) and the third columns contains the corresponding node’s degree.

Lines 4-7 use dplyr functions to create a new object “el.temp” joining the degree information to each of the node names listed in the second position in the original edge list “g.el,” and rename the imported column of degrees “d2.”

Lines 8-12 do the same for the nodes listed in the first position in the original edge list, renames the imported columns of degrees “d1,” and the binds the columns of the “el.temp” object to the new object “d.el.” The resulting object has four columns: Two for the names of the nodes incident to each edge on the edge list (columns 1 and 3), and two other ones corresponding to the degrees of the corresponding nodes (columns 2 and 4).

We can see from the output of the first few rows of the “d.el” object that indeed “BRETT” is assigned a degree of 7 in each row of the edge list, “BUDDY” a degree of 2, “MARSELLUS” a degree of 10, “VINCENT” a degree of 25 and so forth.

Now to compute the degree correlation in the network all we need to do is call the native R function cor on the two columns from “d.el” that containing the degree information. Note that because each degree appears twice at the end of each edge in an undirected graph (as both “sender” and “receiver”), we need to double each column by appending the other degree column at the end. So the first degree column is the vector:

   d1 <- c(d.el$d1, d.el$d2)

And the second degree column is the vector:

   d2 <- c(d.el$d2, d.el$d1)

And the graph’s degree correlation (Newman 2003) is just the Pearson correlation between these two degree vectors:

   cor(d1, d2)
[1] -0.2896427

The result \(r_{deg} = -0.29\) tells us that there is anti-correlation by degree in the Pulp Fiction network. That is high-degree characters tend to appear with low degree characters, or conversely, high-degree characters (like Marsellus and Jules) don’t appear together very often.

Of course, igraph has a function called assortativity_degree that does all the work for us:

   assortativity_degree(g)
[1] -0.2896427

The Average Shortest Path Length

The final statistic people use to characterize networks is the average shortest path length. In a network, even non-adjacent nodes, could be indirectly connected to other nodes via a path of some length (\(l > 1\)) So it is useful to know what the average of this quantity is across all dyads in the network.

To do that, we first need to compute the length of the shortest path \(l\) for each pair of nodes in the network (also known as the geodesic distance). Adjacent nodes get an automatic score of \(l = 1\). In igraph this is done as follows:

   S <- distances(g)
   S[1:7, 1:7]
             BRETT BUDDY BUTCH CAPT KOONS ED SULLIVAN ENGLISH DAVE ESMARELDA
BRETT            0     2     1          2           2            2         3
BUDDY            2     0     2          2           2            2         4
BUTCH            1     2     0          1           2            1         2
CAPT KOONS       2     2     1          0           2            2         3
ED SULLIVAN      2     2     2          2           0            2         4
ENGLISH DAVE     2     2     1          2           2            0         3
ESMARELDA        3     4     2          3           4            3         0

The igraph function distances takes the network object as input and returns the desired shortest path matrix. So for instance, Brett is directly connected to Butch (they appear in a scene together) but indirectly connected to Buddy via a path of length two (they both appear in scenes with common neighbors even if they don’t appear together).

The maximum distance between two nodes in the graph (the longest shortest path to put it confusingly) is called the graph diameter. We can find this out simply by using the native R function for the maximum on the shortest paths matrix:

   max(S)
[1] 8

This means that in the Pulp Fiction network the maximum degree of separation between two characters is a path of length 8.

Of course, we cann also call the igraph function diammeter:

   diameter(g)
[1] 8

Once we have the geodesic distance matrix, it is easy to calculate the average path length of the graph:

   rs.S <- rowSums(S)
   rm.S <- rs.S/(vcount(g) - 1)
   mean(rm.S)
[1] 2.769559
  • First (line 1) we sum all the rows (or columns) of the geodesic distance matrix. This vector (of the same length as the number of nodes) gives us the sum of the geodesic distance of each node to each of the nodes (we will use this to compute closeness centrality later).

  • Then (line 2) we divide this vector by the number of nodes minus one (to exclude the focal node) to create a vector of the average distance of each node to each of the other nodes.

  • Finally (line 3) we take the average across all nodes of this average distance vector to get the graph’s average shortest path length, which in this case equals L = 2.8.

This means that, on average, each character in Pulp Fiction is separated by little less than three contacts in the co-appearance network (a fairly small world).

Of course this can also be done in just one step on igraph:

   mean_distance(g)
[1] 2.769559

Putting it all Together

Now we can put together all the basic network statistics that we have computed into some sort of summary table, like the ones here. We first create a vector with the names of each statistic:

   Stat <- c("Nodes", "Edges", "Min. Degree", "Max. Degree", "Avg. Degree", "Degree Corr.", "Diameter", "Avg. Shortest Path Length")

Then we create a vector with the values:

   Value <- c(vcount(g), ecount(g), min(d), max(d), round(mean(d), 2), round(assortativity_degree(g), 2), max(S), round(mean_distance(g), 2))

We can then put these two vector together into a data frame:

   net.stats <- data.frame(Stat, Value)

We can then use the package kableExtra (a nice table maker) to create a nice html table:

   # intall.packages(kableExtra)
   library(kableExtra)
   kbl(net.stats, format = "pipe", align = c("l", "c"),
       caption = "Key Statistics for Pulp Fiction Network.") %>% 
   kable_styling(bootstrap_options = c("hover", "condensed", "responsive"))
Key Statistics for Pulp Fiction Network.
Stat Value
Nodes 38.00
Edges 102.00
Min. Degree 1.00
Max. Degree 25.00
Avg. Degree 5.37
Degree Corr. -0.29
Diameter 8.00
Avg. Shortest Path Length 2.77

Appendix: Loading Network Data from a File

When get network data from an archival source, and it will be in the form of a matrix or an edge list, typically in some kind of comma separated value (csv) format. Here will show how to input that into R to create an igraph network object from an outside file.

First we will write the Pulp Fiction data into an edge list and save it to disk. We already did that earlier with the “g.el” object. So all we have to do is save it to your local folder as a csv file:

   #install.packages(here)
   library(here)
   write.csv(d.el[c("name1", "name2")], here("pulp.csv"))

The write.csv function just saves an R object into a .csv file. Here the R object is “g.el” and we asked it to save just the columns which contain the name of each character. This represents the adjacency relations in the network as an edge list. We use the package here to keep track of our working directory. See here (pun intended) for details.

Now suppose that’s the network we want to work with and it’s saved in our hard drive. To load it, we just type:

   g.el <- read.csv(here("pulp.csv"), 
                    col.names = c("name1", "name2"))
   head(g.el)
  name1     name2
1 BRETT MARSELLUS
2 BRETT    MARVIN
3 BRETT     ROGER
4 BRETT   VINCENT
5 BUDDY       MIA
6 BUDDY   VINCENT

Which gives us the edge list we want now saved into an R object of class data.frame. So all we need is to convert that into an igraph object. To do that we use one of the many graph_from... functions in the igraph package. In this case, we want graph_from_edgelist because our network is stored as an edge list:

   g.el <- as.matrix(g.el)
   g <- graph_from_edgelist(g.el, directed = FALSE)
   V(g)
+ 38/38 vertices, named, from 99e5caf:
 [1] BRETT           MARSELLUS       MARVIN          ROGER          
 [5] VINCENT         BUDDY           MIA             BUTCH          
 [9] CAPT KOONS      ESMARELDA       GAWKER #2       JULES          
[13] PEDESTRIAN      SPORTSCASTER #1 ENGLISH DAVE    FABIENNE       
[17] FOURTH MAN      HONEY BUNNY     MANAGER         JIMMIE         
[21] JODY            PATRON          PUMPKIN         RAQUEL         
[25] WINSTON         LANCE           MAYNARD         THE GIMP       
[29] ZED             ED SULLIVAN     MOTHER          WOMAN          
[33] PREACHER        SPORTSCASTER #2 THE WOLF        WAITRESS       
[37] YOUNG MAN       YOUNG WOMAN    
   E(g)
+ 102/102 edges from 99e5caf (vertex names):
 [1] BRETT      --MARSELLUS       BRETT      --MARVIN         
 [3] BRETT      --ROGER           BRETT      --VINCENT        
 [5] BUDDY      --MIA             VINCENT    --BUDDY          
 [7] BRETT      --BUTCH           BUTCH      --CAPT KOONS     
 [9] BUTCH      --ESMARELDA       BUTCH      --GAWKER #2      
[11] BUTCH      --JULES           MARSELLUS  --BUTCH          
[13] BUTCH      --PEDESTRIAN      BUTCH      --SPORTSCASTER #1
[15] BUTCH      --ENGLISH DAVE    BRETT      --FABIENNE       
[17] BUTCH      --FABIENNE        JULES      --FABIENNE       
[19] JULES      --FOURTH MAN      VINCENT    --FOURTH MAN     
+ ... omitted several edges

Which gives us back the original igraph object we have been working with. Note that first we converted the data.frame object into a matrix object. We also specified that the graph is undirected by setting the option directed to false.

References

Newman, Mark EJ. 2002. “Assortative Mixing in Networks.” Physical Review Letters 89 (20): 208701.
———. 2003. “Mixing Patterns in Networks.” Physical Review E 67 (2): 026126.