Statistical Models of Networks II

In this previous handout we saw how to do some basic, single-hypothesis testing against plausible null models. This approach can be thought of as conditioning on some set of graph level properties and seeing if something we compute in the graph obtains “net” of these conditioning (e.g., the observed degree distribution).

Conditioning, of course, can be thought of as the analogue of “controlling for” something in a standard linear regression model. This means that it is possible to extend the framework we explored to a setting in which we conditioning on multiple graph properties to see if what we measure obtains “net” of all these properties at the same time, something like a parametric version of the QAP regression approach.

Exponential Random Graph Models

The framework of Exponential Random Graph Models (ERGMs) is useful for these purposes. There are many ways to think of ERGMs, one of them is as a network regression in which we are modeling the probability of observing each tie as a function of a bunch of tie level variables. Another way of thinking about it is as specifying a dependence structure that tell us how probability of observing a given tie depends on other things going on in the “neighborhood” of that tie, like whether one of the nodes incident to that edge is also incident to other edges (Pattison and Robins 2002).

Formally, this means that an Exponential Random Graph Model is a probabilistic (auto)“regression” model in which the network itself, considered as a random variable (\(\mathbf{Y}\)) is the main outcome and the specified patterns of local dependence are the main predictors:

\[ Pr(\mathbf{Y} = \mathbf{y}) = \frac{exp\left[\theta^T S(\mathbf{y})\right] h(\mathbf{y})}{\kappa(\theta)} \]

This means that the probability of observing a particular realization of the network (\(Pr(\mathbf{Y} = \mathbf{y})\)) is a function of a given set of local structural configurations defined on the same network (\(S(\mathbf{y})\)) with the parameter \(\theta\) specifying the form of the dependence (e.g., positive or negative) of that local structural factor.1

Note that if the we take the natural logarithm of the numerator above it leads to the usual additive linear model expression:

\[ log(exp\left[\theta^T S(\mathbf{y})\right]) = \theta_1s_1(y) + \theta_2s_2(y) + \theta_3s_3(y) + \ldots \theta_ps_p(y) \]

Where \(p\) is the number of terms included in the model. Each \(S_p\) is thus a network statistic specifying counts of specific configurations (e.g., dyads or triads of a certain type, or links beginning at a particular type of node). When \(\theta\) is positive, the existence of a given configuration enhances the probability of observing a link contained in it, while when it is negative \(\theta\) is positive, the same configuration depresses the respective probability.

The whole network, therefore, is seen as being generated by the entire ensemble of local structural mechanisms operating in tandem. This means that each ergm also specifies a probability distribution of graphs that are generated by the model’s parameters.

Fitting ERGMs

This is all very abstract, so let’s see how it works in practice. First, we need a package to estimate ERGMs. Thankfully, there is one, called (predictably) ergm (which is dependent on the package network):

   library(networkdata)
   library(igraph)
   #install.packages("network")
   #install.packages("ergm")
   #install.packages("intergraph")
   library(ergm)
   library(intergraph)

Let’s bring back the friendship network from Krackhardt’s 21 high tech managers:

   g <- ht_friends
   g <- as_undirected(g, mode = "collapse")
   n <- asNetwork(g)

Note that we use the function asNetwork from the intergraph package to coerce the igraph object into something that the package ergm can work with (namely, a network object).

And, now, let’s fit an ergm!

   m1 <- ergm(n ~ edges)
   summary(m1)
Call:
ergm(formula = n ~ edges)

Maximum Likelihood Results:

      Estimate Std. Error MCMC % z value Pr(>|z|)    
edges  -0.5057     0.1424      0   -3.55 0.000385 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

     Null Deviance: 291.1  on 210  degrees of freedom
 Residual Deviance: 278.1  on 209  degrees of freedom
 
AIC: 280.1  BIC: 283.5  (Smaller is better. MC Std. Err. = 0)

The typical structure of an ergm call is modeled after the linear regression models R functions like lm or glm, with dependent variable followed by squiggly symbol followed by an additive list of independent variables of the form y ~ x.

The output which we can check out by using the function summary on the resulting object, is also pretty similar to the typical outputs of linear model functions in R. We have a table with the independent variable edges, followed by a coefficient estimate, and some kind of p-value against the null hypothesis that that estimate is zero. Which in this case we can reject pretty handily (\(Z = - 3.55\), \(p < 0.01\)).

What is edges? Well as we said before, the independent variables in ergm are just counts of stuff that’s going on in the network, to predict…the network!

In this case, edges fits the number of edges in the network, which means that it fits anything that is a function of this number, like the density (or the average degree).

In fact, the coefficient estimate of the edges parameters without any other network covariates is the density.

We can check this as follows, let’s transform the coefficient estimate from the logit to the probability scale:

   m1$coefficients
     edges 
-0.5057495 
   exp(m1$coefficients)/(1 + exp(m1$coefficients))
    edges 
0.3761905 

This probability estimate is the same as:

   edge_density(g)
[1] 0.3761905

That is, the expected probability of an edge existing between two randomly chosen nodes in the network, which is the same as the observed density.

If you like regression analogies, fitting an ergm including only the edges parameter and nothing else is equivalent to fitting a linear regression model with only an intercept (which returns the expected value—a.k.a., the mean—of the dependent variable).

As we noted, an ergm model is a model of the network that conditions on the things we put on the right-hand side of the equation, in this case, we conditioned on the density (number of edges), which means that this is an Erdos-Renyi random graph model for the observed network, like the ones we simulated using edge-swapping.

This also means that we can simulate a bunch of networks that condition on the fit of a given ergm. For this we use the function simulate:

   set.seed(123)
   sim <- simulate(m1, nsim = 100) #generates list of network objects
   sim <- lapply(sim, asIgraph) #converts to list of igraph objects

Original Friendship Network.

One ergm simulation.

Another ergm simulation.

Yet another ergm simulation.

Note that the simulated networks are a graph ensemble of networks drawn from a probability distribution in which the expected (mean) density is equivalent to the observed density.

We can check by looking at the densities of the separate networks:

   dens <- sapply(sim, edge_density)
   round(dens, 2)
  [1] 0.35 0.38 0.35 0.38 0.34 0.36 0.33 0.37 0.35 0.32 0.44 0.36 0.40 0.42 0.37
 [16] 0.39 0.39 0.46 0.37 0.36 0.33 0.37 0.40 0.36 0.40 0.42 0.40 0.39 0.36 0.35
 [31] 0.36 0.37 0.38 0.45 0.36 0.41 0.34 0.39 0.36 0.38 0.38 0.35 0.34 0.38 0.36
 [46] 0.38 0.34 0.43 0.38 0.40 0.37 0.43 0.34 0.37 0.34 0.39 0.40 0.40 0.39 0.49
 [61] 0.32 0.38 0.41 0.43 0.40 0.40 0.33 0.32 0.38 0.37 0.35 0.41 0.44 0.42 0.39
 [76] 0.38 0.38 0.40 0.38 0.38 0.36 0.37 0.38 0.35 0.37 0.37 0.40 0.43 0.42 0.43
 [91] 0.31 0.37 0.38 0.35 0.39 0.34 0.34 0.42 0.35 0.35

Some of the values are larger and some of them are lower, but they all revolve around the estimate:

If we take larger and larger graph ensembles the distribution of densities would become bell-shaped and the average of the ensemble would converge around the density of the observed network.

So a useful way to think about ergms is as models that fix expected graph statistics in a given ensemble. The values of the expectations are provided by the coefficients obtained in the model for that statistic, and the range of possible variation around those expectations are given by the standard error of the estimate.

Fitting a Real ERGM

Of course, nobody uses the ergm package to fit single parameter Erdo-Renyi random graph models. The point is to test multivariate hypotheses of the type: Is the value of a given network effect (e.g., homophily) I observe larger or smaller than we would expect in a network with these other characteristics?

So let’s recreate our original analysis of homophily (assortativity) by managerial level from our previous analysis:

   m2 <- ergm(n ~ edges + nodematch("level"))
   summary(m2)
Call:
ergm(formula = n ~ edges + nodematch("level"))

Maximum Likelihood Results:

                Estimate Std. Error MCMC % z value Pr(>|z|)    
edges            -1.0361     0.2481      0  -4.175  < 1e-04 ***
nodematch.level   0.8450     0.3060      0   2.762  0.00575 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

     Null Deviance: 291.1  on 210  degrees of freedom
 Residual Deviance: 270.1  on 208  degrees of freedom
 
AIC: 274.1  BIC: 280.8  (Smaller is better. MC Std. Err. = 0)

Here we fit a nodematch term with the value of the node-characteristic “level.” This term fixes the expected number of dyads that match on this attribute to equal that observed in the network. That is, it adds a “dummy” (change score) indicator if a dyad matches on this attribute.

The results tell us that, net of density, there is higher likelihood of observing ties that match on this attribute (\(\theta = 0.86, p <0.01\)).

If we wanted to put a percentage, on this estimate, we could exponentiate:

   exp(m2$coefficients[2])
nodematch.level 
       2.328063 

Which tells us that there is more than 2 to 1 odds of observing connected dyads that match on level than not.

Activity Differences Based on a Categorical Factor

As we noted before, it could be that people with some values of the level variable are just more likely to form ties (e.g., the mean degree of nodes varies across the value of levels), so before we conclude that there is a lot of homophily going on, we may want to condition on the expected probability of a node with a given value of the level variable to show up at either end of an edge. We can do that like this:

   m3 <- ergm(n ~ edges + nodefactor("level", base = 2) + nodematch("level"))
   summary(m3)
Call:
ergm(formula = n ~ edges + nodefactor("level", base = 2) + nodematch("level"))

Maximum Likelihood Results:

                   Estimate Std. Error MCMC % z value Pr(>|z|)  
edges               -0.7035     0.4879      0  -1.442   0.1493  
nodefactor.level.1  -0.9055     0.6891      0  -1.314   0.1888  
nodefactor.level.3  -0.1587     0.4039      0  -0.393   0.6944  
nodematch.level      0.8147     0.4902      0   1.662   0.0965 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

     Null Deviance: 291.1  on 210  degrees of freedom
 Residual Deviance: 268.1  on 206  degrees of freedom
 
AIC: 276.1  BIC: 289.5  (Smaller is better. MC Std. Err. = 0)

The ergm term nodefactor uses dummy coding for the levels of a given categorical variable and fits \(k-1\) terms for mean differences in degree for each category with a category serving as the (omitted) comparison. In this case, we indicate that using the base argument, which set the second level category as the reference.

As we can see, after adjusting for mean differences in degree across levels the assortativity estimate for the nodematch term is no longer significant at conventional levels \(p = 0.10\), indicating that heterogeneity in connectivity by nodes at different levels partially accounts for this effect (although model fit estimate prefer the more parsimonious model without nodefactor level effects).

Considering Multiple Homophily Effects

Of course, what makes ergms such a flexible modeling system is that we can fit multiple homophily terms at once. In fact, the list of “covariates” can be as long as you want, with the caveat that the more stuff you condition on, the more restricted (and perhaps nonsensical) is the space of graphs that you are presuming your actual data came from.

Regardless, here’s a model with multiple homophily effects, one for levels and the other for “Department”:

   m4 <- ergm(n ~ edges +  nodematch("level") + nodematch("dept"))
   summary(m4)
Call:
ergm(formula = n ~ edges + nodematch("level") + nodematch("dept"))

Maximum Likelihood Results:

                Estimate Std. Error MCMC % z value Pr(>|z|)    
edges            -1.2682     0.2679      0  -4.734  < 1e-04 ***
nodematch.level   0.8217     0.3127      0   2.628  0.00859 ** 
nodematch.dept    1.0345     0.3458      0   2.991  0.00278 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

     Null Deviance: 291.1  on 210  degrees of freedom
 Residual Deviance: 261.0  on 207  degrees of freedom
 
AIC: 267  BIC: 277.1  (Smaller is better. MC Std. Err. = 0)

Which reveals strong homophily effects for each node-level factor “net” of the other one, suggesting that our graph is likely to have come from a world in which nodes form ties selectively based on both these criteria.

Homophily on a Continuous Attribute

Some node attributes like “age” and “tenure” in the Krackhardt’s High-Tech Managers data are continuous. Obviously, we couldn’t condition on matching on each level of a continuous attribute.

In the ergm framework homophily on a continuous attribute is handled using the absdiff term, which computes the absolute difference in the value of the attribute between the two nodes incident to the edge:

   m5 <- ergm(n ~ edges 
              +  nodematch("level") + nodematch("dept")
              + absdiff("age")
              )
   summary(m5)
Call:
ergm(formula = n ~ edges + nodematch("level") + nodematch("dept") + 
    absdiff("age"))

Maximum Likelihood Results:

                Estimate Std. Error MCMC % z value Pr(>|z|)   
edges           -0.89316    0.31550      0  -2.831  0.00464 **
nodematch.level  0.85287    0.31692      0   2.691  0.00712 **
nodematch.dept   1.06643    0.35049      0   3.043  0.00234 **
absdiff.age     -0.03972    0.01850      0  -2.147  0.03180 * 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

     Null Deviance: 291.1  on 210  degrees of freedom
 Residual Deviance: 256.2  on 206  degrees of freedom
 
AIC: 264.2  BIC: 277.6  (Smaller is better. MC Std. Err. = 0)

Which tells us that there is indeed age homophily in this network \(\theta_{agediff} = -0.04, p < 0.05\). Note that here a negative effect indicates a smaller age difference in the tie, which indicates homophily. A positive effect in the absdiff term would indicate heterophily on the continuous attribute.

Activity Differences Based on a Continuous Factor

Of course, just like before, we may want to check that we are not mistaking homophily, with the tendency of people in certain age groups to form more ties in general (e.g., an age/degree correlation in the network).

To fit the analogue of nodefactor for continuous attributes we use nodecov, which conditions on the sum of the values a given attribute for the two nodes incident to the edge:

   m6 <- ergm(n ~ edges 
              +  nodecov("age") + nodematch("level") 
              + nodematch("dept") + absdiff("age")
              )
   summary(m6)
Call:
ergm(formula = n ~ edges + nodecov("age") + nodematch("level") + 
    nodematch("dept") + absdiff("age"))

Maximum Likelihood Results:

                Estimate Std. Error MCMC % z value Pr(>|z|)   
edges           -0.10503    1.11891      0  -0.094  0.92521   
nodecov.age     -0.01104    0.01511      0  -0.731  0.46496   
nodematch.level  0.81527    0.32107      0   2.539  0.01111 * 
nodematch.dept   1.07818    0.35134      0   3.069  0.00215 **
absdiff.age     -0.02951    0.02312      0  -1.276  0.20181   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

     Null Deviance: 291.1  on 210  degrees of freedom
 Residual Deviance: 255.6  on 205  degrees of freedom
 
AIC: 265.6  BIC: 282.4  (Smaller is better. MC Std. Err. = 0)

Like before, after adjusting for the age/degree correlation, there does not seem to be a statistically significant age homophily effect (\(p = 0.20\)).

Finally, here’s a model including homophily effects for all four node attributes (two categorical and two continuous) in the Krackhardt High-Tech Managers data:

   m7 <- ergm(n ~ edges 
              +  nodecov("age") + nodecov("tenure")
              + nodematch("level") + nodematch("dept")  
              + absdiff("age") + absdiff("tenure")
              )
   summary(m7)
Call:
ergm(formula = n ~ edges + nodecov("age") + nodecov("tenure") + 
    nodematch("level") + nodematch("dept") + absdiff("age") + 
    absdiff("tenure"))

Maximum Likelihood Results:

                Estimate Std. Error MCMC % z value Pr(>|z|)    
edges            0.39296    1.21536      0   0.323 0.746444    
nodecov.age     -0.04671    0.01858      0  -2.514 0.011954 *  
nodecov.tenure   0.07830    0.02116      0   3.701 0.000215 ***
nodematch.level  1.25596    0.35743      0   3.514 0.000442 ***
nodematch.dept   1.63590    0.41072      0   3.983  < 1e-04 ***
absdiff.age     -0.03375    0.02681      0  -1.259 0.208042    
absdiff.tenure   0.01060    0.02936      0   0.361 0.718209    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

     Null Deviance: 291.1  on 210  degrees of freedom
 Residual Deviance: 236.4  on 203  degrees of freedom
 
AIC: 250.4  BIC: 273.9  (Smaller is better. MC Std. Err. = 0)

Note that this model also conditions on the degree correlation between tenure at the firm and degree. As we can see, the categorical homophily effects due to level and department remain strong, however, there is no evidence for homophily effects based on age and tenure.

Instead, the continuous attributes are correlated with degree, with ties featuring younger people being more likely to be observed (negative age effect on the nodecov term) and ties featuring people with longer tenure (positive tenure effect on the nodecov term) also being more likely to be observed. It makes sense to condition on both age and tenure at the same time, since these are strongly correlated at the individual level (\(r=\) 0.48)

Heterogeneous Homophily Effects

Note that nodematch fits a single term for the homophily on a categorical attribute. For an attribute with just two categories this may not be a problem, since we are just comparing “same” connected dyads to “different” connected dyads.

But consider an attribute like “department” in these data, which has four levels. Fitting a single term means we are constraining the homophily effect to be the same regardless of the levels at which the nodes match. So the effect for 4-4 nodes is the same as the effect for 3-3 nodes, which is the same as the effect for 2-2 nodes and so forth.

We can of course relax this homogeneity restriction and specify separate nodematch effects by levels of the categorical attribute:

   m8 <- ergm(n ~ edges 
              + nodecov("age") + nodecov("tenure")
              + nodematch("level", levels = c(2, 3)) 
              + nodematch("dept", diff = TRUE, levels = c(2, 3, 4))  
              + absdiff("age") + absdiff("tenure") 
              )
   summary(m8)
Call:
ergm(formula = n ~ edges + nodecov("age") + nodecov("tenure") + 
    nodematch("level", levels = c(2, 3)) + nodematch("dept", 
    diff = TRUE, levels = c(2, 3, 4)) + absdiff("age") + absdiff("tenure"))

Maximum Likelihood Results:

                 Estimate Std. Error MCMC % z value Pr(>|z|)    
edges             1.03210    1.26058      0   0.819 0.412928    
nodecov.age      -0.05374    0.02055      0  -2.616 0.008909 ** 
nodecov.tenure    0.07958    0.02327      0   3.420 0.000627 ***
nodematch.level   1.15303    0.34606      0   3.332 0.000863 ***
nodematch.dept.1  1.26854    0.74658      0   1.699 0.089294 .  
nodematch.dept.2  1.55826    0.56209      0   2.772 0.005566 ** 
nodematch.dept.3 -0.34708    1.33313      0  -0.260 0.794592    
absdiff.age      -0.03089    0.02671      0  -1.157 0.247476    
absdiff.tenure    0.01397    0.02941      0   0.475 0.634642    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

     Null Deviance: 291.1  on 210  degrees of freedom
 Residual Deviance: 242.7  on 201  degrees of freedom
 
AIC: 260.7  BIC: 290.8  (Smaller is better. MC Std. Err. = 0)

Here we are using connected dyads that match at department = 1 in the data (there’s one node that is coded at zero on this variable, so there cannot be dyads that match at this value) as the reference by specifying the levels argument. We specify we want heterogeneous homophily effects by department by setting the diff argument inside the nodematch term to TRUE (it is FALSE by default).

As we can see, we can observe heterogeneous homophily effects on the department attribute in these data. Connected dyads matching on department at levels 2 and 3 are responsible for most of the positive homophily effects. For dyads that match at level 4 of department, there is no homophily (\(\theta_{dept4-4} = - 0.35, p = 0.80\)).

Looking at combinations of ties featuring an attribute

In the preceding we looked at terms that condition on homophily or the probability that a given edge in a network drawn at random would feature nodes that match on a given attribute. Sometimes, we may be interested in looking at terms that condition on both matching and not-matching. For this task, we use the ergm term nodemix.

Let’s look at an example. But first let’s switch to a different data set, this time the friendship nomination network from Lazega’s Law Firm data:

   g <- law_friends
   g <- as_undirected(g, mode = "collapse")
   n <- asNetwork(g)

And now let’s fit an ergm that conditions on homophily and activity effects for four individual level covariates: gender, partner status, age, seniority.

   m9 <- ergm(n ~ edges + 
                  + nodecov("age") + nodecov("seniority")
                  + nodefactor("status") + nodefactor("gender")
                  + nodematch("status") + nodematch("gender")
                  + absdiff("age") + absdiff("seniority")
              )
   summary(m9)
Call:
ergm(formula = n ~ edges + +nodecov("age") + nodecov("seniority") + 
    nodefactor("status") + nodefactor("gender") + nodematch("status") + 
    nodematch("gender") + absdiff("age") + absdiff("seniority"))

Maximum Likelihood Results:

                     Estimate Std. Error MCMC % z value Pr(>|z|)    
edges               -0.851155   0.524376      0  -1.623 0.104551    
nodecov.age         -0.008933   0.007078      0  -1.262 0.206900    
nodecov.seniority    0.017056   0.008958      0   1.904 0.056905 .  
nodefactor.status.2 -0.406772   0.118871      0  -3.422 0.000622 ***
nodefactor.gender.2  0.193677   0.115764      0   1.673 0.094320 .  
nodematch.status     0.799008   0.147556      0   5.415  < 1e-04 ***
nodematch.gender     0.389504   0.149017      0   2.614 0.008953 ** 
absdiff.age         -0.030883   0.010068      0  -3.068 0.002158 ** 
absdiff.seniority   -0.067122   0.012294      0  -5.460  < 1e-04 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

     Null Deviance: 3445  on 2485  degrees of freedom
 Residual Deviance: 1928  on 2476  degrees of freedom
 
AIC: 1946  BIC: 1999  (Smaller is better. MC Std. Err. = 0)

We can see that there are not statistically significant activity effects based on attribute, except for the one premised on status in the firm (although both the positive seniority and gender effects are significant at the \(p <0.10\) level. We can also observe predicted homophily effects based on all four characteristics. This looks like a well-fitting useful baseline model.

Let’s now add additional activity and homophily effects law_school. This variable is coded into three categories: (1) Harvard/Yale, (2) Uconn, (3) “Other”:

   m10 <- ergm(n ~ edges +
                  + nodecov("age") + nodecov("seniority")
                  + nodefactor("status") + nodefactor("gender")
                  + nodefactor("law_school", base = 3)
                  + nodematch("status") + nodematch("gender")
                  + absdiff("age") + absdiff("seniority")
                  + nodematch("law_school")
              )
   summary(m10)
Call:
ergm(formula = n ~ edges + +nodecov("age") + nodecov("seniority") + 
    nodefactor("status") + nodefactor("gender") + nodefactor("law_school", 
    base = 3) + nodematch("status") + nodematch("gender") + absdiff("age") + 
    absdiff("seniority") + nodematch("law_school"))

Maximum Likelihood Results:

                         Estimate Std. Error MCMC % z value Pr(>|z|)    
edges                   -1.275735   0.550528      0  -2.317 0.020488 *  
nodecov.age             -0.006860   0.007370      0  -0.931 0.351919    
nodecov.seniority        0.015054   0.009091      0   1.656 0.097734 .  
nodefactor.status.2     -0.429909   0.120257      0  -3.575 0.000350 ***
nodefactor.gender.2      0.134855   0.118160      0   1.141 0.253747    
nodefactor.law_school.1  0.091510   0.128525      0   0.712 0.476464    
nodefactor.law_school.2  0.319710   0.094629      0   3.379 0.000729 ***
nodematch.status         0.802762   0.148891      0   5.392  < 1e-04 ***
nodematch.gender         0.383468   0.149576      0   2.564 0.010356 *  
absdiff.age             -0.030985   0.010123      0  -3.061 0.002207 ** 
absdiff.seniority       -0.066486   0.012437      0  -5.346  < 1e-04 ***
nodematch.law_school     0.109033   0.122245      0   0.892 0.372436    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

     Null Deviance: 3445  on 2485  degrees of freedom
 Residual Deviance: 1915  on 2473  degrees of freedom
 
AIC: 1939  BIC: 2009  (Smaller is better. MC Std. Err. = 0)

This model shows that Uconn graduates are more popular in the friendship network (they form more ties), compared to graduates from “other” schools, but that there doesn’t seem to be homophily based on this factor.

As we noted, sometimes we may suspect that homophily happens via a combination of attributes (e.g., gender and status). We can specify such effects using the nodemix term as follows:

   m11 <- ergm(n ~ edges +
                  + nodecov("age") + nodecov("seniority")
                  + nodefactor("status") + nodefactor("gender")
                  + nodefactor("law_school", base = 3)
                  + absdiff("age") + absdiff("seniority")
                  + nodematch("law_school")
                  + nodemix(c("gender", "status"), levels = TRUE, levels2 = c(1:6))
              )
   summary(m11)
Call:
ergm(formula = n ~ edges + +nodecov("age") + nodecov("seniority") + 
    nodefactor("status") + nodefactor("gender") + nodefactor("law_school", 
    base = 3) + absdiff("age") + absdiff("seniority") + nodematch("law_school") + 
    nodemix(c("gender", "status"), levels = TRUE, levels2 = c(1:6)))

Maximum Likelihood Results:

                           Estimate Std. Error MCMC % z value Pr(>|z|)    
edges                     -2.225120   0.715918      0  -3.108 0.001883 ** 
nodecov.age               -0.006030   0.007508      0  -0.803 0.421886    
nodecov.seniority          0.017111   0.009211      0   1.858 0.063210 .  
nodefactor.status.2        0.132407   0.239020      0   0.554 0.579607    
nodefactor.gender.2        0.554151   0.238303      0   2.325 0.020050 *  
nodefactor.law_school.1    0.080613   0.128892      0   0.625 0.531689    
nodefactor.law_school.2    0.326575   0.095087      0   3.434 0.000594 ***
absdiff.age               -0.029508   0.010222      0  -2.887 0.003891 ** 
absdiff.seniority         -0.069504   0.012543      0  -5.541  < 1e-04 ***
nodematch.law_school       0.110916   0.122475      0   0.906 0.365138    
mix.gender.status.1.1.1.1  1.940435   0.472391      0   4.108  < 1e-04 ***
mix.gender.status.1.1.1.2  0.645250   0.358610      0   1.799 0.071970 .  
mix.gender.status.1.2.1.2  1.134592   0.380036      0   2.985 0.002831 ** 
mix.gender.status.1.1.2.1  1.462671   0.443169      0   3.300 0.000965 ***
mix.gender.status.1.2.2.1 -0.869626   0.622416      0  -1.397 0.162360    
mix.gender.status.2.1.2.1  2.127482   1.314879      0   1.618 0.105661    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

     Null Deviance: 3445  on 2485  degrees of freedom
 Residual Deviance: 1908  on 2469  degrees of freedom
 
AIC: 1940  BIC: 2033  (Smaller is better. MC Std. Err. = 0)

Here we use low status homogeneous dyads (regardless of gender mix) as the comparison category. The results show that compared to those dyads, high-status homogeneous, all men dyads are more prevalent, as are gender-mixed dyads where the man is higher status. We also find a preponderance of all-men, status heterogeneous dyads.

Checking for Model Fit

In ergms we can always check for model fit using standard statistics they share with general linear models, like the model deviance, BIC, AIC and so forth. These are particularly useful when trying to decide between nested models.

However, we can also use the fact that each ergm specifies a probability distribution across a graph ensemble consistent with the estimated parameters to see whether the specification we decided on actually produces graphs with global (network level) characteristics that fall within a plausible range.

That is, even though network level factors are not directly specified in the model, a good model should be able to reproduce them via the local tie generating processes we specified.

When an ergm specification does not do this, and instead produces graphs within a narrow or weird range of values—e.g., for basic network statistics like the degree distribution, the geodesic distance distribution, or basic counts of key motifs like triangles and number of shared partners (a technical issue known as “degeneracy” in the ergm framework)—then that is a signal we need to change something in our specification.

In the ergm package we can check for goodness of fit based on the ability to match global network characteristics using the gof function. The function takes a ergm model object as input along with an optional argument called GOF specifying what simulated global network property across the implied graph ensemble we want to compare to our observed values. The main ones include the degree distribution, the distribution of edge-wise shared partners (common neighbors of connected dyads), and the geodesic distance distribution (number of dyads separated by shortest parths of a given length).

Let’s use m8 above which was the attribute-driven model with heterogeneous homophily effects for department we fit to the Krackhardt High Tech Managers data. Let’s say, we wanted to see whether our model was successful in reproducing the distribution of activity (degree) across nodes. In that case, we would type:

   set.seed(123)
   gof.m8 <- gof(m8, GOF = ~ degree)

Here we asked gof to create a graph ensemble of one hundred graphs (the default) and look at the degree distribution in each, recording how many times a node with a given degree is observed in each graph.

To see the result we can check out the following object:

   gof.m8$pval.deg
         obs min mean max MC p-value
degree0    0   0 0.08   1       1.00
degree1    0   0 0.36   3       1.00
degree2    1   0 0.45   2       0.78
degree3    1   0 1.03   4       1.00
degree4    1   0 1.32   5       1.00
degree5    3   0 1.93   5       0.68
degree6    4   0 2.37   8       0.44
degree7    2   0 2.68   7       1.00
degree8    2   0 2.71   8       1.00
degree9    2   0 2.66   8       1.00
degree10   3   0 1.97   6       0.56
degree11   0   0 1.60   5       0.46
degree12   0   0 1.05   4       0.64
degree13   0   0 0.48   3       1.00
degree14   1   0 0.18   1       0.36
degree15   0   0 0.09   2       1.00
degree16   0   0 0.03   1       1.00
degree17   0   0 0.01   1       1.00
degree18   1   0 0.00   0       0.00
degree19   0   0 0.00   0       1.00
degree20   0   0 0.00   0       1.00

The way to read the table is as follows: Under the column “obs” is the values of degree in the data (observed), then the other columns contain the range (min and max) and the mean of the each value of degree in the graph ensemble implied by the model.

The p-value is a test that the observed value is significantly different from the mean of the graph ensemble. Note that in a good fitting model, the mean of the ensemble implied by the model so we want \(p > 0.05\) for each row in the table as this indicates a good fitting model. We can see that this is indeed the case for every value except the last outlier one (\(k = 18\)), the simulations show this never happening (max = 0), so it is outside of the range of the observed value of one (\(p = 0.00\)).

Overall, however, the model does a good job of reproducing a distribution of degrees in the ensemble consistent with the observed data. If we wanted to see it in a plot, we would just type:

   plot(gof.m8)

In this box plot, the ensemble mean is the blue diamond, the gray box is the interquartile (25th to 75th percentile) range, the black horizontal line in the gray box is the ensemble median, and the black connected line plot is the observed values.

What we would like to see in a well-fitting model is a black line that goes through the gray box and is not outside them, which is generally the case here for most of the observed range except for the very large values.

Conditioning on Degree and Shared Partners

Note that all the models estimated earlier only adjust for the edges parameter as a structural effect. In this sense, they are using the Erdos-Renyi model as the structural baseline.

As we saw with the edge swapping models, it is probably better to use more sophisticated null models than the Erdos-Renyi one. For instance, models that condition on degree heterogeneity in some way, or that also account for transitivity (e.g., triangle) effects (e.g., the tendency of dyads sharing common partners to be connected).

In an ergm context, we could condition on the degree distribution by adding a separate parameter for each observed degree, or adding a triangle term. These models are computationally intensive, and often fail to converge (for obscure technical reasons).

A better approach is to fit both degree heterogeneity and transitivity effect using a family of geometrically weighted decay parameter terms. The idea is to fit the distribution of a given quantity (e.g., degree, number of shared partners per dyad) by fitting a curve with a term that decays as either the degree or the number of shared partners increases, thus giving more weight to lower degrees and the first or second shared partner as compared to the third, fourth, fifth, and so forth.

These gw terms are less computationally intensive and use a single parameter to fit both the degree distribution—using the gwdegree term—and the (dyad or edge)wise shared partner distribution—using the gwdsp and gwesp terms (respectively).

Let’s see how the gw ergm terms work, going back to Krackhardt’s High Tech Managers data. Here’s a model looking at activity and homophily effects conditioning on the degree distribution:

   g <- ht_friends
   g <- as_undirected(g, mode = "collapse")
   n <- asNetwork(g)
   m12 <- ergm(n ~ edges 
              + nodecov("age") + nodecov("tenure")
              + nodematch("level", levels = c(2, 3)) 
              + nodematch("dept", diff = TRUE, levels = c(2, 3, 4))  
              + absdiff("age") + absdiff("tenure") 
              + gwdegree(decay = 0.25, fixed = TRUE)
              )
   summary(m12)
Call:
ergm(formula = n ~ edges + nodecov("age") + nodecov("tenure") + 
    nodematch("level", levels = c(2, 3)) + nodematch("dept", 
    diff = TRUE, levels = c(2, 3, 4)) + absdiff("age") + absdiff("tenure") + 
    gwdegree(decay = 0.25, fixed = TRUE))

Monte Carlo Maximum Likelihood Results:

                 Estimate Std. Error MCMC % z value Pr(>|z|)   
edges             1.56101    1.53828      0   1.015  0.31021   
nodecov.age      -0.06362    0.02524      0  -2.520  0.01173 * 
nodecov.tenure    0.08518    0.02593      0   3.285  0.00102 **
nodematch.level   1.15615    0.36159      0   3.197  0.00139 **
nodematch.dept.1  1.39805    0.78609      0   1.778  0.07532 . 
nodematch.dept.2  1.62179    0.58450      0   2.775  0.00553 **
nodematch.dept.3 -0.50239    1.36220      0  -0.369  0.71227   
absdiff.age      -0.03298    0.02463      0  -1.339  0.18058   
absdiff.tenure    0.01995    0.03107      0   0.642  0.52079   
gwdeg.fixed.0.25  7.55080   15.28561      0   0.494  0.62132   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

     Null Deviance: 291.1  on 210  degrees of freedom
 Residual Deviance: 242.1  on 200  degrees of freedom
 
AIC: 262.1  BIC: 295.6  (Smaller is better. MC Std. Err. = 0.08229)

Note that our homophily estimates are still significant, even after accounting for heterogeneity in the degree distribution, the estimate of corresponding to the gwdegree term is not significant, suggesting that our activity and homophily attributes do a good job of fitting the empirical degree distribution (which we knew form our goodness of fit tests).

Now, here’s a model that conditions on the tendency for dyads who share common partners to themselves be connected:

   g <- ht_friends
   g <- as_undirected(g, mode = "collapse")
   n <- asNetwork(g)
   m12 <- ergm(n ~ edges 
              + nodecov("age") + nodecov("tenure")
              + nodematch("level", levels = c(2, 3)) 
              + nodematch("dept", diff = TRUE, levels = c(2, 3, 4))  
              + absdiff("age") + absdiff("tenure") 
              + gwdsp(decay = 0.25, fixed = TRUE)
              )
   summary(m12)
Call:
ergm(formula = n ~ edges + nodecov("age") + nodecov("tenure") + 
    nodematch("level", levels = c(2, 3)) + nodematch("dept", 
    diff = TRUE, levels = c(2, 3, 4)) + absdiff("age") + absdiff("tenure") + 
    gwdsp(decay = 0.25, fixed = TRUE))

Monte Carlo Maximum Likelihood Results:

                 Estimate Std. Error MCMC % z value Pr(>|z|)   
edges             1.16726    1.40838      0   0.829  0.40722   
nodecov.age      -0.06852    0.02288      0  -2.995  0.00274 **
nodecov.tenure    0.08571    0.02619      0   3.273  0.00107 **
nodematch.level   1.26334    0.40822      0   3.095  0.00197 **
nodematch.dept.1  1.40475    0.76587      0   1.834  0.06663 . 
nodematch.dept.2  1.53716    0.58401      0   2.632  0.00849 **
nodematch.dept.3 -0.67235    1.36362      0  -0.493  0.62197   
absdiff.age      -0.03087    0.02743      0  -1.125  0.26039   
absdiff.tenure    0.01625    0.03146      0   0.516  0.60552   
gwdsp.fixed.0.25  0.55992    0.23784      0   2.354  0.01856 * 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

     Null Deviance: 291.1  on 210  degrees of freedom
 Residual Deviance: 235.4  on 200  degrees of freedom
 
AIC: 255.4  BIC: 288.9  (Smaller is better. MC Std. Err. = 0.1518)

In this model, we find a statistically significant \(\theta\) estimate for the geometrically weighted dyadwise shared partners effect (with decay parameter \(\gamma = 0.25\)), suggesting that this network is likely drawn from a distribution of graphs where pairs of nodes mutually connected to a third are also more likely to be connected themselves (\(p < 0.05\)).

However, even after adjusting for this structural tendency our homophily effects are still statistically significant, suggesting that they are not the spurious byproduct of transitivity in friendship ties.

Finally, we can also adjust for the tendency of connected dyads to share multiple friends (e.g., connected dyads being embedded on one or more triangles), also known as the Simmelian Tie effect, also known as the geometrically weighted edgewise shared partners effect:

   g <- ht_friends
   g <- as_undirected(g, mode = "collapse")
   n <- asNetwork(g)
   m13 <- ergm(n ~ edges 
              + nodecov("age") + nodecov("tenure")
              + nodematch("level", levels = c(2, 3)) 
              + nodematch("dept", diff = TRUE, levels = c(2, 3, 4))  
              + absdiff("age") + absdiff("tenure") 
              + gwdsp(decay = 0.25, fixed = TRUE)
              + gwesp(decay = 0.25, fixed = TRUE)
              )
   summary(m13)
Call:
ergm(formula = n ~ edges + nodecov("age") + nodecov("tenure") + 
    nodematch("level", levels = c(2, 3)) + nodematch("dept", 
    diff = TRUE, levels = c(2, 3, 4)) + absdiff("age") + absdiff("tenure") + 
    gwdsp(decay = 0.25, fixed = TRUE) + gwesp(decay = 0.25, fixed = TRUE))

Monte Carlo Maximum Likelihood Results:

                 Estimate Std. Error MCMC % z value Pr(>|z|)   
edges            -0.75620    1.99545      0  -0.379  0.70471   
nodecov.age      -0.06005    0.02130      0  -2.820  0.00480 **
nodecov.tenure    0.07811    0.02526      0   3.092  0.00199 **
nodematch.level   1.20285    0.39791      0   3.023  0.00250 **
nodematch.dept.1  1.32797    0.78746      0   1.686  0.09172 . 
nodematch.dept.2  1.39367    0.55827      0   2.496  0.01255 * 
nodematch.dept.3 -0.71792    1.32015      0  -0.544  0.58657   
absdiff.age      -0.03118    0.02791      0  -1.117  0.26396   
absdiff.tenure    0.01854    0.03028      0   0.612  0.54038   
gwdsp.fixed.0.25  0.40011    0.23478      0   1.704  0.08835 . 
gwesp.fixed.0.25  1.14136    1.06525      0   1.071  0.28397   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

     Null Deviance: 291.1  on 210  degrees of freedom
 Residual Deviance: 233.1  on 199  degrees of freedom
 
AIC: 255.1  BIC: 292  (Smaller is better. MC Std. Err. = 0.2465)

Which shows that our results are also robust to this structural tendency, although the transitivity effect is weakened a bit (\(p < 0.10\)) when adjusting for statistical prevalence of Simmelian ties.

References

Pattison, Philippa, and Garry Robins. 2002. “Neighborhood-Based Models for Social Networks.” Sociological Methodology 32 (1): 301–37.

Footnotes

  1. The denominator (\(\kappa(\theta)\)) is a normalizing constant required for the expression to result in a proper probability distribution and thus is of no substantive significance.↩︎