Introduction

At times it can be advantagous to simulate data and study the HLM models and their behavior. In Edpsy 587, some small simulation studies are presented. In the notes on estimation this was done to show the impact of estimation methods on the results (especailly standard errors of fixed effets). Later we will examine a simulation study where we violate assumptions and examine impact on various results.

The basic steps covered here are

Simulate data from an a random intercept model with 2 levels
Fit and compare models fit to random intercept data (too simple, correct model and too complex.
Simulate data from a random intercept and slope model
Fit and comparge models fti to intercept and slope model

These could be writen more efficiently but they work and can be relatively easy to modify

Using screenreg from the texreg package can save you a lot of typing. To get output from texreg into word (and keep spacing, use

“courier new” font in word.
Landscape (especially if have lots of models). portrait works better for a few models and lots effects in them.
Change space with 0 pt between lines.
Edit model titles to be more informative.
If needed use smaller font (my least favorite thing to do).

Set up

We will use the same set up in terms of packages as previously done this semester

library(lme4)
library(texreg)

Simulate Random Intercept Data

We will simulate data from a model with a level 1 predictor and a level 2 predictor of the random intercept. In particular, \[ \mbox{Level 1:}\hspace{.5in} Y_{ij} = \beta_{0j} + \beta_{1j}x_{ij} + R_{ij}\] where \(R_{ij} \sim N(0,\sigma^2) i.i.d\). The variable \(x_{ij}\) is our level one variable. \[\mbox{Level 2:} \hspace{.5in} \beta_{0j} = \gamma_{00} + \gamma_{01}z_{j} + U_{0j} \\ \beta_{1j} = \gamma_{10}\] where \(U_{0j} \sim N(0,\tau_0^2)\) and independent of \(R_{ij}\). The linear mixed model is \[ Y_{ij} = \gamma_{00} + \gamma_{10} + \gamma_{01}\mbox{age}_{j} + U_{0j} + R_{ij}.\] Given our model, we need to set up some parameters and sample sized. These could be based on previous studies, expected outcomes, the purpose of the study, or other considerations.

N <-  100         # number of clusters/groups = 100 * 50
nj <- 20          # number of observations for group j

The linear mixed model we are Set fixed effects parameters,

gamma00 <- 5
gamma01 <- 2
gamma10 <- 3

We also need to set variance of random intercept, slope, covariance, and level 1 residual variances

tau2 = 2
sigma2 = 4

We need to save the simulated values so that we can later fit models to it, the following null matrix will do the trick

R.intercept <- matrix(,nrow=N*nj,ncol=7)
head(R.intercept)

##      [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,]   NA   NA   NA   NA   NA   NA   NA
## [2,]   NA   NA   NA   NA   NA   NA   NA
## [3,]   NA   NA   NA   NA   NA   NA   NA
## [4,]   NA   NA   NA   NA   NA   NA   NA
## [5,]   NA   NA   NA   NA   NA   NA   NA
## [6,]   NA   NA   NA   NA   NA   NA   NA

We also need and index to keep track of rows of the R.intercept object

index <- 1

We will simulate the random intercept data using multiple nexted loops. Think of the sampling of nested data, we first sample marco (e.g., schools) and then within the macro units, sample the micro units (e.g., students)

# (Outer Loop) Sample the marco units
    for (macro in (1:N)) {
      
     U0j= sqrt(tau2) * rnorm(1) 
     age=50 * runif(1) + 20     # this makes age range between 20 - 70
     
#   (Inner loop) Within a marco unit, take a sample of micro units;
       for (micro in (1:nj)) {

# Create level 1 explanatory/predictor variable (draw from standard normal)
        xij = rnorm(1) 

# Create level 1 residual with sigma2= 2    
        z = rnorm(1)   
        Rij=sqrt(sigma2)*z;  

# Computed response/outcome variable using linear mixed model 
#  and save the data in R.intercept
        yij = gamma00 + gamma01*xij + gamma01*age + U0j + Rij
        R.intercept[index,1:7] <- c(macro, micro, yij, age, xij, U0j, Rij)
          index <- index + 1   # up-data index
     }  # --- end inner loop, go to next micro unit. 

} # --- end outer loop, go to next macro unit;

The final step is to change our data to a data frame and make sure we have good names for all columns.

R.intercept <- as.data.frame(R.intercept)
names(R.intercept) <- c("macro","micro","yij","age","xij","U0j","Rij")
head(R.intercept)

##   macro micro      yij      age        xij      U0j       Rij
## 1     1     1 111.8731 52.19103 -0.8990345 1.731856  2.557265
## 2     1     2 108.2850 52.19103 -0.3353093 1.731856 -2.158303
## 3     1     3 113.7693 52.19103 -0.2769367 1.731856  3.209267
## 4     1     4 110.7890 52.19103 -1.0855800 1.731856  1.846235
## 5     1     5 103.0624 52.19103 -0.8738995 1.731856 -6.303725
## 6     1     6 112.0681 52.19103 -0.5397123 1.731856  2.033652

Fit Models to Simulated Random Intercept Data

We will fit a series of models here. One will the “correct” model; that is, the one used to simulate data. We will also fit one that is simpler that levels out out variable that I’ve called age. The last models is on that overfits the data; that is, it is more complex that how we’ve siumlated data. The last models includes a random slope

The correct model:

summary( model.1 <- lmer(yij ~ xij + age + (1 | macro),data=R.intercept, REML=FALSE) )

## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: yij ~ xij + age + (1 | macro)
##    Data: R.intercept
## 
##      AIC      BIC   logLik deviance df.resid 
##   8814.7   8842.7  -4402.3   8804.7     1995 
## 
## Scaled residuals: 
##      Min       1Q   Median       3Q      Max 
## -3.02312 -0.67129  0.01416  0.65867  2.91436 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  macro    (Intercept) 1.912    1.383   
##  Residual             4.261    2.064   
## Number of obs: 2000, groups:  macro, 100
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept) 4.800728   0.464761   10.33
## xij         2.128698   0.047293   45.01
## age         2.008693   0.009943  202.03
## 
## Correlation of Fixed Effects:
##     (Intr) xij   
## xij -0.001       
## age -0.950  0.003

Too simple

summary( model.0 <- lmer(yij ~ xij  + (1 | macro),data=R.intercept, REML=FALSE) )

## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: yij ~ xij + (1 | macro)
##    Data: R.intercept
## 
##      AIC      BIC   logLik deviance df.resid 
##   9414.1   9436.5  -4703.0   9406.1     1996 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.9514 -0.6570  0.0155  0.6461  2.9961 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  macro    (Intercept) 869.259  29.483  
##  Residual               4.261   2.064  
## Number of obs: 2000, groups:  macro, 100
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept) 93.95585    2.94868   31.86
## xij          2.12759    0.04741   44.87
## 
## Correlation of Fixed Effects:
##     (Intr)
## xij 0.000

Too complex

summary( model.2 <- lmer(yij ~ xij + age + (1 + xij | macro), data=R.intercept, REML=FALSE) )

## boundary (singular) fit: see ?isSingular

## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: yij ~ xij + age + (1 + xij | macro)
##    Data: R.intercept
## 
##      AIC      BIC   logLik deviance df.resid 
##   8816.8   8856.0  -4401.4   8802.8     1993 
## 
## Scaled residuals: 
##      Min       1Q   Median       3Q      Max 
## -2.97811 -0.66921  0.01805  0.65947  2.91403 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr
##  macro    (Intercept) 1.911034 1.38240      
##           xij         0.004766 0.06904  1.00
##  Residual             4.256428 2.06311      
## Number of obs: 2000, groups:  macro, 100
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept) 4.781866   0.460719   10.38
## xij         2.129125   0.047763   44.58
## age         2.009108   0.009846  204.05
## 
## Correlation of Fixed Effects:
##     (Intr) xij   
## xij  0.038       
## age -0.949  0.007
## optimizer (nloptwrap) convergence code: 0 (OK)
## boundary (singular) fit: see ?isSingular

Comparison

We can more easily compare the results of the models using `with each other and with the parameters used to simulate

screenreg(list(model.0, model.1, model.2),
          custom.model.names=c("Simple","Correct", "Complex"))

## 
## ====================================================================
##                             Simple        Correct       Complex     
## --------------------------------------------------------------------
## (Intercept)                    93.96 ***      4.80 ***      4.78 ***
##                                (2.95)        (0.46)        (0.46)   
## xij                             2.13 ***      2.13 ***      2.13 ***
##                                (0.05)        (0.05)        (0.05)   
## age                                           2.01 ***      2.01 ***
##                                              (0.01)        (0.01)   
## --------------------------------------------------------------------
## AIC                          9414.08       8814.68       8816.79    
## BIC                          9436.49       8842.68       8855.99    
## Log Likelihood              -4703.04      -4402.34      -4401.39    
## Num. obs.                    2000          2000          2000       
## Num. groups: macro            100           100           100       
## Var: macro (Intercept)        869.26          1.91          1.91    
## Var: Residual                   4.26          4.26          4.26    
## Var: macro xij                                              0.00    
## Cov: macro (Intercept) xij                                  0.10    
## ====================================================================
## *** p < 0.001; ** p < 0.01; * p < 0.05

We see that the correct models (model.1) is the best in terms of both AIC and BIC (more on these measures later). Note that the model that is overly complext have \(\tau_{1}^2=0.00\), basically zero.

Simulate Random Intercept and Slope Model

We’ll use the same sample sizes, fixed effects, and variances as above; however, we also need \(\tau_1^2\) and \(\tau_{01}\), which are set below.

varU1j = 1
covU0jU1j= .4

To get desired covariance matrix for U0j and U1j, I find the use cholskey root to get weights for a linear combination that will create random effects that have the desired variance and covariance.

(Tau.mtx <- matrix(c(tau2, covU0jU1j, covU0jU1j, varU1j),nrow=2,ncol=2) )

##      [,1] [,2]
## [1,]  2.0  0.4
## [2,]  0.4  1.0

w <- chol(Tau.mtx)
w

##          [,1]      [,2]
## [1,] 1.414214 0.2828427
## [2,] 0.000000 0.9591663

We set up a null matrix to save data

R.intslope <- matrix( , nrow=N*nj, ncol=8)

and need to set the index back to 1

index <- 1

The process is the same, except we sample an extra random effect and take a linear combination of them to get the desired covariance

# (Outer Loop) Sample the marco units
    for (macro in (1:N)) {
      U0j <- w[1,1]*rnorm(1)
      U1j <- w[1,2]*U0j + w[2,2]*rnorm(1) 
      zj = 50 * runif(1) + 20      

# (Inner loop) Within a marco unit, take a sample of micro units;
       for (micro in (1:nj)) { 
         xij=rnorm(1)
         z=rnorm(1);   
         Rij=sqrt(sigma2)*z;      

# Computed response/outcome variable of the linear mixed model and save     
        yij = gamma00 + gamma10*xij + gamma01*zj + U0j + + U1j*xij + Rij
          R.intslope[index,1:8] <- c(macro,micro,yij,zj,xij,U0j,U1j,Rij)
        index <- index + 1
     }  # <--- end inner loop, go to next micro unit
} # <--- end outer loop, go to next macro unit

Again, we turn the simulated data into a data frame with names of what’s what

R.intslope <- as.data.frame(R.intslope)
names(R.intslope) <- c("macro","micro","yij","zj","xij","U0j","U1j","Rij")

Every time you randomly sample values, you’ll get different results, unless you set the seed for the random number generator (which I didn’t). Just to check on what the actual correlation matrix of random effects equals and how close it is to the desried one.

x <- R.intslope[ , c(6,7,8)]
cor(x)

##             U0j         U1j         Rij
## U0j 1.000000000 0.402359453 0.003544223
## U1j 0.402359453 1.000000000 0.003367676
## Rij 0.003544223 0.003367676 1.000000000

cov(x)

##             U0j        U1j         Rij
## U0j 1.892319336 0.63547624 0.009877273
## U1j 0.635476238 1.31818324 0.007833160
## Rij 0.009877273 0.00783316 4.104285616

The covariance matrix between \(U_{0j}\) and \(U_{1j}\) should be approximately equal to

Tau.mtx

##      [,1] [,2]
## [1,]  2.0  0.4
## [2,]  0.4  1.0

The covariance bewteen the \(U_{j}\) and \(R_{ij}\) should be very small and close to zero, and \(\sigma\) should be close to 4.

Fitting Random Intercept and Slope Models

Again we fit models that are too simple, too complex, and just right.

Too simple

summary( simple <- lmer(yij ~ xij + zj + (1  |macro), data=R.intslope, REML=FALSE) )

## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: yij ~ xij + zj + (1 | macro)
##    Data: R.intslope
## 
##      AIC      BIC   logLik deviance df.resid 
##   9276.9   9304.9  -4633.4   9266.9     1995 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.9368 -0.6325  0.0079  0.6463  3.2158 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  macro    (Intercept) 2.067    1.438   
##  Residual             5.407    2.325   
## Number of obs: 2000, groups:  macro, 100
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  4.85653    0.51565   9.418
## xij          3.06499    0.05541  55.311
## zj           2.00326    0.01087 184.311
## 
## Correlation of Fixed Effects:
##     (Intr) xij   
## xij  0.009       
## zj  -0.955 -0.010

Correct model

summary( correct <- lmer(yij ~ xij + zj + (1 + xij |macro), data=R.intslope, REML=FALSE) )

## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: yij ~ xij + zj + (1 + xij | macro)
##    Data: R.intslope
## 
##      AIC      BIC   logLik deviance df.resid 
##   8895.7   8934.9  -4440.8   8881.7     1993 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.8597 -0.6219 -0.0166  0.6294  3.7895 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr
##  macro    (Intercept) 1.952    1.397        
##           xij         1.475    1.215    0.44
##  Residual             4.032    2.008        
## Number of obs: 2000, groups:  macro, 100
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept) 4.672484   0.461626   10.12
## xij         3.049993   0.131451   23.20
## zj          2.007790   0.009661  207.82
## 
## Correlation of Fixed Effects:
##     (Intr) xij   
## xij  0.125       
## zj  -0.948 -0.003

Too complex,

summary( complex <- lmer(yij ~ xij + zj + xij*zj + (1 + xij  | macro), data=R.intslope, REML=FALSE) )

## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: yij ~ xij + zj + xij * zj + (1 + xij | macro)
##    Data: R.intslope
## 
##      AIC      BIC   logLik deviance df.resid 
##   8897.5   8942.3  -4440.8   8881.5     1992 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.8566 -0.6212 -0.0164  0.6310  3.7916 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr
##  macro    (Intercept) 1.952    1.397        
##           xij         1.474    1.214    0.43
##  Residual             4.032    2.008        
## Number of obs: 2000, groups:  macro, 100
## 
## Fixed effects:
##              Estimate Std. Error t value
## (Intercept)  4.742022   0.496187   9.557
## xij          3.211910   0.444019   7.234
## zj           2.006261   0.010457 191.860
## xij:zj      -0.003569   0.009350  -0.382
## 
## Correlation of Fixed Effects:
##        (Intr) xij    zj    
## xij     0.385              
## zj     -0.955 -0.367       
## xij:zj -0.367 -0.955  0.383

Comparison

Lastly to make comparison easer,

screenreg(list(simple, correct, complex),
          custom.model.names=c("Simple","Correct", "Complex"))

## 
## ====================================================================
##                             Simple        Correct       Complex     
## --------------------------------------------------------------------
## (Intercept)                     4.86 ***      4.67 ***      4.74 ***
##                                (0.52)        (0.46)        (0.50)   
## xij                             3.06 ***      3.05 ***      3.21 ***
##                                (0.06)        (0.13)        (0.44)   
## zj                              2.00 ***      2.01 ***      2.01 ***
##                                (0.01)        (0.01)        (0.01)   
## xij:zj                                                     -0.00    
##                                                            (0.01)   
## --------------------------------------------------------------------
## AIC                          9276.87       8895.66       8897.51    
## BIC                          9304.87       8934.87       8942.32    
## Log Likelihood              -4633.44      -4440.83      -4440.76    
## Num. obs.                    2000          2000          2000       
## Num. groups: macro            100           100           100       
## Var: macro (Intercept)          2.07          1.95          1.95    
## Var: Residual                   5.41          4.03          4.03    
## Var: macro xij                                1.48          1.47    
## Cov: macro (Intercept) xij                    0.74          0.74    
## ====================================================================
## *** p < 0.001; ** p < 0.01; * p < 0.05

Some questions for you to answer:

What do you notice?
Are the results close to thosed used to simulate the data?

Final Remarks

In most stimulation studies, the process of simulating data, fiting models, and saving desired results is iterated a large number of times. After finishing all iterations, say \(Iter=100\) or \(500\), the fixed effects are averaged (their standard deviation is an estimate of the standard errors), varainces and covariances are averaged, and other desirable output. You should have a clear goal so that you saved all the important information. After running a simulation study, which can take a long time, you don’t want to think “I should have saved …”. It’s better to save more than less. Also, when saving data and output, it is often good to write results (and possibly) data to your hard drive at the end of the outer loop just in case your system crashes.

Simulation of Multilevel Data

Carolyn J. Anderson

2/21/2022

Introduction

Set up

Simulate Random Intercept Data

Fit Models to Simulated Random Intercept Data

The correct model:

Too simple

Too complex

Comparison

Simulate Random Intercept and Slope Model

Fitting Random Intercept and Slope Models

Too simple

Correct model

Too complex,

Comparison

Some questions for you to answer:

Final Remarks