At times it can be advantagous to simulate data and study the HLM models and their behavior. In Edpsy 587, some small simulation studies are presented. In the notes on estimation this was done to show the impact of estimation methods on the results (especailly standard errors of fixed effets). Later we will examine a simulation study where we violate assumptions and examine impact on various results.
The basic steps covered here are
These could be writen more efficiently but they work and can be relatively easy to modify
Using screenreg from the texreg package can save you a lot of typing. To get output from texreg into word (and keep spacing, use
We will use the same set up in terms of packages as previously done this semester
library(lme4)
library(texreg)
We will simulate data from a model with a level 1 predictor and a level 2 predictor of the random intercept. In particular, \[ \mbox{Level 1:}\hspace{.5in} Y_{ij} = \beta_{0j} + \beta_{1j}x_{ij} + R_{ij}\] where \(R_{ij} \sim N(0,\sigma^2) i.i.d\). The variable \(x_{ij}\) is our level one variable. \[\mbox{Level 2:} \hspace{.5in} \beta_{0j} = \gamma_{00} + \gamma_{01}z_{j} + U_{0j} \\ \beta_{1j} = \gamma_{10}\] where \(U_{0j} \sim N(0,\tau_0^2)\) and independent of \(R_{ij}\). The linear mixed model is \[ Y_{ij} = \gamma_{00} + \gamma_{10} + \gamma_{01}\mbox{age}_{j} + U_{0j} + R_{ij}.\] Given our model, we need to set up some parameters and sample sized. These could be based on previous studies, expected outcomes, the purpose of the study, or other considerations.
N <- 100 # number of clusters/groups = 100 * 50
nj <- 20 # number of observations for group j
The linear mixed model we are Set fixed effects parameters,
gamma00 <- 5
gamma01 <- 2
gamma10 <- 3
We also need to set variance of random intercept, slope, covariance, and level 1 residual variances
tau2 = 2
sigma2 = 4
We need to save the simulated values so that we can later fit models to it, the following null matrix will do the trick
R.intercept <- matrix(,nrow=N*nj,ncol=7)
head(R.intercept)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,] NA NA NA NA NA NA NA
## [2,] NA NA NA NA NA NA NA
## [3,] NA NA NA NA NA NA NA
## [4,] NA NA NA NA NA NA NA
## [5,] NA NA NA NA NA NA NA
## [6,] NA NA NA NA NA NA NA
We also need and index to keep track of rows of the R.intercept object
index <- 1
We will simulate the random intercept data using multiple nexted loops. Think of the sampling of nested data, we first sample marco (e.g., schools) and then within the macro units, sample the micro units (e.g., students)
# (Outer Loop) Sample the marco units
for (macro in (1:N)) {
U0j= sqrt(tau2) * rnorm(1)
age=50 * runif(1) + 20 # this makes age range between 20 - 70
# (Inner loop) Within a marco unit, take a sample of micro units;
for (micro in (1:nj)) {
# Create level 1 explanatory/predictor variable (draw from standard normal)
xij = rnorm(1)
# Create level 1 residual with sigma2= 2
z = rnorm(1)
Rij=sqrt(sigma2)*z;
# Computed response/outcome variable using linear mixed model
# and save the data in R.intercept
yij = gamma00 + gamma01*xij + gamma01*age + U0j + Rij
R.intercept[index,1:7] <- c(macro, micro, yij, age, xij, U0j, Rij)
index <- index + 1 # up-data index
} # --- end inner loop, go to next micro unit.
} # --- end outer loop, go to next macro unit;
The final step is to change our data to a data frame and make sure we have good names for all columns.
R.intercept <- as.data.frame(R.intercept)
names(R.intercept) <- c("macro","micro","yij","age","xij","U0j","Rij")
head(R.intercept)
## macro micro yij age xij U0j Rij
## 1 1 1 111.8731 52.19103 -0.8990345 1.731856 2.557265
## 2 1 2 108.2850 52.19103 -0.3353093 1.731856 -2.158303
## 3 1 3 113.7693 52.19103 -0.2769367 1.731856 3.209267
## 4 1 4 110.7890 52.19103 -1.0855800 1.731856 1.846235
## 5 1 5 103.0624 52.19103 -0.8738995 1.731856 -6.303725
## 6 1 6 112.0681 52.19103 -0.5397123 1.731856 2.033652
We will fit a series of models here. One will the “correct” model; that is, the one used to simulate data. We will also fit one that is simpler that levels out out variable that I’ve called age. The last models is on that overfits the data; that is, it is more complex that how we’ve siumlated data. The last models includes a random slope
summary( model.1 <- lmer(yij ~ xij + age + (1 | macro),data=R.intercept, REML=FALSE) )
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: yij ~ xij + age + (1 | macro)
## Data: R.intercept
##
## AIC BIC logLik deviance df.resid
## 8814.7 8842.7 -4402.3 8804.7 1995
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.02312 -0.67129 0.01416 0.65867 2.91436
##
## Random effects:
## Groups Name Variance Std.Dev.
## macro (Intercept) 1.912 1.383
## Residual 4.261 2.064
## Number of obs: 2000, groups: macro, 100
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 4.800728 0.464761 10.33
## xij 2.128698 0.047293 45.01
## age 2.008693 0.009943 202.03
##
## Correlation of Fixed Effects:
## (Intr) xij
## xij -0.001
## age -0.950 0.003
summary( model.0 <- lmer(yij ~ xij + (1 | macro),data=R.intercept, REML=FALSE) )
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: yij ~ xij + (1 | macro)
## Data: R.intercept
##
## AIC BIC logLik deviance df.resid
## 9414.1 9436.5 -4703.0 9406.1 1996
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.9514 -0.6570 0.0155 0.6461 2.9961
##
## Random effects:
## Groups Name Variance Std.Dev.
## macro (Intercept) 869.259 29.483
## Residual 4.261 2.064
## Number of obs: 2000, groups: macro, 100
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 93.95585 2.94868 31.86
## xij 2.12759 0.04741 44.87
##
## Correlation of Fixed Effects:
## (Intr)
## xij 0.000
summary( model.2 <- lmer(yij ~ xij + age + (1 + xij | macro), data=R.intercept, REML=FALSE) )
## boundary (singular) fit: see ?isSingular
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: yij ~ xij + age + (1 + xij | macro)
## Data: R.intercept
##
## AIC BIC logLik deviance df.resid
## 8816.8 8856.0 -4401.4 8802.8 1993
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.97811 -0.66921 0.01805 0.65947 2.91403
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## macro (Intercept) 1.911034 1.38240
## xij 0.004766 0.06904 1.00
## Residual 4.256428 2.06311
## Number of obs: 2000, groups: macro, 100
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 4.781866 0.460719 10.38
## xij 2.129125 0.047763 44.58
## age 2.009108 0.009846 204.05
##
## Correlation of Fixed Effects:
## (Intr) xij
## xij 0.038
## age -0.949 0.007
## optimizer (nloptwrap) convergence code: 0 (OK)
## boundary (singular) fit: see ?isSingular
We can more easily compare the results of the models using `with each other and with the parameters used to simulate
screenreg(list(model.0, model.1, model.2),
custom.model.names=c("Simple","Correct", "Complex"))
##
## ====================================================================
## Simple Correct Complex
## --------------------------------------------------------------------
## (Intercept) 93.96 *** 4.80 *** 4.78 ***
## (2.95) (0.46) (0.46)
## xij 2.13 *** 2.13 *** 2.13 ***
## (0.05) (0.05) (0.05)
## age 2.01 *** 2.01 ***
## (0.01) (0.01)
## --------------------------------------------------------------------
## AIC 9414.08 8814.68 8816.79
## BIC 9436.49 8842.68 8855.99
## Log Likelihood -4703.04 -4402.34 -4401.39
## Num. obs. 2000 2000 2000
## Num. groups: macro 100 100 100
## Var: macro (Intercept) 869.26 1.91 1.91
## Var: Residual 4.26 4.26 4.26
## Var: macro xij 0.00
## Cov: macro (Intercept) xij 0.10
## ====================================================================
## *** p < 0.001; ** p < 0.01; * p < 0.05
We see that the correct models (model.1) is the best in terms of both AIC and BIC (more on these measures later). Note that the model that is overly complext have \(\tau_{1}^2=0.00\), basically zero.
We’ll use the same sample sizes, fixed effects, and variances as above; however, we also need \(\tau_1^2\) and \(\tau_{01}\), which are set below.
varU1j = 1
covU0jU1j= .4
To get desired covariance matrix for U0j and U1j, I find the use cholskey root to get weights for a linear combination that will create random effects that have the desired variance and covariance.
(Tau.mtx <- matrix(c(tau2, covU0jU1j, covU0jU1j, varU1j),nrow=2,ncol=2) )
## [,1] [,2]
## [1,] 2.0 0.4
## [2,] 0.4 1.0
w <- chol(Tau.mtx)
w
## [,1] [,2]
## [1,] 1.414214 0.2828427
## [2,] 0.000000 0.9591663
We set up a null matrix to save data
R.intslope <- matrix( , nrow=N*nj, ncol=8)
and need to set the index back to 1
index <- 1
The process is the same, except we sample an extra random effect and take a linear combination of them to get the desired covariance
# (Outer Loop) Sample the marco units
for (macro in (1:N)) {
U0j <- w[1,1]*rnorm(1)
U1j <- w[1,2]*U0j + w[2,2]*rnorm(1)
zj = 50 * runif(1) + 20
# (Inner loop) Within a marco unit, take a sample of micro units;
for (micro in (1:nj)) {
xij=rnorm(1)
z=rnorm(1);
Rij=sqrt(sigma2)*z;
# Computed response/outcome variable of the linear mixed model and save
yij = gamma00 + gamma10*xij + gamma01*zj + U0j + + U1j*xij + Rij
R.intslope[index,1:8] <- c(macro,micro,yij,zj,xij,U0j,U1j,Rij)
index <- index + 1
} # <--- end inner loop, go to next micro unit
} # <--- end outer loop, go to next macro unit
Again, we turn the simulated data into a data frame with names of what’s what
R.intslope <- as.data.frame(R.intslope)
names(R.intslope) <- c("macro","micro","yij","zj","xij","U0j","U1j","Rij")
Every time you randomly sample values, you’ll get different results, unless you set the seed for the random number generator (which I didn’t). Just to check on what the actual correlation matrix of random effects equals and how close it is to the desried one.
x <- R.intslope[ , c(6,7,8)]
cor(x)
## U0j U1j Rij
## U0j 1.000000000 0.402359453 0.003544223
## U1j 0.402359453 1.000000000 0.003367676
## Rij 0.003544223 0.003367676 1.000000000
cov(x)
## U0j U1j Rij
## U0j 1.892319336 0.63547624 0.009877273
## U1j 0.635476238 1.31818324 0.007833160
## Rij 0.009877273 0.00783316 4.104285616
The covariance matrix between \(U_{0j}\) and \(U_{1j}\) should be approximately equal to
Tau.mtx
## [,1] [,2]
## [1,] 2.0 0.4
## [2,] 0.4 1.0
The covariance bewteen the \(U_{j}\) and \(R_{ij}\) should be very small and close to zero, and \(\sigma\) should be close to 4.
Again we fit models that are too simple, too complex, and just right.
summary( simple <- lmer(yij ~ xij + zj + (1 |macro), data=R.intslope, REML=FALSE) )
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: yij ~ xij + zj + (1 | macro)
## Data: R.intslope
##
## AIC BIC logLik deviance df.resid
## 9276.9 9304.9 -4633.4 9266.9 1995
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.9368 -0.6325 0.0079 0.6463 3.2158
##
## Random effects:
## Groups Name Variance Std.Dev.
## macro (Intercept) 2.067 1.438
## Residual 5.407 2.325
## Number of obs: 2000, groups: macro, 100
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 4.85653 0.51565 9.418
## xij 3.06499 0.05541 55.311
## zj 2.00326 0.01087 184.311
##
## Correlation of Fixed Effects:
## (Intr) xij
## xij 0.009
## zj -0.955 -0.010
summary( correct <- lmer(yij ~ xij + zj + (1 + xij |macro), data=R.intslope, REML=FALSE) )
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: yij ~ xij + zj + (1 + xij | macro)
## Data: R.intslope
##
## AIC BIC logLik deviance df.resid
## 8895.7 8934.9 -4440.8 8881.7 1993
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.8597 -0.6219 -0.0166 0.6294 3.7895
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## macro (Intercept) 1.952 1.397
## xij 1.475 1.215 0.44
## Residual 4.032 2.008
## Number of obs: 2000, groups: macro, 100
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 4.672484 0.461626 10.12
## xij 3.049993 0.131451 23.20
## zj 2.007790 0.009661 207.82
##
## Correlation of Fixed Effects:
## (Intr) xij
## xij 0.125
## zj -0.948 -0.003
summary( complex <- lmer(yij ~ xij + zj + xij*zj + (1 + xij | macro), data=R.intslope, REML=FALSE) )
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: yij ~ xij + zj + xij * zj + (1 + xij | macro)
## Data: R.intslope
##
## AIC BIC logLik deviance df.resid
## 8897.5 8942.3 -4440.8 8881.5 1992
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.8566 -0.6212 -0.0164 0.6310 3.7916
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## macro (Intercept) 1.952 1.397
## xij 1.474 1.214 0.43
## Residual 4.032 2.008
## Number of obs: 2000, groups: macro, 100
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 4.742022 0.496187 9.557
## xij 3.211910 0.444019 7.234
## zj 2.006261 0.010457 191.860
## xij:zj -0.003569 0.009350 -0.382
##
## Correlation of Fixed Effects:
## (Intr) xij zj
## xij 0.385
## zj -0.955 -0.367
## xij:zj -0.367 -0.955 0.383
Lastly to make comparison easer,
screenreg(list(simple, correct, complex),
custom.model.names=c("Simple","Correct", "Complex"))
##
## ====================================================================
## Simple Correct Complex
## --------------------------------------------------------------------
## (Intercept) 4.86 *** 4.67 *** 4.74 ***
## (0.52) (0.46) (0.50)
## xij 3.06 *** 3.05 *** 3.21 ***
## (0.06) (0.13) (0.44)
## zj 2.00 *** 2.01 *** 2.01 ***
## (0.01) (0.01) (0.01)
## xij:zj -0.00
## (0.01)
## --------------------------------------------------------------------
## AIC 9276.87 8895.66 8897.51
## BIC 9304.87 8934.87 8942.32
## Log Likelihood -4633.44 -4440.83 -4440.76
## Num. obs. 2000 2000 2000
## Num. groups: macro 100 100 100
## Var: macro (Intercept) 2.07 1.95 1.95
## Var: Residual 5.41 4.03 4.03
## Var: macro xij 1.48 1.47
## Cov: macro (Intercept) xij 0.74 0.74
## ====================================================================
## *** p < 0.001; ** p < 0.01; * p < 0.05
In most stimulation studies, the process of simulating data, fiting models, and saving desired results is iterated a large number of times. After finishing all iterations, say \(Iter=100\) or \(500\), the fixed effects are averaged (their standard deviation is an estimate of the standard errors), varainces and covariances are averaged, and other desirable output. You should have a clear goal so that you saved all the important information. After running a simulation study, which can take a long time, you don’t want to think “I should have saved …”. It’s better to save more than less. Also, when saving data and output, it is often good to write results (and possibly) data to your hard drive at the end of the outer loop just in case your system crashes.