Graphics are powerful tools; however, producing them can be challenging. This document is a semi-tutorial that will guide you through the steps to create graphics particularly useful in multilevel modeling. I have set up this document such that if in the future you want to produce a particular graphic, you can find an example and then copy the code (and modify to suit your analysis).
In this computer lab, you will produce various graphics designed for exploaratory analysis and some diagnostic plots. I assume that you already know basic R.
The use of each package in this document is given next to the package name. If you need additonal information on these, check online for package documentation.
library(lme4) # Fitting HLM models
library(lmerTest) # Sattherwait df and t-tests
library(lattice) # Lattice/pannel plots
library(ggplot2) # Some nice grahics
library(stringi) # install before HLMdiag or HLMdiag won't load properly
library(HLMdiag) # Some diagnostic graphics for hlm
library(texreg) # Nice tabular output
library(optimx) # If need to change optimizer
Note that the data is basically the same as lab 3, but all variables are present in this data set. You need to change the path to where your data live.
lab4<-read.table("D:/Dropbox/edps587/Lab_and_homework/Computer lab4/lab4_data.txt",header=TRUE)
Check the data
names(lab4)
## [1] "idschool" "idstudent" "grade"
## [4] "gender" "science" "math"
## [7] "genshortages" "hoursTV" "hourscomputergames"
## [10] "typecommunity" "shortages" "isolate"
## [13] "rural" "suburban" "urban"
## [16] "short0" "short1" "short2"
## [19] "short3"
head(lab4)
lab4$boy = ifelse(lab4$gender=="boy", 1, 0)
lab4$boy[is.na(lab4$boy)] <- 1
lab4$third <- ifelse(lab4$grade==3, 1, 0)
school.mean <- as.data.frame(aggregate(math~idschool, data=lab4, "mean"))
names(school.mean) <- c('idschool', 'school.mean')
lab4 <- merge(lab4,school.mean, by=c('idschool'))
lab4$centered.math <- lab4$math - lab4$school.mean
lab4$idschool<-as.factor(lab4$idschool)
lab4$centered.math <- lab4$centered.math/sd(lab4$centered.math)
lab4$school.mean <- lab4$school.mean/sd(lab4$school.mean)
sd(lab4$centered.math)
## [1] 1
sd(lab4$school.mean)
## [1] 1
We are done with all the set up
Since this is a very large data set, it is sometimes nice to have a smaller random sample to use. The following code will do this. The 1st line determines which schools are randomly sampled (without replacement), and the 2nd line extracts the data for the randomly sampled school.
groups <- unique(lab4$idschool)[sample(1:145,25)]
subset <- lab4[lab4$idschool%in%groups,]
The first graph is a lattice plot for looking at students within schools. We’ll Use the random sample and plot science by math scores (otherwise we’ld have 146 plots).
The basic command here is xyplot. The plot will have both points (‘p’) and linear regression lines (‘r’). I have also used the layout command below to specfiy that the panel should have 5 rows and 5 columns of plots. It sometimes will default to odd arrangements.
xyplot(science ~ math | idschool, data=subset,
col.line=c('black','blue'),
type=c('p','r'),
main='Varability in Science ~ math relationship',
xlab="Math Scores",
ylab="Science Scores",
layout=c(5,5)
)
A lattice plot where points are joined is illustrated in the lecture notes on longitudinal data. You do not need to do this, because it would be very hard to see what’s going on. To join points, use the option ‘l’ (“ell”) for line; that is, replace the type command by
type=c('p','l'),
Do the same plot as in part (a), but use smooth regression rather than linear regression line. To do this change the type of plot to
type=c('p','smooth')
Using the sub-sample of the data, draw lattice plot of science by hours of watching TV (as numeric variable). You can copy and edit the example given in part 3. Do this for the following 2 cases
Points and Linear regression lines
Points and smooth lines
Using the sub-sample of the data, draw lattice plot of science by hours of of playing computer games (as numeric variable). You can copy and edit the example given in part 4. Do this for the following 2 cases
Points and Linear regression lines
Points and smooth lines
There are (at least) 2 ways of doing this. An easy way and a bit harder way. I prefer the look of the hard way. The plots produced by the harder way are in most of the lectures and the R code is on the course web-site that go with the lectures. The easier way is below.
Using ggplot
ggplot(lab4, aes(x=math, y=science, col=idschool,type='l'), main='Regressions all in one plot') + geom_smooth(method="lm", se=FALSE ) + theme(legend.position="none")
## `geom_smooth()` using formula 'y ~ x'
Perhaps this it is more informative using less data; that is, easier to see what’s going on.
First compute the means
(y <- aggregate(science~hoursTV, data=lab4, "mean"))
and then plot Means by mean for hours watching TV,
plot(y[,1], y[,2],
type='b',
ylab='Mean Science',
xlab='Hours Watching TV',
main='Marginal of Science x HoursTV')
You can revise the code from item 7 to do this.
You will do this using linear regression and then smooth curves
ggplot(lab4, aes(x=hoursTV, y=science, col=idschool,type='l')) + geom_smooth(method="lm", se=FALSE ) + theme(legend.position="none")
## `geom_smooth()` using formula 'y ~ x'
geom_smooth(method="loess", se=FALSE)
Repeat item 9 parts (a) and (b), except use hours play computer games.
Since type of community is a nominal (discrete) varaible, we will add it to graphs by putting in distinct line for each community.
Frist change type of community to a factor and the graph.
lab4$community <- as.factor(lab4$typecommunity)
ggplot(lab4, aes(x=math, y=science, col=community, type='l')) + geom_smooth(method='lm', se=FALSE)
## `geom_smooth()` using formula 'y ~ x'
For smooth curves change the “geom_smooth” option is indicated below.
ggplot(lab4, aes(x=math, y=science, col=community, type='l')) + geom_smooth(method='loess', se=FALSE)
## `geom_smooth()` using formula 'y ~ x'
It may be the case that some communities are very small so when viewing graphs we may want to place less weight on them.
table(lab4$typecommunity)
##
## 1 2 3 4
## 70 1042 2483 3502
Recall that “1” is isolate, “2” is rural, “3” is suburban, and “4” is urban.
An even better check, “discretize” science score and create 10 groups as follows
lab4$group <- cut(lab4$science,10)
table(lab4$group,lab4$typecommunity)
##
## 1 2 3 4
## (103,112] 1 0 1 2
## (112,120] 0 1 2 5
## (120,128] 1 5 17 39
## (128,136] 1 47 109 261
## (136,144] 5 161 390 708
## (144,152] 11 338 817 1134
## (152,161] 35 351 770 939
## (161,169] 14 117 289 315
## (169,177] 1 11 46 52
## (177,185] 1 11 42 47
This table indicates there are certian combinations of communities and science that have no data and very little.
The same could be done for math.
We will use a very simple model to start with: science~math.
We need some basic information about the data and to set up objects to hold the results
N <- length(unique(lab4$idschool)) # Number of schools
nj <- table(lab4$idschool) # number per school
nj <- as.data.frame(nj)
names(nj) <- c("idschool","nj")
ssmodel <- (0) # initialize SS model
sstotal <- (0) # initialize SS total
R2 <- matrix(99,nrow=N,ncol=2) # for saving R2s
We can now fit regressions for each school and save the R^2, sums of squares for model and total (corrected for total) for each school.
index <-1 # initialize
for (i in (unique(lab4$idschool))) {
sub <- lab4[ which(lab4$idschool==i),] # data for one school
model0 <- lm(science ~ math, data=sub) # fit the model
a <- anova(model0) # save anova table
ssmodel <- ssmodel + a[1,2] # add to SS model
sstotal <- sstotal + sum(a[,2]) # sum of all SS
# save the R2
R2[index,1:2] <- as.numeric(cbind(i,summary(model0)$r.squared))
index <- index+1 # up-date index
}
We also need to compute meta R^2 and do a bit of clean up before graphing the regsults
R2meta <- ssmodel/sstotal
R2 <- as.data.frame(R2)
names(R2) <- c("idschool","R2")
R2.mod1 <- merge(R2,nj,by=c("idschool"))
Notice that the R^2 is call “R2.mod1”. For different models, I will want a different R object name that holds the R^2 for each school.
And at last a graph of R^2 by sample size…
plot(R2.mod1$nj,R2.mod1$R2,
type='p',
ylim=c(0,1),
ylab="R squared",
xlab="n_j (number of observations per school)",
main="Science ~ math")
abline(h=R2meta,col='blue') # adds line for R2meta
text(70,0.95,'R2meta=.36',col="blue") # text of R2meta
Compute and plot R-squares and R^2 meta from the regression of science on math, hours watching TV, and hours playing computer games. Treate both hours watching TV and hours playing computer games as NUMERIC variables. So that we can later compare models, call the R^2 for this model
R2.mod2
Compute and plot R-squares and R^2 meta from the regressions of science on math, hours watching TV, and hours playing computer games. Treate hours watching TV and hours playing comptuer games are both CATEGORICAL variables. So that we can compare R^2s from various models, call them for this model
R2.mod3
## Warning in anova.lm(model0): ANOVA F-tests on an essentially perfect fit are
## unreliable
To compare the R-squares from the models from item 13 (numeric hours) versus those from item 14 (categorical hours), plot the R-squares against each other and draw in reference line.
plot(R2.mod2$R2,R2.mod3$R2, cex=1.4,
ylim=c(0,1),
xlim=c(0,1),
ylab='Hours as Discrete',
xlab='Hours as Numeric',
main='Hours Discrete vs Numeric')
abline(a=0,b=1, col='blue') # adds a reference line
We need a model to work with. We’ll use the as our model where both hours TV and hours computer games are treated as factors (categorical variables), which is defintely warented based on pervious graphs.
modelxx <- lmer(science ~ 1 + centered.math + school.mean + hrTV + hrCG
+ (1 + centered.math|idschool),
data=lab4, REML=FALSE,
control = lmerControl(optimizer ="Nelder_Mead"))
This is a plot of standardized (Pearson) residuals by the fitted contional values of science)
plot(modelxx, xlab='Fitted Conditional',
ylab='Pearson Residuals')
Why do you suppose there is straight line of points in the graph?
The conditional residuals are computed using the fixed plus estimated random effects. We’ll look at these using a dot plot.
# Get y-(X*gamma+Z*U) where U estimated by Empirical Bayes
res1 <- hlm_resid(modelxx, level=1, type="EB", sim=NULL,
standardize=TRUE)
## Warning in LSresids.lmerMod(object, level = 1, standardize = standardize): LS residuals might be inaccurate as one or more groups are rank deficient.
## Use the 'include.ls = FALSE' parameter to get EB residuals only.
head(res1) # look at what is here
Now that we have residuals we want, let’s take a look at them
dotplot(ranef(modelxx,condVar=TRUE),
lattice.options=list(layout=c(1,2)))
## $idschool
(similar to what SAS does):