Graphics are powerful tools; however, producing them can be challenging. This document is a semi-tutorial that will guide you through the steps to create graphics particularly useful in multilevel modeling. I have set up this document such that if in the future you want to produce a particular graphic, you can find an example and then copy the code (and modify to suit your analysis).

In this computer lab, you will produce various graphics designed for exploaratory analysis and some diagnostic plots. I assume that you already know basic R.

1 Set Up

1.1. The following packages will be used

The use of each package in this document is given next to the package name. If you need additonal information on these, check online for package documentation.

library(lme4)     # Fitting HLM models
library(lmerTest) # Sattherwait df and t-tests
library(lattice)  # Lattice/pannel plots
library(ggplot2)  # Some nice grahics
library(stringi)  # install before HLMdiag or HLMdiag won't load properly
library(HLMdiag)  # Some diagnostic graphics for hlm
library(texreg)   # Nice tabular output
library(optimx)   # If need to change optimizer

1.2 Read in the data.

Note that the data is basically the same as lab 3, but all variables are present in this data set. You need to change the path to where your data live.

lab4<-read.table("D:/Dropbox/edps587/Lab_and_homework/Computer lab4/lab4_data.txt",header=TRUE)

Check the data

names(lab4)
##  [1] "idschool"           "idstudent"          "grade"             
##  [4] "gender"             "science"            "math"              
##  [7] "genshortages"       "hoursTV"            "hourscomputergames"
## [10] "typecommunity"      "shortages"          "isolate"           
## [13] "rural"              "suburban"           "urban"             
## [16] "short0"             "short1"             "short2"            
## [19] "short3"
head(lab4)

1.3 Create some variables by re-coding

1.3.1 Dummy code gender:

lab4$boy = ifelse(lab4$gender=="boy", 1, 0) 
lab4$boy[is.na(lab4$boy)] <- 1

1.3.2 Dummy code grade:

lab4$third <- ifelse(lab4$grade==3, 1, 0)

2.3.3 Center math around school mean math

school.mean <- as.data.frame(aggregate(math~idschool, data=lab4, "mean"))

names(school.mean) <- c('idschool', 'school.mean')

lab4 <- merge(lab4,school.mean, by=c('idschool'))

lab4$centered.math <- lab4$math - lab4$school.mean

1.3.4 Make id a factor (not really neccesary)

lab4$idschool<-as.factor(lab4$idschool)

1.3.5 Put some variables on same scale (and check), so we’ll just do it now.

lab4$centered.math <- lab4$centered.math/sd(lab4$centered.math)

lab4$school.mean <- lab4$school.mean/sd(lab4$school.mean)

sd(lab4$centered.math)
## [1] 1
sd(lab4$school.mean)
## [1] 1

We are done with all the set up

2. Draw random sample of data

Since this is a very large data set, it is sometimes nice to have a smaller random sample to use. The following code will do this. The 1st line determines which schools are randomly sampled (without replacement), and the 2nd line extracts the data for the randomly sampled school.

groups <- unique(lab4$idschool)[sample(1:145,25)]

subset <- lab4[lab4$idschool%in%groups,]

3. Lattice Plot

The first graph is a lattice plot for looking at students within schools. We’ll Use the random sample and plot science by math scores (otherwise we’ld have 146 plots).

(a) With points for students and linear regression for each school

The basic command here is xyplot. The plot will have both points (‘p’) and linear regression lines (‘r’). I have also used the layout command below to specfiy that the panel should have 5 rows and 5 columns of plots. It sometimes will default to odd arrangements.

xyplot(science ~ math | idschool, data=subset, 
     col.line=c('black','blue'), 
     type=c('p','r'), 
     main='Varability in Science ~ math relationship',
       xlab="Math Scores",
       ylab="Science Scores",
       layout=c(5,5)
     )

A lattice plot where points are joined is illustrated in the lecture notes on longitudinal data. You do not need to do this, because it would be very hard to see what’s going on. To join points, use the option ‘l’ (“ell”) for line; that is, replace the type command by

     type=c('p','l'), 

(b) Smooth

Do the same plot as in part (a), but use smooth regression rather than linear regression line.  To do this change the type of plot to
type=c('p','smooth')

4 Produce lattice (xyplots) plots for Hours TV

Using the sub-sample of the data, draw lattice plot of science by hours of watching TV (as numeric variable). You can copy and edit the example given in part 3. Do this for the following 2 cases

(a)

Points and Linear regression lines

(b) Smooth regressions

Points and smooth lines

5 Produce lattice (xyplots) plots for Hours Computer Games

Using the sub-sample of the data, draw lattice plot of science by hours of of playing computer games (as numeric variable). You can copy and edit the example given in part 4. Do this for the following 2 cases

(a)

Points and Linear regression lines

(b) Smooth (loess) regressions

Points and smooth lines

6 Regressions all schools in the same plot:

There are (at least) 2 ways of doing this. An easy way and a bit harder way. I prefer the look of the hard way. The plots produced by the harder way are in most of the lectures and the R code is on the course web-site that go with the lectures. The easier way is below.

Using ggplot

(a)

ggplot(lab4, aes(x=math, y=science, col=idschool,type='l'),  main='Regressions all in one plot')  + geom_smooth(method="lm", se=FALSE ) +  theme(legend.position="none")
## `geom_smooth()` using formula 'y ~ x'

(b) Repeat part (a) but use the random sub-sample.

Perhaps this it is more informative using less data; that is, easier to see what’s going on.

7 Plot mean science by hours watching TV

First compute the means

(y <- aggregate(science~hoursTV, data=lab4, "mean"))

and then plot Means by mean for hours watching TV,

plot(y[,1], y[,2], 
     type='b', 
     ylab='Mean Science',
     xlab='Hours Watching TV',
     main='Marginal of Science x HoursTV')

8 Plot mean science by hours playing computer games

You can revise the code from item 7 to do this.

9 Overlay school specific regressions of science on hours TV

You will do this using linear regression and then smooth curves

(a) Linear `regression

ggplot(lab4, aes(x=hoursTV, y=science, col=idschool,type='l'))  + geom_smooth(method="lm", se=FALSE )  +  theme(legend.position="none")
## `geom_smooth()` using formula 'y ~ x'

(b) For smooth curves change the method

geom_smooth(method="loess", se=FALSE)

10 Overlay school specific regressions of science on hours play computer games

Repeat item 9 parts (a) and (b), except use hours play computer games.

11 Plot science by type of community

Since type of community is a nominal (discrete) varaible, we will add it to graphs by putting in distinct line for each community.

(a) Using linear regression

Frist change type of community to a factor and the graph.

lab4$community <- as.factor(lab4$typecommunity)

ggplot(lab4, aes(x=math, y=science, col=community, type='l'))  + geom_smooth(method='lm', se=FALSE)
## `geom_smooth()` using formula 'y ~ x'

(b) Using Smooth curves

For smooth curves change the “geom_smooth” option is indicated below.

ggplot(lab4, aes(x=math, y=science, col=community, type='l'))  + geom_smooth(method='loess', se=FALSE)
## `geom_smooth()` using formula 'y ~ x'

(c) Check frequencies

It may be the case that some communities are very small so when viewing graphs we may want to place less weight on them.

table(lab4$typecommunity)
## 
##    1    2    3    4 
##   70 1042 2483 3502

Recall that “1” is isolate, “2” is rural, “3” is suburban, and “4” is urban.

An even better check, “discretize” science score and create 10 groups as follows

lab4$group <- cut(lab4$science,10)
table(lab4$group,lab4$typecommunity)
##            
##                1    2    3    4
##   (103,112]    1    0    1    2
##   (112,120]    0    1    2    5
##   (120,128]    1    5   17   39
##   (128,136]    1   47  109  261
##   (136,144]    5  161  390  708
##   (144,152]   11  338  817 1134
##   (152,161]   35  351  770  939
##   (161,169]   14  117  289  315
##   (169,177]    1   11   46   52
##   (177,185]    1   11   42   47

This table indicates there are certian combinations of communities and science that have no data and very little.

The same could be done for math.

12 R^2 and Meta R^2: Simple model

We will use a very simple model to start with: science~math.

We need some basic information about the data and to set up objects to hold the results

N <- length(unique(lab4$idschool))  # Number of schools
nj <- table(lab4$idschool)          # number per school
nj <- as.data.frame(nj)
names(nj) <- c("idschool","nj")

ssmodel <- (0)                 # initialize SS model
sstotal <- (0)                 # initialize SS total
R2 <- matrix(99,nrow=N,ncol=2) # for saving R2s

We can now fit regressions for each school and save the R^2, sums of squares for model and total (corrected for total) for each school.

index <-1                               # initialize
for (i in (unique(lab4$idschool))) {   
 sub <- lab4[ which(lab4$idschool==i),] # data for one school
 model0 <- lm(science ~ math, data=sub) # fit the model
 a <- anova(model0)                     # save anova table
 ssmodel <- ssmodel + a[1,2]            # add to SS model
 sstotal <- sstotal + sum(a[,2])        # sum of all SS
                                        # save the R2
 R2[index,1:2] <- as.numeric(cbind(i,summary(model0)$r.squared))

 index <- index+1                      # up-date index 
}

We also need to compute meta R^2 and do a bit of clean up before graphing the regsults

R2meta <- ssmodel/sstotal

R2 <- as.data.frame(R2)
names(R2) <- c("idschool","R2")
R2.mod1 <- merge(R2,nj,by=c("idschool"))

Notice that the R^2 is call “R2.mod1”. For different models, I will want a different R object name that holds the R^2 for each school.

And at last a graph of R^2 by sample size…

plot(R2.mod1$nj,R2.mod1$R2,
     type='p', 
     ylim=c(0,1),
     ylab="R squared",
     xlab="n_j (number of observations per school)", 
     main="Science ~ math")      
abline(h=R2meta,col='blue')           # adds line for R2meta
text(70,0.95,'R2meta=.36',col="blue") # text of R2meta 

13 R^2 and Meta R^2: More complex model

Compute and plot R-squares and R^2 meta from the regression of science on math, hours watching TV, and hours playing computer games. Treate both hours watching TV and hours playing computer games as NUMERIC variables. So that we can later compare models, call the R^2 for this model

R2.mod2

14 R^2 and meta R^2 for categorical TV and computer games

Compute and plot R-squares and R^2 meta from the regressions of science on math, hours watching TV, and hours playing computer games. Treate hours watching TV and hours playing comptuer games are both CATEGORICAL variables. So that we can compare R^2s from various models, call them for this model

R2.mod3
## Warning in anova.lm(model0): ANOVA F-tests on an essentially perfect fit are
## unreliable

15 Comparision

To compare the R-squares from the models from item 13 (numeric hours) versus those from item 14 (categorical hours), plot the R-squares against each other and draw in reference line.

plot(R2.mod2$R2,R2.mod3$R2, cex=1.4,
   ylim=c(0,1),
   xlim=c(0,1),
   ylab='Hours as Discrete',
   xlab='Hours as Numeric',
   main='Hours Discrete vs Numeric')
abline(a=0,b=1, col='blue') # adds a reference line

16 Easiest regression diagnostic to obtain

We need a model to work with. We’ll use the as our model where both hours TV and hours computer games are treated as factors (categorical variables), which is defintely warented based on pervious graphs.

modelxx <- lmer(science ~ 1 + centered.math + school.mean + hrTV + hrCG 
                + (1 + centered.math|idschool),
                data=lab4, REML=FALSE,
                control = lmerControl(optimizer ="Nelder_Mead")) 

(a) Plot of Residuals

This is a plot of standardized (Pearson) residuals by the fitted contional values of science)

plot(modelxx, xlab='Fitted Conditional', 
              ylab='Pearson Residuals')

Why do you suppose there is straight line of points in the graph?

(b) Use HLMdiag to get conditional residuals

The conditional residuals are computed using the fixed plus estimated random effects. We’ll look at these using a dot plot.

# Get y-(X*gamma+Z*U) where U estimated by Empirical Bayes
res1 <- hlm_resid(modelxx, level=1, type="EB", sim=NULL,
         standardize=TRUE)
## Warning in LSresids.lmerMod(object, level = 1, standardize = standardize): LS residuals might be inaccurate as one or more groups are rank deficient.
## Use the 'include.ls = FALSE' parameter to get EB residuals only.
head(res1)          # look at what is here

Now that we have residuals we want, let’s take a look at them

dotplot(ranef(modelxx,condVar=TRUE),
          lattice.options=list(layout=c(1,2)))         
## $idschool

17 A panel of diagnostic plots

(similar to what SAS does):