The data include all 1003 schools but we'll only look at 2 of them. The data are in csv format and I will be lazy and just delete the missing values.
```r setwd("D:/Dropbox/edps587/lectures/1 introduction")
nels <- read.csv("nels_data.csv",header=TRUE,sep=",") nels <- na.omit(nels) head(nels) ```
## sch_id stud_id sex race homework schtyp ses par_edu math clastruc size ## 1 7472 3 2 4 1 1 -0.13 2 48 2 3 ## 2 7472 8 1 4 0 1 -0.39 2 48 2 3 ## 3 7472 13 1 4 0 1 -0.80 2 53 2 3 ## 4 7472 17 1 4 1 1 -0.72 2 42 2 3 ## 5 7472 27 2 4 2 1 -0.74 2 43 2 3 ## 6 7472 28 2 4 1 1 -0.58 2 57 2 3 ## urban geograph perminor ratio ## 1 2 2 0 19 ## 2 2 2 0 19 ## 3 2 2 0 19 ## 4 2 2 0 19 ## 5 2 2 0 19 ## 6 2 2 0 19
We'll pull out data for schools 24725 and 62821
nels1 <- nels[ which(nels$sch_id==24725 | nels$sch_id==62821), ]
We'll start with a plane scatter plot with a linear regression overlayed
plot(nels1$homework,nels1$math, type="p", pch=19,
main=expression(paste("NELS Data (sub-set) ")),
xlab="Time Spent Doing Homework",
ylab="Math Scores",
xlim=c(0,7),
ylim=c(20,80)
)
abline(lm(nels1$math ~ nels1$homework),col="blue",lwd=2)
summary(lm1<- lm(nels1$math ~ nels1$homework))
##
## Call:
## lm(formula = nels1$math ~ nels1$homework)
##
## Residuals:
## Min 1Q Median 3Q Max
## -20.466 -7.414 2.060 6.043 15.534
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 47.9745 1.7784 26.976 < 2e-16 ***
## nels1$homework 3.4913 0.5225 6.682 2.14e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.881 on 87 degrees of freedom
## Multiple R-squared: 0.3391, Adjusted R-squared: 0.3315
## F-statistic: 44.65 on 1 and 87 DF, p-value: 2.145e-09
text(19,47.5, "math = 47.97 + 3.49*homework",cex=1.25)
Does linear regression look OK? What do you notice about the data?
Note that some students have the same values. To get a better idea of how much data there are, we'll "jitter" the points a bit
plot(jitter(nels$homework),nels$math, type="p", pch=19,
main=expression(paste("NELS Data (sub-set) with some Horizondent Jittering")),
xlab="Time Spent Doing Homework",
ylab="Math Scores",
xlim=c(0,7),
ylim=c(20,80)
)
abline(lm(nels1$math ~ nels1$homework),col="blue",lwd=2)
summary(lm1<- lm(nels1$math ~ nels1$homework))
##
## Call:
## lm(formula = nels1$math ~ nels1$homework)
##
## Residuals:
## Min 1Q Median 3Q Max
## -20.466 -7.414 2.060 6.043 15.534
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 47.9745 1.7784 26.976 < 2e-16 ***
## nels1$homework 3.4913 0.5225 6.682 2.14e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.881 on 87 degrees of freedom
## Multiple R-squared: 0.3391, Adjusted R-squared: 0.3315
## F-statistic: 44.65 on 1 and 87 DF, p-value: 2.145e-09
text(19,47.5, "math = 47.97 + 3.49*homework",cex=1.25)
We know that we have two schools, it is insightful note this in the figure.
nels.s1 <- nels[ which(nels$sch_id==24725), ]
nels.s2 <- nels[ which(nels$sch_id==62821), ]
# Graph to illustrate need for random intercept and slope
# -- create frame for plot, notice that type='n' for none
par(new=FALSE)
plot(jitter(nels.s1$homework),nels.s1$math,
type="p", pch=19,
col="blue",
lwd=2,
main="NELS: Linear Regression by School",
xlab="Time Spent Doing Homework",
ylab="Math Scores",
xlim=c(0,7),
ylim=c(20,80)
)
abline(lm(math ~ homework,data=nels.s1),col="blue")
par(new=TRUE)
plot(jitter(nels.s2$homework),nels.s2$math,type="p",
pch=19,
col="red",
xlab="Time Spent Doing Homework",
ylab="Math Scores",
xlim=c(0,7),
ylim=c(20,80)
)
abline(lm(math ~ homework,data=nels.s2),col="red",pch=19)
legend(4, 35, legend=c("School 62821","School 24725"),
col=c("red","blue"), lty=1, cex=1.1,
box.lty=1, box.lwd=1)
The schools have different intercepts and slopes. In this case, the higher intercept, the smaller slope.