Data set up

The data include all 1003 schools but we'll only look at 2 of them. The data are in csv format and I will be lazy and just delete the missing values.

Read in data and look

```r setwd("D:/Dropbox/edps587/lectures/1 introduction")

nels <- read.csv("nels_data.csv",header=TRUE,sep=",") nels <- na.omit(nels) head(nels) ```

## sch_id stud_id sex race homework schtyp ses par_edu math clastruc size ## 1 7472 3 2 4 1 1 -0.13 2 48 2 3 ## 2 7472 8 1 4 0 1 -0.39 2 48 2 3 ## 3 7472 13 1 4 0 1 -0.80 2 53 2 3 ## 4 7472 17 1 4 1 1 -0.72 2 42 2 3 ## 5 7472 27 2 4 2 1 -0.74 2 43 2 3 ## 6 7472 28 2 4 1 1 -0.58 2 57 2 3 ## urban geograph perminor ratio ## 1 2 2 0 19 ## 2 2 2 0 19 ## 3 2 2 0 19 ## 4 2 2 0 19 ## 5 2 2 0 19 ## 6 2 2 0 19

Select 2 schools

We'll pull out data for schools 24725 and 62821

nels1 <- nels[ which(nels$sch_id==24725 | nels$sch_id==62821), ]

Graph data

Basic

We'll start with a plane scatter plot with a linear regression overlayed

plot(nels1$homework,nels1$math, type="p", pch=19,
     main=expression(paste("NELS Data (sub-set) ")),
     xlab="Time Spent Doing Homework",
     ylab="Math Scores",
     xlim=c(0,7),
     ylim=c(20,80)
     )
abline(lm(nels1$math ~ nels1$homework),col="blue",lwd=2)
summary(lm1<- lm(nels1$math ~ nels1$homework))
## 
## Call:
## lm(formula = nels1$math ~ nels1$homework)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -20.466  -7.414   2.060   6.043  15.534 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     47.9745     1.7784  26.976  < 2e-16 ***
## nels1$homework   3.4913     0.5225   6.682 2.14e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.881 on 87 degrees of freedom
## Multiple R-squared:  0.3391, Adjusted R-squared:  0.3315 
## F-statistic: 44.65 on 1 and 87 DF,  p-value: 2.145e-09
text(19,47.5, "math = 47.97 + 3.49*homework",cex=1.25)

Does linear regression look OK? What do you notice about the data?

Graph with Jitter

Note that some students have the same values. To get a better idea of how much data there are, we'll "jitter" the points a bit

plot(jitter(nels$homework),nels$math, type="p", pch=19,
     main=expression(paste("NELS Data (sub-set) with some Horizondent Jittering")),
     xlab="Time Spent Doing Homework",
     ylab="Math Scores",
     xlim=c(0,7),
     ylim=c(20,80)
     )
abline(lm(nels1$math ~ nels1$homework),col="blue",lwd=2)
summary(lm1<- lm(nels1$math ~ nels1$homework))
## 
## Call:
## lm(formula = nels1$math ~ nels1$homework)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -20.466  -7.414   2.060   6.043  15.534 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     47.9745     1.7784  26.976  < 2e-16 ***
## nels1$homework   3.4913     0.5225   6.682 2.14e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.881 on 87 degrees of freedom
## Multiple R-squared:  0.3391, Adjusted R-squared:  0.3315 
## F-statistic: 44.65 on 1 and 87 DF,  p-value: 2.145e-09
text(19,47.5, "math = 47.97 + 3.49*homework",cex=1.25)

Identify the Schools

We know that we have two schools, it is insightful note this in the figure.

nels.s1 <- nels[ which(nels$sch_id==24725), ]

nels.s2 <- nels[ which(nels$sch_id==62821), ]

# Graph to illustrate need for random intercept and slope
# -- create frame for plot, notice that type='n' for none 
par(new=FALSE)
plot(jitter(nels.s1$homework),nels.s1$math,
    type="p", pch=19,
    col="blue",
    lwd=2,
    main="NELS: Linear Regression by School",
    xlab="Time Spent Doing Homework",
    ylab="Math Scores",
    xlim=c(0,7),
    ylim=c(20,80)
    )
abline(lm(math ~ homework,data=nels.s1),col="blue")
par(new=TRUE)
plot(jitter(nels.s2$homework),nels.s2$math,type="p",
    pch=19,
    col="red",
    xlab="Time Spent Doing Homework",
    ylab="Math Scores",
    xlim=c(0,7),
    ylim=c(20,80)
)
abline(lm(math ~ homework,data=nels.s2),col="red",pch=19)
legend(4, 35, legend=c("School 62821","School 24725"), 
       col=c("red","blue"), lty=1, cex=1.1,
       box.lty=1, box.lwd=1)

The schools have different intercepts and slopes. In this case, the higher intercept, the smaller slope.