Introduction

For this exercise you will need to write some R to fit a linear regression model to data where the outcome (response) variable is the state Total SAT scores (ave_tot) and the predictor (explanatory) variable is expenditure per student (exp_pp).

Below is the code that sets up the data and a template of what you’ll need to do. You will have to write code to get some practice with this. You will make mistakes and part of this exercise is to learn to interpret errors messages.

Libraries

Library that I used

library(rjags)
library(runjags)
library(jagsUI)
library(coda)
library(HDInterval)  # for high density interval

Data

You can find the data on the course web-site. You need to change the path to where ever you put the data.

df <- read.table("D:/Dropbox/edps 590BAY/Lectures/3 Normal/getting_what_paid_for_data.txt", header=TRUE)
names(df)
##  [1] "state"     "exp_pp"    "ave_pt"    "salary"    "taking"    "ave_v"    
##  [7] "ave_m"     "ave_tot"   "region"    "state_abv"
head(df)
##        state exp_pp ave_pt salary taking ave_v ave_m ave_tot region state_abv
## 1    Alabama  4.405   17.2 31.144      8   491   538    1029      S        AL
## 2     Alaska  8.963   17.6 47.951     47   445   489     934      W        AK
## 3    Arizona  4.778   19.3 32.175     27   448   496     944      W        AZ
## 4   Arkansas  4.459   17.1 28.934      6   482   523    1005      S        AR
## 5 California  4.992   24.0 41.078     45   417   485     902     CA        CA
## 6   Colorado  5.443   18.4 34.571     29   462   518     980      W        CO

Plot the Data

..and overlay the normal distribution. There are a number of ways to do this and what’s below keeps things in terms of frequencies. The first line sets up the histogram and saves it as “h”. The next two lines get data for horizontal and vertical axis. The third command puts density values of vertical fit values into scale of frequencies and uses information from the historgram. The “lines” command puts in the normal curve.

h =hist(df$ave_tot, breaks = 10, density = 10,
            col = "darkblue", 
            xlab = "Average Total SAT", 
            ylab = "Frequency",
            main="Normal curve Over histogram")
xfit <- seq(min(df$ave_tot), max(df$ave_tot), length = 50) 
yfit <- dnorm(xfit, mean = mean(df$ave_to), sd = sd(df$ave_tot)) 
yfit <- yfit * diff(h$mids[1:2]) * length(df$ave_tot) 
lines(xfit, yfit, col = "black", lwd = 2)

Sample Statistics

We need the sample size and observed sample mean. We will pretend that we know what the variance equals and use the sample variance for this.

n <- nrow(df)
ybar <- mean(df$ave_tot)
std <- sd(df$ave_tot)
s2 <- var(df$ave_tot)


descriptive <- c(n, ybar, std, s2)
names(descriptive) <- c("N","Ybar","std","s^2")
descriptive
##          N       Ybar        std        s^2 
##   50.00000  965.92000   74.82056 5598.11592

OLS Simple Linear Regression

summary(ols <- lm(ave_tot ~ exp_pp, data=df))
## 
## Call:
## lm(formula = ave_tot ~ exp_pp, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -145.074  -46.821    4.087   40.034  128.489 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1089.294     44.390  24.539  < 2e-16 ***
## exp_pp       -20.892      7.328  -2.851  0.00641 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 69.91 on 48 degrees of freedom
## Multiple R-squared:  0.1448, Adjusted R-squared:  0.127 
## F-statistic: 8.128 on 1 and 48 DF,  p-value: 0.006408

Bayesian

Data

Model for Simple Linear Regression

Starting Values

Create Model Object

Get Samples

Did it Converge?

Summary Statistics of Posteriors

Monte Carlo

Simulate posterior distribution of state average SAT total scores and use these to

Compute the mean per state

Compare with data

Any other statistics that interests you