In this practice problem, we are using data from “Getting What Paid For: Debate Over Equity in Public School Expenditures” which contains state data measures on cost of education and SAT scores. These data are from 1994-1995. We will use this data later when talk about linear regression, but for now, we’ll just estimate the mean and assume that we know the variances.
The data are from Deborah Guber and the full reference is Guber, D. (1999). “Getting What You Pay For: The Debate Over Equity in Public School Expenditures”. Journal of Statistics Education. For more information about the data see http://jse.amstat.org/datasets/sat.txt.
Goal for edsp690BAY: Find the mean of the state average total SAT total score (i.e., verbal + math). This gives you a start and we’ll go over this and finish in class.
Library that I used
library(HDInterval) # for high density interval
You can find the data on the course web-site. You need to change the path to where ever you put the data.
df <- read.table("D:/Dropbox/edps 590BAY/Lectures/3 Normal/getting_what_paid_for_data.txt", header=TRUE)
names(df)
## [1] "state" "exp_pp" "ave_pt" "salary" "taking" "ave_v"
## [7] "ave_m" "ave_tot" "region" "state_abv"
head(df)
## state exp_pp ave_pt salary taking ave_v ave_m ave_tot region state_abv
## 1 Alabama 4.405 17.2 31.144 8 491 538 1029 S AL
## 2 Alaska 8.963 17.6 47.951 47 445 489 934 W AK
## 3 Arizona 4.778 19.3 32.175 27 448 496 944 W AZ
## 4 Arkansas 4.459 17.1 28.934 6 482 523 1005 S AR
## 5 California 4.992 24.0 41.078 45 417 485 902 CA CA
## 6 Colorado 5.443 18.4 34.571 29 462 518 980 W CO
..and overlay the normal distribution. There are a number of ways to do this and what’s below keeps things in terms of frequencies. The first line sets up the histogram and saves it as “h”. The next two lines get data for horizontal and vertical axis. The third command puts density values of vertical fit values into scale of frequencies and uses information from the historgra. The “lines” command puts in the normal curve.
h =hist(df$ave_tot, breaks = 10, density = 10,
col = "darkblue",
xlab = "Average Total SAT",
ylab = "Frequency",
main="Normal curve Over histogram")
xfit <- seq(min(df$ave_tot), max(df$ave_tot), length = 50)
yfit <- dnorm(xfit, mean = mean(df$ave_to), sd = sd(df$ave_tot))
yfit <- yfit * diff(h$mids[1:2]) * length(df$ave_tot)
lines(xfit, yfit, col = "black", lwd = 2)
We need the sample size and observed sample mean. We will pretend that we know what the variance equals and use the sample variance for this.
(n <- nrow(df))
## [1] 50
(ybar <- mean(df$ave_tot))
## [1] 965.92
(sigma2 <- var(df$ave_tot))
## [1] 5598.116
We’ll start with flat proiors refelcting no knowledge about SAT scores.
mu.o <- 0
tau.o2 <- 1000000 # Very flat, N(0, 1000000):
And the posterior distribution of the mean is normal with mean mu.n and varianace tau.n; that is \[ \bar{y} \sim N(\mu_n, \tau_n^2)\] will be our posterior distribution.
# Posterior Mean
(mu.n <- ((1/tau.o2)*mu.o + (n/sigma2)*ybar)/(1/tau.o2 + n/sigma2))
## [1] 965.8119
# Posertior variance
(tau.n2 <- 1/ ( 1/tau.o2 + n/sigma2) )
## [1] 111.9498
Intervals estimates
# Credible interval
(pq <- qnorm(c(.025, .975), mu.n, sqrt(tau.n2)))
## [1] 945.0742 986.5495
# High Density interval
x <- rnorm(100000, mu.n, sqrt(tau.n2))
library(HDInterval)
hdi(x, credMass=.95)
## lower upper
## 945.5419 986.9321
## attr(,"credMass")
## [1] 0.95
For this,
To see what effect they have on results.
names(df)
## [1] "state" "exp_pp" "ave_pt" "salary" "taking" "ave_v"
## [7] "ave_m" "ave_tot" "region" "state_abv"
cor(df$exp_pp, df$ave_tot)
## [1] -0.380537
plot(df$salary, df$ave_tot, type="n",
main="Getting what paid for? r(expendature/pupil, ave_total_SAT) = -.38",
xlab="Expenditure per Student",
ylab="Average Total SAT Score",
xlim=c(20,55))
text(df$salary, df$ave_tot, df$state_abv)