Introduction

In this practice problem, we are using data from “Getting What Paid For: Debate Over Equity in Public School Expenditures” which contains state data measures on cost of education and SAT scores. These data are from 1994-1995. We will use this data later when talk about linear regression, but for now, we’ll just estimate the mean and assume that we know the variances.

The data are from Deborah Guber and the full reference is Guber, D. (1999). “Getting What You Pay For: The Debate Over Equity in Public School Expenditures”. Journal of Statistics Education. For more information about the data see http://jse.amstat.org/datasets/sat.txt.

Goal for edsp690BAY: Find the mean and variance of the state average total SAT total score (i.e., verbal + math). This gives you a start and we’ll go over this and finish in class.

Set-up

This is pretty much the same as we did for when estimating mean but the variance was known.

Tools

Library that I used

library(HDInterval)  # for high density interval

Data

You can find the data on the course web-site. You need to change the path to where ever you put the data.

df <- read.table("D:/Dropbox/edps 590BAY/Lectures/3 Normal/getting_what_paid_for_data.txt", header=TRUE)
names(df)
##  [1] "state"     "exp_pp"    "ave_pt"    "salary"    "taking"    "ave_v"    
##  [7] "ave_m"     "ave_tot"   "region"    "state_abv"
head(df)
##        state exp_pp ave_pt salary taking ave_v ave_m ave_tot region state_abv
## 1    Alabama  4.405   17.2 31.144      8   491   538    1029      S        AL
## 2     Alaska  8.963   17.6 47.951     47   445   489     934      W        AK
## 3    Arizona  4.778   19.3 32.175     27   448   496     944      W        AZ
## 4   Arkansas  4.459   17.1 28.934      6   482   523    1005      S        AR
## 5 California  4.992   24.0 41.078     45   417   485     902     CA        CA
## 6   Colorado  5.443   18.4 34.571     29   462   518     980      W        CO

Plot the Data

..and overlay the normal distribution. There are a number of ways to do this and what’s below keeps things in terms of frequencies. The first line sets up the histogram and saves it as “h”. The next two lines get data for horizontal and vertical axis. The third command puts density values of vertical fit values into scale of frequencies and uses information from the historgra. The “lines” command puts in the normal curve.

h =hist(df$ave_tot, breaks = 10, density = 10,
            col = "darkblue", 
            xlab = "Average Total SAT", 
            ylab = "Frequency",
            main="Normal curve Over histogram")
xfit <- seq(min(df$ave_tot), max(df$ave_tot), length = 50) 
yfit <- dnorm(xfit, mean = mean(df$ave_to), sd = sd(df$ave_tot)) 
yfit <- yfit * diff(h$mids[1:2]) * length(df$ave_tot) 
lines(xfit, yfit, col = "black", lwd = 2)

Sample Statistics

We need the sample size and observed sample mean. We will pretend that we know what the variance equals and use the sample variance for this.

n <- nrow(df)
ybar <- mean(df$ave_tot)
std <- sd(df$ave_tot)
s2 <- var(df$ave_tot)


descriptive <- c(n, ybar, std, s2)
names(descriptive) <- c("N","Ybar","std","s^2")
descriptive
##          N       Ybar        std        s^2 
##   50.00000  965.92000   74.82056 5598.11592

Flat (diffuse) and non-informative Priors

We’ll start with flat priors reflecting no knowledge and ignorance about SAT scores and these data.

mu.o <- 0          #  mean<<possible minimum of 400 
tau2.o <- 1000000  # Very flat, N(0, 1000000)
kappa.o <- 1
nu.o <- 1
sigma2.o <- 1000000

And the posterior distribution of the mean is normal with mean mu.n and variance tau.n; that is \[ \bar{y} \sim N(\mu_n, \tau_n^2) \qquad \mbox{and} \qquad 1/\sigma^2 \sim \mbox{Gamma}(\nu_{n}/2, \ (\nu_{n}/2)\sigma^2_{n})\] will be our posterior distribution.

# Posterior sample size
(kappa.n <- kappa.o + n)
## [1] 51
# Posterior nu
(nu.n <- nu.o + n)
## [1] 51
# Posterior mean (mu.n)
(mu.n <- (kappa.o*mu.o + n*ybar)/(kappa.o + n) )
## [1] 946.9804
(sigma2.n <- (nu.o*sigma2.o + (n-1)*s2 + (kappa.o*n/kappa.n)*(ybar-mu.o)**2)/nu.n)
## [1] 42921.86
(s2_theta.n <- sigma2.n/nu.n)
## [1] 841.6051
(tau2.n <- 1/(kappa.o/tau2.o + (nu.o + n)/sigma2.n))
## [1] 840.8974

Intervals estimates

# Credible interval
(pq <- qnorm(c(.025, .975), mu.n, sqrt(tau2.n)))
## [1]  890.1449 1003.8159
# High Density interval
x <- rnorm(100000, mu.n, sqrt(tau2.n))
library(HDInterval)
hdi(x, credMass=.95)
##     lower     upper 
##  890.0951 1004.0515 
## attr(,"credMass")
## [1] 0.95

The results don’t seem right and the problem is the poor priors.

Priors Using Some Information About SAT scores

I know that SAT total scores range from 400 to 1600 (i.e., 200 min and 800 max on each of the two tests). A score of 500 on quantitative and verbal tests might be reasonable averages for each test, so I set mu.o to 1000. I won’t change any of the other priors

mu.o <-1000   
tau2.o <- 1000000  # Very flat, N(1000, 1000000) 
kappa.o <- 1
nu.o <- 1
sigma2.o <- 10000
# Posterior sample size
kappa.n <- kappa.o + n

# Posterior nu
nu.n <- nu.o + n

# Posterior mean (mu.n)
mu.n <- (kappa.o*mu.o + n*ybar)/(kappa.o + n) 
  
sigma2.n <- (nu.o*sigma2.o + (n-1)*s2 + (kappa.o*n/kappa.n)*(ybar-mu.o)**2)/nu.n

Two ways of computing the variance of the mean, lead to about same result

s2_theta.n <- sigma2.n/nu.n

tau2.n <- 1/(kappa.o/tau2.o + (nu.o + n)/sigma2.n)

These results look much more reasonable and comparable to sample statistics. As a check, I used GIBBS (rjags) and get similar results.

Putting this all together, \[ \mu_n \sim N(966.59,109.73)\quad \sigma^2_n \sim \mbox{IGamma}(51, 5596.99)\quad \mbox{ or }\quad 1/\sigma_n^2 \sim \mbox{Gamma}(25.5, 1.4272318\times 10^{5})\] Note standard deviation of \(\mu_n\) is \('r \sqrt{tau2.n}'\)

Intervals estimates

# Credible interval
(pq <- qnorm(c(.025, .975), mu.n, sqrt(tau2.n)))
## [1] 946.0569 987.1195
# High Density interval
x <- rnorm(100000, mu.n, sqrt(tau2.n))
library(HDInterval)
hdi(x, credMass=.95)
##    lower    upper 
## 946.1257 987.1883 
## attr(,"credMass")
## [1] 0.95

Monte Carlo

We first draw random samples for precision (i.e., 1/\(\sigma^2\)). Then we draw random values from the normal using the simulated values of \(\sigma^2\). Taking 1000 random draws from the posterior

sigma2.mc <- 1/ rgamma(1000, nu.n/2, nu.n*sigma2.n/2)
tau2.mc <- sigma2.mc/(kappa.o + n)
mu.mc <- rnorm(1000, mu.n, sqrt(tau2.mc))

Compare the following with the analytic estimates of the mean, variance and precision,

summary(mu.mc)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   935.1   959.7   967.3   966.9   973.8  1003.4
summary(tau2.mc)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   66.53   98.68  112.12  114.63  127.08  266.13
summary(sigma2.mc)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3393    5032    5718    5846    6481   13573
sd(mu.mc)
## [1] 10.64246
var(mu.mc)
## [1] 113.262

Pretty close.