Overview

I will create a series of documents that run various analysis on the nels 23 schools data. I could put them all in one document, but this would take a really, really long time. The documents in the set will do the following

Set-up

Packages

So, let’s load up the rstan package at the start and see the messages that we’re given:

library(rstan)
## Loading required package: StanHeaders
## Loading required package: ggplot2
## rstan (Version 2.21.2, GitRev: 2e1f913d3ca3)
## For execution on a local, multicore CPU with excess RAM we recommend calling
## options(mc.cores = parallel::detectCores()).
## To avoid recompilation of unchanged Stan programs, we recommend calling
## rstan_options(auto_write = TRUE)
## Do not specify '-march=native' in 'LOCAL_CPPFLAGS' or a Makevars file
library(stargazer)   # use for table of descriptive
## 
## Please cite as:
##  Hlavac, Marek (2018). stargazer: Well-Formatted Regression and Summary Statistics Tables.
##  R package version 5.2.2. https://CRAN.R-project.org/package=stargazer

This package computes some useful statistics for model evaluation and comparison. One statistic is efficient approximate “leave one out” (hence it’s called “loo”) cross-validation for Bayesian models. The loo package will also compute waic (“widely applicable information criteria”) and compare waic for multiple models using the loo_compare function.

library(loo)
## This is loo version 2.4.1
## - Online documentation and vignettes at mc-stan.org/loo
## - As of v2.0.0 loo defaults to 1 core but we recommend using as many as possible. Use the 'cores' argument or set options(mc.cores = NUM_CORES) for an entire session.
## - Windows 10 users: loo may be very slow if 'mc.cores' is set in your .Rprofile file (see https://github.com/stan-dev/loo/issues/94).
## 
## Attaching package: 'loo'
## The following object is masked from 'package:rstan':
## 
##     loo

Recommendation for rstan

Note that there three recommendations here about what to run before we use rstan. The first one, “For execution on a local, multicore CPU with excess RAM we recommend calling”, only needs to be issue once per session. This make the default number of core to use equal to the number of cores available

options(mc.cores = parallel::detectCores())

The second recommendation, “To avoid recompilation of unchanged Stan programs, we recommend calling”, save the compile model so that it can be re-used. Models are compiled in C++ and this can take some time.

rstan_options(auto_write = TRUE)

The third recommendation, “For improved execution time, we recommend calling”, I’m not exactly sure what this does. “For improved execution time, we recommend calling”

Sys.setenv(LOCAL_CPPFLAGS = '-march=native -mtune=native')

Directory and Data Set Up

My working directory where the data live is

setwd("C:/Users/cja/Dropbox/edps 590BAY/Lectures/10 Ham_Stan_brms")

And read in data

nels <- read.table("school23_data.txt",header=TRUE)
nels <- nels[order(nels$school,nels$student), ]
# Get information on numbers of school and students
school <- unique(nels$school)               
N <- length(school)                        
# total number of students
n <- length(nels$math)               
names(nels)
##  [1] "school"   "student"  "sex"      "race"     "homew"    "schtype" 
##  [7] "ses"      "pared"    "math"     "classtr"  "schsize"  "urban"   
## [13] "geo"      "minority" "ratio"

I like computing sample statistics

stargazer(nels, type="text") # default is LaTex
## 
## ===================================================================
## Statistic  N     Mean     St. Dev.   Min   Pctl(25) Pctl(75)  Max  
## -------------------------------------------------------------------
## school    519 35,488.710 26,056.710 6,053   7,801    62,821  72,991
## student   519   49.863     28.623     0      24.5      74      99  
## sex       519   1.520      0.500      1       1        2       2   
## race      519   3.615      0.797      1       4        4       5   
## homew     519   1.971      1.484      0       1        3       7   
## schtype   519   1.977      1.328      1       1        4       4   
## ses       519   -0.001     0.881    -2.410  -0.620   0.730   1.850 
## pared     519   3.289      1.440      1       2        4       6   
## math      519   51.723     10.709     30      43      61.5     71  
## classtr   519   3.819      0.790      2       3        4       5   
## schsize   519   3.229      1.400      1       2        4       7   
## urban     519   1.832      0.859      1       1        3       3   
## geo       519   2.125      0.704      1       2        3       3   
## minority  519   2.501      2.037      0       1        3       7   
## ratio     519   16.757     4.931      10      13       20      28  
## -------------------------------------------------------------------

Estimate mean and sd of math scores

Our first stan model! We’ll start simple and work up to more complex ones.

We could either define the model as below and run in rstan like we did with rjags OR we could put this in a file with extension .stan

#############################################
# stan model definition
#############################################
model1 <- "
data {                // We need to define the data
  int<lower=0> n;     // number of students
  real y[n];          // vector of data
} 
parameters {
  real<lower=0> sigma; 
  real mu;             
} 
model {
  target += cauchy_lpdf(sigma | 0, 1);
  target += normal_lpdf(mu |0, 30);
  target += normal_lpdf(y | mu, sigma);
}
generated quantities{
  real log_lik[n];
  for (i in 1:n) {
    log_lik[i] = normal_lpdf(y[i] | mu, sigma); 
          }
}
"

“lpdf” stands for log probability density function.

Alternatively the model part could be written as

   sigma ~ cauchy(0,1);  
   mu ~ normal(0, 10);
   y ~ normal(mu,sigma);
}

“lpdf” stands for log probability density function.

Before we run rstan, we need to create a data list, as follows:

########################################
# Make data list
########################################
y <- as.numeric(nels$math) # otherwise it's class integer
n <- nrow(nels) 
dataList <- list(n=n,y=y) 

Now for the part that takes a bit of time

fit.1 <- stan(
      model_code = model1, 
          model_name = "Nels23: mean & sd",
      data = dataList, 
          iter = 2000, 
          chains = 4,
          cores=4,
          warmup = floor(2000/2),
          verbose = FALSE) 

And some output but I don’t want all the log likelihoods here, so I’ll only ask for mean and standard deviation:

print(fit.1,pars = c("mu","sigma"), probs = c(0.025, 0.975), 
    digits = 4)
## Inference for Stan model: Nels23: mean & sd.
## 4 chains, each with iter=2000; warmup=1000; thin=1; 
## post-warmup draws per chain=1000, total post-warmup draws=4000.
## 
##          mean se_mean     sd    2.5%   97.5% n_eff   Rhat
## mu    51.6984  0.0075 0.4708 50.7726 52.6128  3955 1.0000
## sigma 10.7116  0.0056 0.3256 10.1088 11.3682  3430 0.9998
## 
## Samples were drawn using NUTS(diag_e) at Mon Nov 01 14:37:27 2021.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).

and various graphics

plot(fit.1)
## 'pars' not specified. Showing first 10 parameters by default.
## ci_level: 0.8 (80% intervals)
## outer_level: 0.95 (95% intervals)

stan_trace(fit.1)
## 'pars' not specified. Showing first 10 parameters by default.

stan_dens(fit.1)
## 'pars' not specified. Showing first 10 parameters by default.

stan_hist(fit.1)
## 'pars' not specified. Showing first 10 parameters by default.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

stan_ac(fit.1)
## 'pars' not specified. Showing first 10 parameters by default.

stan_scat(fit.1,pars=c("mu","sigma"))

pairs(fit.1, pars = c("mu", "sigma", "lp__"), las = 1)

There are some more that we could look at but I’m not exactly sure what to do with these

stan_diag(fit.2,information=c("stepsize"))
stan_diag(fit.2,information=c("divergence"))
stan_diag(fit.2,information=c("treedepth"))

And to use the loo package,

log_lik.1 <- extract_log_lik(fit.1, merge_chains = FALSE)
r_eff <- relative_eff(exp(log_lik.1)) 

loo.1 <- loo(log_lik.1, r_eff = r_eff, cores = 4)
print(loo.1)
## 
## Computed from 4000 by 519 log-likelihood matrix
## 
##          Estimate   SE
## elpd_loo  -1968.2 10.5
## p_loo         1.4  0.1
## looic      3936.4 21.0
## ------
## Monte Carlo SE of elpd_loo is 0.0.
## 
## All Pareto k estimates are good (k < 0.5).
## See help('pareto-k-diagnostic') for details.
waic.1 <- waic(log_lik.1)
print(waic.1)
## 
## Computed from 4000 by 519 log-likelihood matrix
## 
##           Estimate   SE
## elpd_waic  -1968.2 10.5
## p_waic         1.4  0.1
## waic        3936.4 21.0

If we had two models we might want to look at

compare(loo.1,loo.2)