I will create a series of documents that run various analysis on the nels 23 schools data. I could put them all in one document, but this would take a really, really long time. The documents in the set will do the following
So, let’s load up the rstan package at the start and see the messages that we’re given:
library(rstan)
## Loading required package: StanHeaders
## Loading required package: ggplot2
## rstan (Version 2.21.2, GitRev: 2e1f913d3ca3)
## For execution on a local, multicore CPU with excess RAM we recommend calling
## options(mc.cores = parallel::detectCores()).
## To avoid recompilation of unchanged Stan programs, we recommend calling
## rstan_options(auto_write = TRUE)
## Do not specify '-march=native' in 'LOCAL_CPPFLAGS' or a Makevars file
library(stargazer) # use for table of descriptive
##
## Please cite as:
## Hlavac, Marek (2018). stargazer: Well-Formatted Regression and Summary Statistics Tables.
## R package version 5.2.2. https://CRAN.R-project.org/package=stargazer
This package computes some useful statistics for model evaluation and comparison. One statistic is efficient approximate “leave one out” (hence it’s called “loo”) cross-validation for Bayesian models. The loo package will also compute waic (“widely applicable information criteria”) and compare waic for multiple models using the loo_compare function.
library(loo)
## This is loo version 2.4.1
## - Online documentation and vignettes at mc-stan.org/loo
## - As of v2.0.0 loo defaults to 1 core but we recommend using as many as possible. Use the 'cores' argument or set options(mc.cores = NUM_CORES) for an entire session.
## - Windows 10 users: loo may be very slow if 'mc.cores' is set in your .Rprofile file (see https://github.com/stan-dev/loo/issues/94).
##
## Attaching package: 'loo'
## The following object is masked from 'package:rstan':
##
## loo
Note that there three recommendations here about what to run before we use rstan. The first one, “For execution on a local, multicore CPU with excess RAM we recommend calling”, only needs to be issue once per session. This make the default number of core to use equal to the number of cores available
options(mc.cores = parallel::detectCores())
The second recommendation, “To avoid recompilation of unchanged Stan programs, we recommend calling”, save the compile model so that it can be re-used. Models are compiled in C++ and this can take some time.
rstan_options(auto_write = TRUE)
The third recommendation, “For improved execution time, we recommend calling”, I’m not exactly sure what this does. “For improved execution time, we recommend calling”
Sys.setenv(LOCAL_CPPFLAGS = '-march=native -mtune=native')
My working directory where the data live is
setwd("C:/Users/cja/Dropbox/edps 590BAY/Lectures/10 Ham_Stan_brms")
And read in data
nels <- read.table("school23_data.txt",header=TRUE)
nels <- nels[order(nels$school,nels$student), ]
# Get information on numbers of school and students
school <- unique(nels$school)
N <- length(school)
# total number of students
n <- length(nels$math)
names(nels)
## [1] "school" "student" "sex" "race" "homew" "schtype"
## [7] "ses" "pared" "math" "classtr" "schsize" "urban"
## [13] "geo" "minority" "ratio"
I like computing sample statistics
stargazer(nels, type="text") # default is LaTex
##
## ===================================================================
## Statistic N Mean St. Dev. Min Pctl(25) Pctl(75) Max
## -------------------------------------------------------------------
## school 519 35,488.710 26,056.710 6,053 7,801 62,821 72,991
## student 519 49.863 28.623 0 24.5 74 99
## sex 519 1.520 0.500 1 1 2 2
## race 519 3.615 0.797 1 4 4 5
## homew 519 1.971 1.484 0 1 3 7
## schtype 519 1.977 1.328 1 1 4 4
## ses 519 -0.001 0.881 -2.410 -0.620 0.730 1.850
## pared 519 3.289 1.440 1 2 4 6
## math 519 51.723 10.709 30 43 61.5 71
## classtr 519 3.819 0.790 2 3 4 5
## schsize 519 3.229 1.400 1 2 4 7
## urban 519 1.832 0.859 1 1 3 3
## geo 519 2.125 0.704 1 2 3 3
## minority 519 2.501 2.037 0 1 3 7
## ratio 519 16.757 4.931 10 13 20 28
## -------------------------------------------------------------------
Our first stan model! We’ll start simple and work up to more complex ones.
We could either define the model as below and run in rstan like we did with rjags OR we could put this in a file with extension .stan
#############################################
# stan model definition
#############################################
model1 <- "
data { // We need to define the data
int<lower=0> n; // number of students
real y[n]; // vector of data
}
parameters {
real<lower=0> sigma;
real mu;
}
model {
target += cauchy_lpdf(sigma | 0, 1);
target += normal_lpdf(mu |0, 30);
target += normal_lpdf(y | mu, sigma);
}
generated quantities{
real log_lik[n];
for (i in 1:n) {
log_lik[i] = normal_lpdf(y[i] | mu, sigma);
}
}
"
“lpdf” stands for log probability density function.
Alternatively the model part could be written as
sigma ~ cauchy(0,1);
mu ~ normal(0, 10);
y ~ normal(mu,sigma);
}
“lpdf” stands for log probability density function.
Before we run rstan, we need to create a data list, as follows:
########################################
# Make data list
########################################
y <- as.numeric(nels$math) # otherwise it's class integer
n <- nrow(nels)
dataList <- list(n=n,y=y)
Now for the part that takes a bit of time
fit.1 <- stan(
model_code = model1,
model_name = "Nels23: mean & sd",
data = dataList,
iter = 2000,
chains = 4,
cores=4,
warmup = floor(2000/2),
verbose = FALSE)
And some output but I don’t want all the log likelihoods here, so I’ll only ask for mean and standard deviation:
print(fit.1,pars = c("mu","sigma"), probs = c(0.025, 0.975),
digits = 4)
## Inference for Stan model: Nels23: mean & sd.
## 4 chains, each with iter=2000; warmup=1000; thin=1;
## post-warmup draws per chain=1000, total post-warmup draws=4000.
##
## mean se_mean sd 2.5% 97.5% n_eff Rhat
## mu 51.6984 0.0075 0.4708 50.7726 52.6128 3955 1.0000
## sigma 10.7116 0.0056 0.3256 10.1088 11.3682 3430 0.9998
##
## Samples were drawn using NUTS(diag_e) at Mon Nov 01 14:37:27 2021.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at
## convergence, Rhat=1).
and various graphics
plot(fit.1)
## 'pars' not specified. Showing first 10 parameters by default.
## ci_level: 0.8 (80% intervals)
## outer_level: 0.95 (95% intervals)
stan_trace(fit.1)
## 'pars' not specified. Showing first 10 parameters by default.
stan_dens(fit.1)
## 'pars' not specified. Showing first 10 parameters by default.
stan_hist(fit.1)
## 'pars' not specified. Showing first 10 parameters by default.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
stan_ac(fit.1)
## 'pars' not specified. Showing first 10 parameters by default.
stan_scat(fit.1,pars=c("mu","sigma"))
pairs(fit.1, pars = c("mu", "sigma", "lp__"), las = 1)
There are some more that we could look at but I’m not exactly sure what to do with these
stan_diag(fit.2,information=c("stepsize"))
stan_diag(fit.2,information=c("divergence"))
stan_diag(fit.2,information=c("treedepth"))
And to use the loo package,
log_lik.1 <- extract_log_lik(fit.1, merge_chains = FALSE)
r_eff <- relative_eff(exp(log_lik.1))
loo.1 <- loo(log_lik.1, r_eff = r_eff, cores = 4)
print(loo.1)
##
## Computed from 4000 by 519 log-likelihood matrix
##
## Estimate SE
## elpd_loo -1968.2 10.5
## p_loo 1.4 0.1
## looic 3936.4 21.0
## ------
## Monte Carlo SE of elpd_loo is 0.0.
##
## All Pareto k estimates are good (k < 0.5).
## See help('pareto-k-diagnostic') for details.
waic.1 <- waic(log_lik.1)
print(waic.1)
##
## Computed from 4000 by 519 log-likelihood matrix
##
## Estimate SE
## elpd_waic -1968.2 10.5
## p_waic 1.4 0.1
## waic 3936.4 21.0
If we had two models we might want to look at
compare(loo.1,loo.2)