import arviz as az
import aesara.tensor as at
import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pymc as pm
Hierarchical modeling with the LKJ prior in PyMC
Hierarchical modeling with the LKJ prior in PyMC
Throughout this blogpost, I will be working with the famous sleepstudy dataset. I’m going to estimate a hierarchical linear regression with both varying intercepts and varying slopes. The goal is to understand how to place non-independent priors for the group-specific effects in PyMC as efficiently as possible.
The sleepstudy dataset is derived from the study described in Belenky et al. (2003) and popularized in the lme4 R package. This dataset contains the average reaction time per day (in milliseconds) on a series of tests for the most sleep-deprived group in a sleep deprivation study. The first two days of the study are considered as adaptation and training, the third day is a baseline, and sleep deprivation started after day 3. The subjects in this group were restricted to 3 hours of sleep per night.
With that said, let’s get into the code!
%matplotlib inline
"arviz-darkgrid")
az.style.use("figure.facecolor"] = "white" mpl.rcParams[
Let’s get started by downloading and exploring sleepstudy dataset.
= "https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/master/csv/lme4/sleepstudy.csv"
url = pd.read_csv(url, index_col = 0) data
The following is a description of the variables we have in the dataset.
- Reaction: Average of the reaction time measurements on a given subject for a given day.
- Days: Number of days of sleep deprivation.
- Subject: The subject ID.
print(f"There are {len(data)} observations.")
data.head()
There are 180 observations.
Reaction | Days | Subject | |
---|---|---|---|
1 | 249.5600 | 0 | 308 |
2 | 258.7047 | 1 | 308 |
3 | 250.8006 | 2 | 308 |
4 | 321.4398 | 3 | 308 |
5 | 356.8519 | 4 | 308 |
print(f"Days range from {data['Days'].min()} to {data['Days'].max()}.")
print(f"There are J={data['Subject'].unique().size} subjects.")
Days range from 0 to 9.
There are J=18 subjects.
Let’s explore the evolution of the reaction times through the days for every subject.
def plot_data(data, figsize=(16, 7.5)):
= plt.subplots(2, 9, figsize=figsize, sharey=True, sharex=True)
fig, axes =0.075, right=0.975, bottom=0.075, top=0.925, wspace=0.03)
fig.subplots_adjust(left
for i, (subject, ax) in enumerate(zip(data["Subject"].unique(), axes.ravel())):
= data.index[data["Subject"] == subject].tolist()
idx = data.loc[idx, "Days"].values
days = data.loc[idx, "Reaction"].values
reaction
# Plot observed data points
="C0", ec="black", alpha=0.7)
ax.scatter(days, reaction, color
# Add a title
f"Subject: {subject}", fontsize=14)
ax.set_title(
0, 2, 4, 6, 8])
ax.xaxis.set_ticks([0.5, 0.02, "Days", fontsize=14)
fig.text(0.03, 0.5, "Reaction time (ms)", rotation=90, fontsize=14, va="center")
fig.text(
return fig, axes
; plot_data(data)
For most of the subjects, there’s a clear positive association between Days and Reaction time. Reaction times increase as people accumulate more days of sleep deprivation. Participants differ in the initial reaction times as well as in the association between sleep deprivation and reaction time. Reaction times increase faster for some subjects and slower for others. Finally, the relationship between Days and Reaction time presents some deviations from linearity within the panels, but these are neither substantial nor systematic.
The model
The model we’re going to build today is a hierarchical linear regression, with a Gaussian likelihood. In the following description, I use the greek letter \(\beta\) to refer to common effects and the roman letter \(u\) to refer to group-specific (or varying) effects.
\[ y_{ij} = \beta_0 + u_{0j} + \left( \beta_1 + u_{1j} \right) \cdot {\text{Days}} + \epsilon_i \]
where
\[ \begin{aligned} y_{ij} &= \text{Reaction time for the subject } j \text{ on day } i \\ \beta_0 &= \text{Intercept common to all subjects} \\ \beta_1 &= \text{Days slope common to all subjects} \\ u_{0j} &= \text{Intercept deviation of the subject } j \\ u_{1j} &= \text{Days slope deviation of the subject } j \\ \epsilon_i &= \text{Residual random error} \end{aligned} \]
we also have
\[ \begin{aligned} i = 1, \cdots, 10 \\ j = 1, \cdots, 18 \end{aligned} \]
where \(i\) indexes Days and \(j\) indexes subjects.
From the mathematical description we notice both the intercept and the slope are made of two components. The intercept is made of a common or global intercept \(\beta_0\) and subject-specific deviations \(u_{0j}\). The same logic applies for the slope with both \(\beta_1\) and \(u_{1j}\).
Priors
Common effects
For the common effects, we Guassian independent priors.
\[ \begin{array}{c} \beta_0 \sim \text{Normal}\left(\bar{y}, \sigma_{\beta_0}\right) \\ \beta_1 \sim \text{Normal}\left(0, \sigma_{\beta_1}\right) \end{array} \]
Bambi centers the prior for the intercept at \(\bar{y}\), so do we. For \(\sigma_{\beta_0}\) and \(\sigma_{\beta_1}\) I’m going to use 50 and 10 respectively. We’ll use these same priors for all the variants of the model above.
Residual error
\[ \begin{aligned} \epsilon_i &\sim \text{Normal}(0, \sigma) \\ \sigma &\sim \text{HalfStudentT}(\nu, \sigma_\epsilon) \end{aligned} \]
Where \(\nu\) and \(\sigma_\epsilon\), both positive constants, represent the degrees of freedom and the scale parameter, respectively.
Group-specific effects
Throughout this post we’ll propose the following variants for the priors of the group-specific effects.
- Independent priors.
- Correlated priors.
- Using
LKJCholeskyCov
. - Using
LKJCorr
. - Usign
LKJCorr
with non-trivial standard deviation.
- Using
Each of them will be described in more detail in its own section.
Then we create subjects
and subjects_idx
. These represent the subject IDs and their indexes. These are used with the distribution of the group-specific coefficients. We also have the coords
that we pass to the model and the mean of the prior for the intercept
# Subjects and subjects index
= np.unique(data["Subject"], return_inverse=True)
subjects, subjects_idx
# Coordinates to handle dimensions of PyMC distributions and use better labels
= {"subject": subjects}
coords
# Response mean -- Used in the prior for the intercept
= data["Reaction"].mean()
y_mean
# Days variable
= data["Days"].values days
Model 1: Independent priors
Group-specific effects: Independent priors
\[ \begin{array}{lr} u_{0j} \sim \text{Normal} \left(0, \sigma_{u_0}\right) & \text{for all } j:1,..., 18 \\ u_{1j} \sim \text{Normal} \left(0, \sigma_{u_1}\right) & \text{for all } j:1,..., 18 \end{array} \]
where the hyperpriors are
\[ \begin{array}{c} \sigma_{u_0} \sim \text{HalfNormal} \left(\tau_0\right) \\ \sigma_{u_1} \sim \text{HalfNormal} \left(\tau_1\right) \end{array} \]
where \(\tau_0\) and \(\tau_1\) represent the standard deviations of the hyperpriors. These are fixed positive constants. We set them to the same values than \(\sigma_{\beta_0}\) and \(\sigma_{\beta_1}\) respectively.
with pm.Model(coords=coords) as model_independent:
# Common effects
0 = pm.Normal("β0", mu=y_mean, sigma=50)
β1 = pm.Normal("β1", mu=0, sigma=10)
β
# Group-specific effects
# Intercept
= pm.HalfNormal("σ_u0", sigma=50)
σ_u0 = pm.Normal("u0", mu=0, sigma=σ_u0, dims="subject")
u0
# Slope
= pm.HalfNormal("σ_u1", sigma=10)
σ_u1 = pm.Normal("u1", mu=0, sigma=σ_u1, dims="subject")
u1
# Construct intercept and slope
= pm.Deterministic("intercept", β0 + u0[subjects_idx])
intercept = pm.Deterministic("slope", (β1 + u1[subjects_idx]) * days)
slope
# Conditional mean
= pm.Deterministic("μ", intercept + slope)
μ
# Residual standard deviation
= pm.HalfStudentT("σ", nu=4, sigma=50)
σ
# Response
= pm.Normal("y", mu=μ, sigma=σ, observed=data["Reaction"]) y
pm.model_to_graphviz(model_independent)
with model_independent:
= pm.sample(draws=1000, chains=4, random_seed=1234) idata_independent
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 2 jobs)
NUTS: [β0, β1, σ_u0, u0, σ_u1, u1, σ]
Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 33 seconds.
az.plot_trace(
idata_independent, =["β0", "β1", "u0", "u1", "σ", "σ_u0", "σ_u1"],
var_names=True,
combined={"ls": "-"}
chain_prop; )
def plot_predictions(data, idata, figsize=(16, 7.5)):
# Plot the data
= plot_data(data, figsize=figsize)
fig, axes
# Extract predicted mean
= idata.posterior["μ"]
reaction_mean
for subject, ax in zip(subjects, axes.ravel()):
= (data["Subject"]== subject).values
idx = data.loc[idx, "Days"].values
days
# Plot highest density interval / credibility interval
="C0", ax=ax)
az.plot_hdi(days, reaction_mean[..., idx], color
# Plot mean regression line
"chain", "draw")), color="C0")
ax.plot(days, reaction_mean[..., idx].mean((
return fig ,axes
; plot_predictions(data, idata_independent)
Compare inferences
= [
groups "β0", "β1", "σ"],
["u0", "σ_u0"],
["u1", "σ_u1"],
[
]
= ["Indepentent", "LKJCholeskyCov", "LKJCorr", "LKJCorr + rstanarm"]
model_names = plt.subplots(1, 3, figsize = (20, 10))
fig, ax
for idx, group in enumerate(groups):
az.plot_forest(
[idata_independent, idata_lkj_cov, idata_lkj_corr, idata_lkj_corr_2],=model_names,
model_names=group,
var_names=True,
combined=ax[idx],
ax )
az.plot_forest(
[idata_lkj_cov, idata_lkj_corr, idata_lkj_corr_2],=model_names[1:],
model_names=["ρ_u"],
var_names=True
combined; )
Conclusions
Conclusions
- We showed how to use correlated priors for group-specific coefficients.
- The posteriors resulted to be the same for the in all the cases.
- The correlated priors didn’t imply any benefit to our sampling process. However, this could be beneficial for more complex hierarchical models.
- What’s more, the model with the correlated priors took more time to sample than the one with independent priors.
- Attempting to replicate rstanarm approach takes even longer because we are forced to compute many things manually.
Notes and suggestions
- Sometimes, the models with correlated priors based on
pm.LKJCorr
resulted in divergences. We needed to increasetarget_accept
. - It would be good to be able to pass a random variable to
sd_dist
inpm.LKJCholeskyCov
, and not just a stateless distribution. This forced me to usepm.LKJCorr
and perform many manipulations manually, which was more error-prone and inefficient. - It would be good to check if there’s something in the LKJCorr/LKJCholeskyCov that could be improved. I plan to use
LKJCholeskyCov
within Bambi in the future and I would like it to work as better as possible.
%load_ext watermark
%watermark -n -u -v -iv -w
Last updated: Sun Jun 12 2022
Python implementation: CPython
Python version : 3.10.4
IPython version : 8.3.0
numpy : 1.21.6
pandas : 1.4.2
matplotlib: 3.5.2
arviz : 0.12.1
aesara : 2.6.6
pymc : 4.0.0
sys : 3.10.4 | packaged by conda-forge | (main, Mar 24 2022, 17:39:04) [GCC 10.3.0]
Watermark: 2.3.1