import arviz as az
import matplotlib.pyplot as plt
import numpy as np
import pymc as pm
Data simulation with PyMC
To simulate data in Python, people generally use NumPy or SciPy. Their interfaces are simple and easy to use. For example, to simulate values from a normal distribution, np.random.normal()
is all we need.
But not all the scenarios where one may want to simulate data are as simple as our example above. In statistics, quite often we want to perform parameter recovery studies. These tell us if our model and estimation procedure are able to recover the true value of the parameters of the assumed data generating process.
To perform a parameter recovery study, we roughly follow the steps below:
- Define model (input variables, parameters, and output variables)
- Set model parameters to some plausible values
- Simulate values of the outcome variable
- Estimate the parameters
- Compare the parameter estimates against the true values
If we wanted to do this using NumPy, we would need to write the model from scratch using functions in its random
module. In general, when our goal is to perform estimation and prediction, we don’t write models from scratch in NumPy. Rather, we use the interface provided by another library that usually comes with the tools needed to do estimation, criticism, prediction, etc.
What the previous paragraph highlights is that, at the end of the day, the model is implemented twice. One implementation is used to carry out simulation and parameter recovery, the other is used to estimate the parameters. As the complexity of the model increases, it becomes increasingly harder to avoid mistakes in the implementation of our model in a pure NumPy or SciPy approach.
Fortunately, for those of us following a Bayesian approach, there’s PyMC. The model representation in PyMC gives users the possibility to do both simulation and estimation using the same model object.
Through a series of examples of increasing complexity, I’m going to show how to simulate data with PyMC in the context of a parameter recovery study. In our journey, we’ll learn about some PyMC specific terminologies and functions including:
pm.draw()
pm.do()
andpm.observe()
coords
anddims
pm.Data
andpm.Deterministic
Hopefully, the groupings will make sense by the end of this post.
= sum(map(ord, "happy data simulation")) # A fancy way to determine a seed
seed = np.random.default_rng(seed) # Random number generator rng
A simple normal model
It all starts with a very simple model. There’s an outcome variable \(Y\) which we model with a normal distribution, and we’re interested in the estimation of the mean and standard deviation parameters, \(\mu\) and \(\sigma\).
\[ \begin{aligned} Y_i \mid \mu, \sigma &\sim \text{Normal}(\mu, \sigma^2) \\ \mu &\sim \text{Normal}(0, 1^2) \\ \sigma &\sim \text{Gamma}(2, 2) \end{aligned} \]
with \(i = 1, \dots, N\). In this case, as we’re starting small, we ill use \(N = 10\).
= 10
N
with pm.Model() as model:
1= pm.Normal("mu", mu=0, sigma=1)
mu 2= pm.Gamma("sigma", alpha=2, beta=2)
sigma 3"y", mu=mu, sigma=sigma, shape=N)
pm.Normal(
model.to_graphviz()
- 1
- Prior for \(\mu\)
- 2
- Prior for \(\sigma\)
- 3
-
The observational model.
shape=N
indicates there areN
values ofy
.
Note that we have created a PyMC model without any observed data. If you’re already familiar with PyMC, don’t look for the observed
keyword in the distribution for \(Y\), you won’t find it.
To simulate values from a PyMC distribution, there’s the very handy pm.draw()
function. It does what its name says: drawing samples from a probability distribution1.
Let’s draw a value from the distribution of \(\mu\):
pm.draw(mu)
array(0.02122677)
You want many values? No problem:
=10) pm.draw(mu, draws
array([ 0.10972394, -0.46840241, 2.81970857, 0.1552543 , 0.89596717,
-0.86260537, 0.31740876, -0.95548771, 0.68387801, 2.33809632])
How about reproducibility? It has you covered with random_seed
. You can pass an integer seed or a random number generator. I prefer the latter and I will use it throughout this blog post.
=rng) pm.draw(sigma, random_seed
array(0.66709727)
Finally, the last thing I’m going to say about pm.draw()
is that it accepts a sequence of random variables. Below, we get a draw for both \(\mu\) and \(\sigma\).
= pm.draw([mu, sigma], random_seed=rng)
mu_value, sigma_value mu_value, sigma_value
(array(-0.64551614), array(0.29948268))
The next step is to set these parameter values in the model. This is where the pm.do()
function comes into play. The name of the function comes from the causal inference literature and it is used to perform “interventions” in a model graph. For our purposes, we can think of it as a way to assign specific values to model parameters. If you’re curious about it you can have a look at Causal analysis with PyMC: Answering “What If?” with the new do operator and Interventional distributions and graph mutation with the do-operator.
Let’s return to our topic. To assign values to model parameters using pm.do()
, we pass it a model instance and a dictionary that maps variable names to values. The output we obtain is a new PyMC model instance.
= pm.do(model, {"mu": mu_value, "sigma": sigma_value})
model_fixed_parameters model_fixed_parameters.to_graphviz()
In this model new instance, the nodes for mu
and sigma
got updated. They don’t represent random variables anymore, they are now constant values.
Now that the parameter values are fixed, we need to simulate values from the outcome variable y
. To do so, we use pm.draw()
again. Note that we access the random variable y
from the model object, as there’s no global variable that represents it.
= pm.draw(model_fixed_parameters["y"], random_seed=rng)
y_values y_values
array([-0.93216552, -0.37497902, 0.02807845, -0.53793148, -0.99724182,
-0.77754811, -0.60105837, -0.34428665, -0.63623636, -0.66625698])
The next step is to set the “observed” values of the outcome variable in the original model, where mu
and sigma
are random variables. This is where pm.observe()
comes into play. Unlike pm.do()
, pm.observe()
doesn’t convert random variables into fixed quantities. Instead, it attaches observed values —realizations of these random variables— that are later used to condition inference.
The usage is analog to pm.do()
. We pass a model, a dictionary mapping variables to values and it returns a new PyMC model.
= pm.observe(model, {"y": y_values})
model_observed_data model_observed_data.to_graphviz()
Finally, we now can perform inference and evaluate whether the posterior concentrates around the parameter value used to simulate the values of the outcome.
with model_observed_data:
= pm.sample(random_seed=rng) idata
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [mu, sigma]
Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 1 seconds.
=[mu_value, sigma_value]); az.plot_posterior(idata, ref_val
And everything worked just as expected.
Summary
Since we’re going to follow the same steps in all the examples below, let’s summarise them here for reference.
- Define model (input variables, parameters, and output variables)
= 10
N
with pm.Model() as model:
= pm.Normal("mu", mu=0, sigma=1)
mu = pm.Gamma("sigma", alpha=2, beta=2)
sigma "y", mu=mu, sigma=sigma, shape=N) pm.Normal(
- Set model parameters to some plausible values
= pm.draw([mu, sigma], random_seed=rng)
mu_value, sigma_value = pm.do(model, {"mu": mu_value, "sigma": sigma_value}) model_fixed_parameters
- Simulate values of the outcome variable
= pm.draw(model_fixed_parameters["y"], random_seed=rng) y_values
- Estimate the parameters
= pm.observe(model, {"y": y_values})
model_observed_data with model_observed_data:
= pm.sample(random_seed=rng) idata
- Compare the parameter estimates against the true values
=[mu_value, sigma_value]); az.plot_posterior(idata, ref_val
Before moving on to the next example, some clarification notes:
- You don’t need to use the prior to simulate parameter values. You could do
pm.do()
with arbitrary values, as long as they are within the parameter domain. - This is not a post about thorough parameter recovery, just the basics to get the idea across. Don’t take it as a gold standard.
Simple linear regression
Let’s work with the following simple linear regression model:
\[ \begin{aligned} Y_i \mid \mu_i &\sim \text{Normal}(\mu_i, \sigma^2) \\ \mu_i &= \alpha + \beta x_i \\ \alpha &\sim \text{Normal}(0, 1^2) \\ \beta &\sim \text{Normal}(0, 1^2) \\ \sigma &\sim \text{Gamma}(2, 2) \end{aligned} \]
with \(i = 1, \dots, N\)
Known covariate values
Usually, the values of the covariate are assumed to be fixed and known. We start with a scenario where we have the values for it, and later on, we show how to simulate data for both covariate and outcome variables.
= np.array(
x_values
[0.786, -0.399, 1.018, 0.657, -0.195, -0.083, 0.651, -0.476, -0.584, -0.194,
0.282, 0.176, -0.309, 2.022, 0.341, -0.982, -0.904, 0.491, -2.07 , -0.568
] )
= 20
N
with pm.Model() as model:
= pm.Normal("alpha", mu=0, sigma=1)
alpha = pm.Normal("beta", mu=0, sigma=1)
beta
= alpha + beta * x_values
mu = pm.Gamma("sigma", alpha=2, beta=2)
sigma
"y", mu=mu, sigma=sigma, shape=N)
pm.Normal(
model.to_graphviz()
The steps below are analogous to the ones above. We only change the random variables from which we sample.
# Get plausible values for the parameters
= pm.draw([alpha, beta, sigma], random_seed=rng)
alpha_value, beta_value, sigma_value
print("alpha:", alpha_value)
print("beta:", beta_value)
print("sigma:", sigma_value)
# Set parameters to the sampled values
= pm.do(
model_fixed_parameters
model,"alpha": alpha_value, "beta": beta_value, "sigma": sigma_value}
{
)
model_fixed_parameters.to_graphviz()
alpha: 0.6559553890005595
beta: -0.2818584784669223
sigma: 2.9596230424561605
Given the fixed parameter values and the values of the covariate, we simulate values of the response y
.
= pm.draw(model_fixed_parameters["y"], random_seed=rng)
y_values y_values
array([ 3.12686173, -0.11044308, -4.05742728, 2.9712478 , 1.03231454,
-0.77103352, 0.42986971, 2.32298661, 2.22334492, 4.25858143,
-1.75237172, -0.51776157, 3.27740227, 1.12482262, -4.2394528 ,
0.77089266, 4.45871827, 5.20353471, 0.61391209, 3.22912347])
= plt.subplots()
fig, ax =0.7)
ax.scatter(x_values, y_values, alpha0, alpha_value), slope=beta_value, color="0.3") # theoretical line curve
ax.axline((set(xlabel="x", ylabel="y"); ax.
Observe the simulated values of y
in the original model:
= pm.observe(model, {"y": y_values})
model_observed_data model_observed_data.to_graphviz()
and perform inference:
with model_observed_data:
= pm.sample(random_seed=rng) idata
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [alpha, beta, sigma]
Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 1 seconds.
Again, the true parameter value is fairly well contained in the posterior distribution.
=[alpha_value, beta_value, sigma_value]); az.plot_posterior(idata, ref_val
We can also simulate some draws of the posterior regression line.
= az.extract(
posterior_draws ="posterior", var_names=["alpha", "beta"], num_samples=15
idata, group
)
= plt.subplots()
fig, ax =0.7)
ax.scatter(x_values, y_values, alpha0, alpha_value), slope=beta_value, color="0.3")
ax.axline((
for a, b in zip(posterior_draws["alpha"].to_numpy(), posterior_draws["beta"].to_numpy()):
0, a), slope=b, color="C1", alpha=0.8, lw=1, zorder=-1)
ax.axline((
set(xlabel="x", ylabel="y"); ax.
Unknown covariate values
We may need to simulate values for x
either because we don’t yet have values for the covariate or because we want to evaluate different scenarios.
The process is again similar to what we’ve done so far, but now we need to specify a distribution for x
to determine how its values are generated. There is a plethora of choices. Here, I’m faling back on a standard normal distribution.
= 20
N
with pm.Model() as model:
= pm.Normal("alpha", mu=0, sigma=1)
alpha = pm.Normal("beta", mu=0, sigma=1)
beta = pm.Normal("x", shape=N) # there are as many 'x' values as observations
x
= alpha + beta * x
mu = pm.Gamma("sigma", alpha=2, beta=2)
sigma
"y", mu=mu, sigma=sigma, shape=N)
pm.Normal(
model.to_graphviz()
Everything else stays the same:
# Get plausible values for the parameters
= pm.draw([alpha, beta, sigma, x], random_seed=rng)
alpha_value, beta_value, sigma_value, x_values
print("alpha:", alpha_value)
print("beta:", beta_value)
print("sigma:", sigma_value)
print("\nx values:", x_values, sep="\n")
# Set parameters to the sampled values
= pm.do(
model_fixed_parameters
model,"alpha": alpha_value, "beta": beta_value, "sigma": sigma_value, "x": x_values}
{
)
model_fixed_parameters.to_graphviz()
alpha: 0.1719973000588151
beta: 0.6008524215503052
sigma: 1.1351657143542704
x values:
[-0.07292669 -0.54826599 -0.46954908 0.26994204 1.00053203 0.65220785
-1.52515351 0.51970338 -0.51429822 -0.26137883 -0.47653569 0.16923646
-0.93059417 0.45502837 -1.05738506 -0.61663942 -0.52632527 0.34543959
-0.04647066 1.32000584]
# Simulate values for the outcome variable
= pm.draw(model_fixed_parameters["y"], random_seed=rng)
y_values print("y values:", y_values, sep="\n")
y values:
[ 1.71790428 -0.0682522 -1.26126753 -0.9484205 -0.06096033 0.28471004
-0.40986102 0.38968251 -2.03112939 1.44298618 0.99584892 -0.46963706
0.99470417 0.2107016 0.50460528 -0.72376308 1.58806132 0.28525278
0.71371689 2.99055351]
= plt.subplots()
fig, ax =0.7)
ax.scatter(x_values, y_values, alpha0, alpha_value), slope=beta_value, color="0.3");
ax.axline((set(xlabel="x", ylabel="y"); ax.
And observe the simulated values of y
in the original model.
= pm.observe(pm.do(model, {"x": x_values}), {"y": y_values})
model_observed_data model_observed_data.to_graphviz()
Hold on for a second and take a second look at the code above. See that we’re using both pm.do()
and pm.observe()
? We first need to use pm.do()
to set the values of x
in the original model, and then we use pm.observe()
to attach the observed values to the outcome variable y
. We can’t pass the values of x
through pm.observe()
because that would mean they are not fixed values, but realizations from a random variable.
To conclude, let’s explore the posterior.
with model_observed_data:
= pm.sample(random_seed=rng) idata
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [alpha, beta, sigma]
Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 1 seconds.
=[alpha_value, beta_value, sigma_value]); az.plot_posterior(idata, ref_val
= az.extract(
posterior_draws ="posterior", var_names=["alpha", "beta"], num_samples=15
idata, group
)
= plt.subplots()
fig, ax =0.7)
ax.scatter(x_values, y_values, alpha0, alpha_value), slope=beta_value, color="0.3")
ax.axline((
for a, b in zip(posterior_draws["alpha"].to_numpy(), posterior_draws["beta"].to_numpy()):
0, a), slope=b, color="C1", alpha=0.8, lw=1, zorder=-1)
ax.axline((
set(xlabel="x", ylabel="y"); ax.
Normal model for multiple groups
The following model is a simple extension of the first one, where we have multiple groups with different population means.
\[ \begin{aligned} Y_i &\sim \text{Normal}(\mu_{j[i]}, \sigma^2) \\ \mu_j &\sim \text{Normal}(0, 3^2) & \text{for all } j \\ \sigma &\sim \text{Gamma}(2, 2) \end{aligned} \]
with \(i = 1, \dots, N\) and \(j = 1, \dots, J\).
The indexing notation in \(\mu_{j[i]}\) is read as “the value of \(j\) for the \(i\)-th observation”. You can also see \(j[i]\) as a function call, where you pass the index of an observation and it gives you the value of the group it belongs to.
Known group membership
Let’s start with a simple example, where:
- All groups have the same, pre-defined, sample size.
- Group memberships are known.
- Observations are sorted by group.
= 4 # Number of groups
J = 20 # Number of observations per group
n_j = J * n_j # Total number of observations
N
# Generate group indexes
= np.repeat(np.arange(J), n_j)
group_idx print("Group indexes:")
print(group_idx)
Group indexes:
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3
3 3 3 3 3 3]
The first 20 observations are from the first group, the second 20 observations are from the second group, and so on.
Let’s write the model in PyMC. Since we have as many means as groups, we need to pass shape=J
when we define the prior for \(\mu\). Note that we create an intermediate variable mu_indexed
that contains the value of mu
for each observation, according to their group.
with pm.Model() as model:
= pm.Normal("mu", mu=0, sigma=3, shape=J)
mu = pm.Gamma("sigma", alpha=2, beta=2)
sigma = mu[group_idx]
mu_indexed "y", mu=mu_indexed, sigma=sigma, shape=N)
pm.Normal(
model.to_graphviz()
= pm.draw([mu, mu_indexed, sigma], random_seed=rng)
mu_values, mu_indexed_values, sigma_value print("mu:", mu_values)
print("sigma:", sigma_value)
mu: [6.05496819 1.16678048 2.30070048 4.72798215]
sigma: 0.5873507289564738
As expected, there are 4 values for mu
. Let’s have a look at the values for mu_indexed_values
.
mu_indexed_values
array([6.05496819, 6.05496819, 6.05496819, 6.05496819, 6.05496819,
6.05496819, 6.05496819, 6.05496819, 6.05496819, 6.05496819,
6.05496819, 6.05496819, 6.05496819, 6.05496819, 6.05496819,
6.05496819, 6.05496819, 6.05496819, 6.05496819, 6.05496819,
1.16678048, 1.16678048, 1.16678048, 1.16678048, 1.16678048,
1.16678048, 1.16678048, 1.16678048, 1.16678048, 1.16678048,
1.16678048, 1.16678048, 1.16678048, 1.16678048, 1.16678048,
1.16678048, 1.16678048, 1.16678048, 1.16678048, 1.16678048,
2.30070048, 2.30070048, 2.30070048, 2.30070048, 2.30070048,
2.30070048, 2.30070048, 2.30070048, 2.30070048, 2.30070048,
2.30070048, 2.30070048, 2.30070048, 2.30070048, 2.30070048,
2.30070048, 2.30070048, 2.30070048, 2.30070048, 2.30070048,
4.72798215, 4.72798215, 4.72798215, 4.72798215, 4.72798215,
4.72798215, 4.72798215, 4.72798215, 4.72798215, 4.72798215,
4.72798215, 4.72798215, 4.72798215, 4.72798215, 4.72798215,
4.72798215, 4.72798215, 4.72798215, 4.72798215, 4.72798215])
The result is not a coincidence. The first 20 values are the first value of mu
, the second 20 values are the second value of mu
, and so on. This is because observations are sorted by group.
The usage of pm.do()
is the same as always.
= pm.do(model, {"mu": mu_values, "sigma": sigma_value})
model_fixed_parameters model_fixed_parameters.to_graphviz()
And the same logic applies to the simulation of the values of y
.
# Simulate values
= pm.draw(model_fixed_parameters["y"], random_seed=rng)
y_values
# Plot empirical distributions
= plt.subplots()
fig, ax
for j in range(J):
== j], bins=6, alpha=0.6, label=f"Group {j}")
ax.hist(y_values[group_idx
ax.legend()set(xlabel="y"); ax.
Now let’s set the observed values, sample the posterior, and, subsequently, explore it.
= pm.observe(model, {"y": y_values})
model_observed_data
with model_observed_data:
= pm.sample(random_seed=rng)
idata
=["mu"], ref_val=mu_values.tolist()); az.plot_posterior(idata, var_names
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [mu, sigma]
Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 1 seconds.
We can also make use of pm.sample_posterior_predictive()
to get draws of the posterior predictive distribution:
with model_observed_data:
idata.extend(pm.sample_posterior_predictive(idata))
Sampling: [y]
This allows us to plot the predictive distribution for each group, where we can see they differ in location but not in dispersion.
= plt.subplots(figsize=(10, 4))
fig, ax
for j in range(J):
= idata.posterior_predictive["y"].to_numpy()[..., group_idx == j].flatten()
predictive_values =50, alpha=0.6, label=f"Group {j}", density=True)
ax.hist(predictive_values, bins
ax.legend()set(xlabel="y"); ax.
Unknown group membership
\[ \begin{aligned} Y_i &\sim \text{Normal}(\mu_{j[i]}, \sigma^2) \\ \mu_j &\sim \text{Normal}(0, 3^2) & \text{for all } j \\ \sigma &\sim \text{Gamma}(2, 2) \end{aligned} \]
with \(i = 1, \dots, N\) and \(j = 1, \dots, J\).
The model is exactly the same as the one in the previous section. The difference is that we don’t know how many observations belong to each group – we also want to simulate that.
To simulate group memberships we need a distribution that gives us integers between \(0\) and \(J-1\). One such distribution is the discrete uniform distribution.
= 4 # Number of groups
J = 48 # Total number of observations
N
with pm.Model() as model:
= pm.DiscreteUniform("group_idx", lower=0, upper=J-1, shape=N)
group_idx = pm.Normal("mu", mu=0, sigma=3, shape=J)
mu = pm.Gamma("sigma", alpha=2, beta=2)
sigma "y", mu=mu[group_idx], sigma=sigma, shape=N)
pm.Normal(
model.to_graphviz()
= pm.draw([mu, sigma, group_idx], random_seed=rng)
mu_values, sigma_value, group_idx_values
print("mu:", mu_values)
print("sigma:", sigma_value)
print("group indexes:", group_idx_values, sep="\n")
mu: [ 4.39222186 3.06006526 -1.82575653 3.42869414]
sigma: 0.36418924500311767
group indexes:
[2 0 0 2 1 3 2 0 2 2 3 3 3 3 3 0 3 0 3 3 3 0 3 1 2 2 1 2 2 0 3 0 3 2 3 0 0
0 1 0 1 0 0 2 0 0 2 3]
Not only are the observations no longer sorted by group, but the group sizes are also different.
=True) np.unique(group_idx_values, return_counts
(array([0, 1, 2, 3]), array([16, 5, 12, 15]))
All the rest is exactly the same:
# Fix model parameters and data
= pm.do(
model_fixed_parameters
model,"mu": mu_values, "sigma": sigma_value, "group_idx": group_idx_values}
{
)
# Display model graph
display(model_fixed_parameters.to_graphviz())
# Simulate values from the outcome
= pm.draw(model_fixed_parameters["y"], random_seed=rng)
y_values
# Fix group indexes and observe the outcome values
= pm.observe(
model_observed_data "group_idx": group_idx_values}),
pm.do(model, {"y": y_values}
{
)
display(model_observed_data.to_graphviz())
And so on…
Good practices
Here, I want to mention a few practices that are recommended when you work with PyMC models. They do not change the previous workflow, but make everything we do with our model more robust and, once we are familiar with it, more intuitive.
Use pm.Data
containers to register data variables
After using the do operator to fix parameters as constants, the corresponding nodes will display a Data
label, indicating that they represent data rather than random variables
Turns out it’s possible to register all fixed quantities in the model with pm.Data
containers. For example, let’s have a look at our simple linear regression model.
= np.array(
x_values
[0.786, -0.399, 1.018, 0.657, -0.195, -0.083, 0.651, -0.476, -0.584, -0.194,
0.282, 0.176, -0.309, 2.022, 0.341, -0.982, -0.904, 0.491, -2.07 , -0.568
] )
= 20
N
with pm.Model() as model:
1= pm.Data("x", x_values)
x_data
= pm.Normal("alpha")
alpha = pm.Normal("beta")
beta = alpha + beta * x_data # Use 'x_data' instead of 'x'
mu = pm.Gamma("sigma", alpha=2, beta=2)
sigma
"y", mu=mu, sigma=sigma, shape=N)
pm.Normal(
model.to_graphviz()
- 1
-
Registers
"x"
in the model.
And when we modify the model, the Data
for x
is still there.
# Simulate values for the parameters
= pm.draw([alpha, beta, sigma], random_seed=rng)
alpha_value, beta_value, sigma_value
# Fix parameter values in the model
= pm.do(
model_fixed_parameters
model,"alpha": alpha_value, "beta": beta_value, "sigma": sigma_value}
{
)
# Simulate values for the outcome variable
= pm.draw(model_fixed_parameters["y"], random_seed=rng)
y_values
# Set the outcome values as observed in the original model
= pm.observe(model, {"y": y_values})
model_observed_data
# Visualize graph
model_observed_data.to_graphviz()
Use coords
and dims
So far, we have always used shape
to indicate the dimensions of a variable. It works fine, but it’s not the recommended approach when using PyMC. A more robust practice is to use dims
, which allows us to specify the real-world entities that each dimension of the variable corresponds to.
For example:
with pm.Model() as model:
"mu", shape=4, dims="group")
pm.Normal(
pm.model_to_graphviz(model)
Now it is clear that the length of mu
is 4 because there are 4 groups.
This becomes truly powerful when used together with coordinates. While dims
defines the dimensions of a variable, coords
provides the length of the dimensions and the labels for the positions (or indices) within those dimensions:
= {
coords "group": ["Big", "Medium", "Small", "Very small"]
}with pm.Model(coords=coords) as model:
"mu", dims="group")
pm.Normal(
pm.model_to_graphviz(model)
Note how it was not necessary to specify the shape
explicitly. It was inferred from the coordinates.
If two variables have the same dims
, they will have the same shape, and it will be clear their dimensions are mapped to the same entities.
This is how we could have used coords
and dims
in our simple linear regression model:
= 20
N
= {
coords "obs_idx": np.arange(N)
}
with pm.Model(coords=coords) as model:
= pm.Data("x", value=x_values, dims="obs_idx")
x_data
= pm.Normal("alpha")
alpha = pm.Normal("beta")
beta = alpha + beta * x_data
mu = pm.Gamma("sigma", alpha=2, beta=2)
sigma
"y", mu=mu, sigma=sigma, dims="obs_idx")
pm.Normal(
model.to_graphviz()
And this is how we could have done it for the normal model for multiple groups:
= 4 # Number of groups
J = 48 # Total number of observations
N
= {
coords "group": ["Group 1", "Group 2", "Group 3", "Group 4"],
"obs_idx": np.arange(N)
}
with pm.Model(coords=coords) as model:
= pm.DiscreteUniform("group_idx", lower=0, upper=J-1, dims="obs_idx")
group_idx = pm.Normal("mu", mu=0, sigma=3, dims="group")
mu = pm.Gamma("sigma", alpha=2, beta=2)
sigma "y", mu=mu[group_idx], sigma=sigma, dims="obs_idx")
pm.Normal(
model.to_graphviz()
= pm.draw([mu, sigma, group_idx], random_seed=rng)
mu_values, sigma_value, group_idx_values
# Fix model parameters _and_ data
= pm.do(
model_fixed_parameters
model,"mu": mu_values, "sigma": sigma_value, "group_idx": group_idx_values}
{
)
# Simulate values from the outcome
= pm.draw(model_fixed_parameters["y"], random_seed=rng)
y_values
# Observe the outcome values
= pm.observe(
model_observed_data "group_idx": group_idx_values}),
pm.do(model, {"y": y_values}
{
)
model_observed_data.to_graphviz()
with model_observed_data:
= pm.sample(random_seed=rng) idata
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [mu, sigma]
Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 1 seconds.
The great thing about coords
and dims
is that everything is labeled, so we no longer have to mentally map the positions to the names we have in our heads
="stats") az.summary(idata, kind
mean | sd | hdi_3% | hdi_97% | |
---|---|---|---|---|
mu[Group 1] | 0.388 | 0.189 | 0.028 | 0.736 |
mu[Group 2] | -3.817 | 0.150 | -4.105 | -3.550 |
mu[Group 3] | -3.890 | 0.157 | -4.173 | -3.580 |
mu[Group 4] | 3.016 | 0.172 | 2.680 | 3.333 |
sigma | 0.568 | 0.062 | 0.453 | 0.681 |
="mu"); az.plot_posterior(idata, var_names
Even slicing of the posterior draws becomes easier.
"mu"].sel(group="Group 3") idata.posterior[
<xarray.DataArray 'mu' (chain: 4, draw: 1000)> Size: 32kB array([[-4.03705149, -3.78900314, -3.76479051, ..., -4.04628137, -3.88638045, -4.21007427], [-3.74006981, -4.03929702, -4.01893038, ..., -3.92875527, -3.76260604, -4.02621378], [-3.77833314, -3.85270145, -3.86176492, ..., -3.76218849, -3.93732774, -3.91962555], [-4.25014046, -3.70552177, -3.99825767, ..., -3.76144302, -3.61515607, -3.93889675]]) Coordinates: * chain (chain) int64 32B 0 1 2 3 * draw (draw) int64 8kB 0 1 2 3 4 5 6 7 ... 993 994 995 996 997 998 999 group <U7 28B 'Group 3'
A not-so-trivial logistic regression model
Let’s conclude this blog post by applying the concepts discussed in the previous examples to a more advanced example.
Here, we work on a logistic regression model with varying intercepts and slopes. The covariates are given by a grouping variable and a continuous variable representing age.
Initial attempt
\[ \begin{aligned} Y_i \mid \pi_i &\sim \text{Bernoulli}(\pi_i) \\ \pi_i & = \text{expit}(\alpha_{j[i]} + \beta_{j[i]} \times \text{age}_i) \\ \alpha_j &\sim \text{Normal}(0, 1^2) & \text{for all } j \\ \beta_j &\sim \text{Normal}(0, 1^2) & \text{for all } j \\ \end{aligned} \]
we have \(J=3\) groups and \(N=200\) observations in total.
= 3
J = [0.5, 0.3, 0.2]
p = 200
N
= {
coords "group": ["A", "B", "C"],
"obs_idx": np.arange(N)
}
with pm.Model(coords=coords) as model:
1= pm.Uniform("age", lower=16, upper=70, dims="obs_idx")
age 2= pm.Categorical("group_idx", p=p, dims="obs_idx")
group_idx
= pm.Normal("alpha", mu=0, sigma=1, dims="group")
alpha = pm.Normal("beta", mu=0, sigma=1, dims="group")
beta
3= pm.Deterministic(
pi "pi",
+ beta[group_idx] * age),
pm.math.sigmoid(alpha[group_idx] ="obs_idx"
dims
)
"y", p=pi, dims="obs_idx")
pm.Bernoulli(
model.to_graphviz()
- 1
- Age is simulated using a uniform distribution between 18 and 70.
- 2
- For group memberships, we move from a uniform to a categorical, which enables us to assign different probabilities to each group.
- 3
-
We register the success probability
pi
in the model graph usingpm.Deterministic
. Althoughpi
is a random variable, it is represented as deterministic because its values are derived from other random variables and constant values.
At this point, we are quite familiar with what comes next:
# Simulate values for parameters and data
= pm.draw(
alpha_values, beta_values, age_values, group_idx_values
[alpha, beta, age, group_idx],=rng
random_seed
)
print("alpha:", alpha_values)
print("beta:", beta_values)
print("Groups and their frequencies:")
print(np.unique(group_idx_values, return_counts=True))
# Fix model parameters and_data
= pm.do(
model_fixed_parameters
model,
{"alpha": alpha_values,
"beta": beta_values,
"age": age_values,
"group_idx": group_idx_values
}
)
# Simulate values from the outcome
= pm.draw(model_fixed_parameters["y"], random_seed=rng)
y_values
# Fix covariate values and observe the outcome
= pm.observe(
model_observed_data "age": age_values, "group_idx": group_idx_values}),
pm.do(model, {"y": y_values}
{
)
model_observed_data.to_graphviz()
alpha: [0.47438197 0.20107513 1.36161128]
beta: [ 0.3889577 -1.65269511 2.58244597]
Groups and their frequencies:
(array([0, 1, 2]), array([98, 58, 44]))
And, finally, we can get our sampler rolling:
with model_observed_data:
= pm.sample(random_seed=rng) idata
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [alpha, beta]
Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 2 seconds.
There were 3306 divergences after tuning. Increase `target_accept` or reparameterize.
The rhat statistic is larger than 1.01 for some parameters. This indicates problems during sampling. See https://arxiv.org/abs/1903.08008 for details
The effective sample size per chain is smaller than 100 for some parameters. A higher number is needed for reliable rhat and ess computation. See https://arxiv.org/abs/1903.08008 for details
Well, that doesn’t seem to work.
Taking the covariate scale into account
The main goal of the blog post is not to show you how to diagnose the model and/or sampler when things are not working, so I will jump straight to the solution.
The sampler above is failing because the scale of the age
covariate, combined with the slope parameters, resulted in an extremely large contribution to the linear predictor, which somehow ruined our computation.
One solution is to use a standardized version of age:
\[ \begin{aligned} Y_i \mid \pi_i &\sim \text{Bernoulli}(\pi_i) \\ \pi_i & = \text{expit}(\alpha_{j[i]} + \beta_{j[i]} \times \text{age}^*_i) \\ \alpha_j &\sim \text{Normal}(0, 1^2) & \text{for all } j \\ \beta_j &\sim \text{Normal}(0, 1^2) & \text{for all } j \\ \end{aligned} \]
where:
\[ \text{age}^*_i = \frac{\text{age}_i - \text{mean}(\text{age})}{\text{std}(\text{age})} \]
We need to import pytensor.tensor
to compute the standard deviation of PyMC random variables.
import pytensor.tensor as pt
= 3
J = [0.5, 0.3, 0.2]
p = 200
N
= {
coords "group": ["A", "B", "C"],
"obs_idx": np.arange(N)
}
with pm.Model(coords=coords) as model:
= pm.Uniform("age", lower=16, upper=70, dims="obs_idx")
age = pm.Categorical("group_idx", p=p, dims="obs_idx")
group_idx
= pm.Normal("alpha", mu=0, sigma=1, dims="group")
alpha = pm.Normal("beta", mu=0, sigma=1, dims="group")
beta
= pm.Deterministic(
age_scaled "age_scaled", (age - pt.mean(age)) / pt.std(age), dims="obs_idx"
)
= pm.Deterministic(
pi "pi",
+ beta[group_idx] * age_scaled),
pm.math.sigmoid(alpha[group_idx] ="obs_idx"
dims
)
"y", p=pi, dims="obs_idx")
pm.Bernoulli(
model.to_graphviz()
It’s lovely to see how things come together in the graph. Let’s simulate now.
# Fix model parameters and data, use the ones we got before
= pm.do(
model_fixed_parameters
model,
{"alpha": alpha_values,
"beta": beta_values,
"age": age_values,
"group_idx": group_idx_values
}
)
# Simulate values from the outcome
= pm.draw(model_fixed_parameters["y"], random_seed=rng)
y_values
# Fix covariate values and observe the outcome
= pm.observe(
model_observed_data "age": age_values, "group_idx": group_idx_values}),
pm.do(model, {"y": y_values}
{
)
model_observed_data.to_graphviz()
Fingers crossed!
with model_observed_data:
= pm.sample(random_seed=rng) idata
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [alpha, beta]
Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 1 seconds.
="alpha", ref_val=alpha_values.tolist()); az.plot_posterior(idata, var_names
="beta", ref_val=beta_values.tolist()); az.plot_posterior(idata, var_names
The sampler worked perfectly this time and we can also see that the posterior successfully recovered the true parameter values.
Out of model predictions
To conclude this blog post, I would like to quickly show one last feature of PyMC that I really like: out of model predictions.
Imagine we want to see how the probability of success changes with age for the different groups. To create such a plot, we need a grid of age values along with posterior draws of \(\pi\) conditional on each age value and group. However, the age values we currently have are not organized in a grid, they are just random. What should we do?
One can always manually work with the posterior draws to perform the necessary computations. But I usually prefer to let PyMC do that for me.
We are going to create a new PyMC model, which is like the previous logistic regression model, but uses the grid of age values that we want. Then, we are going to ask PyMC to “predict” (compute) the values of \(\pi\) in this model using the draws we obtained with the previous model. That’s the out of model prediction.
1= np.mean(age_values)
age_mean = np.std(age_values)
age_std
= np.arange(18, 71)
age_range = np.tile(age_range, J)
age_values = np.repeat(np.arange(J), len(age_range))
group_idx
= {
coords "group": ["A", "B", "C"],
"obs_idx": np.arange(len(age_values))
}
with pm.Model(coords=coords) as model:
= pm.Data("age_scaled", (age_values - age_mean) / age_std, dims="obs_idx")
age_scaled_data = pm.Data("group_idx", group_idx, dims="obs_idx")
group_idx_data
2= pm.Flat("alpha", dims="group")
alpha = pm.Flat("beta", dims="group")
beta
= pm.Deterministic(
pi "pi",
+ beta[group_idx_data] * age_scaled_data),
pm.math.sigmoid(alpha[group_idx_data] ="obs_idx"
dims
)3
model.to_graphviz()
- 1
- Covariate transformations have to be done with the original summaries.
- 2
-
The
pm.Flat
objects represent distributions from which sampling is not possible. We use them to be extra sure the forward sampling below uses the draws available inidata
. - 3
-
Since our goal is not to predict values of
y
, it does not need to be defined in this new model.
Notice we do forward sampling specifying we want to sample from pi
and setting predictions=True
, which gives us a new group.
with model:
= pm.sample_posterior_predictive(
predictions =["pi"], predictions=True, random_seed=rng
idata, var_names"predictions"] )[
Sampling: []
1= predictions["pi"].mean(("chain", "draw"))
pi_mean 2= predictions["pi"].quantile((0.025, 0.975), ("chain", "draw"))
pi_lower, pi_upper
= plt.subplots(figsize=(9, 5))
fig, ax
for j, group in enumerate(["A", "B", "C"]):
=group_idx==j), label=group)
ax.plot(age_range, pi_mean.sel(obs_idx
ax.fill_between(
age_range,=group_idx==j),
pi_lower.sel(obs_idx=group_idx==j),
pi_upper.sel(obs_idx=0.5
alpha
)
="Group")
ax.legend(titleset(xlabel="Age", ylabel="P(Y = 1)"); ax.
- 1
- Compute the mean of \(\pi\) across all chains and draws. It gives the mean for every observation.
- 2
- Compute quantiles of \(\pi\) across all chains and draws. It gives the bounds for a credible interval of \(\pi\) for every observation.
Where to go next
If you, like me, enjoyed exploring these less obvious uses of PyMC, I invite you to check out two other articles I co-authored with Ricardo Vieira. They dive deeper into advanced data simulation and out-of-model predictions with PyMC.
Footnotes
You could also say it generates values from a random variable↩︎