```
library(dplyr)
library(ggplot2)
library(lme4)
library(patchwork)
```

# Design matrices for group-specific effects in formulae and lme4

Bambi uses the library formulae to automatically construct design matrices for both common and group-specific effects. This post compares design matrices for group-specific effects obtained with formulae for a variety of scenarios involving categorical variables with the ones obtained with the R package lme4.

## Introduction

A linear mixed model can be written as

\[ \boldsymbol{y} = \boldsymbol{X}\boldsymbol{\beta} + \boldsymbol{Z}\boldsymbol{u} + \boldsymbol{\epsilon} \]

where \(\boldsymbol{X}\) and \(\boldsymbol{Z}\) are the two design matrices we need to somehow construct when dealing with this type of model. \(\boldsymbol{X}\) is the design matrix for the common (a.k.a. fixed) effects, and \(\boldsymbol{Z}\) is the design matrix for the group-specific (a.k.a. random or varying) effects.

It is quite easy to obtain the design matrix \(\boldsymbol{X}\) in R using its popular formula interface. In Python, patsy provides equivalent functionality. Unfortunately, there aren’t as many alternatives to compute the matrix \(\boldsymbol{Z}\).

In R, there’s lme4, the statistical package par excellence for mixed models. It extends the base formula interface to include group-specific effects via the pipe operator (`|`

) and internally computes both \(\boldsymbol{X}\) and \(\boldsymbol{Z}\) without the user noticing. That’s great!

In Python, we are working on formulae, a library we use to handle mixed model formulas in Bambi. In this process, I’ve found Fitting Linear Mixed-Effects Models Using lme4 vignette extremely useful when figuring out how to compute the design matrix for the group-specific effects.

Today, I was adding tests to make sure we are constructing \(\boldsymbol{Z}\) appropriately and found myself comparing the matrices obtained with formulae with matrices obtained with **lme4**. Then I was like … why not making this a blog post? 🤔

… and so here we are! But before we get started, just note this post mixes both R and Python code. I will try to be explicit when I’m using one language or the other. But if you’re reading a chunk and it looks like Python, it’s Python. And if it looks like R… you guessed! It’s R.

## Setup

`from formulae import design_matrices`

## Problem

Here we will be comparing design matrices for the group-specific terms in a mixed-effects model obtained with both **lme4** and **formulae**. We’re using the dataset `Pixel`

that comes with the R package **nlme**.

```
data("Pixel", package = "nlme")
head(Pixel)
```

```
Grouped Data: pixel ~ day | Dog/Side
Dog Side day pixel
1 1 R 0 1045.8
2 1 R 1 1044.5
3 1 R 2 1042.9
4 1 R 4 1050.4
5 1 R 6 1045.2
6 1 R 10 1038.9
```

We’re not interested in how to fit a certain model here. We’re interested in constructing the design matrix for group-specific effects with different characteristics. We use the following formula

`= ~ (0 + day | Dog) + (1 | Side / Dog) f1 `

where each part can be interpreted as follows

`(0 + day | Dog)`

means that`day`

has a group-specific slope for each`Dog`

. This is usually known as a random slope. The`0`

indicates not to add the default group-specific intercept (because it’s added next).`(1 | Side / Dog)`

is equivalent to`(1 | Side) + (1 | Dog:Side)`

. This means there’s a varying intercept for each`Side`

and a varying intercept for each combination of`Dog`

and`Side`

. In other words, we have a nested group-specific intercept, where`Dog`

is nested within`Side`

.

`= mkReTrms(findbars(f1), model.frame(subbars(f1), data = Pixel)) lme4_terms `

`lme4_terms`

contains much more information than what we need for this post. We mostly use `lme4_terms$Ztlist`

, which is a list that contains the transpose of the group-specific effects model matrix, separated by term. These matrices are stored as sparse matrices of `dgCMatrix`

class. If we want to have the sub-matrix for a given group-specific term as a base R matrix, we have to do `as.matrix(t(lme4_terms$Ztlist$[["term"]]))`

.

`names(lme4_terms$Ztlist)`

`[1] "1 | Dog:Side" "0 + day | Dog" "1 | Side" `

We have three group-specific terms. The first and the last ones are the group-specific intercepts we mentioned. These are the result of the nested group-specific intercept `(1 | Side / Dog)`

. `Dog`

is nested within `Side`

and consequently there’s an intercept varying among `Side`

and another varying among `Dog`

within `Side`

. The second term, `0 + day | Dog`

, represents varying slope of `day`

for each level of `Dog`

.

We finally store the sub-matrix for each term in different objects that we’ll later use when comparing results with those obtained with **formulae**.

```
= as.matrix(t(lme4_terms$Ztlist$`0 + day | Dog`))
day_by_dog = as.matrix(t(lme4_terms$Ztlist$`1 | Side`))
intercept_by_side = as.matrix(t(lme4_terms$Ztlist$`1 | Dog:Side`)) intercept_by_side_dog
```

On the other hand, in Python, we use `design_matrices()`

from the **formulae** library to obtain a `DesignMatrices`

object. All the information associated with the group-specific terms is contained in the `.group`

attribute and the sub-matrix corresponding to a particular term is accessed with `.group[term_name]`

.

`= design_matrices("(0 + day | Dog) + (1 | Side / Dog)", r.Pixel) dm `

There’s a dictionary called `terms_info`

within `dm.group`

. To see the names of the group-specific effects we just retrieve the keys.

` dm.group.terms.keys()`

`dict_keys(['day|Dog', '1|Side', '1|Side:Dog'])`

Names differ a little with the ones from **lme4**, but they represent the same thing.

```
= dm.group['day|Dog']
day_by_dog = dm.group['1|Side']
intercept_by_side = dm.group['1|Side:Dog'] intercept_by_side_dog
```

Now let’s compare those matrices!

## Design matrices for `(day|Dog)`

Rectangles in the following plot correspond to the cells in the matrix. The lowest value for `day`

is 0, represented by violet, and the highest value is 21, represented by yellow. The 10 columns represent the 10 groups in `Dog`

, and the rows represent the observations in `Pixel`

. Here, and also in the other cases, the left panel contains the matrix obtained with **lme4** and the right panel the one produced with **formulae**.

In this first case, both panels are representing the same data so we can happily conclude the result obtained with **formulae** matches the one from **lme4**. Yay!!

But we’re humans and our eyes can fail so it’s better to always check appropiately with

`all(py$day_by_dog == day_by_dog)`

`[1] TRUE`

### Design matrices for `(1|Side)`

Here the first column represents `Side == "L"`

and the second column represents `Side == "R"`

. Since we’re dealing with an intercept, violet means 0 and yellow means 1. In this case it is much easier to see both results match.

`all(py$intercept_by_side == intercept_by_side)`

`[1] TRUE`

### Design matrices for `(1|Side:Dog)`

But things are not always as one wishes. It’s clear from the following plot that both matrices aren’t equal here.

But don’t worry. We’re not giving up. We still have things to do^{1}. We can check what are the groups being represented in the columns of the matrices we’re plotting.

`colnames(intercept_by_side_dog)`

```
[1] "1:L" "1:R" "10:L" "10:R" "2:L" "2:R" "3:L" "3:R" "4:L" "4:R"
[11] "5:L" "5:R" "6:L" "6:R" "7:L" "7:R" "8:L" "8:R" "9:L" "9:R"
```

`"1|Side:Dog"].labels dm.group.terms[`

`['1|Side[L]:Dog[1]', '1|Side[L]:Dog[10]', '1|Side[L]:Dog[2]', '1|Side[L]:Dog[3]', '1|Side[L]:Dog[4]', '1|Side[L]:Dog[5]', '1|Side[L]:Dog[6]', '1|Side[L]:Dog[7]', '1|Side[L]:Dog[8]', '1|Side[L]:Dog[9]', '1|Side[R]:Dog[1]', '1|Side[R]:Dog[10]', '1|Side[R]:Dog[2]', '1|Side[R]:Dog[3]', '1|Side[R]:Dog[4]', '1|Side[R]:Dog[5]', '1|Side[R]:Dog[6]', '1|Side[R]:Dog[7]', '1|Side[R]:Dog[8]', '1|Side[R]:Dog[9]']`

And there it is! Matrices differ because columns are representing different groups. In **lme4**, groups are looping first along `Dog`

and then along `Side`

, while in **formulae** it is the other way around.

We can simply re-order the columns of one of the matrices and generate and check whether they match or not.

```
= as.data.frame(py$intercept_by_side_dog)
intercept_by_side_dog_f colnames(intercept_by_side_dog_f) = py$dm$group$terms[["1|Side:Dog"]]$groups
= paste(
names_lme4_order rep(c("L", "R"), 10),
rep(c(1, 10, 2, 3, 4, 5, 6, 7, 8, 9), each = 2),
sep = ":"
)= intercept_by_side_dog_f[names_lme4_order] %>%
intercept_by_side_dog_f as.matrix() %>%
unname()
```

`all(intercept_by_side_dog_f == intercept_by_side_dog)`

`[1] TRUE`

And there it is! Results match 🤩

## Another formula

This other formula contains an interaction between categorical variables as the expression of the group-specific term, which is something we’re not covering above. In this case, we are going to subset the data so the design matrices are smaller and we can understand what’s going on with more ease.

```
# Subset data
= Pixel %>%
Pixel2 filter(Dog %in% c(1, 2, 3), day %in% c(2, 4, 6)) %>%
mutate(Dog = forcats::fct_drop(Dog))
# Create terms with lme4
= ~ day + (0 + Dog:Side | day)
f2 = mkReTrms(findbars(f2), model.frame(subbars(f2), data = Pixel2))
lme4_terms = as.matrix(t(lme4_terms$Ztlist$`0 + Dog:Side | day`)) dog_and_side_by_day
```

And now with `design_matrices()`

in Python.

```
# Create terms with
= design_matrices("(0 + Dog:Side|day)", r.Pixel2)
dm = dm.group["Dog:Side|day"] dog_and_side_by_day
```

### Design matrix for `(Dog:Side|day)`

Although this term is called slope, it is not actually a slope like the one for `(day|Dog)`

. Since both `Dog`

and `Side`

are categorical, the entries of this matrix consist of zeros and ones.

We have the same problem than above, matrices don’t match. So we know what to do: look at the groups represented in the columns.

`colnames(dog_and_side_by_day)`

` [1] "2" "2" "2" "2" "2" "2" "4" "4" "4" "4" "4" "4" "6" "6" "6" "6" "6" "6"`

`"Dog:Side|day"].labels dm.group.terms[`

`['Dog[1]:Side[L]|day[2.0]', 'Dog[1]:Side[R]|day[2.0]', 'Dog[2]:Side[L]|day[2.0]', 'Dog[2]:Side[R]|day[2.0]', 'Dog[3]:Side[L]|day[2.0]', 'Dog[3]:Side[R]|day[2.0]', 'Dog[1]:Side[L]|day[4.0]', 'Dog[1]:Side[R]|day[4.0]', 'Dog[2]:Side[L]|day[4.0]', 'Dog[2]:Side[R]|day[4.0]', 'Dog[3]:Side[L]|day[4.0]', 'Dog[3]:Side[R]|day[4.0]', 'Dog[1]:Side[L]|day[6.0]', 'Dog[1]:Side[R]|day[6.0]', 'Dog[2]:Side[L]|day[6.0]', 'Dog[2]:Side[R]|day[6.0]', 'Dog[3]:Side[L]|day[6.0]', 'Dog[3]:Side[R]|day[6.0]']`

But this they represent the same groups^{2}. We can look if there’s a difference in how the interactions are ordered within each group.

`$cnms lme4_terms`

```
$day
[1] "Dog1:SideL" "Dog2:SideL" "Dog3:SideL" "Dog1:SideR" "Dog2:SideR"
[6] "Dog3:SideR"
```

And again, thankfully, we see there’s a difference in how columns are being ordered. Let’s see if matrices match after we reorder the one obtained with **formulae**.

```
= as.data.frame(py$dog_and_side_by_day)
dog_and_side_by_day_f colnames(dog_and_side_by_day_f) = py$dm$group$terms[["Dog:Side|day"]]$labels
= rep(rep(c("L", "R"), each = 3), 3)
side = rep(1:3, 6)
dog = rep(c("2.0", "4.0", "6.0"), each = 6)
day = glue::glue("Dog[{dog}]:Side[{side}]|day[{day}]")
names_lme4_order = dog_and_side_by_day_f[names_lme4_order] %>%
dog_and_side_by_day_f as.matrix() %>%
unname()
```

`all(dog_and_side_by_day_f == dog_and_side_by_day)`

`[1] TRUE`

## Conclusion

Although **formulae** works differently than **lme4**, and has different goals, we showed that **formulae** produces the same design matrices as **lme4** for the variety of examples we covered. While case-based comparisons like these are not what one should rely on when writing software, the examples here were really helpful when working on the implementation in **formulae** and writing the corresponding tests. And if this post helps someone to better understand what’s going on when working with design matrices associated with group-specific effects, it will have been even more worth it!