An implementation of Wilkinson's formula language for statistical models à la lme4

Project description

formulae

formulae is a Python library that implements Wilkinson's formulas for statistical models à la lme4. The main difference with other implementations like Patsy or formulaic is that formulaic can work with formulas describing a model with both common and group specific effects (a.k.a. fixed and random effects, respectively).

This package has been written to make it easier to specify models with group effects in Bambi, a package that makes it easy to work with Bayesian GLMMs in Python, but it could be used independently as a backend for another library.

Note: While this package is working, there is a 100% chance of finding bugs within the code. You are encouraged to play with this library and give feedback about it, but it is not recommended to incorporate formulae in a larger project at this early stage of development.

Installation

formulae requires a working Python interpreter (3.7+) and the libraries numpy, scipy and pandas with versions specified in the requirements.txt file.

Assuming a standard Python environment is installed on your machine (including pip), the development version of formulae can be installed in one line using pip:

pip install git+https://github.com/bambinos/formulae.git

Example code

The main function you encounter in this library is design_matrices(). It returns an object of class DesignMatrices that contains information about the response, the common effects, and the group specific effects that can be accessed with the attributes .response, .common, and .group respectively.

import numpy as np 
import pandas as pd

from formulae import design_matrices

np.random.seed(1234)
df = pd.DataFrame({
    'y_num': np.random.normal(size=10),
    'y_cat': np.random.choice(['A', 'B'], size=10),
    'x': np.random.normal(size=10),
    'g': np.random.choice(['Group 1', 'Group 2', 'Group 3'], size=10)
})
df

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	y_num	y_cat	x	g
0	0.471435	B	-0.304260	Group 1
1	-1.190976	A	0.861661	Group 3
2	1.432707	B	-0.689927	Group 3
3	-0.312652	B	0.187497	Group 1
4	-0.720589	A	0.604309	Group 2
5	0.887163	A	-0.183014	Group 2
6	0.859588	B	-1.126502	Group 1
7	-0.636524	A	1.658873	Group 1
8	0.015696	A	-0.660441	Group 1
9	-2.242685	B	1.041086	Group 2

Example 1

A simple linear model with numeric response, numeric common effects and varying slope and intercept for each level of g.

design = design_matrices("y_num ~ x + (x|g)", df)

print(design.response)
print(design.response.design_vector)

ResponseVector(name=y_num, type=numeric, length=10)
[[ 0.47143516]
 [-1.19097569]
 [ 1.43270697]
 [-0.3126519 ]
 [-0.72058873]
 [ 0.88716294]
 [ 0.85958841]
 [-0.6365235 ]
 [ 0.01569637]
 [-2.24268495]]

print(design.common)
print(design.common.design_matrix) # this can be printed as a pandas.DataFrame with design.common.as_dataframe()

CommonEffectsMatrix(
  shape: (10, 2),
  terms: {
    'Intercept': {type=Intercept, cols=slice(0, 1, None), full_names=['Intercept']},
    'x': {type=numeric, cols=slice(1, 2, None), full_names=['x']}
  }
)
[[ 1.         -0.30426018]
 [ 1.          0.861661  ]
 [ 1.         -0.68992667]
 [ 1.          0.18749737]
 [ 1.          0.60430874]
 [ 1.         -0.18301422]
 [ 1.         -1.12650247]
 [ 1.          1.65887284]
 [ 1.         -0.66044141]
 [ 1.          1.04108597]]

Before exploring the group level effects we mention that formulae returns a sparse matrix in CSC format. If it is the case the matrix is not that big and you want to see it as a whole, you can call design.group.design_matrix.toarray()

print(design.group)
print(design.group.design_matrix.toarray())

GroupEffectsMatrix(
  shape: (20, 6),
  terms: {
    '1|g': {type=Intercept, groups=['Group 1', 'Group 3', 'Group 2'], idxs=(slice(0, 10, None), slice(0, 3, None)), full_names=['1|g[Group 1]', '1|g[Group 3]', '1|g[Group 2]']},
    'x|g': {type=numeric, groups=['Group 1', 'Group 3', 'Group 2'], idxs=(slice(10, 20, None), slice(3, 6, None)), full_names=['x|g[Group 1]', 'x|g[Group 3]', 'x|g[Group 2]']}
  }
)
[[ 1.          0.          0.          0.          0.          0.        ]
 [ 0.          1.          0.          0.          0.          0.        ]
 [ 0.          1.          0.          0.          0.          0.        ]
 [ 1.          0.          0.          0.          0.          0.        ]
 [ 0.          0.          1.          0.          0.          0.        ]
 [ 0.          0.          1.          0.          0.          0.        ]
 [ 1.          0.          0.          0.          0.          0.        ]
 [ 1.          0.          0.          0.          0.          0.        ]
 [ 1.          0.          0.          0.          0.          0.        ]
 [ 0.          0.          1.          0.          0.          0.        ]
 [ 0.          0.          0.         -0.30426018  0.          0.        ]
 [ 0.          0.          0.          0.          0.861661    0.        ]
 [ 0.          0.          0.          0.         -0.68992667  0.        ]
 [ 0.          0.          0.          0.18749737  0.          0.        ]
 [ 0.          0.          0.          0.          0.          0.60430874]
 [ 0.          0.          0.          0.          0.         -0.18301422]
 [ 0.          0.          0.         -1.12650247  0.          0.        ]
 [ 0.          0.          0.          1.65887284  0.          0.        ]
 [ 0.          0.          0.         -0.66044141  0.          0.        ]
 [ 0.          0.          0.          0.          0.          1.04108597]]

But if you are interested only in the sub-matrix corresponding to a given group specific effect, you can use design.group['level_name'] as follows

design.group['x|g']

array([[-0.30426018,  0.        ,  0.        ],
       [ 0.        ,  0.861661  ,  0.        ],
       [ 0.        , -0.68992667,  0.        ],
       [ 0.18749737,  0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.60430874],
       [ 0.        ,  0.        , -0.18301422],
       [-1.12650247,  0.        ,  0.        ],
       [ 1.65887284,  0.        ,  0.        ],
       [-0.66044141,  0.        ,  0.        ],
       [ 0.        ,  0.        ,  1.04108597]])

Example 2

A categorical response and a linear predictor that has an interaction between a categorical variable and a function call. In this example we use the variable[level] notation that is taken from Bambi that makes it easier to indicate which level represents a success in a categorical response.

design = design_matrices("y_cat[A] ~ np.exp(x) * g", df)

print(design.response)
print(design.response.design_vector)

ResponseVector(name=y_cat, type=categoric, length=10, refclass=A)
[[0]
 [1]
 [0]
 [0]
 [1]
 [1]
 [0]
 [1]
 [1]
 [0]]

design.common

CommonEffectsMatrix(
  shape: (10, 7),
  terms: {
    'Intercept': {type=Intercept, cols=slice(0, 1, None), full_names=['Intercept']},
    'np.exp(x)': {type=call, cols=slice(1, 2, None), full_names=['np.exp(x)']},
    'g': {type=categoric, levels=['Group 1', 'Group 3', 'Group 2'], reference=Group 1, encoding=reduced, cols=slice(2, 4, None), full_names=['g[Group 3]', 'g[Group 2]']},
    'np.exp(x):g': {type=interaction, vars={
      np.exp(x): {type=call},
      g: {type=categoric, levels=['Group 1', 'Group 3', 'Group 2'], reference=Group 1, encoding=full}
    }}
  }
)

design.common.as_dataframe()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	Intercept	np.exp(x)	g[Group 3]	g[Group 2]	np.exp(x):g[Group 1]	np.exp(x):g[Group 3]	np.exp(x):g[Group 2]
0	1.0	0.737669	0.0	0.0	0.737669	0.000000	0.000000
1	1.0	2.367089	1.0	0.0	0.000000	2.367089	0.000000
2	1.0	0.501613	1.0	0.0	0.000000	0.501613	0.000000
3	1.0	1.206227	0.0	0.0	1.206227	0.000000	0.000000
4	1.0	1.829987	0.0	1.0	0.000000	0.000000	1.829987
5	1.0	0.832756	0.0	1.0	0.000000	0.000000	0.832756
6	1.0	0.324165	0.0	0.0	0.324165	0.000000	0.000000
7	1.0	5.253386	0.0	0.0	5.253386	0.000000	0.000000
8	1.0	0.516623	0.0	0.0	0.516623	0.000000	0.000000
9	1.0	2.832291	0.0	1.0	0.000000	0.000000	2.832291

Notes

The data argument only accepts objects of class pandas.DataFrame.
y ~ . is not implemented and won't be implemented in a first version. However, it is planned to be included in the future.

Project details

Release history Release notifications | RSS feed

0.5.4

Jul 20, 2024

0.5.3

Feb 20, 2024

0.5.2

Feb 18, 2024

0.5.1

Sep 19, 2023

0.5.0

Jun 28, 2023

0.4.0

May 24, 2023

0.3.4

May 18, 2022

0.3.3

Apr 14, 2022

0.3.2

Apr 13, 2022

0.3.1

Apr 12, 2022

0.3.0

Mar 11, 2022

0.2.0

Sep 15, 2021

0.1.4

Aug 9, 2021

0.1.3

Aug 2, 2021

0.1.2

Jul 30, 2021

0.1.1

Jun 18, 2021

0.1.0

Jun 12, 2021

0.0.10

May 10, 2021

0.0.9

Apr 5, 2021

0.0.8

Apr 5, 2021

This version

0.0.7

Mar 9, 2021

0.0.6

Mar 4, 2021

0.0.5

Feb 23, 2021

0.0.4

Feb 22, 2021

0.0.3

Feb 19, 2021

0.0.2

Feb 9, 2021

0.0.1

Feb 9, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

formulae-0.0.7.tar.gz (93.6 kB view hashes)

Uploaded Mar 9, 2021 Source

Built Distribution

formulae-0.0.7-py3-none-any.whl (28.5 kB view hashes)

Uploaded Mar 9, 2021 Python 3

Hashes for formulae-0.0.7.tar.gz

Hashes for formulae-0.0.7.tar.gz
Algorithm	Hash digest
SHA256	`aca06d8404c2075943eaa5a9610a0edfe12a25d312cdaaa3fcfd3c9330f08e83`
MD5	`3fb9a8025711a64635b1428005852617`
BLAKE2b-256	`52e5e28b24a113b8e6b46347b66515b1fa17fd21db2242ac05d3685940b4a290`

Hashes for formulae-0.0.7-py3-none-any.whl

Hashes for formulae-0.0.7-py3-none-any.whl
Algorithm	Hash digest
SHA256	`661934b8e07a6033ca6bc87337089e8f37b717e823eb6227efeb224e05851ac9`
MD5	`43bc2874771cc8e3dddef52b779e3134`
BLAKE2b-256	`643ffe656dc0aaf64127325e29ac85ea6ddaa3272c08e3c21e29bd843d103b73`