Linear regression in Python using R's formula style

An example from the documentation for the statsmodels package

Apr 01, 2025

In academic statistics, the dominant programming language is R, and that was my first language for implementing regression models. If you are familiar and comfortable with its formula syntax, I have some good news for you: You can use a similar syntax for running linear regression (and other generalized linear models) in Python. In this article, I will refer to an example of how to do this.

In Python, the statsmodels package contains many useful modules and functions for statistical analyses. There are 2 broad ways to implement it.

statsmodels.api uses a syntax that is based on matrices
statsmodels.formula.api uses a syntax that is based on formulas

To mirror the regression formulas in R, you need to use statsmodels.formula.api.

First, import the statsmodels package and the formula module by running the following code. (I am also importing statsmodels.api, which will allow me to access a built-in dataset later.)

import statsmodels.api as sm
import statsmodels.formula.api as smf

Then, load a Pandas dataframe that contains the data containing your response variable and explanatory variables. Within the documentation for statsmodels, there is an example that uses linear regression to analyze a dataset called “Guerry”. It refers to Andre-Michel Guerry’s data1 about literacy, crime, and other social variables in France in the 1830s. Let’s keep only the 4 columns that we will use in our regression model.

dta = sm.datasets.get_rdataset("Guerry", "HistData", cache=True)
df = dta.data[["Lottery", "Literacy", "Wealth", "Region"]].dropna()
display(df.head())

Here is what the dataset looks like.

Finally, use smf.ols() to implement ordinary least-squares regression. The example uses a per-capita lottery wager as the response variable, and 3 explanatory variables:

literacy rate
a ranked index of wealth based on the per capita tax on personal property
the region of France

I encourage you to read the full descriptions of all variables in the documentation.

mod = smf.ols(formula="Lottery ~ Literacy + Wealth + Region", data=df) 
res = mod.fit() 
print(res.summary())

The resulting output is below.

Notice how the syntax and the output look very similar to what you get from the lm() function in R.

Vincent Arel-Bundock, a political scientist at the Université de Montréal, provided the original URL among his catalogue of datasets.

Linear regression in Python using R's formula style

An example from the documentation for the statsmodels package

Discussion about this post