Linear regression in Python using R's formula style
An example from the documentation for the statsmodels package
In academic statistics, the dominant programming language is R, and that was my first language for implementing regression models. If you are familiar and comfortable with its formula syntax, I have some good news for you: You can use a similar syntax for running linear regression (and other generalized linear models) in Python. In this article, I will refer to an example of how to do this.
In Python, the statsmodels
package contains many useful modules and functions for statistical analyses. There are 2 broad ways to implement it.
statsmodels.api
uses a syntax that is based on matricesstatsmodels.formula.api
uses a syntax that is based on formulas
To mirror the regression formulas in R, you need to use statsmodels.formula.api.
First, import the statsmodels
package and the formula
module by running the following code. (I am also importing statsmodels.api
, which will allow me to access a built-in dataset later.)
import statsmodels.api as sm
import statsmodels.formula.api as smf
Then, load a Pandas dataframe that contains the data containing your response variable and explanatory variables. Within the documentation for statsmodels
, there is an example that uses linear regression to analyze a dataset called “Guerry”. It refers to Andre-Michel Guerry’s data1 about literacy, crime, and other social variables in France in the 1830s. Let’s keep only the 4 columns that we will use in our regression model.
dta = sm.datasets.get_rdataset("Guerry", "HistData", cache=True)
df = dta.data[["Lottery", "Literacy", "Wealth", "Region"]].dropna()
display(df.head())
Here is what the dataset looks like.
Finally, use smf.ols()
to implement ordinary least-squares regression. The example uses a per-capita lottery wager as the response variable, and 3 explanatory variables:
literacy rate
a ranked index of wealth based on the per capita tax on personal property
the region of France
I encourage you to read the full descriptions of all variables in the documentation.
mod = smf.ols(formula="Lottery ~ Literacy + Wealth + Region", data=df)
res = mod.fit()
print(res.summary())
The resulting output is below.
Notice how the syntax and the output look very similar to what you get from the lm() function in R.
Vincent Arel-Bundock, a political scientist at the Université de Montréal, provided the original URL among his catalogue of datasets.