import altair as alt
import numpy as np
import pandas as pd
from palmerpenguins import load_penguins
from scipy import optimize
Using Scipy optimize to estimate linear regression parameters
When I teach estimating parameters for linear regression, I like start by framing the problem as one of simple optimization. That is, we need to find the parameter values that minimize squared error. I don’t like to jump immediately to the closed form solution for the best parameter estimates. Instead, I typically set up a simple model in Excel, calculate error for each row in the training data, and I optimize parameter estimates with Excel’s solver function. Using Excel for teaching resonated with me when I took the Fast.AI deep learning course.
The R for Data Science book takes a similar approach when introducing models but using R
’s optim
function instead of Excel.
The following is the example I came up with to reproduce the Excel and R examples in Python using an optimizaton algorithm available from Scipy.
We’ll just use the Adelie
data from Palmer Penguins for demonstration purposes. We are going to make a model of body mass, using the bill length variable.
= load_penguins().dropna().query('species == "Adelie"')
penguins
penguins.head()
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | year | |
---|---|---|---|---|---|---|---|---|
0 | Adelie | Torgersen | 39.1 | 18.7 | 181.0 | 3750.0 | male | 2007 |
1 | Adelie | Torgersen | 39.5 | 17.4 | 186.0 | 3800.0 | female | 2007 |
2 | Adelie | Torgersen | 40.3 | 18.0 | 195.0 | 3250.0 | female | 2007 |
4 | Adelie | Torgersen | 36.7 | 19.3 | 193.0 | 3450.0 | female | 2007 |
5 | Adelie | Torgersen | 39.3 | 20.6 | 190.0 | 3650.0 | male | 2007 |
We need to define a function that calculates squared error for a given combination of parameters.
def objective(params, X, y):
= params
m, b = m * X + b
y_pred return np.sum((y - y_pred) ** 2)
This callback function (below ⬇️) will get called after each iteration of the optimizer. It records the parameters, the iteration count, and the corresponding squared error for each optimizer iteration.
= []
iter_data = []
n
def callback(params, *args, **kwargs):
len(n))
n.append(
iter_data.append(
{"m": params[0],
"b": params[1],
"n": len(n),
"SSE": objective(
"bill_depth_mm"].values, penguins["body_mass_g"].values
params, penguins[
),
} )
Now we are ready to estimate parameters!
= [0, 0]
initial_guess
= optimize.minimize(
result
objective,
initial_guess,1=(penguins["bill_length_mm"].values, penguins["body_mass_g"].values),
args="Nelder-Mead",
method=callback,
callback
)
2= result.x m, b
- 1
-
These two arrays get passed to the
objective
function defined above. - 2
-
The
result
object holds the optimized parameter values in thex
attribute.
The fun part about doing this in Python
(or R
) is we can readily plot the output. Here is the model visualized using the optimal parameter estimates:
= penguins.assign(pred=lambda df_: m * df_["bill_length_mm"].values + b)
source = {
axis_kws "titleFontSize": 16,
"titleFontWeight": "normal",
"titlePadding": 10,
"labelFontSize": 12,
}
= alt.Chart(source, width=400).encode(
base =alt.X("bill_length_mm")
x=False)
.scale(zero="Bill length, mm", **axis_kws),
.axis(title
)
= base.mark_point(size=50, fill="steelblue", fillOpacity=0.25).encode(
points =alt.Y("body_mass_g").scale(zero=False).axis(title="Body mass, g", **axis_kws),
y
)
= base.mark_line(stroke="coral", strokeWidth=3).encode(
line =alt.Y("pred").axis(title="")
y
)
+ line points
…and here is how the squared error evolved through iterations of the optimizer with different values for the model parameters.
= pd.DataFrame(iter_data)
source
= alt.Chart(source, height=200, width=250).encode(
base =alt.X("n").axis(title="Iteration", **axis_kws),
x=alt.Y(alt.repeat("repeat"), type="quantitative").axis(format=".2s", **axis_kws),
y
)
= base.mark_point(size=25)
points = base.mark_line()
lines
+ lines).repeat(repeat=["m", "b", "SSE"], columns=2) (points
…and we can look at each parameter combination the optimizer considered.
= []
dfs for data in iter_data:
= penguins.assign(
df =lambda df_: data["m"] * df_["bill_length_mm"].values + data["b"],
pred=data["m"],
m=data["b"],
b=data["n"],
iter_=data["SSE"],
SSE
)
dfs.append(df)
= pd.concat(dfs)
source
= alt.data_transformers.disable_max_rows()
_
= alt.Chart(source, width=400).encode(
base =alt.X("bill_length_mm")
x=False)
.scale(zero="Bill length, mm", **axis_kws),
.axis(title
)
= base.mark_point(size=50, fill="steelblue", fillOpacity=0.25).encode(
points =alt.Y("body_mass_g").scale(zero=False).axis(title="Body mass, g", **axis_kws),
y
)
= base.mark_line(strokeWidth=1).encode(
line =alt.Y("pred").axis(title=""),
y=alt.Color("SSE")
color="lightgreyred", reverse=True)
.scale(scheme"SSE")
.title(="top", format=".2s", **axis_kws),
.legend(orient=alt.Detail("iter_"),
detail
)
+ line points