import altair as alt
import numpy as np
import pandas as pd
from palmerpenguins import load_penguins
from scipy import optimizeUsing Scipy optimize to estimate linear regression parameters
When I teach estimating parameters for linear regression, I like start by framing the problem as one of simple optimization. That is, we need to find the parameter values that minimize squared error. I don’t like to jump immediately to the closed form solution for the best parameter estimates. Instead, I typically set up a simple model in Excel, calculate error for each row in the training data, and I optimize parameter estimates with Excel’s solver function. Using Excel for teaching resonated with me when I took the Fast.AI deep learning course.
The R for Data Science book takes a similar approach when introducing models but using R’s optim function instead of Excel.
The following is the example I came up with to reproduce the Excel and R examples in Python using an optimizaton algorithm available from Scipy.
We’ll just use the Adelie data from Palmer Penguins for demonstration purposes. We are going to make a model of body mass, using the bill length variable.
penguins = load_penguins().dropna().query('species == "Adelie"')
penguins.head()| species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | year | |
|---|---|---|---|---|---|---|---|---|
| 0 | Adelie | Torgersen | 39.1 | 18.7 | 181.0 | 3750.0 | male | 2007 | 
| 1 | Adelie | Torgersen | 39.5 | 17.4 | 186.0 | 3800.0 | female | 2007 | 
| 2 | Adelie | Torgersen | 40.3 | 18.0 | 195.0 | 3250.0 | female | 2007 | 
| 4 | Adelie | Torgersen | 36.7 | 19.3 | 193.0 | 3450.0 | female | 2007 | 
| 5 | Adelie | Torgersen | 39.3 | 20.6 | 190.0 | 3650.0 | male | 2007 | 
We need to define a function that calculates squared error for a given combination of parameters.
def objective(params, X, y):
    m, b = params
    y_pred = m * X + b
    return np.sum((y - y_pred) ** 2)This callback function (below ⬇️) will get called after each iteration of the optimizer. It records the parameters, the iteration count, and the corresponding squared error for each optimizer iteration.
iter_data = []
n = []
def callback(params, *args, **kwargs):
    n.append(len(n))
    iter_data.append(
        {
            "m": params[0],
            "b": params[1],
            "n": len(n),
            "SSE": objective(
                params, penguins["bill_depth_mm"].values, penguins["body_mass_g"].values
            ),
        }
    )Now we are ready to estimate parameters!
initial_guess = [0, 0]
result = optimize.minimize(
    objective,
    initial_guess,
1    args=(penguins["bill_length_mm"].values, penguins["body_mass_g"].values),
    method="Nelder-Mead",
    callback=callback,
)
2m, b = result.x- 1
- 
These two arrays get passed to the objectivefunction defined above.
- 2
- 
The resultobject holds the optimized parameter values in thexattribute.
The fun part about doing this in Python (or R) is we can readily plot the output. Here is the model visualized using the optimal parameter estimates:
source = penguins.assign(pred=lambda df_: m * df_["bill_length_mm"].values + b)
axis_kws = {
    "titleFontSize": 16,
    "titleFontWeight": "normal",
    "titlePadding": 10,
    "labelFontSize": 12,
}
base = alt.Chart(source, width=400).encode(
    x=alt.X("bill_length_mm")
    .scale(zero=False)
    .axis(title="Bill length, mm", **axis_kws),
)
points = base.mark_point(size=50, fill="steelblue", fillOpacity=0.25).encode(
    y=alt.Y("body_mass_g").scale(zero=False).axis(title="Body mass, g", **axis_kws),
)
line = base.mark_line(stroke="coral", strokeWidth=3).encode(
    y=alt.Y("pred").axis(title="")
)
points + line…and here is how the squared error evolved through iterations of the optimizer with different values for the model parameters.
source = pd.DataFrame(iter_data)
base = alt.Chart(source, height=200, width=250).encode(
    x=alt.X("n").axis(title="Iteration", **axis_kws),
    y=alt.Y(alt.repeat("repeat"), type="quantitative").axis(format=".2s", **axis_kws),
)
points = base.mark_point(size=25)
lines = base.mark_line()
(points + lines).repeat(repeat=["m", "b", "SSE"], columns=2)…and we can look at each parameter combination the optimizer considered.
dfs = []
for data in iter_data:
    df = penguins.assign(
        pred=lambda df_: data["m"] * df_["bill_length_mm"].values + data["b"],
        m=data["m"],
        b=data["b"],
        iter_=data["n"],
        SSE=data["SSE"],
    )
    dfs.append(df)
source = pd.concat(dfs)
_ = alt.data_transformers.disable_max_rows()
base = alt.Chart(source, width=400).encode(
    x=alt.X("bill_length_mm")
    .scale(zero=False)
    .axis(title="Bill length, mm", **axis_kws),
)
points = base.mark_point(size=50, fill="steelblue", fillOpacity=0.25).encode(
    y=alt.Y("body_mass_g").scale(zero=False).axis(title="Body mass, g", **axis_kws),
)
line = base.mark_line(strokeWidth=1).encode(
    y=alt.Y("pred").axis(title=""),
    color=alt.Color("SSE")
    .scale(scheme="lightgreyred", reverse=True)
    .title("SSE")
    .legend(orient="top", format=".2s", **axis_kws),
    detail=alt.Detail("iter_"),
)
points + line