Using Scipy optimize to estimate linear regression parameters

scipy

linear regression

teaching

Author

ChuckPR

Published

March 17, 2025

When I teach estimating parameters for linear regression, I like start by framing the problem as one of simple optimization. That is, we need to find the parameter values that minimize squared error. I don’t like to jump immediately to the closed form solution for the best parameter estimates. Instead, I typically set up a simple model in Excel, calculate error for each row in the training data, and I optimize parameter estimates with Excel’s solver function. Using Excel for teaching resonated with me when I took the Fast.AI deep learning course.

The R for Data Science book takes a similar approach when introducing models but using R’s optim function instead of Excel.

The following is the example I came up with to reproduce the Excel and R examples in Python using an optimizaton algorithm available from Scipy.

import altair as alt
import numpy as np
import pandas as pd
from palmerpenguins import load_penguins
from scipy import optimize

We’ll just use the Adelie data from Palmer Penguins for demonstration purposes. We are going to make a model of body mass, using the bill length variable.

penguins = load_penguins().dropna().query('species == "Adelie"')

penguins.head()

	species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex	year
0	Adelie	Torgersen	39.1	18.7	181.0	3750.0	male	2007
1	Adelie	Torgersen	39.5	17.4	186.0	3800.0	female	2007
2	Adelie	Torgersen	40.3	18.0	195.0	3250.0	female	2007
4	Adelie	Torgersen	36.7	19.3	193.0	3450.0	female	2007
5	Adelie	Torgersen	39.3	20.6	190.0	3650.0	male	2007

We need to define a function that calculates squared error for a given combination of parameters.

def objective(params, X, y):
    m, b = params
    y_pred = m * X + b
    return np.sum((y - y_pred) ** 2)

This callback function (below ⬇️) will get called after each iteration of the optimizer. It records the parameters, the iteration count, and the corresponding squared error for each optimizer iteration.

iter_data = []
n = []


def callback(params, *args, **kwargs):
    n.append(len(n))
    iter_data.append(
        {
            "m": params[0],
            "b": params[1],
            "n": len(n),
            "SSE": objective(
                params, penguins["bill_depth_mm"].values, penguins["body_mass_g"].values
            ),
        }
    )

Now we are ready to estimate parameters!

initial_guess = [0, 0]

result = optimize.minimize(
    objective,
    initial_guess,
1    args=(penguins["bill_length_mm"].values, penguins["body_mass_g"].values),
    method="Nelder-Mead",
    callback=callback,
)

2m, b = result.x

1: These two arrays get passed to the objective function defined above.
2: The result object holds the optimized parameter values in the x attribute.

The fun part about doing this in Python (or R) is we can readily plot the output. Here is the model visualized using the optimal parameter estimates:

source = penguins.assign(pred=lambda df_: m * df_["bill_length_mm"].values + b)
axis_kws = {
    "titleFontSize": 16,
    "titleFontWeight": "normal",
    "titlePadding": 10,
    "labelFontSize": 12,
}

base = alt.Chart(source, width=400).encode(
    x=alt.X("bill_length_mm")
    .scale(zero=False)
    .axis(title="Bill length, mm", **axis_kws),
)

points = base.mark_point(size=50, fill="steelblue", fillOpacity=0.25).encode(
    y=alt.Y("body_mass_g").scale(zero=False).axis(title="Body mass, g", **axis_kws),
)

line = base.mark_line(stroke="coral", strokeWidth=3).encode(
    y=alt.Y("pred").axis(title="")
)

points + line

…and here is how the squared error evolved through iterations of the optimizer with different values for the model parameters.

source = pd.DataFrame(iter_data)

base = alt.Chart(source, height=200, width=250).encode(
    x=alt.X("n").axis(title="Iteration", **axis_kws),
    y=alt.Y(alt.repeat("repeat"), type="quantitative").axis(format=".2s", **axis_kws),
)

points = base.mark_point(size=25)
lines = base.mark_line()

(points + lines).repeat(repeat=["m", "b", "SSE"], columns=2)

…and we can look at each parameter combination the optimizer considered.

dfs = []
for data in iter_data:
    df = penguins.assign(
        pred=lambda df_: data["m"] * df_["bill_length_mm"].values + data["b"],
        m=data["m"],
        b=data["b"],
        iter_=data["n"],
        SSE=data["SSE"],
    )
    dfs.append(df)

source = pd.concat(dfs)

_ = alt.data_transformers.disable_max_rows()

base = alt.Chart(source, width=400).encode(
    x=alt.X("bill_length_mm")
    .scale(zero=False)
    .axis(title="Bill length, mm", **axis_kws),
)

points = base.mark_point(size=50, fill="steelblue", fillOpacity=0.25).encode(
    y=alt.Y("body_mass_g").scale(zero=False).axis(title="Body mass, g", **axis_kws),
)

line = base.mark_line(strokeWidth=1).encode(
    y=alt.Y("pred").axis(title=""),
    color=alt.Color("SSE")
    .scale(scheme="lightgreyred", reverse=True)
    .title("SSE")
    .legend(orient="top", format=".2s", **axis_kws),
    detail=alt.Detail("iter_"),
)

points + line