I was working on a side project where I needed to find the linear fit to a set of data points. A linear fit is also known as a “linear approximation” or “linear regression”. This is quite easy using a Numbers spreadsheet. Numbers will even show you equation of the line in slope-intercept form:

\[y = mx + b\]

Unfortunately, there is no way that I know of to get the slope and Y-intercept from the Numbers plot besides visual inspection. I also wanted a way to do this from the command line. This post will explain how to do this using Python and the NumPy library.

TLDR: Python One-Liners

While the rest of the post goes into more detail, here are two quick Python one-liners to find the slope and Y-intercept, given two NumPy arrays, x and y. First, with Polynomial.fit():

b, m = np.polynomial.polynomial.Polynomial.fit(x, y, 1).convert().coef

And second, with Polynomial.polyfit():

b, m = np.polynomial.polynomial.polyfit(x, y, 1)

Both of these give you the slope in m and the Y-intercept in b. I also created a Jupyter notebook demonstrating these APIs.

I’m honestly not sure why you would pick one over the other. Polynomial.polyfit() is less code, so that seems like the better option, to me. But if you know, please contact me! And please read on for more details.

Using Numbers

Here’s the sample data we’ll be using throughout the rest of the post:

X Y
0 -1
1 0.2
2 0.9
3 2.1

In Numbers, put this X and Y data into a table. Then add a Scatter Plot and enable a “Linear Trendline” with “Show Equation” checked. You should see a plot, as in this screenshot:

Thumbnail of a screenshot of a Numbers spreadsheet plot

The plot shows the points in blue and a line in red as the “best fit” line for the points. The legend shows the formula of the line as:

\[y = x - 0.95\]

In other words, the “best fit” line has a slope of 1 and a Y-intercept of -0.95. The goal here is to take the same input data and come up with the same slope and Y-Intercept using Python.

Using Python and NumPy

For those that don’t know, NumPy is a fantastic Python library for doing numerical calculations. And one of the many things it can do is a linear fit. Unfortunately, it also has multiple ways to do this, which I find a bit confusing. Here are all the ways I found:

  1. numpy.polyfit()
  2. numpy.polynomial.Polynomial.fit()
  3. numpy.polynomial.Polynomial.polyfit()

The first option, numpy.polyfit(), is considered legacy, and the documentation says to use numpy.polynomial:

As noted above, the poly1d class and associated functions defined in numpy.lib.polynomial, such as numpy.polyfit and numpy.poly, are considered legacy and should not be used in new code. Since NumPy version 1.4, the numpy.polynomial package is preferred for working with polynomials.

Because this is deprecated, we’ll skip over this option and look at the other two.

Using Polynomial.fit

The Polynomial transition guide specifically says that Polynomial.fit() is the replacement for numpy.polyfit(), so we’ll look at this option first.

The first order of business is to get the data into Python. For demonstration purposes, we’ll create two separate NumPy arrays in code:

import numpy as np

x = np.array([0, 1, 2, 3])
y = np.array([-1, 0.2, 0.9, 2.1])

It is, however, better to store this data outside the code, for example in a CSV or JSON file. You can easily read these in Python using the Standard Library (csv or json modules) or the Pandas Library (pandas.read_csv() or pandas.read_json() functions). Pandas is nice because it automatically parses into NumPy arrays of floating point numbers.

With the data in arrays, we can use Polynomial.fit() to create a linear “polynomial” from the x and y arrays:

import numpy.polynomial.polynomial as poly

linear = poly.Polynomial.fit(x, y, 1)

This returns a Polynomial object of degree 1, meaning it is linear. But what we really want are the coefficients of this polynomial. The docs tell us how to get them:

If the coefficients for the unscaled and unshifted basis polynomials are of interest, do new_series.convert().coef.

So that’s what we’ll do:

coefs = linear.convert().coef

And if we print them out, we get:

>>> print(coefs)
[-0.95  1.  ]

These are the same numbers we got from Numbers, but the Y-intercept is first and the slope is second. Why is that? Because Polynomial represents an \(n\)-degree polynomial (sometimes called the order of the polynomial) of this form:

\[y = c_0 + c_1x + c_1x^2 + \dots + c_nx^n\]

And with a degree of 1, the equation is:

\[y = c_0 + c_1x\]

Plugging the coefficient values of [-0.95 1. ] into that equation give us:

\[y = -0.95 + x\]

This is backwards from the slope-intercept form of \(y = mx + b\). The first coefficient, \(c_0\), is the Y-intercept \(b\), and the second coefficient, \(c_1\), is the slope \(m\). Python’s destructuring makes it easy to assign separate variables, m and b, from the coefficients:

b, m = coefs

This can even be done in one line:

b, m = poly.Polynomial.fit(x, y, 1).convert().coef

Using Polynomial.polyfit

While you can use Polynomial.fit() in one line, I find it a bit verbose. It creates an intermediate Polynomial object and I’m honestly not sure what the convert().coef really does. There’s a simpler way to get the coefficients using Polynomial.polyfit():

coefs = poly.polyfit(x, y, 1)

Again, the 1 is the degree, meaning a linear polynomial. Printing out coefs shows the same results as above:

>>> print(coefs)
[-0.95  1.  ]

Also like above, these are in the opposite order as the slope-intercept form, so we can use destructing to get m and b:

b, m = coef

This gives us the expected slope of 1 and Y-intercept of -0.95:

>>> print(f"{m=:.2} {b=:.2}")
m=1.0 b=-0.95

All in one line, this looks like:

b, m = poly.polyfit(x, y, 1)

I personally find this the more readable option. There’s less code and it does not create an intermediate Polynomial object. Again, if there’s some reason to prefer Polynomial.fit(), please let me know. And as a reminder, check out the Jupyter notebook demonstrating these APIs.

Bonus: Plotting with Python

One nice thing that Numbers gave us was a plot of the data and the line. Python can do this, too, using another fantastic library called Matplotlib. It has good integration with NumPy, and can use the x and y arrays directly. This code (which is also in the notebook):

import matplotlib.pyplot as plt

plt.plot(x, y, "o", label="Original data")
plt.plot(x, m*x + b, "red",
         label=f"Fitted line: y = {m:.2}*x + {b:.2}")
plt.grid(True)
plt.legend()
plt.show()

Produces this plot:

Plot of data and linear fit line using Matplotlib