Let’s say you run a business and you’re spending a varying amount of money on advertisements every single month.
Naturally, your revenue is not uniform across all months and you’re trying to better understand and hopefully predict how do ads spendings predict revenue.
I spent X on ads this month so my revenue should be Y with error E
You have tracked the ads spending and revenue across your business lifetime, and here’s what the data shows
You can see that in months you spent 20$ on ads, your revenue varied from ~130 to 300$, and when you spend 60$ your revenue was ~300 to 470$.
Using this (simulated) data, we are going to use two techniques — Linear Regression and Polynomial Regression to try and predict what would be the revenue for any given ad spending amount.
Linear regression is a model that assumes a linear relationship between the input variable (e.g. ads spending) and the output variable (e.g. revenue).
The model can be represented in the following way
- y is the prediction value
- α₀ is the bias term
- α₁, α₂, …, αₙ are the model parameters
- x₁, x₂, …, xₙ are the input values
In our case, we only have one input — ads spending so our model will be represented with
Basically, we are trying to come up with a linear equation that describes best the relationship between our input and output.
The above statement leads us to the next question
How do we know what linear equation describes best our data?
To determine that, I am going to introduce a new terminology — Residuals.
Given a linear equation and data points, a residual of a point is the distance between the point and the line.
The sum of residuals would give us an estimation of how good or bad the linear equation is.
Above you can see an example of a ‘bad’ linear equation — we can see that most of our data points are pretty far from the line which means the residual of each point is relatively large.
In comparison, here’s a better linear equation (for our data)
Our goal is to minimize the sum of residuals — the error.
There are a few ways to do that, least-squares and gradient descent are both good options. (In some cases, gradient descent can be cheaper in terms of computational complexity).
First, let’s better define what we are minimizing, by defining a cost function.
Where yᵢ are the actual data points and ŷᵢ are the predicted values.
I have chosen not to elaborate on gradient descent in this story and I will be using it as a “black box”.
For those of you unfamiliar with the algorithm, the details aren’t too important — all you need to understand is that by applying gradient descent on our cost function (SSE), we are minimizing it and updating our model parameters to correspond with this minimal value which means that by the end of the algorithm we will have the optimal linear equation.
Another key thing about this algorithm is that it’s iterative. At every iteration, we are trying to further minimize our function until the algorithm converges, which means no more minimization is possible.
Applying Linear Regression
Thanks to amazing modules such as sklearn and others, we are able to do this in just a few lines of code.
import numpy as np import matplotlib.pyplot as plt from sklearn.linear_model import LinearRegression from sklearn.metrics import mean_squared_error x, y = load_data() #psuedo code for loading your data set model = LinearRegression() model.fit(x, y) y_pred = model.predict(x)
And that’s it, we made a prediction of y values and know we can see how good it is with a few more lines
plt.scatter(x, y, s=10)
plt.plot(x, y_pred, color = 'r')
To get an understanding of how good this line is, we can compute its MSE
mse = mean_squared_error(y, y_pred)
And in our case, MSE= 0.7343.
From the graph, we can see that the trend is not linear and our limited linear model is not giving a good prediction because of that — there are dots especially on the edges that seem pretty far away from the line.
A linear line can’t “capture” well enough a non-linear trend.
At this point, it is time to introduce Polynomial Regression.
As you might have guessed, this time we are going to look for the best polynomial that describes our data, and not a linear equation.
The polynomial regression model can be represented in the following way
Since we already have some intuition on how it works and we know what we are trying to minimize here (same as linear regression — MSE), let’s go straight to applying polynomial regression on our data and see the differences.
Applying Polynomial Regression
Similar to what we did with linear regression — we can apply a polynomial regression in the following way
import numpy as np import operator import matplotlib.pyplot as plt from sklearn.linear_model import LinearRegressionfrom sklearn.preprocessing import PolynomialFeatures from sklearn.metrics import mean_squared_error POLY_DEGREE = 2 x, y = load_data() #psuedo code for loading your data set poly_reg = PolynomialFeatures(degree=POLY_DEGREE) x_poly = poly_reg.fit_transform(x) model = LinearRegression() model.fit(x_poly, y) y_pred = model.predict(x_poly)
Now in order to check how good this prediction is let’s generate the graph as we did before (with a little necessary twist)
plt.scatter(x, y, s=10) # Sorting x and y values according to values in x sort_axis = operator.itemgetter(0) sorted_zip = sorted(zip(x,y_pred), key=sort_axis) x, y_pred = zip(*sorted_zip) plt.plot(x, y_pred, color='r') plt.show()
Just by looking at the graph, we can see that this line is a better predictor of our data. Now let’s calculate the mean squared error.
Pay attention to at what point you are calculating the error, as you might get some nonsense scores.
Right after calling on
predict calculate MSE with
mse = mean_squared_error(y, y_pred)
And this time the result is 0.3467
If you paid attention, you saw me passing the argument
POLY_DEGREE which I set to 2, and we can tweak it to try and get better results.
A degree of 2 means that the model is looking for the parameters for the following equation
And by increasing the degree we are basically giving the model more “power” so it should produce better predictions.
Hopefully, we got a better understanding of what is regression, what it aims to do, and how we apply linear and polynomial regression to our data.