Linear Regression For Impatient Programmer

Back To Index

LinearRegression
In [31]:
import numpy as np
import scipy as scp
import matplotlib.pyplot as plt
import math

Load the data

In [20]:
datafile = open("LinearData.txt",'r')
In [21]:
data = [ [float(word) for word in line.split()] for line in datafile.readlines()]
In [22]:
data[0]
Out[22]:
[3.36, 15.6495]
In [24]:
n = float(len(data))
In [25]:
print(n)
1000.0
In [9]:
datafile.close()

Next, split the data pairs into lists of x and y

In [10]:
dataX = [XY[0] for XY in data]
In [11]:
dataY = [XY[1] for XY in data]
In [12]:
print(dataX[0],dataY[0])
3.36 15.6495

Find average X and Y:

In [26]:
xbar = sum(dataX)/n
In [27]:
ybar = sum(dataY)/n
In [28]:
print("Average X: ",xbar," AverageY: ",ybar)
Average X:  6.651379799999999  AverageY:  38.04975769999991

Now find the standard deviation of X and Y:

In [29]:
varianceX = sum([pow(x-xbar,2) for x in dataX])/(n-1)
In [32]:
stdevX = math.sqrt(varianceX)
In [33]:
print("x variance:", varianceX, "  x standard deviation:", stdevX)
x variance: 17.36767965202397   x standard deviation: 4.167454817034489
In [34]:
varianceY = sum([pow(y-ybar,2) for y in dataY])/(n-1)
In [35]:
stdevY = math.sqrt( varianceY )
In [37]:
print("y variance:", varianceY, "  y standard deviation:", stdevY)
y variance: 758.7477305717524   y standard deviation: 27.545375847349632

And find the covariance and correlation between X and Y

In [38]:
covariance = sum( [(x-xbar)*(y-ybar) for x,y in zip(dataX,dataY)] ) / n
In [39]:
correlation = covariance/(stdevX*stdevY)
In [41]:
print("covariance:", covariance, "  correlation:",correlation)
covariance: 87.22690608192573   correlation: 0.7598552455579058

Next we’ll define a function to make a scatterplot of our data. It’s a good idea to make a plotting function, both for reusability and to contain plot-beautificaiton code.

In [42]:
def splot(dataX,dataY):
        fig = plt.figure()
        ax = plt.subplot(111)
        ax.scatter(dataX, dataY,label='data')
        plt.title('Some Very Attractive Title')
        plt.xlabel('x')
        plt.ylabel('y')
        axes = plt.gca()
        axes.set_xlim([0,18])
        axes.set_ylim([0,190])
        ax.legend(loc='upper center', bbox_to_anchor=(0.5, -0.05), shadow=False, ncol=2)
        fig.patch.set_facecolor('white') #sets the color of the border
        plt.show()
In [43]:
splot(dataX,dataY)

Now to make a histogram function. We’ll use this to plot X and Y for now, then other variables.

In [47]:
def hplot(data, nbins, label=''):
        fig = plt.figure()
        ax = plt.subplot(111)
        ax.hist(data,nbins,color='green',alpha=0.8)
        plt.title('Histogram')
        plt.xlabel(label)
        fig.patch.set_facecolor('white') #sets the color of the border
        plt.show()
In [48]:
hplot(dataY,10)

Exercise: try changing the number of bins. Also, make a histogram of dataX.

Now we’ll do the least square fit. First we have produce some informaiton.

In [49]:
dataX2 = [x*x for x in dataX]
dataY2 = [y*y for y in dataY]
dataXY = [x*y for x,y in zip(dataX,dataY)]

Now we compute the fit line y = m*x+b, with slope m and vertial intercept b

In [50]:
m = ( sum(dataXY)*n - sum(dataX)*sum(dataY)  )/( sum(dataX2)*n - pow(sum(dataX),2)  )
In [51]:
b = ( sum(dataY)*sum(dataX2) - sum(dataX)*sum(dataXY) )/( sum(dataX2)*n - pow(sum(dataX),2)  )
In [52]:
print("m: ",m,"  b: ",b)
m:  5.027396984032517   b:  4.610630953825091

We would like to plot this line on top of the data. To do that, we first have to make some points that follow the line.

In [54]:
linX = range(20)
linY = [m*x+b for x in linX]

Now make the scatter plot with a line on top

In [56]:
fig = plt.figure()
ax = plt.subplot(111)
ax.scatter(dataX, dataY,label='data')
ax.plot(linX, linY, color='red', label='Fit Line',linewidth=4.0)##New: fit line
plt.title('Least Square fit to the data')
plt.xlabel('x')
plt.ylabel('y')
axes = plt.gca()
axes.set_xlim([0,18])
axes.set_ylim([0,190])
ax.legend(loc='upper center', bbox_to_anchor=(0.5, -0.05), shadow=False, ncol=2)
fig.patch.set_facecolor('white')
plt.show()

First thing to with this fit line is to check how well it explains the data. R^2 gives a measure of the goodness of the fit. It tells us what proportion of the variance in Y is explained by the model. Ideally R^2 would be 1.

In [60]:
SS_residuals = sum( [pow(y - (m*x+b),2) for x,y in zip(dataX,dataY) ]  )
SS_total = sum([pow(y-ybar,2) for y in dataY])
R_squared = 1.0 - (SS_residuals/SS_total)
print(R_squared)
0.5785364886426594

That’s pretty mediocer, which is to be expected in a system as noisy as this one.

We can also select a slice of the plot, or the whole thing, and view the residuals. Again, these are the vertical distances from the fit line.

In [61]:
residuals = [y - m*x+b for x,y in zip(dataX,dataY) if x < 4]
hplot(residuals,15)

We can project the data along the fit line by finding the perpendicular distances from the points and the fit line and then making a histogram of those.

In [62]:
perp_dist = [ (y - (m*x+b))/math.sqrt(1.0+m*m) for x,y in zip(dataX,dataY) ]
In [64]:
hplot(perp_dist,25)

This gives us a very narrow peak so we can see the spread around our model line.

In [ ]:
 

Back To Index