Introduction to Linear Regression
In the previous chapter, we established the fundamentals of a straight line equation, using a simple example to demonstrate how two variables can be connected. To further illustrate this idea, we presented a table of two arrays of numbers and challenged the readers to discern if there was any relationship between them and if yes, how could one put it into an equation form. The additional task was finding out what the intercept and constant should be.
In this chapter, we will look into forming a link and progress to relative value trading. For ease, I’ll post the table with two sets of numbers again –
A casual perusal of the two sets of numbers won’t give a clue as to how they might be related. Unless you’re some kind of superhuman, it’s not an effective way to discern the connection.
In this situation, we turn to the ‘Linear Regression’ method. This statistical operation takes an array of two sets of numbers as input and returns various parameters, like the intercept and constant required for forming a straight line equation.
We will utilize Excel to perform the linear regression procedure. Here is the comprehensive guide for performing a standard linear regression on two data arrays. Be ready for plenty of screenshots and instructions
Step 1 – Install the Plugin
Create a new Excel spreadsheet, utilizing the values of X and Y from the table. Refer to the example below –
This data set is ours to use. Recall that Y, the ‘Dependent’ variable, takes its value based on X, the independent one. Both of them will be our input variables for the linear regression operation.
To access the Data ribbon, click on the highlighted tab in red on the Excel sheet. The tab can be seen below.
The data ribbon should display the ‘Data Analysis’ option, highlighted in blue. If you do not see it, don’t worry – just follow these instructions.
To access the file menu, select ‘File’ on the top bar.
By clicking on the new window, you will notice a selection to choose from in the left-hand side panel – ‘Option’.
To add the ‘Data Analysis’ option to the data ribbon, you need to click on the ‘Options’ tab. Navigate to the left-hand side panel and select ‘Add-Ins’, then press ‘Go’ and finally hit ‘Ok’. That’s it!
Shut your Excel document and restart your system, and you should be ready to go.
Step 2 – Enter the values
Assuming your excel sheet is equipped with the data analysis pack, the next step is to invoke the linear regression function. Click ‘Data’ in the ribbon and choose Data Analysis. This will open a pop-up presenting a range of statistical operations you can conduct on data sets; select ‘Regression’.
When you choose regression and hit OK, a pop-up will appear
You will notice a variety of fields in this area. Take particular notice of the ‘Input Y Range’ and ‘Input X Range’ in the initial section. These are for the dependent and independent variables, respectively.
To feed in the X and Y series data, click on the input channel and select the corresponding ranges.
I have checked the label box, indicating that cells A2 and B2 both have labels – X and Y, respectively.
For the time being, it would be best to overlook the extra inputs.
Check that you have gone through the steps mentioned below –
When you select ‘New worksheet’, the output data will show up on a different sheet. Additionally, please ensure Residuals and Standardised Residuals are both ticked; I’ll provide more details about them at a later point.
This will enable you to do the linear regression operation. Simply hit the ‘Ok’ button located in the top-right corner.
Excel will accept these inputs and complete the linear regression, then generate a new sheet within the workbook with the results.
Linear Regression Output
This is how the linear regression output looks; as expected, a summary of the result has been provided in a new sheet.
The summary output is intimidating at first glance, since it contains such a huge amount of information. We’ll break it up gradually as we go along.
Right now, let us focus on discovering our incline and intercept. I have highlighted this for you in the image underneath –
The data points coloured red contain the coefficients we seek: the intercept (or constant) and the slope (denoted by x).
You may be confused by the use of x to denote the slope, rather than M as it would typically go with a straight-line equation. I understand that this may be misleading, but unfortunately we’ll have to use x for our purposes.
Slope of the equation = 1.885
Intercept (or constant) = -7.859813.
Given this, the straight-line equation for the arbitrary set of data is –
y = 1.885*x + (-7.859813) or
y = 1.885*x – 7.859813
In other words, what is the implication of this?
Recall from the preceding chapter, this equation enables us to ascertain the value of y or the dependent variable for a given x. For ease, I will post the table here again –
I’ve included a 15 to the x data, and now we can utilise the slope and intercept to estimate y. Let’s get on with it!
y = 1.885 * 15 – 7.859813
= 28.275 – 7.859813
Considering x is 15, the most probable value of y should be around 20.415.
You might be wondering how reliable this projection is.
That is simply what one might expect. To illustrate, if x had a value of 18 (as mentioned in the preceding data point), then according to the straight line equation, y should be -.
y = 1.885*18 – 7.859813
= 33.93 – 7.859813
However, the actual value of y is 22.
This leads us two values of y –
The straight line equation predicts the value of y.
Actual value of y
The residual for y, when x = 18, is the difference between the actual and predicted y.
26.07019 – 22
The summary output when linear regression is carried out includes the residuals, as shown in the snapshot below –
The residual when x equals 18 is discernible, as calculated previously.