  # Cointegration and Stationarity to master Mastering Pair Trading

– Co-Integration of two-time series

Now we will familiarise ourselves with two new terms – Cointegration and Stationarity.

Generally speaking, if stock X and stock Y are ‘cointegrated’, it means that the two stocks tend to move in step. Any departure from this is typically either short-term or related to a special event. One can expect the two time series to revert back to their regular pattern of converging and moving together, which is essential for pair trading. Therefore, the pairs chosen for pair trading should be cointegrated.

The issue is, how can we assess whether the two stocks are cointegrated?

To get an answer as to whether these two stocks are cointegrated, we need to first use a linear regression and assess the residuals. We then will measure if they are ‘stationary’.

If the residuals are stationary, this indicates that the two stocks are cointegrated and thus move together, providing a good potential for pair trading.

This brings up an interesting perspective – when running a regression on any two time-series, the output is not necessarily reliable. To decipher whether this is the case, one must assess stationarity; if residuals aren’t stationary then the regression cannot be trusted and should be disregarded.

Speculation and trading on a co-integrated time series is more meaningful and unaffected by the trend of the market.

Ultimately, it is essential to determine whether the residuals are stationary or not.

At this point, I can quickly demonstrate how to tell if the residuals are stationary or not via a simple test known as the ‘ADF test’, which should sufficiently answer your query. But, it would be beneficial for you to take a few minutes to grasp the true meaning of ‘stationarity’ without diving too deep into the more complex quantitative elements.

Only if you are interested in learning more, please read this portion; otherwise, proceed to the section that discusses the ADF test.

Stationary and non-stationary series

A time series can be marked as “Stationary” if it fulfils three statistical conditions. Weak stationarity is observed when two or even one of the conditions are met. If none of them are met, then the time series can be considered “non-stationary”.

These three principles are –

The mean of the series should remain steady or within a narrow margin.

Standard deviation of the series should be within the range.

There should be no autocorrelation within the series; in other words, each value of the time series must be independent of any other value prior to it. We will discuss this further at a later stage.

Pair trading requires complete stationarity from the pairs we choose; anything that varies in a non-stationary or weak stationary manner will not fulfil our needs.

So, to give a better understanding of ‘stationarity’, I suggest taking up an example such as a sample time series and exploring the three conditions mentioned above.

To illustrate this, I have two datasets with 9000 data points each. I’ve designated them as Series A and Series B, and will use these series to investigate the three stationarity conditions outlined above.

Condition 2 – The variance of the series should also be within a tight range

In order to assess the data, I will divide the time series into three segments and calculate the mean for each. Ideally, all three values should be similar. If this is the case, then it is likely that any new data points introduced in the future will not greatly alter the average.

Let us proceed with the task of splitting the Series A data into three segments and calculating their respective means. The result will appear as follows –

As mentioned, I have 9000 data points in Series A and Series B. These have been split into 3 parts; the starting and ending cells have been highlighted for ease of understanding.

The mean for all the three parts is alike, clearly meeting the first requirement.

For Series B, here’s how the mean look –

As is evident, Series B shifts quite drastically and thus fails to meet the initial requisites for stationarity.

Condition 2 – The standard deviation needs to be within a predetermined range.

My plan is to compute the standard deviation for all of the components in both collections and then examine their respective values.

The findings of Series A are as follows: the result is…

The standard deviation being between 14-19%, indicates a ‘tight’ range and therefore fulfils the second stationarity condition.

Series B’s standard deviation can be revealed in the following manner:

Did you notice the difference? Series B’s standard deviation appears highly unpredictable. It is evidently not stationary. In comparison, Series A seems stable thus far. Let us now check the autocorrelation and finalise our assessment.

Condition 3 – There should be no autocorrelation within the series

Put simply, autocorrelation is a phenomenon in which the value of a time series does not depend on other values prior to it.

Take a look at the image below. It provides a perfect example of what we are discussing.

The 9th number in Series A is 29, which is not determined by any preceding values, should there be no autocorrelation in this series. That is to say, cells 2 through 8 have no bearing on the calculation of 29.

How can we make this happen?

This can be done by using a specific method.

Take the data from Cell 1 to Cell 9 and call it Series X. Then, take the data from Cell 2 to Cell 10 and label it Series Y. Calculate the correlation between these two series – this is referred to as 1-lag correlation; this figure should be close to 0.

I can also calculate the correlation between Cells 1 and 8, as well as 3 and 10; in both cases, the correlation should be close to 0. If this is confirmed, then it becomes evident that the series is not autocorrelated, thereby satisfying the third requirement for stationarity.

The correlation between Series A and itself lagged by 2 was measured, and the results are as follows:

I am dividing Series A into two subsets known as series X and Y. Analysis of the correlation between these two shows that it is almost zero, sufficient enough to determine that Time Series A is stationary.

Let’s also do this for Series B.

Using a comparable methodology, I was able to establish an almost perfect correlation.

It’s clear that the conditions for stationarity have been met by Series A, thus making it stationary. Conversely, the same cannot be said of Series B.

My method of explanation for stationarity and co-integration may not be the typical one, as it omits what some people call ‘scary’ formulas. Nevertheless, I chose to explain these topics this way because our main focus is to acquire the knowledge of pairing trade efficiently, instead of delving too deep into statistics.

You may be wondering if all the steps mentioned are necessary to determine whether the time series (residuals) is stationary. To put it simply, they are not.

The results of the ‘The ADF Test’ can give us insight into whether the time series is stationary or not.

The augmented Dickey-Fuller test (ADF) is an effective way of evaluating the stationarity of a time series, particularly the residuals series in this instance.

Basically, the ADF test carries out all of the steps outlined earlier and additionally involves a multiple lag process to assess autocorrelation. One final thing you should bear in mind – the results of the ADF test are not categorical. Instead, they offer a probability. This probability suggests how likely it is that the series is non-stationary.

For example, the ADF test of a time series might output 0.25 which implies a 25% chance of non-stationarity; consequently, there is a 75% probability that the series is stationary. This figure is also known as The P value.

To ensure that a time series is stationary, its P value should be less than 0.05 (5%). This implies that the likelihood of the time series being stationary is 95% or higher.

Alright, so how do you run an ADF test?

Frankly, this process is highly complex and I could not locate any free sources or methods to run an ADF test. I do possess an excel sheet with a paid plugin, but it is not possible to share here. I would have done so if it were feasible.

As a programmer, you might want to consider running an ADF test with the help of a Python plugin. It’s apparently easy to get one.

For those of us who aren’t experienced programmers, we find ourselves stuck at this stage. To help, I plan to upload a ‘Pair Data’ sheet on a biweekly basis. This will contain the top combinations of pairs, with information such as:

You will know which stock is X and which stock is Y

You will know the intercept and Beta of this combination

You will also know the p-value of the combination

This covers a period of 200 trading days and comprises only banking stocks, but with luck, more sectors can be added in the future. Here is a snapshot of the latest Pair Datasheet for banking stocks for you to refer to and gain a better understanding.

This suggests that Federal Bank Y and PNB X are a viable combination. Regression was conducted for both Federal Y and PNB X, plus the reverse scenario, and it was determined that the error ratio was lowest with Federal Y and PNB X.

Once the order has been determined (which is Y and which X), the intercept and Beta for the combination have been calculated. Then, the ADF was carried out with P value 0.365 for Federal Bank being Y and PNB as X.

Put differently, the chances of these residuals being still are not high, at 63.5%.

As seen in the snapshot, two sets of pairs show the p-value sought after: Kotak and PNB at .01 and HDFC with PNB at .037.

The p values don’t drastically alter quickly, so I typically observe them once every 15 or 20 days and see if the need updating.

We have gained considerable insight from this chapter. Much of the material discussed may be unfamiliar to the majority of readers, so I will now give a summary of all the key points about Pair Trading that you need to know.

The basic premise of pair trading

Basic overview of linear regression and how to perform one

In linear regression, one variable (X) is regressed against another (Y).

When analysing our regressions, we must consider some outputs that are of note – such as the intercept, slope, residuals and their respective standard errors.

The error ratio is what determines whether a stock will be labelled as dependent or independent.

We determine which stock is X and which is Y by exchanging them and finding the one with the smallest error ratio.

The residuals from the regression need to be stationary for us to verify that the two stocks are co-integrated. If they are stationary, then our conclusion is supported.

If the stocks are cointegrated, then they tend to remain in sync.

The ADF test can be used to determine stationarity of a series