Time Series Forecasting with SARIMA in Python

栏目: IT技术 · 发布时间: 4年前

内容简介:In previous articles, we introducedNow, add one last component to the model: seasonality.This article will cover:

Hands-on tutorial on time series modelling with SARIMA using Python

Time Series Forecasting with SARIMA in Python

Photo by Morgan Housel on Unsplash

In previous articles, we introduced moving average processes MA(q) , and autoregressive processes AR(p) . We combined them and formed ARMA(p,q) and ARIMA(p,d,q) models to model more complex time series.

Now, add one last component to the model: seasonality.

This article will cover:

  • Seasonal ARIMA models
  • A complete modelling and forecasting project with real-life data

The notebook and dataset are available on Github.

Let’s get started!

For hands-on video tutorials on machine learning, deep learning, and artificial intelligence, checkout my YouTube channel .

SARIMA Model

Up until now, we have not considered the effect of seasonality in time series. However, this behaviour is surely present in many cases, such as gift shop sales, or total number of air passengers.

A seasonal ARIMA model or SARIMA is written as follows:

Time Series Forecasting with SARIMA in Python
SARIMA notation

You can see that we add P, D, and Q for the seasonal portion of the time series. They are the same terms as the non-seasonal components, by they involve backshifts of the seasonal period.

In the formula above, m is the number of observations per year or the period. If we are analyzing quarterly data, m would equal 4.

ACF and PACF plots

The seasonal part of an AR and MA model can be inferred from the PACF and ACF plots.

In the case of a SARIMA model with only a seasonal moving average process of order 1 and period of 12, denoted as:

  • A spike is observed at lag 12
  • Exponential decay in the seasonal lags of the PACF (lag 12, 24, 36, …)

Similarly, for a model with only a seasonal autoregressive process of order 1 and period of 12:

  • Exponential decay in the seasonal lags of the ACF (lag 12, 24, 36, …)
  • A spike is observed at lag 12 in the PACF

Modelling

The modelling process is the same as with non-seasonal ARIMA models. In this case, we simply need to consider the additional parameters.

Steps required to make the time series stationary and selecting the model according to the lowest AIC remain in the modelling process.

Let’s cover a complete example with a real-world dataset.

Project — Modelling the quarterly EPS for Johnson&Johnson

We are going to revisit the dataset of the quarterly earnings per share (EPS) of Johnson&Johnson. This is a very interesting dataset, because there is a moving average process at play, and we have seasonality, which is perfect to get some practice with SARIMA.

As always, we start off by importing all the necessary libraries for our analysis

from statsmodels.graphics.tsaplots import plot_pacf
from statsmodels.graphics.tsaplots import plot_acf
from statsmodels.tsa.statespace.sarimax import SARIMAX
from statsmodels.tsa.holtwinters import ExponentialSmoothing
from statsmodels.tsa.stattools import adfuller
import matplotlib.pyplot as plt
from tqdm import tqdm_notebook
import numpy as np
import pandas as pdfrom itertools import productimport warnings
warnings.filterwarnings('ignore')%matplotlib inline

Now, let’s read in the data in a dataframe:

data = pd.read_csv('jj.csv')

Then, we can display a plot of the time series:

plt.figure(figsize=[15, 7.5]); # Set dimensions for figure
plt.plot(data['date'], data['data'])
plt.title('Quarterly EPS for Johnson & Johnson')
plt.ylabel('EPS per share ($)')
plt.xlabel('Date')
plt.xticks(rotation=90)
plt.grid(True)
plt.show()

Time Series Forecasting with SARIMA in Python

Quarterly EPS for Johnson&Johnson

Clearly, the time series is not stationary, as its mean is not constant through time, and we see an increasing variance in the data, a sign of heteroscedasticity .

To make sure, let’s plot the PACF and ACF:

plot_pacf(data['data']);
plot_acf(data['data']);

Time Series Forecasting with SARIMA in Python

PACF and ACF

Again, no information can be deduced from those plots. You can further test for stationarity with the Augmented Dickey-Fuller test:

ad_fuller_result = adfuller(data['data'])
print(f'ADF Statistic: {ad_fuller_result[0]}')
print(f'p-value: {ad_fuller_result[1]}')
ADF test result

Since the p-value is large, we cannot reject the null hypothesis and must assume that the time series is non-stationary.

Now, let’s take the log difference in an effort to make it stationary:

data['data'] = np.log(data['data'])
data['data'] = data['data'].diff()
data = data.drop(data.index[0])

Plotting the new data should give:

plt.figure(figsize=[15, 7.5]); # Set dimensions for figure
plt.plot(data['data'])
plt.title("Log Difference of Quarterly EPS for Johnson & Johnson")
plt.show()

Time Series Forecasting with SARIMA in Python

Log return of quarterly EPS for Johnson&Johnson

Awesome! Now, we still see the seasonality in the plot above. Since we are dealing with quarterly data, our period is 4. Therefore, we will take the difference over a period of 4:

# Seasonal differencingdata['data'] = data['data'].diff(4)
data = data.drop([1, 2, 3, 4], axis=0).reset_index(drop=True)

Plotting the new data:

plt.figure(figsize=[15, 7.5]); # Set dimensions for figure
plt.plot(data['data'])
plt.title("Log Difference of Quarterly EPS for Johnson & Johnson")
plt.show()

Time Series Forecasting with SARIMA in Python

Perfect! Keep in mind that although we took the difference over a period of 4 months, the order of seasonal differencing (D) is 1, because we only took the difference once.

Now, let’s run the Augmented Dickey-Fuller test again to see if we have a stationary time series:

ad_fuller_result = adfuller(data['data'])
print(f'ADF Statistic: {ad_fuller_result[0]}')
print(f'p-value: {ad_fuller_result[1]}')

Indeed, the p-value is small enough for us to reject the null hypothesis, and we can consider that the time series is stationary.

Taking a look at the ACF and PACF:

plot_pacf(data['data']);
plot_acf(data['data']);

Time Series Forecasting with SARIMA in Python

We can see from the PACF that we have a significant peak at lag 1, which suggest an AR(1) process. Also, we have another peak at lag 4, suggesting a seasonal autoregressive process of order 1 (P = 1).

Looking at the ACF plot, we only see a significant peak at lag 1, suggesting a non-seasonal MA(1) process.

Although these plots can give us a rough idea of the processes in play, it is better to test multiple scenarios and choose the model that yield the lowest AIC.

Therefore, let’s write a function that will test a series of parameters for the SARIMA model and output a table with the best performing model at the top:

def optimize_SARIMA(parameters_list, d, D, s, exog):
    """
        Return dataframe with parameters, corresponding AIC and SSE
        
        parameters_list - list with (p, q, P, Q) tuples
        d - integration order
        D - seasonal integration order
        s - length of season
        exog - the exogenous variable
    """
    
    results = []
    
    for param in tqdm_notebook(parameters_list):
        try: 
            model = SARIMAX(exog, order=(param[0], d, param[1]), seasonal_order=(param[2], D, param[3], s)).fit(disp=-1)
        except:
            continue
            
        aic = model.aic
        results.append([param, aic])
        
    result_df = pd.DataFrame(results)
    result_df.columns = ['(p,q)x(P,Q)', 'AIC']
    #Sort in ascending order, lower AIC is better
    result_df = result_df.sort_values(by='AIC', ascending=True).reset_index(drop=True)
    
    return result_df

Note that we will only test different values for the parameters p, P, q and Q. We know that both seasonal and non-seasonal integration parameters should be 1, and that the length of the season is 4.

Therefore, we generate all possible parameters combination:

p = range(0, 4, 1)
d = 1
q = range(0, 4, 1)
P = range(0, 4, 1)
D = 1
Q = range(0, 4, 1)
s = 4parameters = product(p, q, P, Q)
parameters_list = list(parameters)
print(len(parameters_list))

And you should see that we get 256 unique combinations! Now, our function will fit 256 different SARIMA models on our data to find the one with the lowest AIC:

result_df = optimize_SARIMA(parameters_list, 1, 1, 4, data['data'])
result_df

Time Series Forecasting with SARIMA in Python

Results table

From the table, you can see that the best model is: SARIMA(0, 1, 2)(0, 1, 2, 4).

We can now fit the model and output its summary:

best_model = SARIMAX(data['data'], order=(0, 1, 2), seasonal_order=(0, 1, 2, 4)).fit(dis=-1)
print(best_model.summary())

Time Series Forecasting with SARIMA in Python

Best model summary

Here, you see that the best performing model has both seasonal and non-seasonal moving average processes.

From the summary above, you can find the value of the coefficients and their p-value. Notice that from the p-value, all coefficients are significant.

Now, we can study the residuals:

best_model.plot_diagnostics(figsize=(15,12));

Time Series Forecasting with SARIMA in Python

Model diagnostics

From the normal Q-Q plot, we can see that we almost have a straight line, which suggest no systematic departure from normality. Also, the correlogram on the bottom right suggests that there is no autocorrelation in the residuals, and so they are effectively white noise.

We are ready to plot the predictions of our model and forecast into the future:

data['arima_model'] = best_model.fittedvalues
data['arima_model'][:4+1] = np.NaNforecast = best_model.predict(start=data.shape[0], end=data.shape[0] + 8)
forecast = data['arima_model'].append(forecast)plt.figure(figsize=(15, 7.5))
plt.plot(forecast, color='r', label='model')
plt.axvspan(data.index[-1], forecast.index[-1], alpha=0.5, color='lightgrey')
plt.plot(data['data'], label='actual')
plt.legend()plt.show()

Time Series Forecasting with SARIMA in Python

Model’s forecast

Voilà!

Conclusion

Congratulations! You now understand what a seasonal ARIMA (or SARIMA) model is and how to use it to model and forecast.

Learn more about time series with the following resources (I may earn a commission if you opt in one of the courses below):

Cheers :beer:


以上所述就是小编给大家介绍的《Time Series Forecasting with SARIMA in Python》,希望对大家有所帮助,如果大家有任何疑问请给我留言,小编会及时回复大家的。在此也非常感谢大家对 码农网 的支持!

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

自制编程语言 基于C语言

自制编程语言 基于C语言

郑钢 / 人民邮电出版社 / 2018-9-1 / CNY 89.00

本书是一本专门介绍自制编程语言的图书,书中深入浅出地讲述了如何开发一门编程语言,以及运行这门编程语言的虚拟机。本书主要内容包括:脚本语言的功能、词法分析器、类、对象、原生方法、自上而下算符优先、语法分析、语义分析、虚拟机、内建类、垃圾回收、命令行及调试等技术。 本书适合程序员阅读,也适合对编程语言原理感兴趣的计算机从业人员学习。一起来看看 《自制编程语言 基于C语言》 这本书的介绍吧!

在线进制转换器
在线进制转换器

各进制数互转换器

正则表达式在线测试
正则表达式在线测试

正则表达式在线测试

HEX CMYK 转换工具
HEX CMYK 转换工具

HEX CMYK 互转工具