内容简介:In previous articles, we introducedNow, add one last component to the model: seasonality.This article will cover:
Hands-on tutorial on time series modelling with SARIMA using Python
Jul 29 ·7min read
In previous articles, we introduced moving average processes MA(q) , and autoregressive processes AR(p) . We combined them and formed ARMA(p,q) and ARIMA(p,d,q) models to model more complex time series.
Now, add one last component to the model: seasonality.
This article will cover:
- Seasonal ARIMA models
- A complete modelling and forecasting project with real-life data
The notebook and dataset are available on Github.
Let’s get started!
For hands-on video tutorials on machine learning, deep learning, and artificial intelligence, checkout my YouTube channel .
SARIMA Model
Up until now, we have not considered the effect of seasonality in time series. However, this behaviour is surely present in many cases, such as gift shop sales, or total number of air passengers.
A seasonal ARIMA model or SARIMA is written as follows:
You can see that we add P, D, and Q for the seasonal portion of the time series. They are the same terms as the non-seasonal components, by they involve backshifts of the seasonal period.
In the formula above, m is the number of observations per year or the period. If we are analyzing quarterly data, m would equal 4.
ACF and PACF plots
The seasonal part of an AR and MA model can be inferred from the PACF and ACF plots.
In the case of a SARIMA model with only a seasonal moving average process of order 1 and period of 12, denoted as:
- A spike is observed at lag 12
- Exponential decay in the seasonal lags of the PACF (lag 12, 24, 36, …)
Similarly, for a model with only a seasonal autoregressive process of order 1 and period of 12:
- Exponential decay in the seasonal lags of the ACF (lag 12, 24, 36, …)
- A spike is observed at lag 12 in the PACF
Modelling
The modelling process is the same as with non-seasonal ARIMA models. In this case, we simply need to consider the additional parameters.
Steps required to make the time series stationary and selecting the model according to the lowest AIC remain in the modelling process.
Let’s cover a complete example with a real-world dataset.
Project — Modelling the quarterly EPS for Johnson&Johnson
We are going to revisit the dataset of the quarterly earnings per share (EPS) of Johnson&Johnson. This is a very interesting dataset, because there is a moving average process at play, and we have seasonality, which is perfect to get some practice with SARIMA.
As always, we start off by importing all the necessary libraries for our analysis
from statsmodels.graphics.tsaplots import plot_pacf from statsmodels.graphics.tsaplots import plot_acf from statsmodels.tsa.statespace.sarimax import SARIMAX from statsmodels.tsa.holtwinters import ExponentialSmoothing from statsmodels.tsa.stattools import adfuller import matplotlib.pyplot as plt from tqdm import tqdm_notebook import numpy as np import pandas as pdfrom itertools import productimport warnings warnings.filterwarnings('ignore')%matplotlib inline
Now, let’s read in the data in a dataframe:
data = pd.read_csv('jj.csv')
Then, we can display a plot of the time series:
plt.figure(figsize=[15, 7.5]); # Set dimensions for figure plt.plot(data['date'], data['data']) plt.title('Quarterly EPS for Johnson & Johnson') plt.ylabel('EPS per share ($)') plt.xlabel('Date') plt.xticks(rotation=90) plt.grid(True) plt.show()
Clearly, the time series is not stationary, as its mean is not constant through time, and we see an increasing variance in the data, a sign of heteroscedasticity .
To make sure, let’s plot the PACF and ACF:
plot_pacf(data['data']); plot_acf(data['data']);
Again, no information can be deduced from those plots. You can further test for stationarity with the Augmented Dickey-Fuller test:
ad_fuller_result = adfuller(data['data']) print(f'ADF Statistic: {ad_fuller_result[0]}') print(f'p-value: {ad_fuller_result[1]}')
Since the p-value is large, we cannot reject the null hypothesis and must assume that the time series is non-stationary.
Now, let’s take the log difference in an effort to make it stationary:
data['data'] = np.log(data['data']) data['data'] = data['data'].diff() data = data.drop(data.index[0])
Plotting the new data should give:
plt.figure(figsize=[15, 7.5]); # Set dimensions for figure plt.plot(data['data']) plt.title("Log Difference of Quarterly EPS for Johnson & Johnson") plt.show()
Awesome! Now, we still see the seasonality in the plot above. Since we are dealing with quarterly data, our period is 4. Therefore, we will take the difference over a period of 4:
# Seasonal differencingdata['data'] = data['data'].diff(4) data = data.drop([1, 2, 3, 4], axis=0).reset_index(drop=True)
Plotting the new data:
plt.figure(figsize=[15, 7.5]); # Set dimensions for figure plt.plot(data['data']) plt.title("Log Difference of Quarterly EPS for Johnson & Johnson") plt.show()
Perfect! Keep in mind that although we took the difference over a period of 4 months, the order of seasonal differencing (D) is 1, because we only took the difference once.
Now, let’s run the Augmented Dickey-Fuller test again to see if we have a stationary time series:
ad_fuller_result = adfuller(data['data']) print(f'ADF Statistic: {ad_fuller_result[0]}') print(f'p-value: {ad_fuller_result[1]}')
Indeed, the p-value is small enough for us to reject the null hypothesis, and we can consider that the time series is stationary.
Taking a look at the ACF and PACF:
plot_pacf(data['data']); plot_acf(data['data']);
We can see from the PACF that we have a significant peak at lag 1, which suggest an AR(1) process. Also, we have another peak at lag 4, suggesting a seasonal autoregressive process of order 1 (P = 1).
Looking at the ACF plot, we only see a significant peak at lag 1, suggesting a non-seasonal MA(1) process.
Although these plots can give us a rough idea of the processes in play, it is better to test multiple scenarios and choose the model that yield the lowest AIC.
Therefore, let’s write a function that will test a series of parameters for the SARIMA model and output a table with the best performing model at the top:
def optimize_SARIMA(parameters_list, d, D, s, exog): """ Return dataframe with parameters, corresponding AIC and SSE parameters_list - list with (p, q, P, Q) tuples d - integration order D - seasonal integration order s - length of season exog - the exogenous variable """ results = [] for param in tqdm_notebook(parameters_list): try: model = SARIMAX(exog, order=(param[0], d, param[1]), seasonal_order=(param[2], D, param[3], s)).fit(disp=-1) except: continue aic = model.aic results.append([param, aic]) result_df = pd.DataFrame(results) result_df.columns = ['(p,q)x(P,Q)', 'AIC'] #Sort in ascending order, lower AIC is better result_df = result_df.sort_values(by='AIC', ascending=True).reset_index(drop=True) return result_df
Note that we will only test different values for the parameters p, P, q and Q. We know that both seasonal and non-seasonal integration parameters should be 1, and that the length of the season is 4.
Therefore, we generate all possible parameters combination:
p = range(0, 4, 1) d = 1 q = range(0, 4, 1) P = range(0, 4, 1) D = 1 Q = range(0, 4, 1) s = 4parameters = product(p, q, P, Q) parameters_list = list(parameters) print(len(parameters_list))
And you should see that we get 256 unique combinations! Now, our function will fit 256 different SARIMA models on our data to find the one with the lowest AIC:
result_df = optimize_SARIMA(parameters_list, 1, 1, 4, data['data']) result_df
From the table, you can see that the best model is: SARIMA(0, 1, 2)(0, 1, 2, 4).
We can now fit the model and output its summary:
best_model = SARIMAX(data['data'], order=(0, 1, 2), seasonal_order=(0, 1, 2, 4)).fit(dis=-1) print(best_model.summary())
Here, you see that the best performing model has both seasonal and non-seasonal moving average processes.
From the summary above, you can find the value of the coefficients and their p-value. Notice that from the p-value, all coefficients are significant.
Now, we can study the residuals:
best_model.plot_diagnostics(figsize=(15,12));
From the normal Q-Q plot, we can see that we almost have a straight line, which suggest no systematic departure from normality. Also, the correlogram on the bottom right suggests that there is no autocorrelation in the residuals, and so they are effectively white noise.
We are ready to plot the predictions of our model and forecast into the future:
data['arima_model'] = best_model.fittedvalues data['arima_model'][:4+1] = np.NaNforecast = best_model.predict(start=data.shape[0], end=data.shape[0] + 8) forecast = data['arima_model'].append(forecast)plt.figure(figsize=(15, 7.5)) plt.plot(forecast, color='r', label='model') plt.axvspan(data.index[-1], forecast.index[-1], alpha=0.5, color='lightgrey') plt.plot(data['data'], label='actual') plt.legend()plt.show()
Voilà!
Conclusion
Congratulations! You now understand what a seasonal ARIMA (or SARIMA) model is and how to use it to model and forecast.
Learn more about time series with the following resources (I may earn a commission if you opt in one of the courses below):
Cheers :beer:
以上所述就是小编给大家介绍的《Time Series Forecasting with SARIMA in Python》,希望对大家有所帮助,如果大家有任何疑问请给我留言,小编会及时回复大家的。在此也非常感谢大家对 码农网 的支持!
猜你喜欢:本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。
Software Engineering for Internet Applications
Eve Andersson、Philip Greenspun、Andrew Grumet / The MIT Press / 2006-03-06 / USD 35.00
After completing this self-contained course on server-based Internet applications software, students who start with only the knowledge of how to write and debug a computer program will have learned ho......一起来看看 《Software Engineering for Internet Applications》 这本书的介绍吧!