内容简介:版权声明:本套技术专栏是作者(秦凯新)平时工作的总结和升华,通过从真实商业环境抽取案例进行总结和分享,并给出商业应用的调优建议和集群环境容量规划等内容,请持续关注本套博客。QQ邮箱地址:1120746959@qq.com,如有任何学术交流,可随时联系。
版权声明:本套技术专栏是作者(秦凯新)平时工作的总结和升华,通过从真实商业环境抽取案例进行总结和分享,并给出商业应用的调优建议和集群环境容量规划等内容,请持续关注本套博客。QQ邮箱地址:1120746959@qq.com,如有任何学术交流,可随时联系。
1 数据的预处理
-
时间序列数据生成
import pandas as pd import numpy as np date_range: 可以指定开始时间与周期 H:小时 D:天 M:月 # TIMES #2016 Jul 1 7/1/2016 1/7/2016 2016-07-01 2016/07/01 rng = pd.date_range('2016-07-01', periods = 10, freq = '3D') rng DatetimeIndex(['2016-07-01', '2016-07-04', '2016-07-07', '2016-07-10', '2016-07-13', '2016-07-16', '2016-07-19', '2016-07-22', '2016-07-25', '2016-07-28'], dtype='datetime64[ns]', freq='3D') time=pd.Series(np.random.randn(20), index=pd.date_range(dt.datetime(2016,1,1),periods=20)) print(time) 2016-01-01 -0.129379 2016-01-02 0.164480 2016-01-03 -0.639117 2016-01-04 -0.427224 2016-01-05 2.055133 2016-01-06 1.116075 2016-01-07 0.357426 2016-01-08 0.274249 2016-01-09 0.834405 2016-01-10 -0.005444 2016-01-11 -0.134409 2016-01-12 0.249318 2016-01-13 -0.297842 2016-01-14 -0.128514 2016-01-15 0.063690 2016-01-16 -2.246031 2016-01-17 0.359552 2016-01-18 0.383030 2016-01-19 0.402717 2016-01-20 -0.694068 Freq: D, dtype: float64 复制代码 -
truncate过滤
time.truncate(before='2016-1-10') 2016-01-10 -0.005444 2016-01-11 -0.134409 2016-01-12 0.249318 2016-01-13 -0.297842 2016-01-14 -0.128514 2016-01-15 0.063690 2016-01-16 -2.246031 2016-01-17 0.359552 2016-01-18 0.383030 2016-01-19 0.402717 2016-01-20 -0.694068 Freq: D, dtype: float64 time.truncate(after='2016-1-10') 2016-01-01 -0.129379 2016-01-02 0.164480 2016-01-03 -0.639117 2016-01-04 -0.427224 2016-01-05 2.055133 2016-01-06 1.116075 2016-01-07 0.357426 2016-01-08 0.274249 2016-01-09 0.834405 2016-01-10 -0.005444 Freq: D, dtype: float64 print(time['2016-01-15':'2016-01-20']) 2016-01-15 0.063690 2016-01-16 -2.246031 2016-01-17 0.359552 2016-01-18 0.383030 2016-01-19 0.402717 2016-01-20 -0.694068 Freq: D, dtype: float64 data=pd.date_range('2010-01-01','2011-01-01',freq='M') print(data) DatetimeIndex(['2010-01-31', '2010-02-28', '2010-03-31', '2010-04-30', '2010-05-31', '2010-06-30', '2010-07-31', '2010-08-31', '2010-09-30', '2010-10-31', '2010-11-30', '2010-12-31'], dtype='datetime64[ns]', freq='M') # 指定索引 rng = pd.date_range('2016 Jul 1', periods = 10, freq = 'D') rng pd.Series(range(len(rng)), index = rng) 2016-07-01 0 2016-07-02 1 2016-07-03 2 2016-07-04 3 2016-07-05 4 2016-07-06 5 2016-07-07 6 2016-07-08 7 2016-07-09 8 2016-07-10 9 Freq: D, dtype: int32 复制代码 -
指定索引
periods = [pd.Period('2016-01'), pd.Period('2016-02'), pd.Period('2016-03')] ts = pd.Series(np.random.randn(len(periods)), index = periods) ts 2016-07-01 0 2016-07-02 1 2016-07-03 2 2016-07-04 3 2016-07-05 4 2016-07-06 5 2016-07-07 6 2016-07-08 7 2016-07-09 8 2016-07-10 9 Freq: D, dtype: int32 复制代码 -
时间戳和时间周期可以转换
ts = pd.Series(range(10), pd.date_range('07-10-16 8:00', periods = 10, freq = 'H')) ts 2016-07-10 08:00:00 0 2016-07-10 09:00:00 1 2016-07-10 10:00:00 2 2016-07-10 11:00:00 3 2016-07-10 12:00:00 4 2016-07-10 13:00:00 5 2016-07-10 14:00:00 6 2016-07-10 15:00:00 7 2016-07-10 16:00:00 8 2016-07-10 17:00:00 9 Freq: H, dtype: int32 ts_period = ts.to_period() ts_period 2016-07-10 08:00 0 2016-07-10 09:00 1 2016-07-10 10:00 2 2016-07-10 11:00 3 2016-07-10 12:00 4 2016-07-10 13:00 5 2016-07-10 14:00 6 2016-07-10 15:00 7 2016-07-10 16:00 8 2016-07-10 17:00 9 Freq: H, dtype: int32 ts_period['2016-07-10 08:30':'2016-07-10 11:45'] 2016-07-10 08:00 0 2016-07-10 09:00 1 2016-07-10 10:00 2 2016-07-10 11:00 3 Freq: H, dtype: int32 ts['2016-07-10 08:30':'2016-07-10 11:45'] 2016-07-10 09:00:00 1 2016-07-10 10:00:00 2 2016-07-10 11:00:00 3 Freq: H, dtype: int32 复制代码
2 数据重采样
-
时间数据由一个频率转换到另一个频率
-
降采样
-
升采样
rng = pd.date_range('1/1/2011', periods=90, freq='D') ts = pd.Series(np.random.randn(len(rng)), index=rng) ts.head() 2011-01-01 -1.025562 2011-01-02 0.410895 2011-01-03 0.660311 2011-01-04 0.710293 2011-01-05 0.444985 Freq: D, dtype: float64 ts.resample('M').sum() 2011-01-31 2.510102 2011-02-28 0.583209 2011-03-31 2.749411 Freq: M, dtype: float64 ts.resample('3D').sum() 2011-01-01 0.045643 2011-01-04 -2.255206 2011-01-07 0.571142 2011-01-10 0.835032 2011-01-13 -0.396766 2011-01-16 -1.156253 2011-01-19 -1.286884 2011-01-22 2.883952 2011-01-25 1.566908 2011-01-28 1.435563 2011-01-31 0.311565 2011-02-03 -2.541235 2011-02-06 0.317075 2011-02-09 1.598877 2011-02-12 -1.950509 2011-02-15 2.928312 2011-02-18 -0.733715 2011-02-21 1.674817 2011-02-24 -2.078872 2011-02-27 2.172320 2011-03-02 -2.022104 2011-03-05 -0.070356 2011-03-08 1.276671 2011-03-11 -2.835132 2011-03-14 -1.384113 2011-03-17 1.517565 2011-03-20 -0.550406 2011-03-23 0.773430 2011-03-26 2.244319 2011-03-29 2.951082 Freq: 3D, dtype: float64 day3Ts = ts.resample('3D').mean() day3Ts 2011-01-01 0.015214 2011-01-04 -0.751735 2011-01-07 0.190381 2011-01-10 0.278344 2011-01-13 -0.132255 2011-01-16 -0.385418 2011-01-19 -0.428961 2011-01-22 0.961317 2011-01-25 0.522303 2011-01-28 0.478521 2011-01-31 0.103855 2011-02-03 -0.847078 2011-02-06 0.105692 2011-02-09 0.532959 2011-02-12 -0.650170 2011-02-15 0.976104 2011-02-18 -0.244572 2011-02-21 0.558272 2011-02-24 -0.692957 2011-02-27 0.724107 2011-03-02 -0.674035 2011-03-05 -0.023452 2011-03-08 0.425557 2011-03-11 -0.945044 2011-03-14 -0.461371 2011-03-17 0.505855 2011-03-20 -0.183469 2011-03-23 0.257810 2011-03-26 0.748106 2011-03-29 0.983694 Freq: 3D, dtype: float64 ## 下采样 print(day3Ts.resample('D').asfreq()) 2011-01-01 0.015214 2011-01-02 NaN 2011-01-03 NaN 2011-01-04 -0.751735 2011-01-05 NaN 2011-01-06 NaN 2011-01-07 0.190381 2011-01-08 NaN 2011-01-09 NaN 2011-01-10 0.278344 2011-01-11 NaN 2011-01-12 NaN 2011-01-13 -0.132255 2011-01-14 NaN 2011-01-15 NaN 2011-01-16 -0.385418 2011-01-17 NaN 2011-01-18 NaN 2011-01-19 -0.428961 2011-01-20 NaN 2011-01-21 NaN 2011-01-22 0.961317 Freq: D, Length: 88, dtype: float64 复制代码 -
ffill 空值取前面的值
-
bfill 空值取后面的值
-
interpolate 线性取值
day3Ts.resample('D').ffill(1) 2011-01-01 0.015214 2011-01-02 0.015214 2011-01-03 NaN 2011-01-04 -0.751735 2011-01-05 -0.751735 2011-01-06 NaN 2011-01-07 0.190381 2011-01-08 0.190381 2011-01-09 NaN 2011-01-10 0.278344 2011-01-11 0.278344 day3Ts.resample('D').bfill(1) 2011-01-01 0.015214 2011-01-02 NaN 2011-01-03 -0.751735 2011-01-04 -0.751735 2011-01-05 NaN 2011-01-06 0.190381 2011-01-07 0.190381 2011-01-08 NaN 2011-01-09 0.278344 2011-01-10 0.278344 2011-01-11 NaN 2011-01-12 -0.132255 2011-01-13 -0.132255 day3Ts.resample('D').interpolate('linear') 2011-01-01 0.015214 2011-01-02 -0.240435 2011-01-03 -0.496085 2011-01-04 -0.751735 2011-01-05 -0.437697 2011-01-06 -0.123658 2011-01-07 0.190381 2011-01-08 0.219702 2011-01-09 0.249023 2011-01-10 0.278344 2011-01-11 0.141478 2011-01-12 0.004611 2011-01-13 -0.132255 2011-01-14 -0.216643 2011-01-15 -0.301030 复制代码
3 滑动窗
-
滑动窗计算
%matplotlib inline import matplotlib.pylab import numpy as np import pandas as pd df = pd.Series(np.random.randn(600), index = pd.date_range('7/1/2016', freq = 'D', periods = 600)) df.head() 2016-07-01 -0.192140 2016-07-02 0.357953 2016-07-03 -0.201847 2016-07-04 -0.372230 2016-07-05 1.414753 Freq: D, dtype: float64 r = df.rolling(window = 10) #r.max, r.median, r.std, r.skew, r.sum, r.var print(r.mean()) 016-07-01 NaN 2016-07-02 NaN 2016-07-03 NaN 2016-07-04 NaN 2016-07-05 NaN 2016-07-06 NaN 2016-07-07 NaN 2016-07-08 NaN 2016-07-09 NaN 2016-07-10 0.300133 2016-07-11 0.284780 2016-07-12 0.252831 2016-07-13 0.220699 2016-07-14 0.167137 2016-07-15 0.018593 2016-07-16 -0.061414 2016-07-17 -0.134593 2016-07-18 -0.153333 2016-07-19 -0.218928 2016-07-20 -0.169426 2016-07-21 -0.219747 2016-07-22 -0.181266 2016-07-23 -0.173674 2016-07-24 -0.130629 2016-07-25 -0.166730 2016-07-26 -0.233044 2016-07-27 -0.256642 2016-07-28 -0.280738 2016-07-29 -0.289893 2016-07-30 -0.379625 ... 2018-01-22 -0.211467 2018-01-23 0.034996 2018-01-24 -0.105910 2018-01-25 -0.145774 2018-01-26 -0.089320 2018-01-27 -0.164370 2018-01-28 -0.110892 2018-01-29 -0.205786 2018-01-30 -0.101162 2018-01-31 -0.034760 2018-02-01 0.229333 2018-02-02 0.043741 2018-02-03 0.052837 2018-02-04 0.057746 2018-02-05 -0.071401 2018-02-06 -0.011153 2018-02-07 -0.045737 2018-02-08 -0.021983 2018-02-09 -0.196715 2018-02-10 -0.063721 2018-02-11 -0.289452 2018-02-12 -0.050946 2018-02-13 -0.047014 2018-02-14 0.048754 2018-02-15 0.143949 2018-02-16 0.424823 2018-02-17 0.361878 2018-02-18 0.363235 2018-02-19 0.517436 2018-02-20 0.368020 Freq: D, Length: 600, dtype: float64 复制代码 -
可视化
import matplotlib.pyplot as plt %matplotlib inline plt.figure(figsize=(15, 5)) df.plot(style='r--') df.rolling(window=10).mean().plot(style='b') 复制代码
4 ARIMA预测
-
数据的预处理
import pandas_datareader import datetime import matplotlib.pylab as plt import seaborn as sns from matplotlib.pylab import style from statsmodels.tsa.arima_model import ARIMA from statsmodels.graphics.tsaplots import plot_acf, plot_pacf style.use('ggplot') plt.rcParams['font.sans-serif'] = ['SimHei'] plt.rcParams['axes.unicode_minus'] = False stockFile = 'data/T10yr.csv' stock = pd.read_csv(stockFile, index_col=0, parse_dates=[0]) stock.head(10) 复制代码
stock_week = stock['Close'].resample('W-MON').mean()
stock_train = stock_week['2000':'2015']
stock_train.plot(figsize=(12,8))
plt.legend(bbox_to_anchor=(1.25, 0.5))
plt.title("Stock Close")
sns.despine()
复制代码
stock_diff = stock_train.diff()
stock_diff = stock_diff.dropna()
plt.figure()
plt.plot(stock_diff)
plt.title('一阶差分')
plt.show()
复制代码
acf = plot_acf(stock_diff, lags=20)
plt.title("ACF")
acf.show()
复制代码
pacf = plot_pacf(stock_diff, lags=20)
plt.title("PACF")
pacf.show()
复制代码
model = ARIMA(stock_train, order=(1, 1, 1),freq='W-MON')
result = model.fit()
#print(result.summary())
pred = result.predict('20140609', '20160701',dynamic=True, typ='levels')
print (pred)
2014-06-09 2.463559
2014-06-16 2.455539
2014-06-23 2.449569
2014-06-30 2.444183
2014-07-07 2.438962
2014-07-14 2.433788
2014-07-21 2.428627
2014-07-28 2.423470
2014-08-04 2.418315
2014-08-11 2.413159
2014-08-18 2.408004
2014-08-25 2.402849
2014-09-01 2.397693
2014-09-08 2.392538
2014-09-15 2.387383
plt.figure(figsize=(6, 6))
plt.xticks(rotation=45)
plt.plot(pred)
plt.plot(stock_train)
复制代码
以上所述就是小编给大家介绍的《时间序列数据的预处理及基于ARIMA模型进行趋势预测-大数据ML样本集案例实战》,希望对大家有所帮助,如果大家有任何疑问请给我留言,小编会及时回复大家的。在此也非常感谢大家对 码农网 的支持!
猜你喜欢:- Netflix数据库架构变革:缩放时间序列的数据存储
- 处理时间序列数据需要注意的 5 个要点
- Pandas必备技能之“时间序列数据处理”
- 规模化时间序列数据存储Part1
- 【火炉炼AI】机器学习043-pandas操作时间序列数据
- PostgreSQL中的大容量空间探索时间序列数据存储
本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。
Probability and Computing
Michael Mitzenmacher、Eli Upfal / Cambridge University Press / 2005-01-31 / USD 66.00
Assuming only an elementary background in discrete mathematics, this textbook is an excellent introduction to the probabilistic techniques and paradigms used in the development of probabilistic algori......一起来看看 《Probability and Computing》 这本书的介绍吧!