内容简介:版权声明:本套技术专栏是作者(秦凯新)平时工作的总结和升华,通过从真实商业环境抽取案例进行总结和分享,并给出商业应用的调优建议和集群环境容量规划等内容,请持续关注本套博客。QQ邮箱地址:1120746959@qq.com,如有任何学术交流,可随时联系。
版权声明:本套技术专栏是作者(秦凯新)平时工作的总结和升华,通过从真实商业环境抽取案例进行总结和分享,并给出商业应用的调优建议和集群环境容量规划等内容,请持续关注本套博客。QQ邮箱地址:1120746959@qq.com,如有任何学术交流,可随时联系。
1 数据的预处理
-
时间序列数据生成
import pandas as pd import numpy as np date_range: 可以指定开始时间与周期 H:小时 D:天 M:月 # TIMES #2016 Jul 1 7/1/2016 1/7/2016 2016-07-01 2016/07/01 rng = pd.date_range('2016-07-01', periods = 10, freq = '3D') rng DatetimeIndex(['2016-07-01', '2016-07-04', '2016-07-07', '2016-07-10', '2016-07-13', '2016-07-16', '2016-07-19', '2016-07-22', '2016-07-25', '2016-07-28'], dtype='datetime64[ns]', freq='3D') time=pd.Series(np.random.randn(20), index=pd.date_range(dt.datetime(2016,1,1),periods=20)) print(time) 2016-01-01 -0.129379 2016-01-02 0.164480 2016-01-03 -0.639117 2016-01-04 -0.427224 2016-01-05 2.055133 2016-01-06 1.116075 2016-01-07 0.357426 2016-01-08 0.274249 2016-01-09 0.834405 2016-01-10 -0.005444 2016-01-11 -0.134409 2016-01-12 0.249318 2016-01-13 -0.297842 2016-01-14 -0.128514 2016-01-15 0.063690 2016-01-16 -2.246031 2016-01-17 0.359552 2016-01-18 0.383030 2016-01-19 0.402717 2016-01-20 -0.694068 Freq: D, dtype: float64 复制代码
-
truncate过滤
time.truncate(before='2016-1-10') 2016-01-10 -0.005444 2016-01-11 -0.134409 2016-01-12 0.249318 2016-01-13 -0.297842 2016-01-14 -0.128514 2016-01-15 0.063690 2016-01-16 -2.246031 2016-01-17 0.359552 2016-01-18 0.383030 2016-01-19 0.402717 2016-01-20 -0.694068 Freq: D, dtype: float64 time.truncate(after='2016-1-10') 2016-01-01 -0.129379 2016-01-02 0.164480 2016-01-03 -0.639117 2016-01-04 -0.427224 2016-01-05 2.055133 2016-01-06 1.116075 2016-01-07 0.357426 2016-01-08 0.274249 2016-01-09 0.834405 2016-01-10 -0.005444 Freq: D, dtype: float64 print(time['2016-01-15':'2016-01-20']) 2016-01-15 0.063690 2016-01-16 -2.246031 2016-01-17 0.359552 2016-01-18 0.383030 2016-01-19 0.402717 2016-01-20 -0.694068 Freq: D, dtype: float64 data=pd.date_range('2010-01-01','2011-01-01',freq='M') print(data) DatetimeIndex(['2010-01-31', '2010-02-28', '2010-03-31', '2010-04-30', '2010-05-31', '2010-06-30', '2010-07-31', '2010-08-31', '2010-09-30', '2010-10-31', '2010-11-30', '2010-12-31'], dtype='datetime64[ns]', freq='M') # 指定索引 rng = pd.date_range('2016 Jul 1', periods = 10, freq = 'D') rng pd.Series(range(len(rng)), index = rng) 2016-07-01 0 2016-07-02 1 2016-07-03 2 2016-07-04 3 2016-07-05 4 2016-07-06 5 2016-07-07 6 2016-07-08 7 2016-07-09 8 2016-07-10 9 Freq: D, dtype: int32 复制代码
-
指定索引
periods = [pd.Period('2016-01'), pd.Period('2016-02'), pd.Period('2016-03')] ts = pd.Series(np.random.randn(len(periods)), index = periods) ts 2016-07-01 0 2016-07-02 1 2016-07-03 2 2016-07-04 3 2016-07-05 4 2016-07-06 5 2016-07-07 6 2016-07-08 7 2016-07-09 8 2016-07-10 9 Freq: D, dtype: int32 复制代码
-
时间戳和时间周期可以转换
ts = pd.Series(range(10), pd.date_range('07-10-16 8:00', periods = 10, freq = 'H')) ts 2016-07-10 08:00:00 0 2016-07-10 09:00:00 1 2016-07-10 10:00:00 2 2016-07-10 11:00:00 3 2016-07-10 12:00:00 4 2016-07-10 13:00:00 5 2016-07-10 14:00:00 6 2016-07-10 15:00:00 7 2016-07-10 16:00:00 8 2016-07-10 17:00:00 9 Freq: H, dtype: int32 ts_period = ts.to_period() ts_period 2016-07-10 08:00 0 2016-07-10 09:00 1 2016-07-10 10:00 2 2016-07-10 11:00 3 2016-07-10 12:00 4 2016-07-10 13:00 5 2016-07-10 14:00 6 2016-07-10 15:00 7 2016-07-10 16:00 8 2016-07-10 17:00 9 Freq: H, dtype: int32 ts_period['2016-07-10 08:30':'2016-07-10 11:45'] 2016-07-10 08:00 0 2016-07-10 09:00 1 2016-07-10 10:00 2 2016-07-10 11:00 3 Freq: H, dtype: int32 ts['2016-07-10 08:30':'2016-07-10 11:45'] 2016-07-10 09:00:00 1 2016-07-10 10:00:00 2 2016-07-10 11:00:00 3 Freq: H, dtype: int32 复制代码
2 数据重采样
-
时间数据由一个频率转换到另一个频率
-
降采样
-
升采样
rng = pd.date_range('1/1/2011', periods=90, freq='D') ts = pd.Series(np.random.randn(len(rng)), index=rng) ts.head() 2011-01-01 -1.025562 2011-01-02 0.410895 2011-01-03 0.660311 2011-01-04 0.710293 2011-01-05 0.444985 Freq: D, dtype: float64 ts.resample('M').sum() 2011-01-31 2.510102 2011-02-28 0.583209 2011-03-31 2.749411 Freq: M, dtype: float64 ts.resample('3D').sum() 2011-01-01 0.045643 2011-01-04 -2.255206 2011-01-07 0.571142 2011-01-10 0.835032 2011-01-13 -0.396766 2011-01-16 -1.156253 2011-01-19 -1.286884 2011-01-22 2.883952 2011-01-25 1.566908 2011-01-28 1.435563 2011-01-31 0.311565 2011-02-03 -2.541235 2011-02-06 0.317075 2011-02-09 1.598877 2011-02-12 -1.950509 2011-02-15 2.928312 2011-02-18 -0.733715 2011-02-21 1.674817 2011-02-24 -2.078872 2011-02-27 2.172320 2011-03-02 -2.022104 2011-03-05 -0.070356 2011-03-08 1.276671 2011-03-11 -2.835132 2011-03-14 -1.384113 2011-03-17 1.517565 2011-03-20 -0.550406 2011-03-23 0.773430 2011-03-26 2.244319 2011-03-29 2.951082 Freq: 3D, dtype: float64 day3Ts = ts.resample('3D').mean() day3Ts 2011-01-01 0.015214 2011-01-04 -0.751735 2011-01-07 0.190381 2011-01-10 0.278344 2011-01-13 -0.132255 2011-01-16 -0.385418 2011-01-19 -0.428961 2011-01-22 0.961317 2011-01-25 0.522303 2011-01-28 0.478521 2011-01-31 0.103855 2011-02-03 -0.847078 2011-02-06 0.105692 2011-02-09 0.532959 2011-02-12 -0.650170 2011-02-15 0.976104 2011-02-18 -0.244572 2011-02-21 0.558272 2011-02-24 -0.692957 2011-02-27 0.724107 2011-03-02 -0.674035 2011-03-05 -0.023452 2011-03-08 0.425557 2011-03-11 -0.945044 2011-03-14 -0.461371 2011-03-17 0.505855 2011-03-20 -0.183469 2011-03-23 0.257810 2011-03-26 0.748106 2011-03-29 0.983694 Freq: 3D, dtype: float64 ## 下采样 print(day3Ts.resample('D').asfreq()) 2011-01-01 0.015214 2011-01-02 NaN 2011-01-03 NaN 2011-01-04 -0.751735 2011-01-05 NaN 2011-01-06 NaN 2011-01-07 0.190381 2011-01-08 NaN 2011-01-09 NaN 2011-01-10 0.278344 2011-01-11 NaN 2011-01-12 NaN 2011-01-13 -0.132255 2011-01-14 NaN 2011-01-15 NaN 2011-01-16 -0.385418 2011-01-17 NaN 2011-01-18 NaN 2011-01-19 -0.428961 2011-01-20 NaN 2011-01-21 NaN 2011-01-22 0.961317 Freq: D, Length: 88, dtype: float64 复制代码
-
ffill 空值取前面的值
-
bfill 空值取后面的值
-
interpolate 线性取值
day3Ts.resample('D').ffill(1) 2011-01-01 0.015214 2011-01-02 0.015214 2011-01-03 NaN 2011-01-04 -0.751735 2011-01-05 -0.751735 2011-01-06 NaN 2011-01-07 0.190381 2011-01-08 0.190381 2011-01-09 NaN 2011-01-10 0.278344 2011-01-11 0.278344 day3Ts.resample('D').bfill(1) 2011-01-01 0.015214 2011-01-02 NaN 2011-01-03 -0.751735 2011-01-04 -0.751735 2011-01-05 NaN 2011-01-06 0.190381 2011-01-07 0.190381 2011-01-08 NaN 2011-01-09 0.278344 2011-01-10 0.278344 2011-01-11 NaN 2011-01-12 -0.132255 2011-01-13 -0.132255 day3Ts.resample('D').interpolate('linear') 2011-01-01 0.015214 2011-01-02 -0.240435 2011-01-03 -0.496085 2011-01-04 -0.751735 2011-01-05 -0.437697 2011-01-06 -0.123658 2011-01-07 0.190381 2011-01-08 0.219702 2011-01-09 0.249023 2011-01-10 0.278344 2011-01-11 0.141478 2011-01-12 0.004611 2011-01-13 -0.132255 2011-01-14 -0.216643 2011-01-15 -0.301030 复制代码
3 滑动窗
-
滑动窗计算
%matplotlib inline import matplotlib.pylab import numpy as np import pandas as pd df = pd.Series(np.random.randn(600), index = pd.date_range('7/1/2016', freq = 'D', periods = 600)) df.head() 2016-07-01 -0.192140 2016-07-02 0.357953 2016-07-03 -0.201847 2016-07-04 -0.372230 2016-07-05 1.414753 Freq: D, dtype: float64 r = df.rolling(window = 10) #r.max, r.median, r.std, r.skew, r.sum, r.var print(r.mean()) 016-07-01 NaN 2016-07-02 NaN 2016-07-03 NaN 2016-07-04 NaN 2016-07-05 NaN 2016-07-06 NaN 2016-07-07 NaN 2016-07-08 NaN 2016-07-09 NaN 2016-07-10 0.300133 2016-07-11 0.284780 2016-07-12 0.252831 2016-07-13 0.220699 2016-07-14 0.167137 2016-07-15 0.018593 2016-07-16 -0.061414 2016-07-17 -0.134593 2016-07-18 -0.153333 2016-07-19 -0.218928 2016-07-20 -0.169426 2016-07-21 -0.219747 2016-07-22 -0.181266 2016-07-23 -0.173674 2016-07-24 -0.130629 2016-07-25 -0.166730 2016-07-26 -0.233044 2016-07-27 -0.256642 2016-07-28 -0.280738 2016-07-29 -0.289893 2016-07-30 -0.379625 ... 2018-01-22 -0.211467 2018-01-23 0.034996 2018-01-24 -0.105910 2018-01-25 -0.145774 2018-01-26 -0.089320 2018-01-27 -0.164370 2018-01-28 -0.110892 2018-01-29 -0.205786 2018-01-30 -0.101162 2018-01-31 -0.034760 2018-02-01 0.229333 2018-02-02 0.043741 2018-02-03 0.052837 2018-02-04 0.057746 2018-02-05 -0.071401 2018-02-06 -0.011153 2018-02-07 -0.045737 2018-02-08 -0.021983 2018-02-09 -0.196715 2018-02-10 -0.063721 2018-02-11 -0.289452 2018-02-12 -0.050946 2018-02-13 -0.047014 2018-02-14 0.048754 2018-02-15 0.143949 2018-02-16 0.424823 2018-02-17 0.361878 2018-02-18 0.363235 2018-02-19 0.517436 2018-02-20 0.368020 Freq: D, Length: 600, dtype: float64 复制代码
-
可视化
import matplotlib.pyplot as plt %matplotlib inline plt.figure(figsize=(15, 5)) df.plot(style='r--') df.rolling(window=10).mean().plot(style='b') 复制代码
4 ARIMA预测
-
数据的预处理
import pandas_datareader import datetime import matplotlib.pylab as plt import seaborn as sns from matplotlib.pylab import style from statsmodels.tsa.arima_model import ARIMA from statsmodels.graphics.tsaplots import plot_acf, plot_pacf style.use('ggplot') plt.rcParams['font.sans-serif'] = ['SimHei'] plt.rcParams['axes.unicode_minus'] = False stockFile = 'data/T10yr.csv' stock = pd.read_csv(stockFile, index_col=0, parse_dates=[0]) stock.head(10) 复制代码
stock_week = stock['Close'].resample('W-MON').mean() stock_train = stock_week['2000':'2015'] stock_train.plot(figsize=(12,8)) plt.legend(bbox_to_anchor=(1.25, 0.5)) plt.title("Stock Close") sns.despine() 复制代码
stock_diff = stock_train.diff() stock_diff = stock_diff.dropna() plt.figure() plt.plot(stock_diff) plt.title('一阶差分') plt.show() 复制代码
acf = plot_acf(stock_diff, lags=20) plt.title("ACF") acf.show() 复制代码
pacf = plot_pacf(stock_diff, lags=20) plt.title("PACF") pacf.show() 复制代码
model = ARIMA(stock_train, order=(1, 1, 1),freq='W-MON') result = model.fit() #print(result.summary()) pred = result.predict('20140609', '20160701',dynamic=True, typ='levels') print (pred) 2014-06-09 2.463559 2014-06-16 2.455539 2014-06-23 2.449569 2014-06-30 2.444183 2014-07-07 2.438962 2014-07-14 2.433788 2014-07-21 2.428627 2014-07-28 2.423470 2014-08-04 2.418315 2014-08-11 2.413159 2014-08-18 2.408004 2014-08-25 2.402849 2014-09-01 2.397693 2014-09-08 2.392538 2014-09-15 2.387383 plt.figure(figsize=(6, 6)) plt.xticks(rotation=45) plt.plot(pred) plt.plot(stock_train) 复制代码
以上所述就是小编给大家介绍的《时间序列数据的预处理及基于ARIMA模型进行趋势预测-大数据ML样本集案例实战》,希望对大家有所帮助,如果大家有任何疑问请给我留言,小编会及时回复大家的。在此也非常感谢大家对 码农网 的支持!
猜你喜欢:- Netflix数据库架构变革:缩放时间序列的数据存储
- 处理时间序列数据需要注意的 5 个要点
- Pandas必备技能之“时间序列数据处理”
- 规模化时间序列数据存储Part1
- 【火炉炼AI】机器学习043-pandas操作时间序列数据
- PostgreSQL中的大容量空间探索时间序列数据存储
本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。
算法竞赛入门经典
刘汝佳 / 清华大学出版社 / 2009-11 / 24.00元
《算法竞赛入门经典》是一本算法竞赛的入门教材,把C/C++语言、算法和解题有机地结合在了一起,淡化理论,注重学习方法和实践技巧。全书内容分为11章,包括程序设计入门、循环结构程序设计、数组和字符串、函数和递归、基础题目选解、数据结构基础、暴力求解法、高效算法设计、动态规划初步、数学概念与方法、图论模型与算法,覆盖了算法竞赛入门所需的主要知识点,并附有大量习题。书中的代码规范、简洁、易懂,不仅能帮助......一起来看看 《算法竞赛入门经典》 这本书的介绍吧!