时间序列数据的预处理及基于ARIMA模型进行趋势预测-大数据ML样本集案例实战

栏目: R语言 · 发布时间: 6年前

内容简介：版权声明：本套技术专栏是作者（秦凯新）平时工作的总结和升华，通过从真实商业环境抽取案例进行总结和分享，并给出商业应用的调优建议和集群环境容量规划等内容，请持续关注本套博客。QQ邮箱地址：1120746959@qq.com，如有任何学术交流，可随时联系。

1 数据的预处理

时间序列数据生成

import pandas as pd
  import numpy as np
  
  
  date_range：
  可以指定开始时间与周期
  H：小时
  D：天
  M：月
  
  # TIMES #2016 Jul 1 7/1/2016 1/7/2016 2016-07-01 2016/07/01
  rng = pd.date_range('2016-07-01', periods = 10, freq = '3D')
  rng
  
  DatetimeIndex(['2016-07-01', '2016-07-04', '2016-07-07', '2016-07-10',
         '2016-07-13', '2016-07-16', '2016-07-19', '2016-07-22',
         '2016-07-25', '2016-07-28'],
        dtype='datetime64[ns]', freq='3D')
        

   time=pd.Series(np.random.randn(20),
         index=pd.date_range(dt.datetime(2016,1,1),periods=20))
   print(time)
   
  2016-01-01   -0.129379
  2016-01-02    0.164480
  2016-01-03   -0.639117
  2016-01-04   -0.427224
  2016-01-05    2.055133
  2016-01-06    1.116075
  2016-01-07    0.357426
  2016-01-08    0.274249
  2016-01-09    0.834405
  2016-01-10   -0.005444
  2016-01-11   -0.134409
  2016-01-12    0.249318
  2016-01-13   -0.297842
  2016-01-14   -0.128514
  2016-01-15    0.063690
  2016-01-16   -2.246031
  2016-01-17    0.359552
  2016-01-18    0.383030
  2016-01-19    0.402717
  2016-01-20   -0.694068
  Freq: D, dtype: float64
复制代码

truncate过滤

time.truncate(before='2016-1-10')
  2016-01-10   -0.005444
  2016-01-11   -0.134409
  2016-01-12    0.249318
  2016-01-13   -0.297842
  2016-01-14   -0.128514
  2016-01-15    0.063690
  2016-01-16   -2.246031
  2016-01-17    0.359552
  2016-01-18    0.383030
  2016-01-19    0.402717
  2016-01-20   -0.694068
  Freq: D, dtype: float64
  
  time.truncate(after='2016-1-10')
  2016-01-01   -0.129379
  2016-01-02    0.164480
  2016-01-03   -0.639117
  2016-01-04   -0.427224
  2016-01-05    2.055133
  2016-01-06    1.116075
  2016-01-07    0.357426
  2016-01-08    0.274249
  2016-01-09    0.834405
  2016-01-10   -0.005444
  Freq: D, dtype: float64
  
  print(time['2016-01-15':'2016-01-20'])
  2016-01-15    0.063690
  2016-01-16   -2.246031
  2016-01-17    0.359552
  2016-01-18    0.383030
  2016-01-19    0.402717
  2016-01-20   -0.694068
  Freq: D, dtype: float64
  
  data=pd.date_range('2010-01-01','2011-01-01',freq='M')
  print(data)
  
  DatetimeIndex(['2010-01-31', '2010-02-28', '2010-03-31', '2010-04-30',
         '2010-05-31', '2010-06-30', '2010-07-31', '2010-08-31',
         '2010-09-30', '2010-10-31', '2010-11-30', '2010-12-31'],
        dtype='datetime64[ns]', freq='M')
        
        
  # 指定索引
  rng = pd.date_range('2016 Jul 1', periods = 10, freq = 'D')
  rng
  pd.Series(range(len(rng)), index = rng)
  
  2016-07-01    0
  2016-07-02    1
  2016-07-03    2
  2016-07-04    3
  2016-07-05    4
  2016-07-06    5
  2016-07-07    6
  2016-07-08    7
  2016-07-09    8
  2016-07-10    9
  Freq: D, dtype: int32
复制代码

指定索引

periods = [pd.Period('2016-01'), pd.Period('2016-02'), pd.Period('2016-03')]
  ts = pd.Series(np.random.randn(len(periods)), index = periods)
  ts
  
  2016-07-01    0
  2016-07-02    1
  2016-07-03    2
  2016-07-04    3
  2016-07-05    4
  2016-07-06    5
  2016-07-07    6
  2016-07-08    7
  2016-07-09    8
  2016-07-10    9
  Freq: D, dtype: int32
复制代码

时间戳和时间周期可以转换

ts = pd.Series(range(10), pd.date_range('07-10-16 8:00', periods = 10, freq = 'H'))
  ts
  
  2016-07-10 08:00:00    0
  2016-07-10 09:00:00    1
  2016-07-10 10:00:00    2
  2016-07-10 11:00:00    3
  2016-07-10 12:00:00    4
  2016-07-10 13:00:00    5
  2016-07-10 14:00:00    6
  2016-07-10 15:00:00    7
  2016-07-10 16:00:00    8
  2016-07-10 17:00:00    9
  Freq: H, dtype: int32

  ts_period = ts.to_period()
  ts_period
  
  2016-07-10 08:00    0
  2016-07-10 09:00    1
  2016-07-10 10:00    2
  2016-07-10 11:00    3
  2016-07-10 12:00    4
  2016-07-10 13:00    5
  2016-07-10 14:00    6
  2016-07-10 15:00    7
  2016-07-10 16:00    8
  2016-07-10 17:00    9
  Freq: H, dtype: int32
  
  ts_period['2016-07-10 08:30':'2016-07-10 11:45']
  
  2016-07-10 08:00    0
  2016-07-10 09:00    1
  2016-07-10 10:00    2
  2016-07-10 11:00    3
  Freq: H, dtype: int32
  
  ts['2016-07-10 08:30':'2016-07-10 11:45']
  
  2016-07-10 09:00:00    1
  2016-07-10 10:00:00    2
  2016-07-10 11:00:00    3
  Freq: H, dtype: int32
复制代码

2 数据重采样

时间数据由一个频率转换到另一个频率
降采样

升采样

rng = pd.date_range('1/1/2011', periods=90, freq='D')
  ts = pd.Series(np.random.randn(len(rng)), index=rng)
  ts.head()
  
  2011-01-01   -1.025562
  2011-01-02    0.410895
  2011-01-03    0.660311
  2011-01-04    0.710293
  2011-01-05    0.444985
  Freq: D, dtype: float64
  
  ts.resample('M').sum()
  
  2011-01-31    2.510102
  2011-02-28    0.583209
  2011-03-31    2.749411
  Freq: M, dtype: float64
  
  ts.resample('3D').sum()
  
  2011-01-01    0.045643
  2011-01-04   -2.255206
  2011-01-07    0.571142
  2011-01-10    0.835032
  2011-01-13   -0.396766
  2011-01-16   -1.156253
  2011-01-19   -1.286884
  2011-01-22    2.883952
  2011-01-25    1.566908
  2011-01-28    1.435563
  2011-01-31    0.311565
  2011-02-03   -2.541235
  2011-02-06    0.317075
  2011-02-09    1.598877
  2011-02-12   -1.950509
  2011-02-15    2.928312
  2011-02-18   -0.733715
  2011-02-21    1.674817
  2011-02-24   -2.078872
  2011-02-27    2.172320
  2011-03-02   -2.022104
  2011-03-05   -0.070356
  2011-03-08    1.276671
  2011-03-11   -2.835132
  2011-03-14   -1.384113
  2011-03-17    1.517565
  2011-03-20   -0.550406
  2011-03-23    0.773430
  2011-03-26    2.244319
  2011-03-29    2.951082
  Freq: 3D, dtype: float64

  day3Ts = ts.resample('3D').mean()
  day3Ts
  
  2011-01-01    0.015214
  2011-01-04   -0.751735
  2011-01-07    0.190381
  2011-01-10    0.278344
  2011-01-13   -0.132255
  2011-01-16   -0.385418
  2011-01-19   -0.428961
  2011-01-22    0.961317
  2011-01-25    0.522303
  2011-01-28    0.478521
  2011-01-31    0.103855
  2011-02-03   -0.847078
  2011-02-06    0.105692
  2011-02-09    0.532959
  2011-02-12   -0.650170
  2011-02-15    0.976104
  2011-02-18   -0.244572
  2011-02-21    0.558272
  2011-02-24   -0.692957
  2011-02-27    0.724107
  2011-03-02   -0.674035
  2011-03-05   -0.023452
  2011-03-08    0.425557
  2011-03-11   -0.945044
  2011-03-14   -0.461371
  2011-03-17    0.505855
  2011-03-20   -0.183469
  2011-03-23    0.257810
  2011-03-26    0.748106
  2011-03-29    0.983694
  Freq: 3D, dtype: float64
  
  ## 下采样
  print(day3Ts.resample('D').asfreq())
  
  2011-01-01    0.015214
  2011-01-02         NaN
  2011-01-03         NaN
  2011-01-04   -0.751735
  2011-01-05         NaN
  2011-01-06         NaN
  2011-01-07    0.190381
  2011-01-08         NaN
  2011-01-09         NaN
  2011-01-10    0.278344
  2011-01-11         NaN
  2011-01-12         NaN
  2011-01-13   -0.132255
  2011-01-14         NaN
  2011-01-15         NaN
  2011-01-16   -0.385418
  2011-01-17         NaN
  2011-01-18         NaN
  2011-01-19   -0.428961
  2011-01-20         NaN
  2011-01-21         NaN
  2011-01-22    0.961317
  Freq: D, Length: 88, dtype: float64
复制代码

ffill 空值取前面的值
bfill 空值取后面的值

interpolate 线性取值

day3Ts.resample('D').ffill(1)
 
  2011-01-01    0.015214
  2011-01-02    0.015214
  2011-01-03         NaN
  2011-01-04   -0.751735
  2011-01-05   -0.751735
  2011-01-06         NaN
  2011-01-07    0.190381
  2011-01-08    0.190381
  2011-01-09         NaN
  2011-01-10    0.278344
  2011-01-11    0.278344
  
  day3Ts.resample('D').bfill(1)
  2011-01-01    0.015214
  2011-01-02         NaN
  2011-01-03   -0.751735
  2011-01-04   -0.751735
  2011-01-05         NaN
  2011-01-06    0.190381
  2011-01-07    0.190381
  2011-01-08         NaN
  2011-01-09    0.278344
  2011-01-10    0.278344
  2011-01-11         NaN
  2011-01-12   -0.132255
  2011-01-13   -0.132255

 day3Ts.resample('D').interpolate('linear')
 2011-01-01    0.015214
  2011-01-02   -0.240435
  2011-01-03   -0.496085
  2011-01-04   -0.751735
  2011-01-05   -0.437697
  2011-01-06   -0.123658
  2011-01-07    0.190381
  2011-01-08    0.219702
  2011-01-09    0.249023
  2011-01-10    0.278344
  2011-01-11    0.141478
  2011-01-12    0.004611
  2011-01-13   -0.132255
  2011-01-14   -0.216643
  2011-01-15   -0.301030
复制代码

3 滑动窗

滑动窗计算

%matplotlib inline 
  import matplotlib.pylab
  import numpy as np
  import pandas as pd
  
  df = pd.Series(np.random.randn(600), index = pd.date_range('7/1/2016', freq = 'D', periods = 600))
  df.head()
  
  2016-07-01   -0.192140
  2016-07-02    0.357953
  2016-07-03   -0.201847
  2016-07-04   -0.372230
  2016-07-05    1.414753
  Freq: D, dtype: float64

  r = df.rolling(window = 10)
  #r.max, r.median, r.std, r.skew, r.sum, r.var
  print(r.mean())
  
  016-07-01         NaN
  2016-07-02         NaN
  2016-07-03         NaN
  2016-07-04         NaN
  2016-07-05         NaN
  2016-07-06         NaN
  2016-07-07         NaN
  2016-07-08         NaN
  2016-07-09         NaN
  2016-07-10    0.300133
  2016-07-11    0.284780
  2016-07-12    0.252831
  2016-07-13    0.220699
  2016-07-14    0.167137
  2016-07-15    0.018593
  2016-07-16   -0.061414
  2016-07-17   -0.134593
  2016-07-18   -0.153333
  2016-07-19   -0.218928
  2016-07-20   -0.169426
  2016-07-21   -0.219747
  2016-07-22   -0.181266
  2016-07-23   -0.173674
  2016-07-24   -0.130629
  2016-07-25   -0.166730
  2016-07-26   -0.233044
  2016-07-27   -0.256642
  2016-07-28   -0.280738
  2016-07-29   -0.289893
  2016-07-30   -0.379625
                  ...   
  2018-01-22   -0.211467
  2018-01-23    0.034996
  2018-01-24   -0.105910
  2018-01-25   -0.145774
  2018-01-26   -0.089320
  2018-01-27   -0.164370
  2018-01-28   -0.110892
  2018-01-29   -0.205786
  2018-01-30   -0.101162
  2018-01-31   -0.034760
  2018-02-01    0.229333
  2018-02-02    0.043741
  2018-02-03    0.052837
  2018-02-04    0.057746
  2018-02-05   -0.071401
  2018-02-06   -0.011153
  2018-02-07   -0.045737
  2018-02-08   -0.021983
  2018-02-09   -0.196715
  2018-02-10   -0.063721
  2018-02-11   -0.289452
  2018-02-12   -0.050946
  2018-02-13   -0.047014
  2018-02-14    0.048754
  2018-02-15    0.143949
  2018-02-16    0.424823
  2018-02-17    0.361878
  2018-02-18    0.363235
  2018-02-19    0.517436
  2018-02-20    0.368020
  Freq: D, Length: 600, dtype: float64
复制代码

可视化

import matplotlib.pyplot as plt
  %matplotlib inline
  
  plt.figure(figsize=(15, 5))
  
  df.plot(style='r--')
  df.rolling(window=10).mean().plot(style='b')
复制代码

4 ARIMA预测

数据的预处理

import pandas_datareader
  import datetime
  import matplotlib.pylab as plt
  import seaborn as sns
  from matplotlib.pylab import style
  from statsmodels.tsa.arima_model import ARIMA
  from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
  
  style.use('ggplot')    
  plt.rcParams['font.sans-serif'] = ['SimHei'] 
  plt.rcParams['axes.unicode_minus'] = False  
  stockFile = 'data/T10yr.csv'
  stock = pd.read_csv(stockFile, index_col=0, parse_dates=[0])
  stock.head(10)
复制代码

stock_week = stock['Close'].resample('W-MON').mean()
    stock_train = stock_week['2000':'2015'] 
    stock_train.plot(figsize=(12,8))
    plt.legend(bbox_to_anchor=(1.25, 0.5))
    plt.title("Stock Close")
    sns.despine()
复制代码

stock_diff = stock_train.diff()
    stock_diff = stock_diff.dropna()
    
    plt.figure()
    plt.plot(stock_diff)
    plt.title('一阶差分')
    plt.show()
复制代码

acf = plot_acf(stock_diff, lags=20)
plt.title("ACF")
acf.show()
复制代码

pacf = plot_pacf(stock_diff, lags=20)
    plt.title("PACF")
    pacf.show()
复制代码

model = ARIMA(stock_train, order=(1, 1, 1),freq='W-MON')
    result = model.fit()
    #print(result.summary())
    pred = result.predict('20140609', '20160701',dynamic=True, typ='levels')
    print (pred)
    
    2014-06-09    2.463559
    2014-06-16    2.455539
    2014-06-23    2.449569
    2014-06-30    2.444183
    2014-07-07    2.438962
    2014-07-14    2.433788
    2014-07-21    2.428627
    2014-07-28    2.423470
    2014-08-04    2.418315
    2014-08-11    2.413159
    2014-08-18    2.408004
    2014-08-25    2.402849
    2014-09-01    2.397693
    2014-09-08    2.392538
    2014-09-15    2.387383
    
    plt.figure(figsize=(6, 6))
    plt.xticks(rotation=45)
    plt.plot(pred)
    plt.plot(stock_train)
复制代码

以上所述就是小编给大家介绍的《时间序列数据的预处理及基于ARIMA模型进行趋势预测-大数据ML样本集案例实战》，希望对大家有所帮助，如果大家有任何疑问请给我留言，小编会及时回复大家的。在此也非常感谢大家对码农网的支持！

查看所有标签

猜你喜欢:

本站部分资源来源于网络，本站转载出于传递更多信息之目的，版权归原作者或者来源机构所有，如转载稿涉及版权问题，请联系我们。

码农书籍

鸟哥的Linux私房菜

鸟哥 / 人民邮电出版社 / 2010-6-28 / 88.00元

本书是最具知名度的Linux入门书《鸟哥的Linux私房菜基础学习篇》的最新版，全面而详细地介绍了Linux操作系统。全书分为5个部分：第一部分着重说明Linux的起源及功能，如何规划和安装Linux主机；第二部分介绍Linux的文件系统、文件、目录与磁盘的管理；第三部分介绍文字模式接口 shell和管理系统的好帮手shell脚本，另外还介绍了文字编辑器vi和vim的使用方法；第四部分介绍了对于系......一起来看看《鸟哥的Linux私房菜》这本书的介绍吧!

码农工具