时间序列数据的预处理及基于ARIMA模型进行趋势预测-大数据ML样本集案例实战

栏目: R语言 · 发布时间: 7年前

内容简介:版权声明:本套技术专栏是作者(秦凯新)平时工作的总结和升华,通过从真实商业环境抽取案例进行总结和分享,并给出商业应用的调优建议和集群环境容量规划等内容,请持续关注本套博客。QQ邮箱地址:1120746959@qq.com,如有任何学术交流,可随时联系。

版权声明:本套技术专栏是作者(秦凯新)平时工作的总结和升华,通过从真实商业环境抽取案例进行总结和分享,并给出商业应用的调优建议和集群环境容量规划等内容,请持续关注本套博客。QQ邮箱地址:1120746959@qq.com,如有任何学术交流,可随时联系。

1 数据的预处理

  • 时间序列数据生成

    import pandas as pd
      import numpy as np
      
      
      date_range:
      可以指定开始时间与周期
      H:小时
      D:天
      M:月
      
      # TIMES #2016 Jul 1 7/1/2016 1/7/2016 2016-07-01 2016/07/01
      rng = pd.date_range('2016-07-01', periods = 10, freq = '3D')
      rng
      
      DatetimeIndex(['2016-07-01', '2016-07-04', '2016-07-07', '2016-07-10',
             '2016-07-13', '2016-07-16', '2016-07-19', '2016-07-22',
             '2016-07-25', '2016-07-28'],
            dtype='datetime64[ns]', freq='3D')
            
    
       time=pd.Series(np.random.randn(20),
             index=pd.date_range(dt.datetime(2016,1,1),periods=20))
       print(time)
       
      2016-01-01   -0.129379
      2016-01-02    0.164480
      2016-01-03   -0.639117
      2016-01-04   -0.427224
      2016-01-05    2.055133
      2016-01-06    1.116075
      2016-01-07    0.357426
      2016-01-08    0.274249
      2016-01-09    0.834405
      2016-01-10   -0.005444
      2016-01-11   -0.134409
      2016-01-12    0.249318
      2016-01-13   -0.297842
      2016-01-14   -0.128514
      2016-01-15    0.063690
      2016-01-16   -2.246031
      2016-01-17    0.359552
      2016-01-18    0.383030
      2016-01-19    0.402717
      2016-01-20   -0.694068
      Freq: D, dtype: float64
    复制代码
  • truncate过滤

    time.truncate(before='2016-1-10')
      2016-01-10   -0.005444
      2016-01-11   -0.134409
      2016-01-12    0.249318
      2016-01-13   -0.297842
      2016-01-14   -0.128514
      2016-01-15    0.063690
      2016-01-16   -2.246031
      2016-01-17    0.359552
      2016-01-18    0.383030
      2016-01-19    0.402717
      2016-01-20   -0.694068
      Freq: D, dtype: float64
      
      time.truncate(after='2016-1-10')
      2016-01-01   -0.129379
      2016-01-02    0.164480
      2016-01-03   -0.639117
      2016-01-04   -0.427224
      2016-01-05    2.055133
      2016-01-06    1.116075
      2016-01-07    0.357426
      2016-01-08    0.274249
      2016-01-09    0.834405
      2016-01-10   -0.005444
      Freq: D, dtype: float64
      
      print(time['2016-01-15':'2016-01-20'])
      2016-01-15    0.063690
      2016-01-16   -2.246031
      2016-01-17    0.359552
      2016-01-18    0.383030
      2016-01-19    0.402717
      2016-01-20   -0.694068
      Freq: D, dtype: float64
      
      data=pd.date_range('2010-01-01','2011-01-01',freq='M')
      print(data)
      
      DatetimeIndex(['2010-01-31', '2010-02-28', '2010-03-31', '2010-04-30',
             '2010-05-31', '2010-06-30', '2010-07-31', '2010-08-31',
             '2010-09-30', '2010-10-31', '2010-11-30', '2010-12-31'],
            dtype='datetime64[ns]', freq='M')
            
            
      # 指定索引
      rng = pd.date_range('2016 Jul 1', periods = 10, freq = 'D')
      rng
      pd.Series(range(len(rng)), index = rng)
      
      2016-07-01    0
      2016-07-02    1
      2016-07-03    2
      2016-07-04    3
      2016-07-05    4
      2016-07-06    5
      2016-07-07    6
      2016-07-08    7
      2016-07-09    8
      2016-07-10    9
      Freq: D, dtype: int32
    复制代码
  • 指定索引

    periods = [pd.Period('2016-01'), pd.Period('2016-02'), pd.Period('2016-03')]
      ts = pd.Series(np.random.randn(len(periods)), index = periods)
      ts
      
      2016-07-01    0
      2016-07-02    1
      2016-07-03    2
      2016-07-04    3
      2016-07-05    4
      2016-07-06    5
      2016-07-07    6
      2016-07-08    7
      2016-07-09    8
      2016-07-10    9
      Freq: D, dtype: int32
    复制代码
  • 时间戳和时间周期可以转换

    ts = pd.Series(range(10), pd.date_range('07-10-16 8:00', periods = 10, freq = 'H'))
      ts
      
      2016-07-10 08:00:00    0
      2016-07-10 09:00:00    1
      2016-07-10 10:00:00    2
      2016-07-10 11:00:00    3
      2016-07-10 12:00:00    4
      2016-07-10 13:00:00    5
      2016-07-10 14:00:00    6
      2016-07-10 15:00:00    7
      2016-07-10 16:00:00    8
      2016-07-10 17:00:00    9
      Freq: H, dtype: int32
    
      ts_period = ts.to_period()
      ts_period
      
      2016-07-10 08:00    0
      2016-07-10 09:00    1
      2016-07-10 10:00    2
      2016-07-10 11:00    3
      2016-07-10 12:00    4
      2016-07-10 13:00    5
      2016-07-10 14:00    6
      2016-07-10 15:00    7
      2016-07-10 16:00    8
      2016-07-10 17:00    9
      Freq: H, dtype: int32
      
      ts_period['2016-07-10 08:30':'2016-07-10 11:45']
      
      2016-07-10 08:00    0
      2016-07-10 09:00    1
      2016-07-10 10:00    2
      2016-07-10 11:00    3
      Freq: H, dtype: int32
      
      ts['2016-07-10 08:30':'2016-07-10 11:45']
      
      2016-07-10 09:00:00    1
      2016-07-10 10:00:00    2
      2016-07-10 11:00:00    3
      Freq: H, dtype: int32
    复制代码

2 数据重采样

  • 时间数据由一个频率转换到另一个频率

  • 降采样

  • 升采样

    rng = pd.date_range('1/1/2011', periods=90, freq='D')
      ts = pd.Series(np.random.randn(len(rng)), index=rng)
      ts.head()
      
      2011-01-01   -1.025562
      2011-01-02    0.410895
      2011-01-03    0.660311
      2011-01-04    0.710293
      2011-01-05    0.444985
      Freq: D, dtype: float64
      
      ts.resample('M').sum()
      
      2011-01-31    2.510102
      2011-02-28    0.583209
      2011-03-31    2.749411
      Freq: M, dtype: float64
      
      ts.resample('3D').sum()
      
      2011-01-01    0.045643
      2011-01-04   -2.255206
      2011-01-07    0.571142
      2011-01-10    0.835032
      2011-01-13   -0.396766
      2011-01-16   -1.156253
      2011-01-19   -1.286884
      2011-01-22    2.883952
      2011-01-25    1.566908
      2011-01-28    1.435563
      2011-01-31    0.311565
      2011-02-03   -2.541235
      2011-02-06    0.317075
      2011-02-09    1.598877
      2011-02-12   -1.950509
      2011-02-15    2.928312
      2011-02-18   -0.733715
      2011-02-21    1.674817
      2011-02-24   -2.078872
      2011-02-27    2.172320
      2011-03-02   -2.022104
      2011-03-05   -0.070356
      2011-03-08    1.276671
      2011-03-11   -2.835132
      2011-03-14   -1.384113
      2011-03-17    1.517565
      2011-03-20   -0.550406
      2011-03-23    0.773430
      2011-03-26    2.244319
      2011-03-29    2.951082
      Freq: 3D, dtype: float64
    
      day3Ts = ts.resample('3D').mean()
      day3Ts
      
      2011-01-01    0.015214
      2011-01-04   -0.751735
      2011-01-07    0.190381
      2011-01-10    0.278344
      2011-01-13   -0.132255
      2011-01-16   -0.385418
      2011-01-19   -0.428961
      2011-01-22    0.961317
      2011-01-25    0.522303
      2011-01-28    0.478521
      2011-01-31    0.103855
      2011-02-03   -0.847078
      2011-02-06    0.105692
      2011-02-09    0.532959
      2011-02-12   -0.650170
      2011-02-15    0.976104
      2011-02-18   -0.244572
      2011-02-21    0.558272
      2011-02-24   -0.692957
      2011-02-27    0.724107
      2011-03-02   -0.674035
      2011-03-05   -0.023452
      2011-03-08    0.425557
      2011-03-11   -0.945044
      2011-03-14   -0.461371
      2011-03-17    0.505855
      2011-03-20   -0.183469
      2011-03-23    0.257810
      2011-03-26    0.748106
      2011-03-29    0.983694
      Freq: 3D, dtype: float64
      
      ## 下采样
      print(day3Ts.resample('D').asfreq())
      
      2011-01-01    0.015214
      2011-01-02         NaN
      2011-01-03         NaN
      2011-01-04   -0.751735
      2011-01-05         NaN
      2011-01-06         NaN
      2011-01-07    0.190381
      2011-01-08         NaN
      2011-01-09         NaN
      2011-01-10    0.278344
      2011-01-11         NaN
      2011-01-12         NaN
      2011-01-13   -0.132255
      2011-01-14         NaN
      2011-01-15         NaN
      2011-01-16   -0.385418
      2011-01-17         NaN
      2011-01-18         NaN
      2011-01-19   -0.428961
      2011-01-20         NaN
      2011-01-21         NaN
      2011-01-22    0.961317
      Freq: D, Length: 88, dtype: float64
    复制代码
  • ffill 空值取前面的值

  • bfill 空值取后面的值

  • interpolate 线性取值

    day3Ts.resample('D').ffill(1)
     
      2011-01-01    0.015214
      2011-01-02    0.015214
      2011-01-03         NaN
      2011-01-04   -0.751735
      2011-01-05   -0.751735
      2011-01-06         NaN
      2011-01-07    0.190381
      2011-01-08    0.190381
      2011-01-09         NaN
      2011-01-10    0.278344
      2011-01-11    0.278344
      
      day3Ts.resample('D').bfill(1)
      2011-01-01    0.015214
      2011-01-02         NaN
      2011-01-03   -0.751735
      2011-01-04   -0.751735
      2011-01-05         NaN
      2011-01-06    0.190381
      2011-01-07    0.190381
      2011-01-08         NaN
      2011-01-09    0.278344
      2011-01-10    0.278344
      2011-01-11         NaN
      2011-01-12   -0.132255
      2011-01-13   -0.132255
    
     day3Ts.resample('D').interpolate('linear')
     2011-01-01    0.015214
      2011-01-02   -0.240435
      2011-01-03   -0.496085
      2011-01-04   -0.751735
      2011-01-05   -0.437697
      2011-01-06   -0.123658
      2011-01-07    0.190381
      2011-01-08    0.219702
      2011-01-09    0.249023
      2011-01-10    0.278344
      2011-01-11    0.141478
      2011-01-12    0.004611
      2011-01-13   -0.132255
      2011-01-14   -0.216643
      2011-01-15   -0.301030
    复制代码

3 滑动窗

  • 滑动窗计算

    %matplotlib inline 
      import matplotlib.pylab
      import numpy as np
      import pandas as pd
      
      df = pd.Series(np.random.randn(600), index = pd.date_range('7/1/2016', freq = 'D', periods = 600))
      df.head()
      
      2016-07-01   -0.192140
      2016-07-02    0.357953
      2016-07-03   -0.201847
      2016-07-04   -0.372230
      2016-07-05    1.414753
      Freq: D, dtype: float64
    
      r = df.rolling(window = 10)
      #r.max, r.median, r.std, r.skew, r.sum, r.var
      print(r.mean())
      
      016-07-01         NaN
      2016-07-02         NaN
      2016-07-03         NaN
      2016-07-04         NaN
      2016-07-05         NaN
      2016-07-06         NaN
      2016-07-07         NaN
      2016-07-08         NaN
      2016-07-09         NaN
      2016-07-10    0.300133
      2016-07-11    0.284780
      2016-07-12    0.252831
      2016-07-13    0.220699
      2016-07-14    0.167137
      2016-07-15    0.018593
      2016-07-16   -0.061414
      2016-07-17   -0.134593
      2016-07-18   -0.153333
      2016-07-19   -0.218928
      2016-07-20   -0.169426
      2016-07-21   -0.219747
      2016-07-22   -0.181266
      2016-07-23   -0.173674
      2016-07-24   -0.130629
      2016-07-25   -0.166730
      2016-07-26   -0.233044
      2016-07-27   -0.256642
      2016-07-28   -0.280738
      2016-07-29   -0.289893
      2016-07-30   -0.379625
                      ...   
      2018-01-22   -0.211467
      2018-01-23    0.034996
      2018-01-24   -0.105910
      2018-01-25   -0.145774
      2018-01-26   -0.089320
      2018-01-27   -0.164370
      2018-01-28   -0.110892
      2018-01-29   -0.205786
      2018-01-30   -0.101162
      2018-01-31   -0.034760
      2018-02-01    0.229333
      2018-02-02    0.043741
      2018-02-03    0.052837
      2018-02-04    0.057746
      2018-02-05   -0.071401
      2018-02-06   -0.011153
      2018-02-07   -0.045737
      2018-02-08   -0.021983
      2018-02-09   -0.196715
      2018-02-10   -0.063721
      2018-02-11   -0.289452
      2018-02-12   -0.050946
      2018-02-13   -0.047014
      2018-02-14    0.048754
      2018-02-15    0.143949
      2018-02-16    0.424823
      2018-02-17    0.361878
      2018-02-18    0.363235
      2018-02-19    0.517436
      2018-02-20    0.368020
      Freq: D, Length: 600, dtype: float64
    复制代码
  • 可视化

    import matplotlib.pyplot as plt
      %matplotlib inline
      
      plt.figure(figsize=(15, 5))
      
      df.plot(style='r--')
      df.rolling(window=10).mean().plot(style='b')
    复制代码
时间序列数据的预处理及基于ARIMA模型进行趋势预测-大数据ML样本集案例实战

4 ARIMA预测

时间序列数据的预处理及基于ARIMA模型进行趋势预测-大数据ML样本集案例实战
  • 数据的预处理

    import pandas_datareader
      import datetime
      import matplotlib.pylab as plt
      import seaborn as sns
      from matplotlib.pylab import style
      from statsmodels.tsa.arima_model import ARIMA
      from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
      
      style.use('ggplot')    
      plt.rcParams['font.sans-serif'] = ['SimHei'] 
      plt.rcParams['axes.unicode_minus'] = False  
      stockFile = 'data/T10yr.csv'
      stock = pd.read_csv(stockFile, index_col=0, parse_dates=[0])
      stock.head(10)
    复制代码
时间序列数据的预处理及基于ARIMA模型进行趋势预测-大数据ML样本集案例实战
stock_week = stock['Close'].resample('W-MON').mean()
    stock_train = stock_week['2000':'2015'] 
    stock_train.plot(figsize=(12,8))
    plt.legend(bbox_to_anchor=(1.25, 0.5))
    plt.title("Stock Close")
    sns.despine()
复制代码
时间序列数据的预处理及基于ARIMA模型进行趋势预测-大数据ML样本集案例实战
stock_diff = stock_train.diff()
    stock_diff = stock_diff.dropna()
    
    plt.figure()
    plt.plot(stock_diff)
    plt.title('一阶差分')
    plt.show()
复制代码
时间序列数据的预处理及基于ARIMA模型进行趋势预测-大数据ML样本集案例实战
acf = plot_acf(stock_diff, lags=20)
plt.title("ACF")
acf.show()
复制代码
时间序列数据的预处理及基于ARIMA模型进行趋势预测-大数据ML样本集案例实战
pacf = plot_pacf(stock_diff, lags=20)
    plt.title("PACF")
    pacf.show()
复制代码
时间序列数据的预处理及基于ARIMA模型进行趋势预测-大数据ML样本集案例实战
model = ARIMA(stock_train, order=(1, 1, 1),freq='W-MON')
    result = model.fit()
    #print(result.summary())
    pred = result.predict('20140609', '20160701',dynamic=True, typ='levels')
    print (pred)
    
    2014-06-09    2.463559
    2014-06-16    2.455539
    2014-06-23    2.449569
    2014-06-30    2.444183
    2014-07-07    2.438962
    2014-07-14    2.433788
    2014-07-21    2.428627
    2014-07-28    2.423470
    2014-08-04    2.418315
    2014-08-11    2.413159
    2014-08-18    2.408004
    2014-08-25    2.402849
    2014-09-01    2.397693
    2014-09-08    2.392538
    2014-09-15    2.387383
    
    plt.figure(figsize=(6, 6))
    plt.xticks(rotation=45)
    plt.plot(pred)
    plt.plot(stock_train)
复制代码
时间序列数据的预处理及基于ARIMA模型进行趋势预测-大数据ML样本集案例实战

以上所述就是小编给大家介绍的《时间序列数据的预处理及基于ARIMA模型进行趋势预测-大数据ML样本集案例实战》,希望对大家有所帮助,如果大家有任何疑问请给我留言,小编会及时回复大家的。在此也非常感谢大家对 码农网 的支持!

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

Probability and Computing

Probability and Computing

Michael Mitzenmacher、Eli Upfal / Cambridge University Press / 2005-01-31 / USD 66.00

Assuming only an elementary background in discrete mathematics, this textbook is an excellent introduction to the probabilistic techniques and paradigms used in the development of probabilistic algori......一起来看看 《Probability and Computing》 这本书的介绍吧!

CSS 压缩/解压工具
CSS 压缩/解压工具

在线压缩/解压 CSS 代码

JSON 在线解析
JSON 在线解析

在线 JSON 格式化工具

XML 在线格式化
XML 在线格式化

在线 XML 格式化压缩工具