PandaPy has the speed of NumPy and the usability of Pandas

栏目: IT技术 · 发布时间: 4年前

内容简介：PandaPy software, similar to the original Pandas project, is developed to improve the usability of python for finance. Structured datatypes are designed to be able to mimic ‘structs’ in the C language, and share a similar memory layout. PandaPy currently h

PandaPy

Install

!pip3 install pandapy

Load

import pandapy as pp

Why PandaPy?

Maintains the full functionality and speed of structured NumPy datatype (eg., array[col1] + array[col2], or np.log(array[col1] )
Provides wrapper functions over NumPy to give you the usability of Pandas (eg., pp.group(array, [col1, col2, col2], ['mean', 'std'], ['Adj_Close','Close'])
If you need Pandas for speciality functions, you can easily df = pp.pandas(array) and back array = pp.structured(df)
For simple calculations (i.e, plus, mult, log) PandaPy is 25x - 80x faster than Pandas.
For table functions (i.e., group, pivot, drop, concat, fillna) PandaPy is 5x - 100x times faster than Pandas.
For most use cases, PandaPy is faster than Dask, Modin Ray and Pandas.
The best competing python package for performance on table functions is datatable , it is 2x - 10x faster than PandaPy.
The problem is that datatable is 5x - 10x slower with simple calculations (plus, mult, returns), it is less intuitive, does not have a large range of functions, have very few complementary libraries, e.g. matplotlib, and doesn't leave you in a Numpy datatype.
For finance applications the speed of simple calculations takes preference over table function speed.
PandaPy is not created to allow you to scale up to clusters for multiple computer processing like Dask, Modin, and Spark, instead it is focused on speed and usability within a single computer's Memory.
Machines are getting large, EC2 X1 has 2TB of RAM and is remarkably affordable. If it can be done on a single machine then it should be done on a single machine. Quoting Dask - "For data that fits into RAM, Pandas can often be faster and easier to use than Dask DataFrame"
If your dataset is very small you can load your data using PandaPy's read() function, for medium sized data, it is best to load it with datatable or pyspark and convert it to structured Numpy, if it is large pyspark, Dask, or Modin, if it is very large use pyspark.
Lastly PandaPy can have as input any multidimensional object and does not have to conform to the basic NumPy datatypes. It can include nested datatypes, subarrays, functions as long as each column conforms to the array lenght, this allows for a great amount of flexibility. You can for example, add(array, "panda function",[[pd for i in range(len(multiple_stocks))]]) to create a list of the panda (pd) module and access it along any index value array["panda function"][0].read_csv(url) .

PandaPy software, similar to the original Pandas project, is developed to improve the usability of python for finance. Structured datatypes are designed to be able to mimic ‘structs’ in the C language, and share a similar memory layout. PandaPy currently houses more than 30 functions. Structured NumPy are meant for interfacing with C code and for low-level manipulation of structured buffers, for example for interpreting binary blobs. For these purposes they support specialized features such as subarrays, nested datatypes, and unions, and allow control over the memory layout of the structure.

Note this is a fledgling project, much room for improvement, all feedback appreciated (issues tab)

Description

A Structured NumPy Array is an array of structures. NumPy arrays can only contain one data type, but structured arrays in a sense create an array of homogeneous structures. This is done without moving out of NumPy such as is required with Xarray. For structured arrays the data type only has to be the same per column like an SQL data base. Each column can be another multidimensional object and does not have to conform to the basic NumPy datatypes.

PandaPy comes with similar functionality like Pandas, such as groupby, pivot, and others. The biggest benefit of this approach is that NumPy dtype(data type) directly maps onto a C structure definition, so the buffer containing the array content can be accessed directly within an appropriately written C program. If you find yourself writing a Python interface to a legacy C or Fortran library that manipulates structured data, you'll probably find structured arrays quite useful.

Additional

Play around with speed tests here and some more here .
Test and explore the package with this Google Colab Notebook .
Get in touch on LinkedIn or Twitter .
Use table(array) to get a pandas looking table printout

Functions

PandaPy Speed Over Pandas In (X) e.g., (dropnarow) (30x)

Array Structure

Read In Arrays (read)
To Pandas (unstructured) 
Pandas to Structured (structured) 
To Unstructured (to_unstruct) 
To Structured (to_struct) 
Print Table (table)

Explorative Functions

Descriptive Statistics (describe) (5x)
Correlation Array (corr) (2x)

Finance Functions

Returns (returns) (50x)  
Portfolio Value (portfolio_value) (50x)
Cummulative Value (cummulative_return) (50x)
Column Lags (lags) (7x)

Array Functions

Drop Null Rows (dropnarow) (30x)
Drop Column/s (drop) (100x)
Add Column/s (add) (3x)
Concatenate (concat) (rows 25x columns 70x)
Merge (merge) (2x)
Group by (group) (10x)
Pivot (pivot) (20x)
Fill Nulls (fillna) (20x)
Shift Column (shift) (50x)
Rename (rename) (500x)

Other Speed Tests

Update (array[col] = values) (60x)
Addition (array[col] + array[col]) (80x)
Multiplication (array[col] * array[col]) (80x)
Log (np.log(array[col]) (25x)

note speed tests done on financial dataset only

Documentation by Example

Read In Arrays

# First Example
multiple_stocks = pp.read('https://github.com/firmai/random-assets-two/blob/master/numpy/multiple_stocks.csv?raw=true')
closing = multiple_stocks[['Ticker','Date','Adj_Close']]
piv = pp.pivot(closing,"Date","Ticker","Adj_Close"); piv
closing = pp.to_struct(piv, name_list = [x for x in np.unique(multiple_stocks["Ticker"])])

# Second Example
tsla = pp.read('https://github.com/firmai/random-assets-two/raw/master/numpy/tsla.csv')
crm = pp.read('https://github.com/firmai/random-assets-two/raw/master/numpy/crm.csv')
tsla_sub = tsla[["Date","Adj_Close","Volume"]]
crm_sub = crm[["Date","Adj_Close","Volume"]]
crm_adj = crm[['Date','Adj_Close']]

closing

array([(37.24206924, 100.45429993, 44.57522202, 20.72605705, 130.59109497, 35.80251312,  41.9791832 ,  81.51140594, 66.33999634),
       (35.08446503,  97.62433624, 43.83200836, 20.34561157, 128.53627014, 35.80251312,  41.59314346,  80.89860535, 66.15000153),
       (35.34244537,  97.63354492, 42.79874039, 19.90727234, 125.76422119, 36.07437897,  40.98268127,  80.28580475, 64.58000183),
       ...,
       (21.57999992, 289.79998779, 59.08000183, 11.18000031, 135.27000427, 55.34999847, 158.96000671, 137.53999329, 88.37000275),
       (21.34000015, 291.51998901, 58.65999985, 11.07999992, 132.80999756, 55.27000046, 157.58999634, 136.80999756, 87.95999908),
       (21.51000023, 293.6499939 , 58.47999954, 11.15999985, 134.03999329, 55.34999847, 157.69999695, 136.66999817, 88.08999634)],
      dtype=[('AA', '<f8'), ('AAPL', '<f8'), ('DAL', '<f8'), ('GE', '<f8'), ('IBM', '<f8'), ('KO', '<f8'), ('MSFT', '<f8'), ('PEP', '<f8'), ('UAL', '<f8')])

Rename

pp.rename(closing,["AA","AAPL"],["GAP","FAF"])[:5]

array([(37.24206924, 100.45429993, 44.57522202, 20.72605705, 130.59109497, 35.80251312, 41.9791832 , 81.51140594, 66.33999634),
       (35.08446503,  97.62433624, 43.83200836, 20.34561157, 128.53627014, 35.80251312, 41.59314346, 80.89860535, 66.15000153),
       (35.34244537,  97.63354492, 42.79874039, 19.90727234, 125.76422119, 36.07437897, 40.98268127, 80.28580475, 64.58000183),
       (36.25707626,  99.00255585, 42.57216263, 19.91554451, 124.94229126, 36.52467346, 41.50337982, 82.63342285, 65.52999878),
       (37.28897095, 102.80648041, 43.67792892, 20.15538216, 127.65791321, 36.966465  , 42.72432327, 84.13523865, 66.63999939)],
      dtype=[('GAP', '<f8'), ('FAF', '<f8'), ('DAL', '<f8'), ('GE', '<f8'), ('IBM', '<f8'), ('KO', '<f8'), ('MSFT', '<f8'), ('PEP', '<f8'), ('UAL', '<f8')])

pp.rename(closing,"AA", "GALLY")[:5]

array([(37.24206924, 100.45429993, 44.57522202, 20.72605705, 130.59109497, 35.80251312, 41.9791832 , 81.51140594, 66.33999634),
       (35.08446503,  97.62433624, 43.83200836, 20.34561157, 128.53627014, 35.80251312, 41.59314346, 80.89860535, 66.15000153),
       (35.34244537,  97.63354492, 42.79874039, 19.90727234, 125.76422119, 36.07437897, 40.98268127, 80.28580475, 64.58000183),
       (36.25707626,  99.00255585, 42.57216263, 19.91554451, 124.94229126, 36.52467346, 41.50337982, 82.63342285, 65.52999878),
       (37.28897095, 102.80648041, 43.67792892, 20.15538216, 127.65791321, 36.966465  , 42.72432327, 84.13523865, 66.63999939)],
      dtype=[('GALLY', '<f8'), ('AAPL', '<f8'), ('DAL', '<f8'), ('GE', '<f8'), ('IBM', '<f8'), ('KO', '<f8'), ('MSFT', '<f8'), ('PEP', '<f8'), ('UAL', '<f8')])

Statistics

described = pp.describe(closing)

Describe	observations	minimum	maximum	mean	variance	skewness	kurtosis
AA	1258.00	15.97	60.23	31.46	99.42	0.67	-0.58
AAPL	1258.00	85.39	293.65	149.45	2119.86	0.66	-0.28
DAL	1258.00	30.73	62.69	47.15	44.33	-0.01	-0.78
GE	1258.00	6.42	28.67	18.85	48.45	-0.25	-1.54
IBM	1258.00	99.83	161.17	133.35	116.28	-0.37	0.56
KO	1258.00	32.81	55.35	41.67	28.86	0.80	-0.05
MSFT	1258.00	36.27	158.96	78.31	1102.21	0.61	-0.82
PEP	1258.00	78.46	139.30	102.86	229.01	0.63	-0.32
UAL	1258.00	37.75	96.70	69.22	195.65	0.02	-1.04

Drop Column/s

removed = pp.drop(closing,["AA","AAPL","IBM"]) ; removed[:5]

array([(44.57522202, 20.72605705, 35.80251312, 41.9791832 , 81.51140594, 66.33999634),
       (43.83200836, 20.34561157, 35.80251312, 41.59314346, 80.89860535, 66.15000153),
       (42.79874039, 19.90727234, 36.07437897, 40.98268127, 80.28580475, 64.58000183),
       (42.57216263, 19.91554451, 36.52467346, 41.50337982, 82.63342285, 65.52999878),
       (43.67792892, 20.15538216, 36.966465  , 42.72432327, 84.13523865, 66.63999939)],
      dtype={'names':['DAL','GE','KO','MSFT','PEP','UAL'], 'formats':['<f8','<f8','<f8','<f8','<f8','<f8'], 'offsets':[16,24,40,48,56,64], 'itemsize':72})

Add Column/s

added = pp.add(closing,["GALLY","FAF"],[closing["IBM"],closing["AA"]]); added[:5]  ## set two new columns with that two previous columnns

array([(37.24206924, 100.45429993, 44.57522202, 20.72605705, 130.59109497, 35.80251312, 41.9791832 , 81.51140594, 66.33999634, 130.59109497, 37.24206924),
       (35.08446503,  97.62433624, 43.83200836, 20.34561157, 128.53627014, 35.80251312, 41.59314346, 80.89860535, 66.15000153, 128.53627014, 35.08446503),
       (35.34244537,  97.63354492, 42.79874039, 19.90727234, 125.76422119, 36.07437897, 40.98268127, 80.28580475, 64.58000183, 125.76422119, 35.34244537),
       (36.25707626,  99.00255585, 42.57216263, 19.91554451, 124.94229126, 36.52467346, 41.50337982, 82.63342285, 65.52999878, 124.94229126, 36.25707626),
       (37.28897095, 102.80648041, 43.67792892, 20.15538216, 127.65791321, 36.966465  , 42.72432327, 84.13523865, 66.63999939, 127.65791321, 37.28897095)],
      dtype=[('AA', '<f8'), ('AAPL', '<f8'), ('DAL', '<f8'), ('GE', '<f8'), ('IBM', '<f8'), ('KO', '<f8'), ('MSFT', '<f8'), ('PEP', '<f8'), ('UAL', '<f8'), ('GALLY', '<f8'), ('FAF', '<f8')])

Concatenate Arrays by Row

concat_row = pp.concat(removed[["DAL","GE"]], added[["PEP","UAL"]], type="row"); concat_row[:5]

array([(44.57522202, 20.72605705), (43.83200836, 20.34561157),
       (42.79874039, 19.90727234), (42.57216263, 19.91554451),
       (43.67792892, 20.15538216)], dtype=[('DAL', '<f8'), ('GE', '<f8')])

Concatenate Arrays by Column

concat_col = pp.concat(removed[["DAL","GE"]], added[["PEP","UAL"]], type="columns"); concat_col[:5]

array([(44.57522202, 20.72605705, 81.51140594, 66.33999634),
       (43.83200836, 20.34561157, 80.89860535, 66.15000153),
       (42.79874039, 19.90727234, 80.28580475, 64.58000183),
       (42.57216263, 19.91554451, 82.63342285, 65.52999878),
       (43.67792892, 20.15538216, 84.13523865, 66.63999939)],
      dtype=[('DAL', '<f8'), ('GE', '<f8'), ('PEP', '<f8'), ('UAL', '<f8')])

Concatenate by Array

concat_array = pp.concat(removed[["DAL","GE"]], added[["PEP","UAL"]], type="array"); concat_array[:5]

array([[(44.57522201538086, 20.726057052612305),
        (43.832008361816406, 20.345611572265625),
        (42.79874038696289, 19.907272338867188), ...,
        (59.08000183105469, 11.180000305175781),
        (58.65999984741211, 11.079999923706055),
        (58.47999954223633, 11.15999984741211)],
       [(81.51140594482422, 66.33999633789062),
        (80.89860534667969, 66.1500015258789),
        (80.28580474853516, 64.58000183105469), ...,
        (137.5399932861328, 88.37000274658203),
        (136.80999755859375, 87.95999908447266),
        (136.6699981689453, 88.08999633789062)]], dtype=object)

Concatenate by Melt

concat_melt = pp.concat(removed[["DAL","GE"]], added[["PEP","UAL"]], type="melt"); concat_melt[:5]

array([(44.57522202, 20.72605705), (43.83200836, 20.34561157),
       (42.79874039, 19.90727234), (42.57216263, 19.91554451),
       (43.67792892, 20.15538216)], dtype=[('DAL', '<f8'), ('GE', '<f8')])

Merge Array (inner, outer)

merged = pp.merge(tsla_sub, crm_adj, left_on="Date", right_on="Date",how="inner",left_postscript="_TSLA",right_postscript="_CRM"); merged[:5]

array([('2019-01-02', 310.11999512, 135.55000305, 11658600),
       ('2019-01-03', 300.35998535, 130.3999939 ,  6965200),
       ('2019-01-04', 317.69000244, 137.96000671,  7394100),
       ('2019-01-07', 334.95999146, 142.22000122,  7551200),
       ('2019-01-08', 335.3500061 , 145.72000122,  7008500)],
      dtype=[('Date', '<M8[D]'), ('Adj_Close_TSLA', '<f8'), ('Adj_Close_CRM', '<f8'), ('Volume', '<i8')])

Replace Individual Values

## More work to done on replace (structured)
## replace(merged,original=317.69000244, replacement=np.nan)[:5]

Print Table

pp.table(merged[:5])

	Date	Adj_Close_TSLA	Adj_Close_CRM	Volume
0	2019-01-02	310.120	135.550	11658600
1	2019-01-03	300.360	130.400	6965200
2	2019-01-04	317.690	137.960	7394100
3	2019-01-07	334.960	142.220	7551200
4	2019-01-08	335.350	145.720	7008500

### This is the new function that you should include above
### You can add the same peculuarities to remove

Add and Concatenate

tsla = pp.add(tsla,["Ticker"], "TSLA", "U10")
crm = pp.add(crm,["Ticker"], "CRM", "U10")
combine = pp.concat(tsla[0:5], crm[0:5], type="row"); combine

array([(315.13000488, 298.79998779, 306.1000061 , 310.11999512, 11658600, 310.11999512, '2019-01-02', 'TSLA'),
       (309.3999939 , 297.38000488, 307.        , 300.35998535,  6965200, 300.35998535, '2019-01-03', 'TSLA'),
       (318.        , 302.73001099, 306.        , 317.69000244,  7394100, 317.69000244, '2019-01-04', 'TSLA'),
       (336.73999023, 317.75      , 321.72000122, 334.95999146,  7551200, 334.95999146, '2019-01-07', 'TSLA'),
       (344.01000977, 327.01998901, 341.95999146, 335.3500061 ,  7008500, 335.3500061 , '2019-01-08', 'TSLA'),
       (136.83000183, 133.05000305, 133.3999939 , 135.55000305,  4783900, 135.55000305, '2019-01-02', 'CRM'),
       (134.77999878, 130.1000061 , 133.47999573, 130.3999939 ,  6365700, 130.3999939 , '2019-01-03', 'CRM'),
       (139.32000732, 132.22000122, 133.5       , 137.96000671,  6650600, 137.96000671, '2019-01-04', 'CRM'),
       (143.38999939, 138.78999329, 141.02000427, 142.22000122,  9064800, 142.22000122, '2019-01-07', 'CRM'),
       (146.46000671, 142.88999939, 144.72999573, 145.72000122,  9057300, 145.72000122, '2019-01-08', 'CRM')],
      dtype=[('High', '<f8'), ('Low', '<f8'), ('Open', '<f8'), ('Close', '<f8'), ('Volume', '<i8'), ('Adj_Close', '<f8'), ('Date', '<M8[D]'), ('Ticker', '<U10')])

dropped = pp.drop(combine,["High","Low","Open"]); dropped[:10]

array([(310.11999512, 11658600, 310.11999512, '2019-01-02', 'TSLA'),
       (300.35998535,  6965200, 300.35998535, '2019-01-03', 'TSLA'),
       (317.69000244,  7394100, 317.69000244, '2019-01-04', 'TSLA'),
       (334.95999146,  7551200, 334.95999146, '2019-01-07', 'TSLA'),
       (335.3500061 ,  7008500, 335.3500061 , '2019-01-08', 'TSLA'),
       (135.55000305,  4783900, 135.55000305, '2019-01-02', 'CRM'),
       (130.3999939 ,  6365700, 130.3999939 , '2019-01-03', 'CRM'),
       (137.96000671,  6650600, 137.96000671, '2019-01-04', 'CRM'),
       (142.22000122,  9064800, 142.22000122, '2019-01-07', 'CRM'),
       (145.72000122,  9057300, 145.72000122, '2019-01-08', 'CRM')],
      dtype={'names':['Close','Volume','Adj_Close','Date','Ticker'], 'formats':['<f8','<i8','<f8','<M8[D]','<U10'], 'offsets':[24,32,40,48,56], 'itemsize':96})

Pivot Array

piv = pp.pivot(dropped,"Date","Ticker","Adj_Close",display=True)

Adj_Close	CRM	TSLA
2019-01-02	135.55	310.12
2019-01-03	130.40	300.36
2019-01-04	137.96	317.69
2019-01-07	142.22	334.96
2019-01-08	145.72	335.35

Add New Data types

tsla_extended = pp.add(tsla,"Month",tsla["Date"],'datetime64[M]')
tsla_extended = pp.add(tsla_extended,"Year",tsla_extended["Date"],'datetime64[Y]')

Update Existing Column

## faster method elsewhere
year_frame = pp.update(tsla,"Date", [dt.year for dt in tsla["Date"].astype(object)],types="|U10"); year_frame[:5]

array([(315.13000488, 298.79998779, 306.1000061 , 310.11999512, 11658600, 310.11999512, 'TSLA', '2019'),
       (309.3999939 , 297.38000488, 307.        , 300.35998535,  6965200, 300.35998535, 'TSLA', '2019'),
       (318.        , 302.73001099, 306.        , 317.69000244,  7394100, 317.69000244, 'TSLA', '2019'),
       (336.73999023, 317.75      , 321.72000122, 334.95999146,  7551200, 334.95999146, 'TSLA', '2019'),
       (344.01000977, 327.01998901, 341.95999146, 335.3500061 ,  7008500, 335.3500061 , 'TSLA', '2019')],
      dtype=[('High', '<f8'), ('Low', '<f8'), ('Open', '<f8'), ('Close', '<f8'), ('Volume', '<i8'), ('Adj_Close', '<f8'), ('Ticker', '<U10'), ('Date', '<U10')])

Group Arrays By

grouped = pp.group(tsla_extended, ['Ticker','Month','Year'],['mean', 'std', 'min', 'max'], ['Adj_Close','Close'], display=True)

	Ticker	Month	Year	Adj_Close_mean	Adj_Close_std	Adj_Close_min	Adj_Close_max	Close_mean	Close_std	Close_min	Close_max
0	TSLA	2019-01-01	2019-01-01	318.494	21.098	287.590	347.310	318.494	21.098	287.590	347.310
1	TSLA	2019-02-01	2019-01-01	307.728	8.053	291.230	321.350	307.728	8.053	291.230	321.350
2	TSLA	2019-03-01	2019-01-01	277.757	8.925	260.420	294.790	277.757	8.925	260.420	294.790
3	TSLA	2019-04-01	2019-01-01	266.656	14.985	235.140	291.810	266.656	14.985	235.140	291.810
4	TSLA	2019-05-01	2019-01-01	219.715	24.040	185.160	255.340	219.715	24.040	185.160	255.340
5	TSLA	2019-06-01	2019-01-01	213.717	12.125	178.970	226.430	213.717	12.125	178.970	226.430
6	TSLA	2019-07-01	2019-01-01	242.382	12.077	224.550	264.880	242.382	12.077	224.550	264.880
7	TSLA	2019-08-01	2019-01-01	225.103	7.831	211.400	238.300	225.103	7.831	211.400	238.300
8	TSLA	2019-09-01	2019-01-01	237.261	8.436	220.680	247.100	237.261	8.436	220.680	247.100
9	TSLA	2019-10-01	2019-01-01	266.355	31.463	231.430	328.130	266.355	31.463	231.430	328.130
10	TSLA	2019-11-01	2019-01-01	338.300	13.226	313.310	359.520	338.300	13.226	313.310	359.520
11	TSLA	2019-12-01	2019-01-01	377.695	36.183	330.370	430.940	377.695	36.183	330.370	430.940

Convert Array to Pandas

grouped_frame = pp.pandas(grouped); grouped_frame.head()

	Ticker	Month	Year	Adj_Close_mean	Adj_Close_std	Adj_Close_min	Adj_Close_max	Close_mean	Close_std	Close_min	Close_max
0	TSLA	2019-01-01	2019-01-01	318.494284	21.098362	287.589996	347.309998	318.494284	21.098362	287.589996	347.309998
1	TSLA	2019-02-01	2019-01-01	307.728421	8.052522	291.230011	321.350006	307.728421	8.052522	291.230011	321.350006
2	TSLA	2019-03-01	2019-01-01	277.757140	8.924873	260.420013	294.790009	277.757140	8.924873	260.420013	294.790009
3	TSLA	2019-04-01	2019-01-01	266.655716	14.984572	235.139999	291.809998	266.655716	14.984572	235.139999	291.809998
4	TSLA	2019-05-01	2019-01-01	219.715454	24.039647	185.160004	255.339996	219.715454	24.039647	185.160004	255.339996

From Pandas to Structured

struct = pp.structured(grouped_frame); struct[:5]

rec.array([('TSLA', '2019-01-01T00:00:00.000000000', '2019-01-01T00:00:00.000000000', 318.49428449, 21.09836186, 287.58999634, 347.30999756, 318.49428449, 21.09836186, 287.58999634, 347.30999756),
           ('TSLA', '2019-02-01T00:00:00.000000000', '2019-01-01T00:00:00.000000000', 307.72842086,  8.05252198, 291.23001099, 321.3500061 , 307.72842086,  8.05252198, 291.23001099, 321.3500061 ),
           ('TSLA', '2019-03-01T00:00:00.000000000', '2019-01-01T00:00:00.000000000', 277.75713966,  8.92487345, 260.42001343, 294.79000854, 277.75713966,  8.92487345, 260.42001343, 294.79000854),
           ('TSLA', '2019-04-01T00:00:00.000000000', '2019-01-01T00:00:00.000000000', 266.65571594, 14.98457194, 235.13999939, 291.80999756, 266.65571594, 14.98457194, 235.13999939, 291.80999756),
           ('TSLA', '2019-05-01T00:00:00.000000000', '2019-01-01T00:00:00.000000000', 219.7154541 , 24.03964724, 185.16000366, 255.33999634, 219.7154541 , 24.03964724, 185.16000366, 255.33999634)],
          dtype=[('Ticker', 'O'), ('Month', '<M8[ns]'), ('Year', '<M8[ns]'), ('Adj_Close_mean', '<f8'), ('Adj_Close_std', '<f8'), ('Adj_Close_min', '<f8'), ('Adj_Close_max', '<f8'), ('Close_mean', '<f8'), ('Close_std', '<f8'), ('Close_min', '<f8'), ('Close_max', '<f8')])

Shift Column

pp.shift(merged["Adj_Close_TSLA"],1)[:5]

array([         nan, 310.11999512, 300.35998535, 317.69000244,
       334.95999146])

Multiple Lags for Column

tsla_lagged = pp.lags(tsla_extended, "Adj_Close", 5); tsla_lagged[:5]

array([(315.13000488, 298.79998779, 306.1000061 , 310.11999512, 11658600, 310.11999512, '2019-01-02', 'TSLA', '2019-01', '2019',          nan,          nan,          nan,          nan, nan),
       (309.3999939 , 297.38000488, 307.        , 300.35998535,  6965200, 300.35998535, '2019-01-03', 'TSLA', '2019-01', '2019', 310.11999512,          nan,          nan,          nan, nan),
       (318.        , 302.73001099, 306.        , 317.69000244,  7394100, 317.69000244, '2019-01-04', 'TSLA', '2019-01', '2019', 300.35998535, 310.11999512,          nan,          nan, nan),
       (336.73999023, 317.75      , 321.72000122, 334.95999146,  7551200, 334.95999146, '2019-01-07', 'TSLA', '2019-01', '2019', 317.69000244, 300.35998535, 310.11999512,          nan, nan),
       (344.01000977, 327.01998901, 341.95999146, 335.3500061 ,  7008500, 335.3500061 , '2019-01-08', 'TSLA', '2019-01', '2019', 334.95999146, 317.69000244, 300.35998535, 310.11999512, nan)],
      dtype=[('High', '<f8'), ('Low', '<f8'), ('Open', '<f8'), ('Close', '<f8'), ('Volume', '<i8'), ('Adj_Close', '<f8'), ('Date', '<M8[D]'), ('Ticker', '<U10'), ('Month', '<M8[M]'), ('Year', '<M8[Y]'), ('Adj_Close_lag_1', '<f8'), ('Adj_Close_lag_2', '<f8'), ('Adj_Close_lag_3', '<f8'), ('Adj_Close_lag_4', '<f8'), ('Adj_Close_lag_5', '<f8')])

Correlation Array

correlated = pp.corr(closing)

Correlation	AA	AAPL	DAL	GE	IBM	KO	MSFT	PEP	UAL
AA	1.00	0.21	0.24	-0.17	0.39	-0.09	0.05	-0.04	0.12
AAPL	0.21	1.00	0.86	-0.83	0.22	0.85	0.94	0.85	0.82
DAL	0.24	0.86	1.00	-0.78	0.14	0.79	0.86	0.78	0.86
GE	-0.17	-0.83	-0.78	1.00	0.06	-0.76	-0.86	-0.69	-0.76
IBM	0.39	0.22	0.14	0.06	1.00	0.07	0.15	0.24	0.18
KO	-0.09	0.85	0.79	-0.76	0.07	1.00	0.94	0.96	0.74
MSFT	0.05	0.94	0.86	-0.86	0.15	0.94	1.00	0.93	0.83
PEP	-0.04	0.85	0.78	-0.69	0.24	0.96	0.93	1.00	0.75
UAL	0.12	0.82	0.86	-0.76	0.18	0.74	0.83	0.75	1.00

Log Returns

pp.returns(closing,"IBM",type="log")

array([        nan, -0.01585991, -0.02180223, ...,  0.0026649 ,
       -0.0183533 ,  0.0092187 ])

Normal Returns

loga = pp.returns(closing,"IBM",type="normal"); loga

array([        nan, -0.0157348 , -0.02156628, ...,  0.00266845,
       -0.0181859 ,  0.00926132])

Add Column

close_ret = pp.add(closing,"IBM_log_return",loga); close_ret[:5]

array([(37.24206924, 100.45429993, 44.57522202, 20.72605705, 130.59109497, 35.80251312, 41.9791832 , 81.51140594, 66.33999634,         nan),
       (35.08446503,  97.62433624, 43.83200836, 20.34561157, 128.53627014, 35.80251312, 41.59314346, 80.89860535, 66.15000153, -0.0157348 ),
       (35.34244537,  97.63354492, 42.79874039, 19.90727234, 125.76422119, 36.07437897, 40.98268127, 80.28580475, 64.58000183, -0.02156628),
       (36.25707626,  99.00255585, 42.57216263, 19.91554451, 124.94229126, 36.52467346, 41.50337982, 82.63342285, 65.52999878, -0.00653548),
       (37.28897095, 102.80648041, 43.67792892, 20.15538216, 127.65791321, 36.966465  , 42.72432327, 84.13523865, 66.63999939,  0.02173501)],
      dtype=[('AA', '<f8'), ('AAPL', '<f8'), ('DAL', '<f8'), ('GE', '<f8'), ('IBM', '<f8'), ('KO', '<f8'), ('MSFT', '<f8'), ('PEP', '<f8'), ('UAL', '<f8'), ('IBM_log_return', '<f8')])

Drop Array Rows Where Null

close_ret_na = pp.dropnarow(close_ret, "IBM_log_return"); close_ret[:5]

array([(37.24206924, 100.45429993, 44.57522202, 20.72605705, 130.59109497, 35.80251312, 41.9791832 , 81.51140594, 66.33999634,         nan),
       (35.08446503,  97.62433624, 43.83200836, 20.34561157, 128.53627014, 35.80251312, 41.59314346, 80.89860535, 66.15000153, -0.0157348 ),
       (35.34244537,  97.63354492, 42.79874039, 19.90727234, 125.76422119, 36.07437897, 40.98268127, 80.28580475, 64.58000183, -0.02156628),
       (36.25707626,  99.00255585, 42.57216263, 19.91554451, 124.94229126, 36.52467346, 41.50337982, 82.63342285, 65.52999878, -0.00653548),
       (37.28897095, 102.80648041, 43.67792892, 20.15538216, 127.65791321, 36.966465  , 42.72432327, 84.13523865, 66.63999939,  0.02173501)],
      dtype=[('AA', '<f8'), ('AAPL', '<f8'), ('DAL', '<f8'), ('GE', '<f8'), ('IBM', '<f8'), ('KO', '<f8'), ('MSFT', '<f8'), ('PEP', '<f8'), ('UAL', '<f8'), ('IBM_log_return', '<f8')])

Portfolio Value from Log Return

pp.portfolio_value(close_ret_na, "IBM_log_return", type="log")

array([0.98438834, 0.96338604, 0.95711037, ..., 1.15115429, 1.13040872,
       1.14092643])

Cummulative Value from Log Return

pp.cummulative_return(close_ret_na, "IBM_log_return", type="log")

array([-0.01561166, -0.03661396, -0.04288963, ...,  0.15115429,
        0.13040872,  0.14092643])

Fillna Mean

pp.fillna(tsla_lagged,type="mean")[:5]

array([(315.13000488, 298.79998779, 306.1000061 , 310.11999512, 11658600, 310.11999512, 272.95330665, 272.38631982, 271.75180703, 271.10991915, 270.48587024),
       (309.3999939 , 297.38000488, 307.        , 300.35998535,  6965200, 300.35998535, 310.11999512, 272.38631982, 271.75180703, 271.10991915, 270.48587024),
       (318.        , 302.73001099, 306.        , 317.69000244,  7394100, 317.69000244, 300.35998535, 310.11999512, 271.75180703, 271.10991915, 270.48587024),
       (336.73999023, 317.75      , 321.72000122, 334.95999146,  7551200, 334.95999146, 317.69000244, 300.35998535, 310.11999512, 271.10991915, 270.48587024),
       (344.01000977, 327.01998901, 341.95999146, 335.3500061 ,  7008500, 335.3500061 , 334.95999146, 317.69000244, 300.35998535, 310.11999512, 270.48587024)],
      dtype={'names':['High','Low','Open','Close','Volume','Adj_Close','Adj_Close_lag_1','Adj_Close_lag_2','Adj_Close_lag_3','Adj_Close_lag_4','Adj_Close_lag_5'], 'formats':['<f8','<f8','<f8','<f8','<i8','<f8','<f8','<f8','<f8','<f8','<f8'], 'offsets':[0,8,16,24,32,40,112,120,128,136,144], 'itemsize':152})

Fillna Value

pp.fillna(tsla_lagged,type="value",value=-999999)[:5]

array([(315.13000488, 298.79998779, 306.1000061 , 310.11999512, 11658600, 310.11999512, -9.99999000e+05, -9.99999000e+05, -9.99999000e+05, -9.99999000e+05, -999999.),
       (309.3999939 , 297.38000488, 307.        , 300.35998535,  6965200, 300.35998535,  3.10119995e+02, -9.99999000e+05, -9.99999000e+05, -9.99999000e+05, -999999.),
       (318.        , 302.73001099, 306.        , 317.69000244,  7394100, 317.69000244,  3.00359985e+02,  3.10119995e+02, -9.99999000e+05, -9.99999000e+05, -999999.),
       (336.73999023, 317.75      , 321.72000122, 334.95999146,  7551200, 334.95999146,  3.17690002e+02,  3.00359985e+02,  3.10119995e+02, -9.99999000e+05, -999999.),
       (344.01000977, 327.01998901, 341.95999146, 335.3500061 ,  7008500, 335.3500061 ,  3.34959991e+02,  3.17690002e+02,  3.00359985e+02,  3.10119995e+02, -999999.)],
      dtype={'names':['High','Low','Open','Close','Volume','Adj_Close','Adj_Close_lag_1','Adj_Close_lag_2','Adj_Close_lag_3','Adj_Close_lag_4','Adj_Close_lag_5'], 'formats':['<f8','<f8','<f8','<f8','<i8','<f8','<f8','<f8','<f8','<f8','<f8'], 'offsets':[0,8,16,24,32,40,112,120,128,136,144], 'itemsize':152})

Fillna Forward Fill

pp.fillna(tsla_lagged,type="ffill")[:5]

array([(315.13000488, 298.79998779, 306.1000061 , 310.11999512, 11658600, 310.11999512,          nan,          nan,          nan,          nan, nan),
       (309.3999939 , 297.38000488, 307.        , 300.35998535,  6965200, 300.35998535, 310.11999512,          nan,          nan,          nan, nan),
       (318.        , 302.73001099, 306.        , 317.69000244,  7394100, 317.69000244, 300.35998535, 310.11999512,          nan,          nan, nan),
       (336.73999023, 317.75      , 321.72000122, 334.95999146,  7551200, 334.95999146, 317.69000244, 300.35998535, 310.11999512,          nan, nan),
       (344.01000977, 327.01998901, 341.95999146, 335.3500061 ,  7008500, 335.3500061 , 334.95999146, 317.69000244, 300.35998535, 310.11999512, nan)],
      dtype={'names':['High','Low','Open','Close','Volume','Adj_Close','Adj_Close_lag_1','Adj_Close_lag_2','Adj_Close_lag_3','Adj_Close_lag_4','Adj_Close_lag_5'], 'formats':['<f8','<f8','<f8','<f8','<i8','<f8','<f8','<f8','<f8','<f8','<f8'], 'offsets':[0,8,16,24,32,40,112,120,128,136,144], 'itemsize':152})

Fillna Backward Fill

pp.fillna(tsla_lagged,type="bfill")[:5]

array([(315.13000488, 298.79998779, 306.1000061 , 310.11999512, 11658600, 310.11999512, 310.11999512, 310.11999512, 310.11999512, 310.11999512, 310.11999512),
       (309.3999939 , 297.38000488, 307.        , 300.35998535,  6965200, 300.35998535, 310.11999512, 310.11999512, 310.11999512, 310.11999512, 310.11999512),
       (318.        , 302.73001099, 306.        , 317.69000244,  7394100, 317.69000244, 300.35998535, 310.11999512, 310.11999512, 310.11999512, 310.11999512),
       (336.73999023, 317.75      , 321.72000122, 334.95999146,  7551200, 334.95999146, 317.69000244, 300.35998535, 310.11999512, 310.11999512, 310.11999512),
       (344.01000977, 327.01998901, 341.95999146, 335.3500061 ,  7008500, 335.3500061 , 334.95999146, 317.69000244, 300.35998535, 310.11999512, 310.11999512)],
      dtype={'names':['High','Low','Open','Close','Volume','Adj_Close','Adj_Close_lag_1','Adj_Close_lag_2','Adj_Close_lag_3','Adj_Close_lag_4','Adj_Close_lag_5'], 'formats':['<f8','<f8','<f8','<f8','<i8','<f8','<f8','<f8','<f8','<f8','<f8'], 'offsets':[0,8,16,24,32,40,112,120,128,136,144], 'itemsize':152})

Print Table

pp.table(tsla_lagged,5)

	High	Low	Open	Close	Volume	Adj_Close	Date	Ticker	Month	Year	Adj_Close_lag_1	Adj_Close_lag_2	Adj_Close_lag_3	Adj_Close_lag_4	Adj_Close_lag_5
0	315.130	298.800	306.100	310.120	11658600	310.120	2019-01-02	TSLA	2019-01-01	2019-01-01	nan	nan	nan	nan	nan
1	309.400	297.380	307.000	300.360	6965200	300.360	2019-01-03	TSLA	2019-01-01	2019-01-01	310.120	nan	nan	nan	nan
2	318.000	302.730	306.000	317.690	7394100	317.690	2019-01-04	TSLA	2019-01-01	2019-01-01	300.360	310.120	nan	nan	nan
3	336.740	317.750	321.720	334.960	7551200	334.960	2019-01-07	TSLA	2019-01-01	2019-01-01	317.690	300.360	310.120	nan	nan
4	344.010	327.020	341.960	335.350	7008500	335.350	2019-01-08	TSLA	2019-01-01	2019-01-01	334.960	317.690	300.360	310.120	nan

Outliers

signal = tsla_lagged["Volume"]
z_signal = (signal - np.mean(signal)) / np.std(signal)

tsla_lagged = pp.add(tsla_lagged,"z_signal_volume",z_signal)

outliers = pp.detect(tsla_lagged["z_signal_volume"]); outliers

[12, 40, 42, 64, 78, 79, 84, 95, 97, 98, 107, 141, 205, 206, 207]

import matplotlib.pyplot as plt

plt.figure(figsize=(15, 7))
plt.plot(np.arange(len(tsla_lagged["Volume"])), tsla_lagged["Volume"])
plt.plot(np.arange(len(tsla_lagged["Volume"])), tsla_lagged["Volume"], 'X', label='outliers',markevery=outliers, c='r')
plt.legend()
plt.show()

PandaPy has the speed of NumPy and the usability of Pandas

Remove Noise

price_signal = tsla_lagged["Close"]
removed_signal = pp.removal(price_signal, 30)
noise = pp.get(price_signal, removed_signal)

plt.figure(figsize=(15, 7))
plt.subplot(2, 1, 1)
plt.plot(removed_signal)
plt.title('timeseries without noise')
plt.subplot(2, 1, 2)
plt.plot(noise)
plt.title('noise timeseries')
plt.show()

PandaPy has the speed of NumPy and the usability of Pandas

以上所述就是小编给大家介绍的《PandaPy has the speed of NumPy and the usability of Pandas》，希望对大家有所帮助，如果大家有任何疑问请给我留言，小编会及时回复大家的。在此也非常感谢大家对码农网的支持！

查看所有标签

猜你喜欢:

PandaPy has the speed of NumPy and the usability of Pandas

本站部分资源来源于网络，本站转载出于传递更多信息之目的，版权归原作者或者来源机构所有，如转载稿涉及版权问题，请联系我们。

码农书籍

算法

Robert Sedgewick、Kevin Wayne / 人民邮电出版社 / 2012-3 / 99.00元

《算法(英文版•第4版)》作为算法领域经典的参考书，全面介绍了关于算法和数据结构的必备知识，并特别针对排序、搜索、图处理和字符串处理进行了论述。第4版具体给出了每位程序员应知应会的50个算法，提供了实际代码，而且这些Java代码实现采用了模块化的编程风格，读者可以方便地加以改造。本书配套网站提供了本书内容的摘要及更多的代码实现、测试数据、练习、教学课件等资源。《算法(英文版•第4版)》适合......一起来看看《算法》这本书的介绍吧!

码农工具

PandaPy has the speed of NumPy and the usability of Pandas

PandaPy

Install

Load

Why PandaPy?

Note this is a fledgling project, much room for improvement, all feedback appreciated (issues tab)

Description

Additional

Functions

PandaPy Speed Over Pandas In (X) e.g., (dropnarow) (30x)

Array Structure

Explorative Functions

Finance Functions

Array Functions

Other Speed Tests

Documentation by Example

Read In Arrays

Rename

Statistics

Drop Column/s

Add Column/s

Concatenate Arrays by Row

Concatenate Arrays by Column

Concatenate by Array

Concatenate by Melt

Merge Array (inner, outer)

Replace Individual Values

Print Table

Add and Concatenate

Pivot Array

Add New Data types

Update Existing Column

Group Arrays By

Convert Array to Pandas

From Pandas to Structured

Shift Column

Multiple Lags for Column

Correlation Array

Log Returns

Normal Returns

Add Column

Drop Array Rows Where Null

Portfolio Value from Log Return

Cummulative Value from Log Return

Fillna Mean

Fillna Value

Fillna Forward Fill

Fillna Backward Fill

Print Table

Outliers

Remove Noise

算法

随机密码生成器

Markdown 在线编辑器

正则表达式在线测试