内容简介:PandaPy software, similar to the original Pandas project, is developed to improve the usability of python for finance. Structured datatypes are designed to be able to mimic ‘structs’ in the C language, and share a similar memory layout. PandaPy currently h
PandaPy
Install
!pip3 install pandapy
Load
import pandapy as pp
Why PandaPy?
- Maintains the full functionality and speed of structured NumPy datatype (eg.,
array[col1] + array[col2], or np.log(array[col1]
) - Provides wrapper functions over NumPy to give you the usability of Pandas (eg.,
pp.group(array, [col1, col2, col2], ['mean', 'std'], ['Adj_Close','Close'])
- If you need Pandas for speciality functions, you can easily
df = pp.pandas(array)
and backarray = pp.structured(df)
- For simple calculations (i.e, plus, mult, log) PandaPy is 25x - 80x faster than Pandas.
- For table functions (i.e., group, pivot, drop, concat, fillna) PandaPy is 5x - 100x times faster than Pandas.
- For most use cases, PandaPy is faster than Dask, Modin Ray and Pandas.
- The best competing python package for performance on table functions is datatable , it is 2x - 10x faster than PandaPy.
- The problem is that datatable is 5x - 10x slower with simple calculations (plus, mult, returns), it is less intuitive, does not have a large range of functions, have very few complementary libraries, e.g. matplotlib, and doesn't leave you in a Numpy datatype.
- For finance applications the speed of simple calculations takes preference over table function speed.
- PandaPy is not created to allow you to scale up to clusters for multiple computer processing like Dask, Modin, and Spark, instead it is focused on speed and usability within a single computer's Memory.
- Machines are getting large, EC2 X1 has 2TB of RAM and is remarkably affordable. If it can be done on a single machine then it should be done on a single machine. Quoting Dask - "For data that fits into RAM, Pandas can often be faster and easier to use than Dask DataFrame"
- If your dataset is very small you can load your data using PandaPy's
read()
function, for medium sized data, it is best to load it with datatable or pyspark and convert it to structured Numpy, if it is large pyspark, Dask, or Modin, if it is very large use pyspark. - Lastly PandaPy can have as input any multidimensional object and does not have to conform to the basic NumPy datatypes. It can include nested datatypes, subarrays, functions as long as each column conforms to the array lenght, this allows for a great amount of flexibility. You can for example,
add(array, "panda function",[[pd for i in range(len(multiple_stocks))]])
to create a list of the panda (pd) module and access it along any index valuearray["panda function"][0].read_csv(url)
.
PandaPy software, similar to the original Pandas project, is developed to improve the usability of python for finance. Structured datatypes are designed to be able to mimic ‘structs’ in the C language, and share a similar memory layout. PandaPy currently houses more than 30 functions. Structured NumPy are meant for interfacing with C code and for low-level manipulation of structured buffers, for example for interpreting binary blobs. For these purposes they support specialized features such as subarrays, nested datatypes, and unions, and allow control over the memory layout of the structure.
Note this is a fledgling project, much room for improvement, all feedback appreciated (issues tab)
Description
A Structured NumPy Array is an array of structures. NumPy arrays can only contain one data type, but structured arrays in a sense create an array of homogeneous structures. This is done without moving out of NumPy such as is required with Xarray. For structured arrays the data type only has to be the same per column like an SQL data base. Each column can be another multidimensional object and does not have to conform to the basic NumPy datatypes.
PandaPy comes with similar functionality like Pandas, such as groupby, pivot, and others. The biggest benefit of this approach is that NumPy dtype(data type) directly maps onto a C structure definition, so the buffer containing the array content can be accessed directly within an appropriately written C program. If you find yourself writing a Python interface to a legacy C or Fortran library that manipulates structured data, you'll probably find structured arrays quite useful.
Additional
- Play around with speed tests here and some more here .
- Test and explore the package with this Google Colab Notebook .
- Get in touch on LinkedIn or Twitter .
- Use
table(array)
to get a pandas looking table printout
Functions
PandaPy Speed Over Pandas In (X) e.g., (dropnarow) (30x)
Array Structure
Read In Arrays (read) To Pandas (unstructured) Pandas to Structured (structured) To Unstructured (to_unstruct) To Structured (to_struct) Print Table (table)
Explorative Functions
Descriptive Statistics (describe) (5x) Correlation Array (corr) (2x)
Finance Functions
Returns (returns) (50x) Portfolio Value (portfolio_value) (50x) Cummulative Value (cummulative_return) (50x) Column Lags (lags) (7x)
Array Functions
Drop Null Rows (dropnarow) (30x) Drop Column/s (drop) (100x) Add Column/s (add) (3x) Concatenate (concat) (rows 25x columns 70x) Merge (merge) (2x) Group by (group) (10x) Pivot (pivot) (20x) Fill Nulls (fillna) (20x) Shift Column (shift) (50x) Rename (rename) (500x)
Other Speed Tests
Update (array[col] = values) (60x) Addition (array[col] + array[col]) (80x) Multiplication (array[col] * array[col]) (80x) Log (np.log(array[col]) (25x)
note speed tests done on financial dataset only
Documentation by Example
Read In Arrays
# First Example multiple_stocks = pp.read('https://github.com/firmai/random-assets-two/blob/master/numpy/multiple_stocks.csv?raw=true') closing = multiple_stocks[['Ticker','Date','Adj_Close']] piv = pp.pivot(closing,"Date","Ticker","Adj_Close"); piv closing = pp.to_struct(piv, name_list = [x for x in np.unique(multiple_stocks["Ticker"])]) # Second Example tsla = pp.read('https://github.com/firmai/random-assets-two/raw/master/numpy/tsla.csv') crm = pp.read('https://github.com/firmai/random-assets-two/raw/master/numpy/crm.csv') tsla_sub = tsla[["Date","Adj_Close","Volume"]] crm_sub = crm[["Date","Adj_Close","Volume"]] crm_adj = crm[['Date','Adj_Close']]
closing
array([(37.24206924, 100.45429993, 44.57522202, 20.72605705, 130.59109497, 35.80251312, 41.9791832 , 81.51140594, 66.33999634), (35.08446503, 97.62433624, 43.83200836, 20.34561157, 128.53627014, 35.80251312, 41.59314346, 80.89860535, 66.15000153), (35.34244537, 97.63354492, 42.79874039, 19.90727234, 125.76422119, 36.07437897, 40.98268127, 80.28580475, 64.58000183), ..., (21.57999992, 289.79998779, 59.08000183, 11.18000031, 135.27000427, 55.34999847, 158.96000671, 137.53999329, 88.37000275), (21.34000015, 291.51998901, 58.65999985, 11.07999992, 132.80999756, 55.27000046, 157.58999634, 136.80999756, 87.95999908), (21.51000023, 293.6499939 , 58.47999954, 11.15999985, 134.03999329, 55.34999847, 157.69999695, 136.66999817, 88.08999634)], dtype=[('AA', '<f8'), ('AAPL', '<f8'), ('DAL', '<f8'), ('GE', '<f8'), ('IBM', '<f8'), ('KO', '<f8'), ('MSFT', '<f8'), ('PEP', '<f8'), ('UAL', '<f8')])
Rename
pp.rename(closing,["AA","AAPL"],["GAP","FAF"])[:5]
array([(37.24206924, 100.45429993, 44.57522202, 20.72605705, 130.59109497, 35.80251312, 41.9791832 , 81.51140594, 66.33999634), (35.08446503, 97.62433624, 43.83200836, 20.34561157, 128.53627014, 35.80251312, 41.59314346, 80.89860535, 66.15000153), (35.34244537, 97.63354492, 42.79874039, 19.90727234, 125.76422119, 36.07437897, 40.98268127, 80.28580475, 64.58000183), (36.25707626, 99.00255585, 42.57216263, 19.91554451, 124.94229126, 36.52467346, 41.50337982, 82.63342285, 65.52999878), (37.28897095, 102.80648041, 43.67792892, 20.15538216, 127.65791321, 36.966465 , 42.72432327, 84.13523865, 66.63999939)], dtype=[('GAP', '<f8'), ('FAF', '<f8'), ('DAL', '<f8'), ('GE', '<f8'), ('IBM', '<f8'), ('KO', '<f8'), ('MSFT', '<f8'), ('PEP', '<f8'), ('UAL', '<f8')])
pp.rename(closing,"AA", "GALLY")[:5]
array([(37.24206924, 100.45429993, 44.57522202, 20.72605705, 130.59109497, 35.80251312, 41.9791832 , 81.51140594, 66.33999634), (35.08446503, 97.62433624, 43.83200836, 20.34561157, 128.53627014, 35.80251312, 41.59314346, 80.89860535, 66.15000153), (35.34244537, 97.63354492, 42.79874039, 19.90727234, 125.76422119, 36.07437897, 40.98268127, 80.28580475, 64.58000183), (36.25707626, 99.00255585, 42.57216263, 19.91554451, 124.94229126, 36.52467346, 41.50337982, 82.63342285, 65.52999878), (37.28897095, 102.80648041, 43.67792892, 20.15538216, 127.65791321, 36.966465 , 42.72432327, 84.13523865, 66.63999939)], dtype=[('GALLY', '<f8'), ('AAPL', '<f8'), ('DAL', '<f8'), ('GE', '<f8'), ('IBM', '<f8'), ('KO', '<f8'), ('MSFT', '<f8'), ('PEP', '<f8'), ('UAL', '<f8')])
Statistics
described = pp.describe(closing)
Describe | observations | minimum | maximum | mean | variance | skewness | kurtosis |
---|---|---|---|---|---|---|---|
AA | 1258.00 | 15.97 | 60.23 | 31.46 | 99.42 | 0.67 | -0.58 |
AAPL | 1258.00 | 85.39 | 293.65 | 149.45 | 2119.86 | 0.66 | -0.28 |
DAL | 1258.00 | 30.73 | 62.69 | 47.15 | 44.33 | -0.01 | -0.78 |
GE | 1258.00 | 6.42 | 28.67 | 18.85 | 48.45 | -0.25 | -1.54 |
IBM | 1258.00 | 99.83 | 161.17 | 133.35 | 116.28 | -0.37 | 0.56 |
KO | 1258.00 | 32.81 | 55.35 | 41.67 | 28.86 | 0.80 | -0.05 |
MSFT | 1258.00 | 36.27 | 158.96 | 78.31 | 1102.21 | 0.61 | -0.82 |
PEP | 1258.00 | 78.46 | 139.30 | 102.86 | 229.01 | 0.63 | -0.32 |
UAL | 1258.00 | 37.75 | 96.70 | 69.22 | 195.65 | 0.02 | -1.04 |
Drop Column/s
removed = pp.drop(closing,["AA","AAPL","IBM"]) ; removed[:5]
array([(44.57522202, 20.72605705, 35.80251312, 41.9791832 , 81.51140594, 66.33999634), (43.83200836, 20.34561157, 35.80251312, 41.59314346, 80.89860535, 66.15000153), (42.79874039, 19.90727234, 36.07437897, 40.98268127, 80.28580475, 64.58000183), (42.57216263, 19.91554451, 36.52467346, 41.50337982, 82.63342285, 65.52999878), (43.67792892, 20.15538216, 36.966465 , 42.72432327, 84.13523865, 66.63999939)], dtype={'names':['DAL','GE','KO','MSFT','PEP','UAL'], 'formats':['<f8','<f8','<f8','<f8','<f8','<f8'], 'offsets':[16,24,40,48,56,64], 'itemsize':72})
Add Column/s
added = pp.add(closing,["GALLY","FAF"],[closing["IBM"],closing["AA"]]); added[:5] ## set two new columns with that two previous columnns
array([(37.24206924, 100.45429993, 44.57522202, 20.72605705, 130.59109497, 35.80251312, 41.9791832 , 81.51140594, 66.33999634, 130.59109497, 37.24206924), (35.08446503, 97.62433624, 43.83200836, 20.34561157, 128.53627014, 35.80251312, 41.59314346, 80.89860535, 66.15000153, 128.53627014, 35.08446503), (35.34244537, 97.63354492, 42.79874039, 19.90727234, 125.76422119, 36.07437897, 40.98268127, 80.28580475, 64.58000183, 125.76422119, 35.34244537), (36.25707626, 99.00255585, 42.57216263, 19.91554451, 124.94229126, 36.52467346, 41.50337982, 82.63342285, 65.52999878, 124.94229126, 36.25707626), (37.28897095, 102.80648041, 43.67792892, 20.15538216, 127.65791321, 36.966465 , 42.72432327, 84.13523865, 66.63999939, 127.65791321, 37.28897095)], dtype=[('AA', '<f8'), ('AAPL', '<f8'), ('DAL', '<f8'), ('GE', '<f8'), ('IBM', '<f8'), ('KO', '<f8'), ('MSFT', '<f8'), ('PEP', '<f8'), ('UAL', '<f8'), ('GALLY', '<f8'), ('FAF', '<f8')])
Concatenate Arrays by Row
concat_row = pp.concat(removed[["DAL","GE"]], added[["PEP","UAL"]], type="row"); concat_row[:5]
array([(44.57522202, 20.72605705), (43.83200836, 20.34561157), (42.79874039, 19.90727234), (42.57216263, 19.91554451), (43.67792892, 20.15538216)], dtype=[('DAL', '<f8'), ('GE', '<f8')])
Concatenate Arrays by Column
concat_col = pp.concat(removed[["DAL","GE"]], added[["PEP","UAL"]], type="columns"); concat_col[:5]
array([(44.57522202, 20.72605705, 81.51140594, 66.33999634), (43.83200836, 20.34561157, 80.89860535, 66.15000153), (42.79874039, 19.90727234, 80.28580475, 64.58000183), (42.57216263, 19.91554451, 82.63342285, 65.52999878), (43.67792892, 20.15538216, 84.13523865, 66.63999939)], dtype=[('DAL', '<f8'), ('GE', '<f8'), ('PEP', '<f8'), ('UAL', '<f8')])
Concatenate by Array
concat_array = pp.concat(removed[["DAL","GE"]], added[["PEP","UAL"]], type="array"); concat_array[:5]
array([[(44.57522201538086, 20.726057052612305), (43.832008361816406, 20.345611572265625), (42.79874038696289, 19.907272338867188), ..., (59.08000183105469, 11.180000305175781), (58.65999984741211, 11.079999923706055), (58.47999954223633, 11.15999984741211)], [(81.51140594482422, 66.33999633789062), (80.89860534667969, 66.1500015258789), (80.28580474853516, 64.58000183105469), ..., (137.5399932861328, 88.37000274658203), (136.80999755859375, 87.95999908447266), (136.6699981689453, 88.08999633789062)]], dtype=object)
Concatenate by Melt
concat_melt = pp.concat(removed[["DAL","GE"]], added[["PEP","UAL"]], type="melt"); concat_melt[:5]
array([(44.57522202, 20.72605705), (43.83200836, 20.34561157), (42.79874039, 19.90727234), (42.57216263, 19.91554451), (43.67792892, 20.15538216)], dtype=[('DAL', '<f8'), ('GE', '<f8')])
Merge Array (inner, outer)
merged = pp.merge(tsla_sub, crm_adj, left_on="Date", right_on="Date",how="inner",left_postscript="_TSLA",right_postscript="_CRM"); merged[:5]
array([('2019-01-02', 310.11999512, 135.55000305, 11658600), ('2019-01-03', 300.35998535, 130.3999939 , 6965200), ('2019-01-04', 317.69000244, 137.96000671, 7394100), ('2019-01-07', 334.95999146, 142.22000122, 7551200), ('2019-01-08', 335.3500061 , 145.72000122, 7008500)], dtype=[('Date', '<M8[D]'), ('Adj_Close_TSLA', '<f8'), ('Adj_Close_CRM', '<f8'), ('Volume', '<i8')])
Replace Individual Values
## More work to done on replace (structured) ## replace(merged,original=317.69000244, replacement=np.nan)[:5]
Print Table
pp.table(merged[:5])
Date | Adj_Close_TSLA | Adj_Close_CRM | Volume | |
---|---|---|---|---|
0 | 2019-01-02 | 310.120 | 135.550 | 11658600 |
1 | 2019-01-03 | 300.360 | 130.400 | 6965200 |
2 | 2019-01-04 | 317.690 | 137.960 | 7394100 |
3 | 2019-01-07 | 334.960 | 142.220 | 7551200 |
4 | 2019-01-08 | 335.350 | 145.720 | 7008500 |
### This is the new function that you should include above ### You can add the same peculuarities to remove
Add and Concatenate
tsla = pp.add(tsla,["Ticker"], "TSLA", "U10") crm = pp.add(crm,["Ticker"], "CRM", "U10") combine = pp.concat(tsla[0:5], crm[0:5], type="row"); combine
array([(315.13000488, 298.79998779, 306.1000061 , 310.11999512, 11658600, 310.11999512, '2019-01-02', 'TSLA'), (309.3999939 , 297.38000488, 307. , 300.35998535, 6965200, 300.35998535, '2019-01-03', 'TSLA'), (318. , 302.73001099, 306. , 317.69000244, 7394100, 317.69000244, '2019-01-04', 'TSLA'), (336.73999023, 317.75 , 321.72000122, 334.95999146, 7551200, 334.95999146, '2019-01-07', 'TSLA'), (344.01000977, 327.01998901, 341.95999146, 335.3500061 , 7008500, 335.3500061 , '2019-01-08', 'TSLA'), (136.83000183, 133.05000305, 133.3999939 , 135.55000305, 4783900, 135.55000305, '2019-01-02', 'CRM'), (134.77999878, 130.1000061 , 133.47999573, 130.3999939 , 6365700, 130.3999939 , '2019-01-03', 'CRM'), (139.32000732, 132.22000122, 133.5 , 137.96000671, 6650600, 137.96000671, '2019-01-04', 'CRM'), (143.38999939, 138.78999329, 141.02000427, 142.22000122, 9064800, 142.22000122, '2019-01-07', 'CRM'), (146.46000671, 142.88999939, 144.72999573, 145.72000122, 9057300, 145.72000122, '2019-01-08', 'CRM')], dtype=[('High', '<f8'), ('Low', '<f8'), ('Open', '<f8'), ('Close', '<f8'), ('Volume', '<i8'), ('Adj_Close', '<f8'), ('Date', '<M8[D]'), ('Ticker', '<U10')])
dropped = pp.drop(combine,["High","Low","Open"]); dropped[:10]
array([(310.11999512, 11658600, 310.11999512, '2019-01-02', 'TSLA'), (300.35998535, 6965200, 300.35998535, '2019-01-03', 'TSLA'), (317.69000244, 7394100, 317.69000244, '2019-01-04', 'TSLA'), (334.95999146, 7551200, 334.95999146, '2019-01-07', 'TSLA'), (335.3500061 , 7008500, 335.3500061 , '2019-01-08', 'TSLA'), (135.55000305, 4783900, 135.55000305, '2019-01-02', 'CRM'), (130.3999939 , 6365700, 130.3999939 , '2019-01-03', 'CRM'), (137.96000671, 6650600, 137.96000671, '2019-01-04', 'CRM'), (142.22000122, 9064800, 142.22000122, '2019-01-07', 'CRM'), (145.72000122, 9057300, 145.72000122, '2019-01-08', 'CRM')], dtype={'names':['Close','Volume','Adj_Close','Date','Ticker'], 'formats':['<f8','<i8','<f8','<M8[D]','<U10'], 'offsets':[24,32,40,48,56], 'itemsize':96})
Pivot Array
piv = pp.pivot(dropped,"Date","Ticker","Adj_Close",display=True)
Adj_Close | CRM | TSLA |
---|---|---|
2019-01-02 | 135.55 | 310.12 |
2019-01-03 | 130.40 | 300.36 |
2019-01-04 | 137.96 | 317.69 |
2019-01-07 | 142.22 | 334.96 |
2019-01-08 | 145.72 | 335.35 |
Add New Data types
tsla_extended = pp.add(tsla,"Month",tsla["Date"],'datetime64[M]') tsla_extended = pp.add(tsla_extended,"Year",tsla_extended["Date"],'datetime64[Y]')
Update Existing Column
## faster method elsewhere year_frame = pp.update(tsla,"Date", [dt.year for dt in tsla["Date"].astype(object)],types="|U10"); year_frame[:5]
array([(315.13000488, 298.79998779, 306.1000061 , 310.11999512, 11658600, 310.11999512, 'TSLA', '2019'), (309.3999939 , 297.38000488, 307. , 300.35998535, 6965200, 300.35998535, 'TSLA', '2019'), (318. , 302.73001099, 306. , 317.69000244, 7394100, 317.69000244, 'TSLA', '2019'), (336.73999023, 317.75 , 321.72000122, 334.95999146, 7551200, 334.95999146, 'TSLA', '2019'), (344.01000977, 327.01998901, 341.95999146, 335.3500061 , 7008500, 335.3500061 , 'TSLA', '2019')], dtype=[('High', '<f8'), ('Low', '<f8'), ('Open', '<f8'), ('Close', '<f8'), ('Volume', '<i8'), ('Adj_Close', '<f8'), ('Ticker', '<U10'), ('Date', '<U10')])
Group Arrays By
grouped = pp.group(tsla_extended, ['Ticker','Month','Year'],['mean', 'std', 'min', 'max'], ['Adj_Close','Close'], display=True)
Ticker | Month | Year | Adj_Close_mean | Adj_Close_std | Adj_Close_min | Adj_Close_max | Close_mean | Close_std | Close_min | Close_max | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | TSLA | 2019-01-01 | 2019-01-01 | 318.494 | 21.098 | 287.590 | 347.310 | 318.494 | 21.098 | 287.590 | 347.310 |
1 | TSLA | 2019-02-01 | 2019-01-01 | 307.728 | 8.053 | 291.230 | 321.350 | 307.728 | 8.053 | 291.230 | 321.350 |
2 | TSLA | 2019-03-01 | 2019-01-01 | 277.757 | 8.925 | 260.420 | 294.790 | 277.757 | 8.925 | 260.420 | 294.790 |
3 | TSLA | 2019-04-01 | 2019-01-01 | 266.656 | 14.985 | 235.140 | 291.810 | 266.656 | 14.985 | 235.140 | 291.810 |
4 | TSLA | 2019-05-01 | 2019-01-01 | 219.715 | 24.040 | 185.160 | 255.340 | 219.715 | 24.040 | 185.160 | 255.340 |
5 | TSLA | 2019-06-01 | 2019-01-01 | 213.717 | 12.125 | 178.970 | 226.430 | 213.717 | 12.125 | 178.970 | 226.430 |
6 | TSLA | 2019-07-01 | 2019-01-01 | 242.382 | 12.077 | 224.550 | 264.880 | 242.382 | 12.077 | 224.550 | 264.880 |
7 | TSLA | 2019-08-01 | 2019-01-01 | 225.103 | 7.831 | 211.400 | 238.300 | 225.103 | 7.831 | 211.400 | 238.300 |
8 | TSLA | 2019-09-01 | 2019-01-01 | 237.261 | 8.436 | 220.680 | 247.100 | 237.261 | 8.436 | 220.680 | 247.100 |
9 | TSLA | 2019-10-01 | 2019-01-01 | 266.355 | 31.463 | 231.430 | 328.130 | 266.355 | 31.463 | 231.430 | 328.130 |
10 | TSLA | 2019-11-01 | 2019-01-01 | 338.300 | 13.226 | 313.310 | 359.520 | 338.300 | 13.226 | 313.310 | 359.520 |
11 | TSLA | 2019-12-01 | 2019-01-01 | 377.695 | 36.183 | 330.370 | 430.940 | 377.695 | 36.183 | 330.370 | 430.940 |
Convert Array to Pandas
grouped_frame = pp.pandas(grouped); grouped_frame.head()
Ticker | Month | Year | Adj_Close_mean | Adj_Close_std | Adj_Close_min | Adj_Close_max | Close_mean | Close_std | Close_min | Close_max | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | TSLA | 2019-01-01 | 2019-01-01 | 318.494284 | 21.098362 | 287.589996 | 347.309998 | 318.494284 | 21.098362 | 287.589996 | 347.309998 |
1 | TSLA | 2019-02-01 | 2019-01-01 | 307.728421 | 8.052522 | 291.230011 | 321.350006 | 307.728421 | 8.052522 | 291.230011 | 321.350006 |
2 | TSLA | 2019-03-01 | 2019-01-01 | 277.757140 | 8.924873 | 260.420013 | 294.790009 | 277.757140 | 8.924873 | 260.420013 | 294.790009 |
3 | TSLA | 2019-04-01 | 2019-01-01 | 266.655716 | 14.984572 | 235.139999 | 291.809998 | 266.655716 | 14.984572 | 235.139999 | 291.809998 |
4 | TSLA | 2019-05-01 | 2019-01-01 | 219.715454 | 24.039647 | 185.160004 | 255.339996 | 219.715454 | 24.039647 | 185.160004 | 255.339996 |
From Pandas to Structured
struct = pp.structured(grouped_frame); struct[:5]
rec.array([('TSLA', '2019-01-01T00:00:00.000000000', '2019-01-01T00:00:00.000000000', 318.49428449, 21.09836186, 287.58999634, 347.30999756, 318.49428449, 21.09836186, 287.58999634, 347.30999756), ('TSLA', '2019-02-01T00:00:00.000000000', '2019-01-01T00:00:00.000000000', 307.72842086, 8.05252198, 291.23001099, 321.3500061 , 307.72842086, 8.05252198, 291.23001099, 321.3500061 ), ('TSLA', '2019-03-01T00:00:00.000000000', '2019-01-01T00:00:00.000000000', 277.75713966, 8.92487345, 260.42001343, 294.79000854, 277.75713966, 8.92487345, 260.42001343, 294.79000854), ('TSLA', '2019-04-01T00:00:00.000000000', '2019-01-01T00:00:00.000000000', 266.65571594, 14.98457194, 235.13999939, 291.80999756, 266.65571594, 14.98457194, 235.13999939, 291.80999756), ('TSLA', '2019-05-01T00:00:00.000000000', '2019-01-01T00:00:00.000000000', 219.7154541 , 24.03964724, 185.16000366, 255.33999634, 219.7154541 , 24.03964724, 185.16000366, 255.33999634)], dtype=[('Ticker', 'O'), ('Month', '<M8[ns]'), ('Year', '<M8[ns]'), ('Adj_Close_mean', '<f8'), ('Adj_Close_std', '<f8'), ('Adj_Close_min', '<f8'), ('Adj_Close_max', '<f8'), ('Close_mean', '<f8'), ('Close_std', '<f8'), ('Close_min', '<f8'), ('Close_max', '<f8')])
Shift Column
pp.shift(merged["Adj_Close_TSLA"],1)[:5]
array([ nan, 310.11999512, 300.35998535, 317.69000244, 334.95999146])
Multiple Lags for Column
tsla_lagged = pp.lags(tsla_extended, "Adj_Close", 5); tsla_lagged[:5]
array([(315.13000488, 298.79998779, 306.1000061 , 310.11999512, 11658600, 310.11999512, '2019-01-02', 'TSLA', '2019-01', '2019', nan, nan, nan, nan, nan), (309.3999939 , 297.38000488, 307. , 300.35998535, 6965200, 300.35998535, '2019-01-03', 'TSLA', '2019-01', '2019', 310.11999512, nan, nan, nan, nan), (318. , 302.73001099, 306. , 317.69000244, 7394100, 317.69000244, '2019-01-04', 'TSLA', '2019-01', '2019', 300.35998535, 310.11999512, nan, nan, nan), (336.73999023, 317.75 , 321.72000122, 334.95999146, 7551200, 334.95999146, '2019-01-07', 'TSLA', '2019-01', '2019', 317.69000244, 300.35998535, 310.11999512, nan, nan), (344.01000977, 327.01998901, 341.95999146, 335.3500061 , 7008500, 335.3500061 , '2019-01-08', 'TSLA', '2019-01', '2019', 334.95999146, 317.69000244, 300.35998535, 310.11999512, nan)], dtype=[('High', '<f8'), ('Low', '<f8'), ('Open', '<f8'), ('Close', '<f8'), ('Volume', '<i8'), ('Adj_Close', '<f8'), ('Date', '<M8[D]'), ('Ticker', '<U10'), ('Month', '<M8[M]'), ('Year', '<M8[Y]'), ('Adj_Close_lag_1', '<f8'), ('Adj_Close_lag_2', '<f8'), ('Adj_Close_lag_3', '<f8'), ('Adj_Close_lag_4', '<f8'), ('Adj_Close_lag_5', '<f8')])
Correlation Array
correlated = pp.corr(closing)
Correlation | AA | AAPL | DAL | GE | IBM | KO | MSFT | PEP | UAL |
---|---|---|---|---|---|---|---|---|---|
AA | 1.00 | 0.21 | 0.24 | -0.17 | 0.39 | -0.09 | 0.05 | -0.04 | 0.12 |
AAPL | 0.21 | 1.00 | 0.86 | -0.83 | 0.22 | 0.85 | 0.94 | 0.85 | 0.82 |
DAL | 0.24 | 0.86 | 1.00 | -0.78 | 0.14 | 0.79 | 0.86 | 0.78 | 0.86 |
GE | -0.17 | -0.83 | -0.78 | 1.00 | 0.06 | -0.76 | -0.86 | -0.69 | -0.76 |
IBM | 0.39 | 0.22 | 0.14 | 0.06 | 1.00 | 0.07 | 0.15 | 0.24 | 0.18 |
KO | -0.09 | 0.85 | 0.79 | -0.76 | 0.07 | 1.00 | 0.94 | 0.96 | 0.74 |
MSFT | 0.05 | 0.94 | 0.86 | -0.86 | 0.15 | 0.94 | 1.00 | 0.93 | 0.83 |
PEP | -0.04 | 0.85 | 0.78 | -0.69 | 0.24 | 0.96 | 0.93 | 1.00 | 0.75 |
UAL | 0.12 | 0.82 | 0.86 | -0.76 | 0.18 | 0.74 | 0.83 | 0.75 | 1.00 |
Log Returns
pp.returns(closing,"IBM",type="log")
array([ nan, -0.01585991, -0.02180223, ..., 0.0026649 , -0.0183533 , 0.0092187 ])
Normal Returns
loga = pp.returns(closing,"IBM",type="normal"); loga
array([ nan, -0.0157348 , -0.02156628, ..., 0.00266845, -0.0181859 , 0.00926132])
Add Column
close_ret = pp.add(closing,"IBM_log_return",loga); close_ret[:5]
array([(37.24206924, 100.45429993, 44.57522202, 20.72605705, 130.59109497, 35.80251312, 41.9791832 , 81.51140594, 66.33999634, nan), (35.08446503, 97.62433624, 43.83200836, 20.34561157, 128.53627014, 35.80251312, 41.59314346, 80.89860535, 66.15000153, -0.0157348 ), (35.34244537, 97.63354492, 42.79874039, 19.90727234, 125.76422119, 36.07437897, 40.98268127, 80.28580475, 64.58000183, -0.02156628), (36.25707626, 99.00255585, 42.57216263, 19.91554451, 124.94229126, 36.52467346, 41.50337982, 82.63342285, 65.52999878, -0.00653548), (37.28897095, 102.80648041, 43.67792892, 20.15538216, 127.65791321, 36.966465 , 42.72432327, 84.13523865, 66.63999939, 0.02173501)], dtype=[('AA', '<f8'), ('AAPL', '<f8'), ('DAL', '<f8'), ('GE', '<f8'), ('IBM', '<f8'), ('KO', '<f8'), ('MSFT', '<f8'), ('PEP', '<f8'), ('UAL', '<f8'), ('IBM_log_return', '<f8')])
Drop Array Rows Where Null
close_ret_na = pp.dropnarow(close_ret, "IBM_log_return"); close_ret[:5]
array([(37.24206924, 100.45429993, 44.57522202, 20.72605705, 130.59109497, 35.80251312, 41.9791832 , 81.51140594, 66.33999634, nan), (35.08446503, 97.62433624, 43.83200836, 20.34561157, 128.53627014, 35.80251312, 41.59314346, 80.89860535, 66.15000153, -0.0157348 ), (35.34244537, 97.63354492, 42.79874039, 19.90727234, 125.76422119, 36.07437897, 40.98268127, 80.28580475, 64.58000183, -0.02156628), (36.25707626, 99.00255585, 42.57216263, 19.91554451, 124.94229126, 36.52467346, 41.50337982, 82.63342285, 65.52999878, -0.00653548), (37.28897095, 102.80648041, 43.67792892, 20.15538216, 127.65791321, 36.966465 , 42.72432327, 84.13523865, 66.63999939, 0.02173501)], dtype=[('AA', '<f8'), ('AAPL', '<f8'), ('DAL', '<f8'), ('GE', '<f8'), ('IBM', '<f8'), ('KO', '<f8'), ('MSFT', '<f8'), ('PEP', '<f8'), ('UAL', '<f8'), ('IBM_log_return', '<f8')])
Portfolio Value from Log Return
pp.portfolio_value(close_ret_na, "IBM_log_return", type="log")
array([0.98438834, 0.96338604, 0.95711037, ..., 1.15115429, 1.13040872, 1.14092643])
Cummulative Value from Log Return
pp.cummulative_return(close_ret_na, "IBM_log_return", type="log")
array([-0.01561166, -0.03661396, -0.04288963, ..., 0.15115429, 0.13040872, 0.14092643])
Fillna Mean
pp.fillna(tsla_lagged,type="mean")[:5]
array([(315.13000488, 298.79998779, 306.1000061 , 310.11999512, 11658600, 310.11999512, 272.95330665, 272.38631982, 271.75180703, 271.10991915, 270.48587024), (309.3999939 , 297.38000488, 307. , 300.35998535, 6965200, 300.35998535, 310.11999512, 272.38631982, 271.75180703, 271.10991915, 270.48587024), (318. , 302.73001099, 306. , 317.69000244, 7394100, 317.69000244, 300.35998535, 310.11999512, 271.75180703, 271.10991915, 270.48587024), (336.73999023, 317.75 , 321.72000122, 334.95999146, 7551200, 334.95999146, 317.69000244, 300.35998535, 310.11999512, 271.10991915, 270.48587024), (344.01000977, 327.01998901, 341.95999146, 335.3500061 , 7008500, 335.3500061 , 334.95999146, 317.69000244, 300.35998535, 310.11999512, 270.48587024)], dtype={'names':['High','Low','Open','Close','Volume','Adj_Close','Adj_Close_lag_1','Adj_Close_lag_2','Adj_Close_lag_3','Adj_Close_lag_4','Adj_Close_lag_5'], 'formats':['<f8','<f8','<f8','<f8','<i8','<f8','<f8','<f8','<f8','<f8','<f8'], 'offsets':[0,8,16,24,32,40,112,120,128,136,144], 'itemsize':152})
Fillna Value
pp.fillna(tsla_lagged,type="value",value=-999999)[:5]
array([(315.13000488, 298.79998779, 306.1000061 , 310.11999512, 11658600, 310.11999512, -9.99999000e+05, -9.99999000e+05, -9.99999000e+05, -9.99999000e+05, -999999.), (309.3999939 , 297.38000488, 307. , 300.35998535, 6965200, 300.35998535, 3.10119995e+02, -9.99999000e+05, -9.99999000e+05, -9.99999000e+05, -999999.), (318. , 302.73001099, 306. , 317.69000244, 7394100, 317.69000244, 3.00359985e+02, 3.10119995e+02, -9.99999000e+05, -9.99999000e+05, -999999.), (336.73999023, 317.75 , 321.72000122, 334.95999146, 7551200, 334.95999146, 3.17690002e+02, 3.00359985e+02, 3.10119995e+02, -9.99999000e+05, -999999.), (344.01000977, 327.01998901, 341.95999146, 335.3500061 , 7008500, 335.3500061 , 3.34959991e+02, 3.17690002e+02, 3.00359985e+02, 3.10119995e+02, -999999.)], dtype={'names':['High','Low','Open','Close','Volume','Adj_Close','Adj_Close_lag_1','Adj_Close_lag_2','Adj_Close_lag_3','Adj_Close_lag_4','Adj_Close_lag_5'], 'formats':['<f8','<f8','<f8','<f8','<i8','<f8','<f8','<f8','<f8','<f8','<f8'], 'offsets':[0,8,16,24,32,40,112,120,128,136,144], 'itemsize':152})
Fillna Forward Fill
pp.fillna(tsla_lagged,type="ffill")[:5]
array([(315.13000488, 298.79998779, 306.1000061 , 310.11999512, 11658600, 310.11999512, nan, nan, nan, nan, nan), (309.3999939 , 297.38000488, 307. , 300.35998535, 6965200, 300.35998535, 310.11999512, nan, nan, nan, nan), (318. , 302.73001099, 306. , 317.69000244, 7394100, 317.69000244, 300.35998535, 310.11999512, nan, nan, nan), (336.73999023, 317.75 , 321.72000122, 334.95999146, 7551200, 334.95999146, 317.69000244, 300.35998535, 310.11999512, nan, nan), (344.01000977, 327.01998901, 341.95999146, 335.3500061 , 7008500, 335.3500061 , 334.95999146, 317.69000244, 300.35998535, 310.11999512, nan)], dtype={'names':['High','Low','Open','Close','Volume','Adj_Close','Adj_Close_lag_1','Adj_Close_lag_2','Adj_Close_lag_3','Adj_Close_lag_4','Adj_Close_lag_5'], 'formats':['<f8','<f8','<f8','<f8','<i8','<f8','<f8','<f8','<f8','<f8','<f8'], 'offsets':[0,8,16,24,32,40,112,120,128,136,144], 'itemsize':152})
Fillna Backward Fill
pp.fillna(tsla_lagged,type="bfill")[:5]
array([(315.13000488, 298.79998779, 306.1000061 , 310.11999512, 11658600, 310.11999512, 310.11999512, 310.11999512, 310.11999512, 310.11999512, 310.11999512), (309.3999939 , 297.38000488, 307. , 300.35998535, 6965200, 300.35998535, 310.11999512, 310.11999512, 310.11999512, 310.11999512, 310.11999512), (318. , 302.73001099, 306. , 317.69000244, 7394100, 317.69000244, 300.35998535, 310.11999512, 310.11999512, 310.11999512, 310.11999512), (336.73999023, 317.75 , 321.72000122, 334.95999146, 7551200, 334.95999146, 317.69000244, 300.35998535, 310.11999512, 310.11999512, 310.11999512), (344.01000977, 327.01998901, 341.95999146, 335.3500061 , 7008500, 335.3500061 , 334.95999146, 317.69000244, 300.35998535, 310.11999512, 310.11999512)], dtype={'names':['High','Low','Open','Close','Volume','Adj_Close','Adj_Close_lag_1','Adj_Close_lag_2','Adj_Close_lag_3','Adj_Close_lag_4','Adj_Close_lag_5'], 'formats':['<f8','<f8','<f8','<f8','<i8','<f8','<f8','<f8','<f8','<f8','<f8'], 'offsets':[0,8,16,24,32,40,112,120,128,136,144], 'itemsize':152})
Print Table
pp.table(tsla_lagged,5)
High | Low | Open | Close | Volume | Adj_Close | Date | Ticker | Month | Year | Adj_Close_lag_1 | Adj_Close_lag_2 | Adj_Close_lag_3 | Adj_Close_lag_4 | Adj_Close_lag_5 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 315.130 | 298.800 | 306.100 | 310.120 | 11658600 | 310.120 | 2019-01-02 | TSLA | 2019-01-01 | 2019-01-01 | nan | nan | nan | nan | nan |
1 | 309.400 | 297.380 | 307.000 | 300.360 | 6965200 | 300.360 | 2019-01-03 | TSLA | 2019-01-01 | 2019-01-01 | 310.120 | nan | nan | nan | nan |
2 | 318.000 | 302.730 | 306.000 | 317.690 | 7394100 | 317.690 | 2019-01-04 | TSLA | 2019-01-01 | 2019-01-01 | 300.360 | 310.120 | nan | nan | nan |
3 | 336.740 | 317.750 | 321.720 | 334.960 | 7551200 | 334.960 | 2019-01-07 | TSLA | 2019-01-01 | 2019-01-01 | 317.690 | 300.360 | 310.120 | nan | nan |
4 | 344.010 | 327.020 | 341.960 | 335.350 | 7008500 | 335.350 | 2019-01-08 | TSLA | 2019-01-01 | 2019-01-01 | 334.960 | 317.690 | 300.360 | 310.120 | nan |
Outliers
signal = tsla_lagged["Volume"] z_signal = (signal - np.mean(signal)) / np.std(signal)
tsla_lagged = pp.add(tsla_lagged,"z_signal_volume",z_signal)
outliers = pp.detect(tsla_lagged["z_signal_volume"]); outliers
[12, 40, 42, 64, 78, 79, 84, 95, 97, 98, 107, 141, 205, 206, 207]
import matplotlib.pyplot as plt plt.figure(figsize=(15, 7)) plt.plot(np.arange(len(tsla_lagged["Volume"])), tsla_lagged["Volume"]) plt.plot(np.arange(len(tsla_lagged["Volume"])), tsla_lagged["Volume"], 'X', label='outliers',markevery=outliers, c='r') plt.legend() plt.show()
Remove Noise
price_signal = tsla_lagged["Close"] removed_signal = pp.removal(price_signal, 30) noise = pp.get(price_signal, removed_signal)
plt.figure(figsize=(15, 7)) plt.subplot(2, 1, 1) plt.plot(removed_signal) plt.title('timeseries without noise') plt.subplot(2, 1, 2) plt.plot(noise) plt.title('noise timeseries') plt.show()
以上所述就是小编给大家介绍的《PandaPy has the speed of NumPy and the usability of Pandas》,希望对大家有所帮助,如果大家有任何疑问请给我留言,小编会及时回复大家的。在此也非常感谢大家对 码农网 的支持!
猜你喜欢:本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。