内容简介:A python module is simply a set of python operations, often functions, placed in a single file with aLet’s run through an example.In the below code I have read in the CSV file I will be working with using pandas.
A data science use case for modules
A python module is simply a set of python operations, often functions, placed in a single file with a .py
extension. This file can then imported into a Jupyter notebook, IPython shell or into another module for use in your project.
Let’s run through an example.
In the below code I have read in the CSV file I will be working with using pandas.
import pandas as pddata = pd.read_csv('adults_data.csv') data.head()
This dataset contains a lot of categorical features. If we were planning to use this to train a machine learning model we would first need to perform some pre-processing.
Having analysed this data I have determined that I will take the following steps to preprocess the data before training a model.
- One-hot encode the following columns: workclass, marital-status, relationship, race and gender.
- Take the most commonly occurring values, group remaining values as ‘others’ and one-hot encode the resulting feature. This will need to be performed for the following columns as they have a large number of unique values: education, occupation, native-country.
- Scale the remaining numerical values.
The code that we will need to write to perform these tasks will be quite large. Additionally, these are all tasks that we may want to perform more than once. To make our code more readable and to easily be able to re-use it we can write a series of functions into a separate file that can be imported for use in our notebook — a module.
Writing a module
To create a module you will need to first make a new blank text file and save it with the .py
extension. You can use an ordinary text editor for this but many people use an IDE (Integrated Development Environment). IDE’s provide a lot of additional functionality for writing code including tools for compiling code, debugging and integrating with Github. There are many different types of IDE’s available and it is worth experimenting with a few to find the one which works best for you. I personally prefer PyCharm
so I will be using this in the example.
To start writing the python module I am going to create a new python file.
I will name it preprocessing.py
.
Let’s write our first preprocessing function in this file and test importing and using it in a Jupyter Notebook.
I have written the following code at the top of the preprocessing.py
file. It is good practice to annotate the code to make it more readable. I have added some notes to the function in the code below.
def one_hot_encoder(df, column_list):
"""Takes in a dataframe and a list of columns
for pre-processing via one hot encoding"""
df_to_encode = df[column_list]
df = pd.get_dummies(df_to_encode)
return df
To import this module into a Jupyter Notebook we simply write the following.
import preprocessing as pr
IPython has a handy magic extension known as autoreload
. If you add the following code before the import then if you make any changes to the module file they will automatically be reflected in the notebook.
%load_ext autoreload %autoreload 2import preprocessing as pr
Let’s test using it to preprocess some data.
cols = ['workclass', 'marital-status', 'relationship', 'race', 'gender']one_hot_df = pr.one_hot_encoder(data, cols)
Now we will add the remaining preprocessing functions to our preprocessing.py
file.
def one_hot_encoder(df, column_list):
"""Takes in a dataframe and a list of columns
for pre-processing via one hot encoding returns
a dataframe of one hot encoded values"""
df_to_encode = df[column_list]
df = pd.get_dummies(df_to_encode)
return df
def reduce_uniques(df, column_threshold_dict):
"""Takes in a dataframe and a dictionary consisting
of column name : value count threshold returns the original
dataframe"""
for key, value in column_threshold_dict.items():
counts = df[key].value_counts()
others = set(counts[counts < value].index)
df[key] = df[key].replace(list(others), 'Others')
return df
def scale_data(df, column_list):
"""Takes in a dataframe and a list of column names to transform
returns a dataframe of scaled values"""
df_to_scale = df[column_list]
x = df_to_scale.values
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)
df_to_scale = pd.DataFrame(x_scaled, columns=df_to_scale.columns)
return df_to_scale
If we go back to the notebook we can use all these functions to transform the data.
import pandas as pd from sklearn import preprocessing%load_ext autoreload %autoreload 2import preprocessing as prone_hot_list = ['workclass', 'marital-status', 'relationship', 'race', 'gender'] reduce_uniques_dict = {'education' : 1000,'occupation' : 3000, 'native-country' : 100} scale_data_list = data.select_dtypes(include=['int64', 'float64']).columnsone_hot_enc_df = pr.one_hot_encoder(data, one_hot_list) reduce_uniques_df = pr.reduce_uniques(data, reduce_uniques_dict) reduce_uniques_df = pr.one_hot_encoder(data, reduce_uniques_dict.keys()) scale_data_df = pr.scale_data(data, scale_data_list)final_data = pd.concat([one_hot_enc_df, reduce_uniques_df, scale_data_df], axis=1)final_data.dtypes
We now have an entirely numerical dataset which is suitable for training a machine learning model.
Packages
When working on a machine learning project it can often be desirable or sometimes necessary to create several related modules and package them so that they can be installed and used together.
For example, in my work, I am currently using a Google Cloud deployment solution for machine learning models called AI Platform . This tool requires that you package up preprocessing, training and prediction steps in the machine learning model to upload and install on the platform to deploy the final model.
A python package is a directory containing modules, files and subdirectories. The directory needs to contain a file called __init__.py
. This file indicates that the directory it is contained within should be treated as a package and specifies the modules and functions that should be imported.
We are going to create a package for all the steps in our preprocessing pipeline. The contents of the __init__.py file are as follows.
from .preprocessing import one_hot_encoder from .preprocessing import reduce_uniques from .preprocessing import scale_data from .makedata import preprocess_data
Modules within the same package can be imported for use within another module. We are going to add another module to our directory called makedata.py that uses the preprocessing.py module to execute the data transformations and then export the final dataset as a CSV file for later use.
import preprocessing as pr import pandas as pd def preprocess_data(df, one_hot_list, reduce_uniques_dict, scale_data_list, output_filename): one_hot_enc_df = pr.one_hot_encoder(data, one_hot_list) reduce_uniques_df = pr.reduce_uniques(data, reduce_uniques_dict) reduce_uniques_df = pr.one_hot_encoder(data, reduce_uniques_dict.keys()) scale_data_df = pr.scale_data(data, scale_data_list) final_data = pd.concat([one_hot_enc_df, reduce_uniques_df, scale_data_df], axis=1) final_data.to_csv(output_filename)
The new directory now looks like this.
Now we can go back to the Jupyter Notebook and use this package to execute all the preprocessing. Our code is now very simple and clean.
import pandas as pd%load_ext autoreload %autoreload 2 import preprocessing as prdata = pd.read_csv('adults_data.csv')one_hot_list = ['workclass', 'marital-status', 'relationship', 'race', 'gender'] reduce_uniques_dict = {'education' : 1000,'occupation' : 3000, 'native-country' : 100} scale_data_list = data.select_dtypes(include=['int64', 'float64']).columnspr.preprocess_data(data, one_hot_list, reduce_uniques_dict,scale_data_list, 'final_data.csv')
In our current working directory, there will now be a new CSV file called final_data.csv
which contains the preprocessed dataset. Let’s read this back in and inspect a few rows to ensure that our package has performed as expected.
data_ = pd.read_csv('final_data.csv') data.head()
以上所述就是小编给大家介绍的《A Data Scientists Guide to Python Modules and Packages》,希望对大家有所帮助,如果大家有任何疑问请给我留言,小编会及时回复大家的。在此也非常感谢大家对 码农网 的支持!
猜你喜欢:本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。
Complete Web Monitoring
Alistair Croll、Sean Power / O'Reilly Media / June 29, 2009 / GBP 39.99
Do you know the true value of your website to your organization? i??Complete Web Monitoringi?? shows you how to integrate several different views of your online business - including analytics, back-en......一起来看看 《Complete Web Monitoring》 这本书的介绍吧!