The 4 steps necessary before fitting a machine learning model
A plain, object-oriented approach to data processing.
Mar 6 ·5min read
There are many steps in a common machine learning pipeline and much thought that goes into architecting it. There is the problem definition, data acquisition, error detection and data cleaning, etc. In this story, we begin with the assumption that we have a clean and ready to go dataset.
With that in mind, we outline the four steps necessary before fitting any machine learning model. We then implement those steps in Pytorch, using a common syntax for invoking multiple method calls; method chaining. The goal is to define a simple yet generalizable API, that transforms any raw dataset into a format that is ready to be consumed by a machine learning model.
To this end, we will use the build pattern , which constructs a complex object using a step by step approach.
The builder pattern is a design pattern , which provides a flexible solution to object creation problems in object-oriented programming . Its aim is to separate the construction of a complex object from its representation.
So, what are those 4 things? In its most simple case, processing data before modelling includes four distinct actions:
- Load the data
- Split into train/valid/test sets
- Label the data tuples
- Obtain batches of data
In the following sections, I analyze those four steps one-by-one and implement them in code. Our goal is to finally create a PyTorch DataLoader
, an abstraction PyTorch uses to represent an iterable over a dataset. Having a DataLoader
is the first step in setting up the training loop. So, without further ado, let us get our hands dirty.
Loading the data
For this example, we use a mock dataset that is kept in a pandas DataFrame
format. Our goal is to create one PyTorch Dataloader
class for the training set and one for the validation set. Thus, let us build a class named DataLoaderBuilder
that is responsible for building those classes.
We see that the only operation of the DataLoaderBuilder
is to store a data
variable, which type is a torch.tensor
. So now, we need a way to initialize it from a pandas DataFrame
. For that, we use a python classmethod
.
The classmethod
is a plain python class method, but instead of receiving self
as the first argument it receives a class
. Thus, given a pandas DataFrame
, we turn the DataFrame
into a PyTorch tensor and instantiate the DataLoaderBuilder
class, that is passed to the method as the cls
argument. Optionally, we can keep only the columns of the DataFrame
we care about. After defining it, we patch it to the main DataLoaderBuilder
class.
Splitting into Training & Validation
For this example, we split the dataset into two sets; training and validation. It is easy to extend the code and split it into three sets; training, validation and testing.
We want to split the dataset randomly, and keep some percentage of the data for training and set aside what is left for validation. To this end, we use Pytorch’s SubsetRandomSampler
. You can read more about this sampler and many more sampling methods in the official PyTorch documentation .
By default, we keep 90% of the data for training and we split across rows ( axis=0
). Another detail in the code is that we return self
. Thus, after creating the train_data
and valid_data
splits, we return back the whole class. This will permit as to use method chaining in the end.
Label the Dataset
Next, we should label the dataset. Most of the time, we use some feature variables to predict a depended variable (i.e. the target). That is, of course, called supervise learning. The label_by_func
method annotates the dataset according to a given function. After this call, the dataset is usually converted to (features, target)
tuples.
We see that the label_by_func
method accepts a function as an argument and applies it to the train and valid sets. Our job is to design a function that serves our purposes any time we want to label a dataset of some form. Later in the “putting it all together” example we show how simple it is to create such a function.
Create Batches
Finally, only one step is left; break the dataset into batches. For this, we can leverage PyTorch’s TensorDataset
and DataLoader
classes.
This is the last method in the chain, thus, we name it “ build” . It creates the train and valid datasets and having them it is easy to instantiate the corresponding Pytorch DataLoader
, with a known batch size. Keep in mind that we now have labelled the data, thus, self.train_data
is a tuple of features
and a target
variable. Consequently, self.train_data[0]
keeps the features
and self.train_data[1]
holds the target.
Having that in place, let us put it all together with a simple example.
In this example, we create a dummy dataset of three columns, where the last column stores the target or depended variable. We then define a get_label
function, that pulls the last column and creates a features-target tuple. Finally, using method chaining we can easily create the data loaders we need from a given pandas DataFrame
.
Conclusion
In this story, we saw what are the four necessary steps of data processing before fitting any model, assuming that the dataset is clean. Although this is a toy example, it can be used and extended to cover a wide variety of machine learning problems.
Also, there are steps that are not covered in this article (e.g. data normalization or augmentation for computer vision) but the goal of the story is to provide a general idea on how to structure code that solves a relevant problem.
My name is Dimitris Poulopoulos and I’m a machine learning researcher at BigDataStack and PhD(c) at the University of Piraeus, Greece. I have worked on designing and implementing AI and software solutions for major clients such as the European Commission, Eurostat, IMF, the European Central Bank, OECD, and IKEA. If you are interested in reading more posts about Machine Learning, Deep Learning and Data Science, follow me on Medium , LinkedIn or @james2pl on twitter.
以上就是本文的全部内容,希望本文的内容对大家的学习或者工作能带来一定的帮助,也希望大家多多支持 码农网
猜你喜欢:本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。