End-to-End Machine Learning Project Tutorial — Part 1

栏目: IT技术 · 发布时间: 4年前

内容简介:The perpetual question with regards to Data Science that I come across:My answer remains constant: There is no alternative to working onIn my post,

The perpetual question with regards to Data Science that I come across:

What is the best way to master Data Science? What will get me hired?

My answer remains constant: There is no alternative to working on portfolio-worthy projects . Even after clearing the TensorFlow Developer Certificate Exam, I’d say no certificates, no courses, nothing, you can only prove your competency with projects that showcase your research, programming skills, mathematical background, etc.

In my post, how to build an effective Data Science Portfolio , I shared many project ideas and other tips to prepare a kickass portfolio. This post is dedicated to one of those ideas where I mentioned about end-to-end data science/ML projects.

Agenda

This tutorial is intended to walk you through all the major steps involved in completing an End-to-End Machine Learning project. For this project, I’ve chosen a supervised learning regression problem.

Major topics covered:-

  • Pre-requisites and Resources
  • Data Collection and Problem Statement
  • Exploratory Data Analysis with Pandas and NumPy
  • Data Preparation using Sklearn
  • Selecting and Training a few Machine Learning Models
  • Cross-Validation and Hyperparameter Tuning using Sklearn
  • Deploying the Final Trained Model on Heroku via a Flask App

Let’s start building…

Pre-requisites and Resources

This project and tutorial expect familiarity with Machine Learning algorithms, Python environment setup, and common ML terminologies. Here are a few resources to get you started:

That’s it, make sure you have an understanding of these concepts and tools and you’re ready to go!

Data Collection and Problem Statement

End-to-End Machine Learning Project Tutorial — Part 1

The first step is to get your hands onto the data but if you have access to data(as in most product-based companies) then, the first step is to define the problem that you want to solve. We don’t have the data yet, so we are going to collect the data first.

We are using the Auto MPG dataset from the UCI Machine Learning Repository . Here is the link to the dataset:

The data concerns city-cycle fuel consumption in miles per gallon, to be predicted in terms of 3 multivalued discrete and 5 continuous attributes.

Once you have downloaded the data, move it to your project directory, activate your virtualenv, start the jupyter local server.

  • You can download the data into your project from the notebook as well using wget :
!wget "http://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data"

End-to-End Machine Learning Project Tutorial — Part 1

  • The next step is to load this .data file into a pandas datagram, for that, make sure you have pandas and other general use case libraries installed. Import all the general use case libraries:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
  • Reading and loading the file into a dataframe using read_csv() method:
  • Looking at a few rows of the dataframe and reading the description of each attribute from the website helps you define the problem statement.

End-to-End Machine Learning Project Tutorial — Part 1

Problem Statement —The data contains the MPG(Mile Per Gallon) variable which is continuous data and tells us about the efficiency of fuel consumption of a vehicle in the 70s and 80s.

Our aim here is to predict the MPG value for a vehicle given we have other attributes of that vehicle.

Exploratory Data Analysis with Pandas and NumPy

For this rather simple dataset, the exploration is broken down into a series of steps:

  1. Check for Data type of columns
##checking the data info
data.info()

2. Check for null values.

##checking for all the null values
data.isnull().sum()

End-to-End Machine Learning Project Tutorial — Part 1

The horsepower column has 6 missing values. We’ll have to study the column a bit more.

3. Check for outliers in horsepower column

##summary statistics of quantitative variables
data.describe()##looking at horsepower box plot
sns.boxplot(x=data['Horsepower'])

End-to-End Machine Learning Project Tutorial — Part 1

Since there are a few outliers, we can use the median of the column to impute the missing values using the pandas median() method.

##imputing the values with median
median = data['Horsepower'].median()
data['Horsepower'] = data['Horsepower'].fillna(median)
data.info()

4. Look for the category distribution in categorical columns

##category distribution
data["Cylinders"].value_counts() / len(data)data['Origin'].value_counts()

The 2 categorical columns are Cylinders and Origin which only have a few categories of values. Looking at the distribution of the values among these categories will tell us how the data is distributed:

End-to-End Machine Learning Project Tutorial — Part 1

5. Plot for correlation

##pairplots to get an intuition of potential correlations
sns.pairplot(data[["MPG", "Cylinders", "Displacement", "Weight", "Horsepower"]], diag_kind="kde")

End-to-End Machine Learning Project Tutorial — Part 1

The pair plot gives you a brief overview of how each variable behaves with respect to every other variable.

For example, the MPG column(our target variable) is negatively correlated with Displacement, weight, and horsepower features.

6. Set aside the test data set

This is one of the first things we should do as we want to test our final model on unseen/unbiased data.

There are many ways to split the data into training and testing sets but we want our test set to represent the overall population and not just a few specific categories. Thus, instead of using simple and common train_test_split() method from sklearn, we use stratified sampling.

Stratified Sampling — We create homogeneous subgroups called strata from the overall population and sample the right number of instances to each stratum to ensure that the test set is representative of the overall population.

In task 4, we saw how the data is distributed over each category of the Cylinder column. We’re using the Cylinder column to create the strata:

from sklearn.model_selection import StratifiedShuffleSplitsplit = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in split.split(data, data["Cylinders"]):
    strat_train_set = data.loc[train_index]
    strat_test_set = data.loc[test_index]

Checking for the distribution in training set:

##checking for cylinder category distribution in training set
strat_train_set['Cylinders'].value_counts() / len(strat_train_set)
End-to-End Machine Learning Project Tutorial — Part 1

Testing set:

strat_test_set["Cylinders"].value_counts() / len(strat_test_set)
End-to-End Machine Learning Project Tutorial — Part 1

You can compare these results with the output of train_test_split() to find out which one produces better splits.

7. Checking the Origin Column

The Origin column about the origin of the vehicle and has discrete values that look like the code of a country.

To add some complication and make it more explicit, I converted these numbers to strings:

##converting integer classes to countries in Origin column
train_set['Origin'] = train_set['Origin'].map({1: 'India', 2: 'USA', 3 : 'Germany'})
train_set.sample(10)

End-to-End Machine Learning Project Tutorial — Part 1

We’ll have to preprocess this categorical column by one-hot encoding these values:

##one hot encoding
train_set = pd.get_dummies(train_set, prefix='', prefix_sep='')
train_set.head()

8. Testing for new variables — Analyze the correlation of each variable with the target variable

## testing new variables by checking their correlation w.r.t. MPG
data['displacement_on_power'] = data['Displacement'] / data['Horsepower']
data['weight_on_cylinder'] = data['Weight'] / data['Cylinders']
data['acceleration_on_power'] = data['Acceleration'] / data['Horsepower']
data['acceleration_on_cyl'] = data['Acceleration'] / data['Cylinders']corr_matrix = data.corr()
corr_matrix['MPG'].sort_values(ascending=False)

End-to-End Machine Learning Project Tutorial — Part 1

We found acceleration_on_power and acceleration_on_cyl as 2 new variables which turned out to be more positively correlated than the original variables.

This brings us to the end of the Exploratory Analysis. We are ready to proceed to our next step of preparing the data for our Machine Learning.


以上就是本文的全部内容,希望本文的内容对大家的学习或者工作能带来一定的帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

自媒体写作,从基本功到实战方法

自媒体写作,从基本功到实战方法

余老诗 / 清华大学出版社 / 2018-9-1 / 59.00元

《自媒体写作》是一本系统而通俗易懂的自媒体写作指导书。 全书共分为10章,分别从写作基本功、新媒体认知、新媒体传播规律和自媒体作者阅读写作素养以及如何进阶等方面展开,结合简书、公众号、今日头条等主流自媒体所选例文,讲解写作知识和新媒体特点,内容详实,有理有据,非常适合自媒体写作爱好者自学。 尤其值得一提的是,写作基本功部分从原理、方法和技巧三个层面展开论说,让自媒体写作学习者既能从根本......一起来看看 《自媒体写作,从基本功到实战方法》 这本书的介绍吧!

HTML 压缩/解压工具
HTML 压缩/解压工具

在线压缩/解压 HTML 代码

JS 压缩/解压工具
JS 压缩/解压工具

在线压缩/解压 JS 代码

在线进制转换器
在线进制转换器

各进制数互转换器