内容简介:The perpetual question with regards to Data Science that I come across:My answer remains constant: There is no alternative to working onIn my post,
The perpetual question with regards to Data Science that I come across:
What is the best way to master Data Science? What will get me hired?
My answer remains constant: There is no alternative to working on portfolio-worthy projects . Even after clearing the TensorFlow Developer Certificate Exam, I’d say no certificates, no courses, nothing, you can only prove your competency with projects that showcase your research, programming skills, mathematical background, etc.
In my post, how to build an effective Data Science Portfolio , I shared many project ideas and other tips to prepare a kickass portfolio. This post is dedicated to one of those ideas where I mentioned about end-to-end data science/ML projects.
Agenda
This tutorial is intended to walk you through all the major steps involved in completing an End-to-End Machine Learning project. For this project, I’ve chosen a supervised learning regression problem.
Major topics covered:-
- Pre-requisites and Resources
- Data Collection and Problem Statement
- Exploratory Data Analysis with Pandas and NumPy
- Data Preparation using Sklearn
- Selecting and Training a few Machine Learning Models
- Cross-Validation and Hyperparameter Tuning using Sklearn
- Deploying the Final Trained Model on Heroku via a Flask App
Let’s start building…
Pre-requisites and Resources
This project and tutorial expect familiarity with Machine Learning algorithms, Python environment setup, and common ML terminologies. Here are a few resources to get you started:
- Read the first 2–3 chapters of The hundred page ML book: http://themlbook.com/wiki/doku.php
- List of Tasks for almost every Machine Learning Project — Keep referring to this list while working on this(or any other) ML project.
- You need a Python Environment set up — a virtual environment dedicated to this project.
- Familiarity withJupyter Notebook.
That’s it, make sure you have an understanding of these concepts and tools and you’re ready to go!
Data Collection and Problem Statement
The first step is to get your hands onto the data but if you have access to data(as in most product-based companies) then, the first step is to define the problem that you want to solve. We don’t have the data yet, so we are going to collect the data first.
We are using the Auto MPG dataset from the UCI Machine Learning Repository . Here is the link to the dataset:
The data concerns city-cycle fuel consumption in miles per gallon, to be predicted in terms of 3 multivalued discrete and 5 continuous attributes.
Once you have downloaded the data, move it to your project directory, activate your virtualenv, start the jupyter local server.
- You can download the data into your project from the notebook as well using
wget
:
!wget "http://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data"
- The next step is to load this
.data
file into a pandas datagram, for that, make sure you have pandas and other general use case libraries installed. Import all the general use case libraries:
import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns
- Reading and loading the file into a dataframe using
read_csv()
method:
- Looking at a few rows of the dataframe and reading the description of each attribute from the website helps you define the problem statement.
Problem Statement —The data contains the MPG(Mile Per Gallon) variable which is continuous data and tells us about the efficiency of fuel consumption of a vehicle in the 70s and 80s.
Our aim here is to predict the MPG value for a vehicle given we have other attributes of that vehicle.
Exploratory Data Analysis with Pandas and NumPy
For this rather simple dataset, the exploration is broken down into a series of steps:
- Check for Data type of columns
##checking the data info data.info()
2. Check for null values.
##checking for all the null values data.isnull().sum()
The horsepower column has 6 missing values. We’ll have to study the column a bit more.
3. Check for outliers in horsepower column
##summary statistics of quantitative variables data.describe()##looking at horsepower box plot sns.boxplot(x=data['Horsepower'])
Since there are a few outliers, we can use the median of the column to impute the missing values using the pandas median()
method.
##imputing the values with median median = data['Horsepower'].median() data['Horsepower'] = data['Horsepower'].fillna(median) data.info()
4. Look for the category distribution in categorical columns
##category distribution data["Cylinders"].value_counts() / len(data)data['Origin'].value_counts()
The 2 categorical columns are Cylinders and Origin which only have a few categories of values. Looking at the distribution of the values among these categories will tell us how the data is distributed:
5. Plot for correlation
##pairplots to get an intuition of potential correlations sns.pairplot(data[["MPG", "Cylinders", "Displacement", "Weight", "Horsepower"]], diag_kind="kde")
The pair plot gives you a brief overview of how each variable behaves with respect to every other variable.
For example, the MPG column(our target variable) is negatively correlated with Displacement, weight, and horsepower features.
6. Set aside the test data set
This is one of the first things we should do as we want to test our final model on unseen/unbiased data.
There are many ways to split the data into training and testing sets but we want our test set to represent the overall population and not just a few specific categories. Thus, instead of using simple and common train_test_split()
method from sklearn, we use stratified sampling.
Stratified Sampling — We create homogeneous subgroups called strata from the overall population and sample the right number of instances to each stratum to ensure that the test set is representative of the overall population.
In task 4, we saw how the data is distributed over each category of the Cylinder column. We’re using the Cylinder column to create the strata:
from sklearn.model_selection import StratifiedShuffleSplitsplit = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42) for train_index, test_index in split.split(data, data["Cylinders"]): strat_train_set = data.loc[train_index] strat_test_set = data.loc[test_index]
Checking for the distribution in training set:
##checking for cylinder category distribution in training set strat_train_set['Cylinders'].value_counts() / len(strat_train_set)
Testing set:
strat_test_set["Cylinders"].value_counts() / len(strat_test_set)
You can compare these results with the output of train_test_split()
to find out which one produces better splits.
7. Checking the Origin Column
The Origin column about the origin of the vehicle and has discrete values that look like the code of a country.
To add some complication and make it more explicit, I converted these numbers to strings:
##converting integer classes to countries in Origin column train_set['Origin'] = train_set['Origin'].map({1: 'India', 2: 'USA', 3 : 'Germany'}) train_set.sample(10)
We’ll have to preprocess this categorical column by one-hot encoding these values:
##one hot encoding train_set = pd.get_dummies(train_set, prefix='', prefix_sep='') train_set.head()
8. Testing for new variables — Analyze the correlation of each variable with the target variable
## testing new variables by checking their correlation w.r.t. MPG data['displacement_on_power'] = data['Displacement'] / data['Horsepower'] data['weight_on_cylinder'] = data['Weight'] / data['Cylinders'] data['acceleration_on_power'] = data['Acceleration'] / data['Horsepower'] data['acceleration_on_cyl'] = data['Acceleration'] / data['Cylinders']corr_matrix = data.corr() corr_matrix['MPG'].sort_values(ascending=False)
We found acceleration_on_power and acceleration_on_cyl as 2 new variables which turned out to be more positively correlated than the original variables.
This brings us to the end of the Exploratory Analysis. We are ready to proceed to our next step of preparing the data for our Machine Learning.
以上就是本文的全部内容,希望本文的内容对大家的学习或者工作能带来一定的帮助,也希望大家多多支持 码农网
猜你喜欢:本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。
精通Python设计模式
[荷] Sakis Kasampalis / 夏永锋 / 人民邮电出版社 / 2016-7 / 45.00元
本书分三部分、共16章介绍一些常用的设计模式。第一部分介绍处理对象创建的设计模式,包括工厂模式、建造者模式、原型模式;第二部分介绍处理一个系统中不同实体(类、对象等)之间关系的设计模式,包括外观模式、享元模式等;第三部分介绍处理系统实体之间通信的设计模式,包括责任链模式、观察者模式等。一起来看看 《精通Python设计模式》 这本书的介绍吧!