Data splitting technique to fit any Machine Learning Model

栏目: IT技术 · 发布时间: 4年前

内容简介:This aims to be a short 4-minute article to introduce you guys with Data splitting technique and its importance in practical projects.Ethically, it is suggested to divide your dataset into three parts to avoid overfitting and model selection bias called -T

This aims to be a short 4-minute article to introduce you guys with Data splitting technique and its importance in practical projects.

Ethically, it is suggested to divide your dataset into three parts to avoid overfitting and model selection bias called -

  1. Training set (Has to be the largest set)
  2. Cross-Validation set or Development set or Dev set
  3. Testing Set

The test set can be sometimes omitted too. It is meant to get an unbiased estimate of algorithms performance in the real world. People who divide their dataset into just two parts usually call their Dev set the Test set.

We try to build a model upon training set then try to optimize hyperparameters on the dev set as much as possible then after our model is ready, we try and evaluate the testing set.

# Training Set:

The sample of data used to fit the model, that is the actual subset of the dataset that we use to train the model (estimating the weights and biases in the case ofNeural Network). The model observes and learns from this data and optimize its parameters.

# Cross-Validation Set:

We select the appropriate model or the degree of the polynomial (if using regression model only) by minimizing the error on the cross-validation set.

# Test set:

The sample of data used to provide an unbiased evaluation of a final model fit on the training dataset. It is only used once the model is completely trained using the training and validation sets. Therefore test set is the one used to replicate the type of situation that will be encountered once the model is deployed for real-time use.

The test set is generally what is used to evaluate different models in competitions of Kaggle or Analytics Vidhya . Generally in a Machine Learning hackathon, the cross-validation set is released along with the training set and the actual test set is only released when the competition is about to close, and it is the score of the model on the Test set that decides the winner.

# How to decide the ratio of splitting the dataset?


以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

The Ruby Programming Language

The Ruby Programming Language

David Flanagan、Yukihiro Matsumoto / O'Reilly Media, Inc. / 2008 / USD 39.99

Ruby has gained some attention through the popular Ruby on Rails web development framework, but the language alone is worthy of more consideration -- a lot more. This book offers a definition explanat......一起来看看 《The Ruby Programming Language》 这本书的介绍吧!

HTML 压缩/解压工具
HTML 压缩/解压工具

在线压缩/解压 HTML 代码

SHA 加密
SHA 加密

SHA 加密工具

Markdown 在线编辑器
Markdown 在线编辑器

Markdown 在线编辑器