Data splitting technique to fit any Machine Learning Model

栏目: IT技术 · 发布时间: 5年前

内容简介:This aims to be a short 4-minute article to introduce you guys with Data splitting technique and its importance in practical projects.Ethically, it is suggested to divide your dataset into three parts to avoid overfitting and model selection bias called -T

This aims to be a short 4-minute article to introduce you guys with Data splitting technique and its importance in practical projects.

Ethically, it is suggested to divide your dataset into three parts to avoid overfitting and model selection bias called -

  1. Training set (Has to be the largest set)
  2. Cross-Validation set or Development set or Dev set
  3. Testing Set

The test set can be sometimes omitted too. It is meant to get an unbiased estimate of algorithms performance in the real world. People who divide their dataset into just two parts usually call their Dev set the Test set.

We try to build a model upon training set then try to optimize hyperparameters on the dev set as much as possible then after our model is ready, we try and evaluate the testing set.

# Training Set:

The sample of data used to fit the model, that is the actual subset of the dataset that we use to train the model (estimating the weights and biases in the case ofNeural Network). The model observes and learns from this data and optimize its parameters.

# Cross-Validation Set:

We select the appropriate model or the degree of the polynomial (if using regression model only) by minimizing the error on the cross-validation set.

# Test set:

The sample of data used to provide an unbiased evaluation of a final model fit on the training dataset. It is only used once the model is completely trained using the training and validation sets. Therefore test set is the one used to replicate the type of situation that will be encountered once the model is deployed for real-time use.

The test set is generally what is used to evaluate different models in competitions of Kaggle or Analytics Vidhya . Generally in a Machine Learning hackathon, the cross-validation set is released along with the training set and the actual test set is only released when the competition is about to close, and it is the score of the model on the Test set that decides the winner.

# How to decide the ratio of splitting the dataset?


以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

代码

代码

劳伦斯・莱斯格 / 李旭 / 中信出版社 / 2004-10-1 / 30.00元

劳伦斯·莱斯格的著作《代码》 问世便震动了学界和业界,被人称为“也许是迄今为止互联网领域最重要的书籍”,也被一些学者称为“网络空间法律的圣经”。 《代码》挑战了早期人们对互联网的认识,即技术已经创造了一个自由的环境,因而网络空间无法被规制——也就是说,网络的特性使它押脱了政府的控制。莱斯格提出,事实恰恰相反。 代码的存在证明,网络并不是本制拷贝 ,不可规制的,它并没有什......一起来看看 《代码》 这本书的介绍吧!

HTML 压缩/解压工具
HTML 压缩/解压工具

在线压缩/解压 HTML 代码

html转js在线工具
html转js在线工具

html转js在线工具

HEX CMYK 转换工具
HEX CMYK 转换工具

HEX CMYK 互转工具