Data splitting technique to fit any Machine Learning Model

栏目: IT技术 · 发布时间: 5年前

内容简介：This aims to be a short 4-minute article to introduce you guys with Data splitting technique and its importance in practical projects.Ethically, it is suggested to divide your dataset into three parts to avoid overfitting and model selection bias called -T

This aims to be a short 4-minute article to introduce you guys with Data splitting technique and its importance in practical projects.

Ethically, it is suggested to divide your dataset into three parts to avoid overfitting and model selection bias called -

Training set (Has to be the largest set)
Cross-Validation set or Development set or Dev set
Testing Set

The test set can be sometimes omitted too. It is meant to get an unbiased estimate of algorithms performance in the real world. People who divide their dataset into just two parts usually call their Dev set the Test set.

We try to build a model upon training set then try to optimize hyperparameters on the dev set as much as possible then after our model is ready, we try and evaluate the testing set.

# Training Set:

The sample of data used to fit the model, that is the actual subset of the dataset that we use to train the model (estimating the weights and biases in the case ofNeural Network). The model observes and learns from this data and optimize its parameters.

# Cross-Validation Set:

We select the appropriate model or the degree of the polynomial (if using regression model only) by minimizing the error on the cross-validation set.

# Test set:

The sample of data used to provide an unbiased evaluation of a final model fit on the training dataset. It is only used once the model is completely trained using the training and validation sets. Therefore test set is the one used to replicate the type of situation that will be encountered once the model is deployed for real-time use.

The test set is generally what is used to evaluate different models in competitions of Kaggle or Analytics Vidhya . Generally in a Machine Learning hackathon, the cross-validation set is released along with the training set and the actual test set is only released when the competition is about to close, and it is the score of the model on the Test set that decides the winner.

# How to decide the ratio of splitting the dataset?

以上就是本文的全部内容，希望对大家的学习有所帮助，也希望大家多多支持码农网

查看所有标签

猜你喜欢:

Data splitting technique to fit any Machine Learning Model

本站部分资源来源于网络，本站转载出于传递更多信息之目的，版权归原作者或者来源机构所有，如转载稿涉及版权问题，请联系我们。

码农书籍

商业模式新生代

亚历山大•奥斯特瓦德 (Alexander Osterwalder)、伊夫•皮尼厄 (Yves Pigneur) / 王帅、毛心宇、严威 / 机械工业出版社 / 2011-8-15 / 88.00元

中文官网：http://www.bizmodel.org 内容简介：当你愉快的看完第一章：商业模式画布，赫然发现这些构成要素全都交织成一幅清晰的图像在脑海中呈现，它们如何互相影响、如何交互作用全都历历在目。利用商业模式画布分析瑞士银行、Google、Lego、Wii 、Apple等跨国企业，归纳出三种不同的产业模式，也涵括新近的热门现象免费效应及长尾理论等。在这些有趣的例子中，我们不仅更......一起来看看《商业模式新生代》这本书的介绍吧!

码农工具