Data splitting technique to fit any Machine Learning Model

栏目: IT技术 · 发布时间: 5年前

内容简介：This aims to be a short 4-minute article to introduce you guys with Data splitting technique and its importance in practical projects.Ethically, it is suggested to divide your dataset into three parts to avoid overfitting and model selection bias called -T

This aims to be a short 4-minute article to introduce you guys with Data splitting technique and its importance in practical projects.

Ethically, it is suggested to divide your dataset into three parts to avoid overfitting and model selection bias called -

Training set (Has to be the largest set)
Cross-Validation set or Development set or Dev set
Testing Set

The test set can be sometimes omitted too. It is meant to get an unbiased estimate of algorithms performance in the real world. People who divide their dataset into just two parts usually call their Dev set the Test set.

We try to build a model upon training set then try to optimize hyperparameters on the dev set as much as possible then after our model is ready, we try and evaluate the testing set.

# Training Set:

The sample of data used to fit the model, that is the actual subset of the dataset that we use to train the model (estimating the weights and biases in the case ofNeural Network). The model observes and learns from this data and optimize its parameters.

# Cross-Validation Set:

We select the appropriate model or the degree of the polynomial (if using regression model only) by minimizing the error on the cross-validation set.

# Test set:

The sample of data used to provide an unbiased evaluation of a final model fit on the training dataset. It is only used once the model is completely trained using the training and validation sets. Therefore test set is the one used to replicate the type of situation that will be encountered once the model is deployed for real-time use.

The test set is generally what is used to evaluate different models in competitions of Kaggle or Analytics Vidhya . Generally in a Machine Learning hackathon, the cross-validation set is released along with the training set and the actual test set is only released when the competition is about to close, and it is the score of the model on the Test set that decides the winner.

# How to decide the ratio of splitting the dataset?

以上就是本文的全部内容，希望对大家的学习有所帮助，也希望大家多多支持码农网

查看所有标签

猜你喜欢:

Data splitting technique to fit any Machine Learning Model

本站部分资源来源于网络，本站转载出于传递更多信息之目的，版权归原作者或者来源机构所有，如转载稿涉及版权问题，请联系我们。

码农书籍

调试九法

David J.Agans / 赵俐 / 人民邮电出版社 / 2010-12-7 / 35.00元

硬件缺陷和软件错误是“技术侦探”的劲敌，它们负隅顽抗，见缝插针。本书提出的九条简单实用的规则，适用于任何软件应用程序和硬件系统，可以帮助软硬件调试工程师检测任何bug，不管它们有多么狡猾和隐秘。作者使用真实示例展示了如何应用简单有效的通用策略来排查各种各样的问题，例如芯片过热、由蛋酒引起的电路短路、触摸屏失真，等等。本书给出了真正能够隔离关键因素、运行测试序列和查找失败原因的技术。 ......一起来看看《调试九法》这本书的介绍吧!

码农工具