Splitting a dataset

栏目: IT技术 · 发布时间: 3年前

内容简介:To train any machine learning model irrespective what type of dataset is being used you have to split the dataset into training data and testing data. So, let us look into how it can be done?Here I am going to use the iris dataset and split it using the ‘t
Image by author

To train any machine learning model irrespective what type of dataset is being used you have to split the dataset into training data and testing data. So, let us look into how it can be done?

Here I am going to use the iris dataset and split it using the ‘train_test_split’ library from sklearn

from sklearn.model_selection import train_test_splitfrom sklearn.datasets import load_iris

Then I load the iris dataset into a variable.

iris = load_iris()

Which I then use to store the data and target value into two separate variables.

x, y = iris.data, iris.target

Here I have used the ‘train_test_split’ to split the data in 80:20 ratio i.e. 80% of the data will be used for training the model while 20% will be used for testing the model that is built out of it.

x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=123)

As you can see here I have passed the following parameters in ‘train_test_split’:

  1. x and y that we had previously defined
  2. test_size: This is set 0.2 thus defining the test size will be 20% of the dataset
  3. random_state: it controls the shuffling applied to the data before applying the split. Setting random_state a fixed value will guarantee that the same sequence of random numbers are generated each time you run the code.

When splitting a dataset there are two competing concerns:

-If you have less training data, your parameter estimates have greater variance.

-And if you have less testing data, your performance statistic will have greater variance.

The data should be divided in such a way that neither of them is too high, which is more dependent on the amount of data you have. If your data is too small then no split will give you satisfactory variance so you will have to do cross-validation but if your data is huge then it doesn’t really matter whether you choose an 80:20 split or a 90:10 split (indeed you may choose to use less training data as otherwise, it might be more computationally intensive).


以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

持续交付

持续交付

Jez Humble、David Farley / 乔梁 / 人民邮电出版社 / 2011-10 / 89.00元

Jez Humble编著的《持续交付(发布可靠软件的系统方法)》讲述如何实现更快、更可靠、低成本的自动化软件交付,描述了如何通过增加反馈,并改进开发人员、测试人员、运维人员和项目经理之间的协作来达到这个目标。《持续交付(发布可靠软件的系统方法)》由三部分组成。第一部分阐述了持续交付背后的一些原则,以及支持这些原则的实践。第二部分是本书的核心,全面讲述了部署流水线。第三部分围绕部署流水线的投入产出讨......一起来看看 《持续交付》 这本书的介绍吧!

HTML 压缩/解压工具
HTML 压缩/解压工具

在线压缩/解压 HTML 代码

CSS 压缩/解压工具
CSS 压缩/解压工具

在线压缩/解压 CSS 代码

Markdown 在线编辑器
Markdown 在线编辑器

Markdown 在线编辑器