The 4 steps necessary before fitting a machine learning model

栏目: IT技术 · 发布时间: 4年前

The 4 steps necessary before fitting a machine learning model

A plain, object-oriented approach to data processing.

The 4 steps necessary before fitting a machine learning model

Photo by chuttersnap on Unsplash

There are many steps in a common machine learning pipeline and much thought that goes into architecting it. There is the problem definition, data acquisition, error detection and data cleaning, etc. In this story, we begin with the assumption that we have a clean and ready to go dataset.

With that in mind, we outline the four steps necessary before fitting any machine learning model. We then implement those steps in Pytorch, using a common syntax for invoking multiple method calls; method chaining. The goal is to define a simple yet generalizable API, that transforms any raw dataset into a format that is ready to be consumed by a machine learning model.

To this end, we will use the build pattern , which constructs a complex object using a step by step approach.

The builder pattern is a design pattern , which provides a flexible solution to object creation problems in object-oriented programming . Its aim is to separate the construction of a complex object from its representation.

So, what are those 4 things? In its most simple case, processing data before modelling includes four distinct actions:

  1. Load the data
  2. Split into train/valid/test sets
  3. Label the data tuples
  4. Obtain batches of data

In the following sections, I analyze those four steps one-by-one and implement them in code. Our goal is to finally create a PyTorch DataLoader , an abstraction PyTorch uses to represent an iterable over a dataset. Having a DataLoader is the first step in setting up the training loop. So, without further ado, let us get our hands dirty.

Loading the data

For this example, we use a mock dataset that is kept in a pandas DataFrame format. Our goal is to create one PyTorch Dataloader class for the training set and one for the validation set. Thus, let us build a class named DataLoaderBuilder that is responsible for building those classes.

We see that the only operation of the DataLoaderBuilder is to store a data variable, which type is a torch.tensor . So now, we need a way to initialize it from a pandas DataFrame . For that, we use a python classmethod .

The classmethod is a plain python class method, but instead of receiving self as the first argument it receives a class . Thus, given a pandas DataFrame , we turn the DataFrame into a PyTorch tensor and instantiate the DataLoaderBuilder class, that is passed to the method as the cls argument. Optionally, we can keep only the columns of the DataFrame we care about. After defining it, we patch it to the main DataLoaderBuilder class.

Splitting into Training & Validation

For this example, we split the dataset into two sets; training and validation. It is easy to extend the code and split it into three sets; training, validation and testing.

We want to split the dataset randomly, and keep some percentage of the data for training and set aside what is left for validation. To this end, we use Pytorch’s SubsetRandomSampler . You can read more about this sampler and many more sampling methods in the official PyTorch documentation .

By default, we keep 90% of the data for training and we split across rows ( axis=0 ). Another detail in the code is that we return self . Thus, after creating the train_data and valid_data splits, we return back the whole class. This will permit as to use method chaining in the end.

Label the Dataset

Next, we should label the dataset. Most of the time, we use some feature variables to predict a depended variable (i.e. the target). That is, of course, called supervise learning. The label_by_func method annotates the dataset according to a given function. After this call, the dataset is usually converted to (features, target) tuples.

We see that the label_by_func method accepts a function as an argument and applies it to the train and valid sets. Our job is to design a function that serves our purposes any time we want to label a dataset of some form. Later in the “putting it all together” example we show how simple it is to create such a function.

Create Batches

Finally, only one step is left; break the dataset into batches. For this, we can leverage PyTorch’s TensorDataset and DataLoader classes.

This is the last method in the chain, thus, we name it “ build” . It creates the train and valid datasets and having them it is easy to instantiate the corresponding Pytorch DataLoader , with a known batch size. Keep in mind that we now have labelled the data, thus, self.train_data is a tuple of features and a target variable. Consequently, self.train_data[0] keeps the features and self.train_data[1] holds the target.

Having that in place, let us put it all together with a simple example.

In this example, we create a dummy dataset of three columns, where the last column stores the target or depended variable. We then define a get_label function, that pulls the last column and creates a features-target tuple. Finally, using method chaining we can easily create the data loaders we need from a given pandas DataFrame .

Conclusion

In this story, we saw what are the four necessary steps of data processing before fitting any model, assuming that the dataset is clean. Although this is a toy example, it can be used and extended to cover a wide variety of machine learning problems.

Also, there are steps that are not covered in this article (e.g. data normalization or augmentation for computer vision) but the goal of the story is to provide a general idea on how to structure code that solves a relevant problem.

My name is Dimitris Poulopoulos and I’m a machine learning researcher at BigDataStack and PhD(c) at the University of Piraeus, Greece. I have worked on designing and implementing AI and software solutions for major clients such as the European Commission, Eurostat, IMF, the European Central Bank, OECD, and IKEA. If you are interested in reading more posts about Machine Learning, Deep Learning and Data Science, follow me on Medium , LinkedIn or @james2pl on twitter.


以上就是本文的全部内容,希望本文的内容对大家的学习或者工作能带来一定的帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

UNIX网络编程 卷1:套接字联网API(第3版)

UNIX网络编程 卷1:套接字联网API(第3版)

[美]W. 理查德•史蒂文斯(W. Richard Stevens)、比尔• 芬纳(Bill Fenner)、安德鲁 M. 鲁道夫(Andrew M. Rudoff) / 匿名 / 人民邮电出版社 / 2014-6-1 / 129.00

《UNIX环境高级编程(第3版)》是被誉为UNIX编程“圣经”的Advanced Programming in the UNIX Environment一书的第3版。在本书第2版出版后的8年中,UNIX行业发生了巨大的变化,特别是影响UNIX编程接口的有关标准变化很大。本书在保持前一版风格的基础上,根据最新的标准对内容进行了修订和增补,反映了最新的技术发展。书中除了介绍UNIX文件和目录、标准I/......一起来看看 《UNIX网络编程 卷1:套接字联网API(第3版)》 这本书的介绍吧!

JS 压缩/解压工具
JS 压缩/解压工具

在线压缩/解压 JS 代码

HTML 编码/解码
HTML 编码/解码

HTML 编码/解码

RGB HSV 转换
RGB HSV 转换

RGB HSV 互转工具