The 4 steps necessary before fitting a machine learning model

栏目: IT技术 · 发布时间: 5年前

The 4 steps necessary before fitting a machine learning model

A plain, object-oriented approach to data processing.

Mar 6 ·5min read

The 4 steps necessary before fitting a machine learning model — Photo by chuttersnap on Unsplash

There are many steps in a common machine learning pipeline and much thought that goes into architecting it. There is the problem definition, data acquisition, error detection and data cleaning, etc. In this story, we begin with the assumption that we have a clean and ready to go dataset.

With that in mind, we outline the four steps necessary before fitting any machine learning model. We then implement those steps in Pytorch, using a common syntax for invoking multiple method calls; method chaining. The goal is to define a simple yet generalizable API, that transforms any raw dataset into a format that is ready to be consumed by a machine learning model.

To this end, we will use the build pattern , which constructs a complex object using a step by step approach.

The builder pattern is a design pattern , which provides a flexible solution to object creation problems in object-oriented programming . Its aim is to separate the construction of a complex object from its representation.

So, what are those 4 things? In its most simple case, processing data before modelling includes four distinct actions:

Load the data
Split into train/valid/test sets
Label the data tuples
Obtain batches of data

In the following sections, I analyze those four steps one-by-one and implement them in code. Our goal is to finally create a PyTorch DataLoader , an abstraction PyTorch uses to represent an iterable over a dataset. Having a DataLoader is the first step in setting up the training loop. So, without further ado, let us get our hands dirty.

Loading the data

For this example, we use a mock dataset that is kept in a pandas DataFrame format. Our goal is to create one PyTorch Dataloader class for the training set and one for the validation set. Thus, let us build a class named DataLoaderBuilder that is responsible for building those classes.

We see that the only operation of the DataLoaderBuilder is to store a data variable, which type is a torch.tensor . So now, we need a way to initialize it from a pandas DataFrame . For that, we use a python classmethod .

The classmethod is a plain python class method, but instead of receiving self as the first argument it receives a class . Thus, given a pandas DataFrame , we turn the DataFrame into a PyTorch tensor and instantiate the DataLoaderBuilder class, that is passed to the method as the cls argument. Optionally, we can keep only the columns of the DataFrame we care about. After defining it, we patch it to the main DataLoaderBuilder class.

Splitting into Training & Validation

For this example, we split the dataset into two sets; training and validation. It is easy to extend the code and split it into three sets; training, validation and testing.

We want to split the dataset randomly, and keep some percentage of the data for training and set aside what is left for validation. To this end, we use Pytorch’s SubsetRandomSampler . You can read more about this sampler and many more sampling methods in the official PyTorch documentation .

By default, we keep 90% of the data for training and we split across rows ( axis=0 ). Another detail in the code is that we return self . Thus, after creating the train_data and valid_data splits, we return back the whole class. This will permit as to use method chaining in the end.

Label the Dataset

Next, we should label the dataset. Most of the time, we use some feature variables to predict a depended variable (i.e. the target). That is, of course, called supervise learning. The label_by_func method annotates the dataset according to a given function. After this call, the dataset is usually converted to (features, target) tuples.

We see that the label_by_func method accepts a function as an argument and applies it to the train and valid sets. Our job is to design a function that serves our purposes any time we want to label a dataset of some form. Later in the “putting it all together” example we show how simple it is to create such a function.

Create Batches

Finally, only one step is left; break the dataset into batches. For this, we can leverage PyTorch’s TensorDataset and DataLoader classes.

This is the last method in the chain, thus, we name it “ build” . It creates the train and valid datasets and having them it is easy to instantiate the corresponding Pytorch DataLoader , with a known batch size. Keep in mind that we now have labelled the data, thus, self.train_data is a tuple of features and a target variable. Consequently, self.train_data[0] keeps the features and self.train_data[1] holds the target.

Having that in place, let us put it all together with a simple example.

In this example, we create a dummy dataset of three columns, where the last column stores the target or depended variable. We then define a get_label function, that pulls the last column and creates a features-target tuple. Finally, using method chaining we can easily create the data loaders we need from a given pandas DataFrame .

Conclusion

In this story, we saw what are the four necessary steps of data processing before fitting any model, assuming that the dataset is clean. Although this is a toy example, it can be used and extended to cover a wide variety of machine learning problems.

Also, there are steps that are not covered in this article (e.g. data normalization or augmentation for computer vision) but the goal of the story is to provide a general idea on how to structure code that solves a relevant problem.

My name is Dimitris Poulopoulos and I’m a machine learning researcher at BigDataStack and PhD(c) at the University of Piraeus, Greece. I have worked on designing and implementing AI and software solutions for major clients such as the European Commission, Eurostat, IMF, the European Central Bank, OECD, and IKEA. If you are interested in reading more posts about Machine Learning, Deep Learning and Data Science, follow me on Medium , LinkedIn or @james2pl on twitter.

以上就是本文的全部内容，希望本文的内容对大家的学习或者工作能带来一定的帮助，也希望大家多多支持码农网

查看所有标签

猜你喜欢:

The 4 steps necessary before fitting a machine learning model

本站部分资源来源于网络，本站转载出于传递更多信息之目的，版权归原作者或者来源机构所有，如转载稿涉及版权问题，请联系我们。

码农书籍

数字化生存

（美）Nicholas Negroponte（尼古拉·尼葛洛庞帝） / 胡泳、范海燕 / 电子工业出版社 / 2017-1-1 / 68.00

《数字化生存》描绘了数字科技为我们的生活、工作、教育和娱乐带来的各种冲击和其中值得深思的问题，是跨入数字化新世界的*指南。英文版曾高居《纽约时报》畅销书排行榜。 “信息的DNA”正在迅速取代原子而成为人类生活中的基本交换物。尼葛洛庞帝向我们展示出这一变化的巨大影响。电视机与计算机屏幕的差别变得只是大小不同而已。从前所说的“大众”传媒正演变成个人化的双向交流。信息不再被“推给”消费者，相反，人们或他......一起来看看《数字化生存》这本书的介绍吧!

码农工具