GridSearch: the ultimate Machine Learning Tool

栏目: IT技术 · 发布时间: 3年前

GridSearch: the ultimate Machine Learning Tool

GridSearch: the ultimate Machine Learning Tool. Photo by Chris Liverani on Unsplash

Machine Learning in short

The goal of supervised Machine Learning is to build a prediction function based on historical data. This data has independent (explanatory) variables and a target variable (the variable that you want to predict).

Once a predictive model has been built, we measure its error on a separate testing data set. We do this using KPIs that allow quantifying the error of the model, for example, the Mean Square Error in a regression context (quantitative target variable) or the Accuracy in a classification context (categorical target variable).

The model with the smallest error is generally selected as the best model. Then we use this model to predict the values of the target variable by inputting the explanatory variables.

In this article, I will deep-dive into GridSearch.

Machine Learning’s Two Types of Optimization

GridSearch is a tool that is used for hyperparameter tuning . As stated before, Machine Learning in practice comes down to comparing different models to each other and trying to find the best working model.

Apart from selecting the right data set, there are generally two aspects of optimizing a predictive model:

  1. Optimize the choice of the best model
  2. Optimize a model’s fit using hyperparameters tuning

Let’s now look into those to have an explanation for the need for GridSearch.

Part 1. Optimize the choice of the best model

In some datasets, there may exist a simple linear relationship that can predict a target variable from the explanatory variables. In other datasets, these relationships may be more complex or highly nonlinear.

At the same time, many models exist. This ranges from simple models like the Linear Regression, up to very complex models like Deep Neural Networks.

It is key to use a model that is appropriate for our data.

For example, if we use a Linear Regression on a very complex task, the model will not be performant. But if we use a Deep Neural Network on a very simple task, this will also not be performant!

To find a well-fitting Machine Learning model, the solution is to split data into train and test data, then fit many models on the training data and test each of them on the test data. The model that has the smallest error on the test data will be kept.

GridSearch: the ultimate Machine Learning Tool

A screenshot from Scikit Learn’s list of supervised models shows that there are a lot of models to try out!

Part 2. Optimize a model’s fit using hyperparameters tuning

After choosing one well-performing model (or a few), the second thing to optimize is the hyperparameters of a model. Hyperparameters are like a configuration of the training phase of the model. They influence what a model can or cannot learn.

Tuning hyperparameters can, therefore, lower the error on the test data set even more.

The way of estimating is different for each model, and thus each model has its own hyperparameters to optimize.

GridSearch: the ultimate Machine Learning Tool

This extract of the documentation of Scikit Learn’s RandomForestClassifier shows numerous parameters that can all influence the final accuracy of your model.

One way to do a thorough search for the best hyperparameters is to use a tool called GridSearch.

What is GridSearch?

GridSearch is an optimization tool that we use when tuning hyperparameters. We define the grid of parameters that we want to search through, and we select the best combination of parameters for our data.

The “Search” in GridSearch

The hypothesis is that there is a specific combination of values of the different hyperparameters that will minimize the error of our predictive model. Our goal using GridSearch is to find this specific combination of parameters.

The “Grid” in GridSearch

GridSearch’s idea for finding this best parameter combination is is simple: just test each parameter combination possible and select the best one!

Not really each combination possible though, since for a continuous scale there would be infinitely many combinations to test. The solution for this is to define a Grid. This Grid defines for each hyperparameter, which values should be tested.

GridSearch: the ultimate Machine Learning Tool

A schematic overview of GridSearch on two hyperparameters Alpha and Beta (graphics by author)

In an example case where two hyperparameters — Alpha and Beta— are tuned: we could give both of them the values [0.1, 0.01, 0.001, 0.0001] resulting in the following “Grid” of values. At each crossing point, our GridSearch will fit the model to see what the error at this point is.

And after checking all the grid points, we know which parameter combination is best for our prediction.

The “Cross-Validation” in GridSearch

At this point, only one thing remains to be added: the Cross-Validation Error.

When testing the performance of a model with each combination of hyperparameters, there could be a risk of overfitting. This means that just by pure chance, only the training data set corresponded well to this particular hyperparameter combination! The performance on new, real-life data, could be much worse!

To get a more reliable estimate of the performances of a hyperparameter combination, we take the Cross Validation Error.

GridSearch: the ultimate Machine Learning Tool

A schematic overview of Cross-Validation (graphics by author)

In Cross-Validation, the data is split in multiple parts. For example 5 parts. Then the model is fit 5 times while leaving out one-fifth of the data. This one-fifth left-out data is used to measure the performances.

For one combination of hyperparameter values, the average of the 5 errors constitutes the cross-validation error. This makes the selection of the final combination more reliable.

What makes GridSearch so important?

GridSearch allows us to find the best model given a data set very easily. It actually makes the Machine Learning part of the Data Scientists role much easier by automating the search.

On the Machine Learning side, some things that still remain to be done is deciding on the right way to measure error, deciding on which models to try out and which hyperparameters to test for. And the most important part, the work on data preparation, is also left for the data scientist.

Thanks to the GridSearch approach, the Data Scientist can focus on the data wrangling work, while automating repetitive tasks of model comparison. This makes the work more interesting and allows the Data Scientist to add value where he’s most needed: working with data.

A number of alternatives for GridSearch exist, including Random Search, Bayesian Optimization, Genetic Algorithms, and more. I will write an article about those soon, so don’t hesitate to stay tuned. Thanks for reading!


以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

超简单!一学就懂的互联网金融

超简单!一学就懂的互联网金融

视觉图文 / 人民邮电出版社 / 2015-2-1 / 45.00元

零基础、全图解,通过130多个精辟的知识点、220多张通俗易懂的逻辑图表,让您一书在手,即可彻底看懂、玩转互联网金融从菜鸟成为达人,从新手成为互联网金融高手! 本书主要特色:最简洁的版式+最直观的图解+最实用的内容。 本书细节特色:10章专题内容详解+80多个特别提醒奉献+130多个知识点讲解+220多张图片全程图解,深度剖析互联网金融的精华之处,帮助读者在最短的时间内掌握互联网金融知......一起来看看 《超简单!一学就懂的互联网金融》 这本书的介绍吧!

JSON 在线解析
JSON 在线解析

在线 JSON 格式化工具

UNIX 时间戳转换
UNIX 时间戳转换

UNIX 时间戳转换