Easily visualize Scikit-learn models’ decision boundaries

栏目: IT技术 · 发布时间: 6年前

内容简介：A simple utility function to visualize the decision boundaries of Scikit-learn machine learning models/estimators.Scikit-learn is an amazing Python library for working and experimenting with aIt is built with robustness and

A simple utility function to visualize the decision boundaries of Scikit-learn machine learning models/estimators.

Tirthajyoti Sarkar

Apr 12 ·6min read

Easily visualize Scikit-learn models’ decision boundaries — Image source: Pixabay (Free license)

Introduction

Scikit-learn is an amazing Python library for working and experimenting with a plethora of supervised and unsupervised machine learning (ML) algorithms and associated tools .

It is built with robustness and speed in mind — using NumPy and SciPy methods as much as possible with memory-optimization techniques . Most importantly, the library offers a simple and intuitive API across the board for all kinds of ML estimators — fitting the data, predicting, and examining the model parameters.

For many classification problems in the domain of supervised ML, we may want to go beyond the numerical prediction (of the class or of the probability) and visualize the actual decision boundary between the classes. This is, of course, particularly suitable for binary classification problems and for a pair of features — the visualization is displayed on a 2-dimensional (2D) plane.

For example, here is a visualization of the decision boundary for a Support Vector Machine (SVM) tutorial from the official Scikit-learn documentation.

While Scikit-learn does not offer a ready-made, accessible method for doing that kind of visualization, in this article, we examine a simple piece of Python code to achieve that.

A simple Python function

The full code is given here in my Github Repo on Python machine learning. You are certainly welcome to explore the whole repository for other useful ML tutorials, as well.

Here, we show the docstring for illustrating how this can be used,

You can pass on the model class and the model parameters (specific and unique to each model class) to the function, along with the feature and labels data (as NumPy arrays).

Here the model class denotes the exact Scikit-learn estimator class that you call in to instantiate your ML estimator object. Note that you don’t have to pass on the specific ML estimator that you are working with. Just the class name will suffice. This function will internally fit the data and predict to create the appropriate decision boundary (taking into account the model parameters that you also pass on).

At present, the function uses just the first two columns of the data for fitting the model as we need to find the predicted value for every point in a mesh grid-style scatter plot.

Some illustrative results

Code is boring, while results (and plots) are exciting, aren’t they?

For the demonstration, we used a divorce classification dataset. This dataset is about participants who completed the personal information form and a divorce predictors scale. The data is a modified version of the publicly available data at the UCI portal (after injecting some noise). There are 170 participants and 54 attributes (or predictor variables) that are all real-valued.

We compared the performance of multiple ML estimators on the same dataset,

Naive Bayes
Logistic regression
K-nearest neighbor (KNN)

Because the binary classes of this particular dataset are fairly easily separable, all the ML algorithms perform almost equally well. However, their respective decision boundary looks different from each other and that is what we are interested in visualizing through this utility function.

Naive Bayes decision boundary

The decision boundary from the Naive Bayes algorithm was smooth and slightly nonlinear . And, with only four lines of code!

Logistic regression decision boundary

As expected, the decision boundary from the logistic regression estimator was visualized as a linear separator.

K-nearest neighbor (KNN) decision boundary

K-nearest neighbor is an algorithm based on the local geometry of the distribution of the data on the feature hyperplane (and their relative distance measures). The decision boundary, therefore, comes up as nonlinear and non-smooth .

You can pass even a neural network classifier

The function works with any Scikit-learn estimator, even a neural network. Here is the decision boundary with the MLPClassifier estimator of Scikit-learn, which models a densely-connected neural network (with user-configurable parameters). Note, in the code, we pass on the hidden layer settings, the learning rate, and the optimizer ( Stochastic Gradient Descent or SGD).

Examining the impact of model parameters

As mentioned before, we can pass on any model parameters that we want to the utility function. In the case of the KNN classifier, as we increase the number of neighboring data points, the decision boundary becomes smoother. This can be readily visualized using our utility function. Note, in the code below, how we pass on the variable k to the n_neighbors model parameter inside a loop.

以上就是本文的全部内容，希望对大家的学习有所帮助，也希望大家多多支持码农网

查看所有标签

本站部分资源来源于网络，本站转载出于传递更多信息之目的，版权归原作者或者来源机构所有，如转载稿涉及版权问题，请联系我们。

码农书籍

深入浅出Web设计（中文版）

（美）瓦特罗尔、（美）西罗托 / O'Reilly Taiwan公司 / 东南大学出版社 / 2010-11 / 99.00元

你将从《深入浅出Web设计(中文版)》学到什么？你曾经希望看看书就能学到真正的网站设计吗？曾经想过该如何同时达成让网站看起来美观，又能有效率地沟通信息，还要通过可访问性与可用性的策略吗？《深入浅出Web设计》正是精通上述主题的秘笈。我们将学到如何设计一个绝佳、用户友好的网站，上谈客户需求，下说手绘分镜表，乃至完成在线所需的HTML与css主文件……而且会有一个真正可以运作的网站！一起来看看《深入浅出Web设计（中文版）》这本书的介绍吧!

码农工具