Easily visualize Scikit-learn models’ decision boundaries

栏目: IT技术 · 发布时间: 4年前

内容简介：A simple utility function to visualize the decision boundaries of Scikit-learn machine learning models/estimators.Scikit-learn is an amazing Python library for working and experimenting with aIt is built with robustness and

A simple utility function to visualize the decision boundaries of Scikit-learn machine learning models/estimators.

Tirthajyoti Sarkar

Apr 12 ·6min read

Easily visualize Scikit-learn models’ decision boundaries — Image source: Pixabay (Free license)

Introduction

Scikit-learn is an amazing Python library for working and experimenting with a plethora of supervised and unsupervised machine learning (ML) algorithms and associated tools .

It is built with robustness and speed in mind — using NumPy and SciPy methods as much as possible with memory-optimization techniques . Most importantly, the library offers a simple and intuitive API across the board for all kinds of ML estimators — fitting the data, predicting, and examining the model parameters.

For many classification problems in the domain of supervised ML, we may want to go beyond the numerical prediction (of the class or of the probability) and visualize the actual decision boundary between the classes. This is, of course, particularly suitable for binary classification problems and for a pair of features — the visualization is displayed on a 2-dimensional (2D) plane.

For example, here is a visualization of the decision boundary for a Support Vector Machine (SVM) tutorial from the official Scikit-learn documentation.

While Scikit-learn does not offer a ready-made, accessible method for doing that kind of visualization, in this article, we examine a simple piece of Python code to achieve that.

A simple Python function

The full code is given here in my Github Repo on Python machine learning. You are certainly welcome to explore the whole repository for other useful ML tutorials, as well.

Here, we show the docstring for illustrating how this can be used,

You can pass on the model class and the model parameters (specific and unique to each model class) to the function, along with the feature and labels data (as NumPy arrays).

Here the model class denotes the exact Scikit-learn estimator class that you call in to instantiate your ML estimator object. Note that you don’t have to pass on the specific ML estimator that you are working with. Just the class name will suffice. This function will internally fit the data and predict to create the appropriate decision boundary (taking into account the model parameters that you also pass on).

At present, the function uses just the first two columns of the data for fitting the model as we need to find the predicted value for every point in a mesh grid-style scatter plot.

Some illustrative results

Code is boring, while results (and plots) are exciting, aren’t they?

For the demonstration, we used a divorce classification dataset. This dataset is about participants who completed the personal information form and a divorce predictors scale. The data is a modified version of the publicly available data at the UCI portal (after injecting some noise). There are 170 participants and 54 attributes (or predictor variables) that are all real-valued.

We compared the performance of multiple ML estimators on the same dataset,

Naive Bayes
Logistic regression
K-nearest neighbor (KNN)

Because the binary classes of this particular dataset are fairly easily separable, all the ML algorithms perform almost equally well. However, their respective decision boundary looks different from each other and that is what we are interested in visualizing through this utility function.

Naive Bayes decision boundary

The decision boundary from the Naive Bayes algorithm was smooth and slightly nonlinear . And, with only four lines of code!

Logistic regression decision boundary

As expected, the decision boundary from the logistic regression estimator was visualized as a linear separator.

K-nearest neighbor (KNN) decision boundary

K-nearest neighbor is an algorithm based on the local geometry of the distribution of the data on the feature hyperplane (and their relative distance measures). The decision boundary, therefore, comes up as nonlinear and non-smooth .

You can pass even a neural network classifier

The function works with any Scikit-learn estimator, even a neural network. Here is the decision boundary with the MLPClassifier estimator of Scikit-learn, which models a densely-connected neural network (with user-configurable parameters). Note, in the code, we pass on the hidden layer settings, the learning rate, and the optimizer ( Stochastic Gradient Descent or SGD).

Examining the impact of model parameters

As mentioned before, we can pass on any model parameters that we want to the utility function. In the case of the KNN classifier, as we increase the number of neighboring data points, the decision boundary becomes smoother. This can be readily visualized using our utility function. Note, in the code below, how we pass on the variable k to the n_neighbors model parameter inside a loop.

以上就是本文的全部内容，希望对大家的学习有所帮助，也希望大家多多支持码农网

查看所有标签

猜你喜欢:

Easily visualize Scikit-learn models’ decision boundaries

本站部分资源来源于网络，本站转载出于传递更多信息之目的，版权归原作者或者来源机构所有，如转载稿涉及版权问题，请联系我们。

码农书籍

微信营销与运营一册通

何秀芳、葛存山 / 人民邮电出版社 / 2014-10

《微信营销与运营一册通》深入介绍了当今最为火热的话题——微信营销，内容全面、系统和深入。它基于微信的最新版本，从策略、技巧与案例等多角度详细解析了微信的营销与运营，所有内容都是行业经验的结晶，旨在为企业或个人运用微信提供有价值的参考。《微信营销与运营一册通》主要内容如下。 * 5大微信营销利器：书中介绍了5大微信营销利器，包括漂流瓶、二维码、LBS功能、朋友圈和公众平台等。 * 6大微......一起来看看《微信营销与运营一册通》这本书的介绍吧!

码农工具