Easily visualize Scikit-learn models’ decision boundaries

栏目: IT技术 · 发布时间: 4年前

内容简介:A simple utility function to visualize the decision boundaries of Scikit-learn machine learning models/estimators.Scikit-learn is an amazing Python library for working and experimenting with aIt is built with robustness and

A simple utility function to visualize the decision boundaries of Scikit-learn machine learning models/estimators.

Easily visualize Scikit-learn models’ decision boundaries

Image source: Pixabay (Free license)

Introduction

Scikit-learn is an amazing Python library for working and experimenting with a plethora of supervised and unsupervised machine learning (ML) algorithms and associated tools .

It is built with robustness and speed in mind — using NumPy and SciPy methods as much as possible with memory-optimization techniques . Most importantly, the library offers a simple and intuitive API across the board for all kinds of ML estimators — fitting the data, predicting, and examining the model parameters.

Easily visualize Scikit-learn models’ decision boundaries

Image: Scikit-learn estimator illustration

For many classification problems in the domain of supervised ML, we may want to go beyond the numerical prediction (of the class or of the probability) and visualize the actual decision boundary between the classes. This is, of course, particularly suitable for binary classification problems and for a pair of features — the visualization is displayed on a 2-dimensional (2D) plane.

For example, here is a visualization of the decision boundary for a Support Vector Machine (SVM) tutorial from the official Scikit-learn documentation.

Easily visualize Scikit-learn models’ decision boundaries

Image source: Scikit-learn SVM

While Scikit-learn does not offer a ready-made, accessible method for doing that kind of visualization, in this article, we examine a simple piece of Python code to achieve that.

A simple Python function

The full code is given here in my Github Repo on Python machine learning. You are certainly welcome to explore the whole repository for other useful ML tutorials, as well.

Here, we show the docstring for illustrating how this can be used,

Easily visualize Scikit-learn models’ decision boundaries

The docstring for the utility function

You can pass on the model class and the model parameters (specific and unique to each model class) to the function, along with the feature and labels data (as NumPy arrays).

Here the model class denotes the exact Scikit-learn estimator class that you call in to instantiate your ML estimator object. Note that you don’t have to pass on the specific ML estimator that you are working with. Just the class name will suffice. This function will internally fit the data and predict to create the appropriate decision boundary (taking into account the model parameters that you also pass on).

At present, the function uses just the first two columns of the data for fitting the model as we need to find the predicted value for every point in a mesh grid-style scatter plot.

Easily visualize Scikit-learn models’ decision boundaries

Main code section

Some illustrative results

Code is boring, while results (and plots) are exciting, aren’t they?

For the demonstration, we used a divorce classification dataset. This dataset is about participants who completed the personal information form and a divorce predictors scale. The data is a modified version of the publicly available data at the UCI portal (after injecting some noise). There are 170 participants and 54 attributes (or predictor variables) that are all real-valued.

Easily visualize Scikit-learn models’ decision boundaries

UCI divorce predictor dataset

We compared the performance of multiple ML estimators on the same dataset,

  • Naive Bayes
  • Logistic regression
  • K-nearest neighbor (KNN)

Because the binary classes of this particular dataset are fairly easily separable, all the ML algorithms perform almost equally well. However, their respective decision boundary looks different from each other and that is what we are interested in visualizing through this utility function.

Easily visualize Scikit-learn models’ decision boundaries

Image: Class separability of the divorce prediction dataset

Naive Bayes decision boundary

The decision boundary from the Naive Bayes algorithm was smooth and slightly nonlinear . And, with only four lines of code!

Easily visualize Scikit-learn models’ decision boundaries

Logistic regression decision boundary

As expected, the decision boundary from the logistic regression estimator was visualized as a linear separator.

Easily visualize Scikit-learn models’ decision boundaries

K-nearest neighbor (KNN) decision boundary

K-nearest neighbor is an algorithm based on the local geometry of the distribution of the data on the feature hyperplane (and their relative distance measures). The decision boundary, therefore, comes up as nonlinear and non-smooth .

Easily visualize Scikit-learn models’ decision boundaries

You can pass even a neural network classifier

The function works with any Scikit-learn estimator, even a neural network. Here is the decision boundary with the MLPClassifier estimator of Scikit-learn, which models a densely-connected neural network (with user-configurable parameters). Note, in the code, we pass on the hidden layer settings, the learning rate, and the optimizer ( Stochastic Gradient Descent or SGD).

Easily visualize Scikit-learn models’ decision boundaries

Examining the impact of model parameters

As mentioned before, we can pass on any model parameters that we want to the utility function. In the case of the KNN classifier, as we increase the number of neighboring data points, the decision boundary becomes smoother. This can be readily visualized using our utility function. Note, in the code below, how we pass on the variable k to the n_neighbors model parameter inside a loop.


以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

创新者的解答

创新者的解答

【美】克莱顿•克里斯坦森、【加】迈克尔·雷纳 / 中信出版社 / 2013-10-10 / 49.00

《创新者的解答》讲述为了追求创新成长机会,美国电信巨子AT&T在短短10年间,总共耗费了500亿美元。企业为了保持成功记录,会面对成长的压力以达成持续获利的目标。但是如果追求成长的方向出现偏误,后果往往比没有成长更糟。因此,如何创新,并选对正确方向,是每个企业最大的难题。 因此,如何创新,并导向何种方向,便在于创新结果的可预测性─而此可预测性则来自于正确的理论依据。在《创新者的解答》中,两位......一起来看看 《创新者的解答》 这本书的介绍吧!

JSON 在线解析
JSON 在线解析

在线 JSON 格式化工具

HTML 编码/解码
HTML 编码/解码

HTML 编码/解码

SHA 加密
SHA 加密

SHA 加密工具