内容简介:A simple utility function to visualize the decision boundaries of Scikit-learn machine learning models/estimators.Scikit-learn is an amazing Python library for working and experimenting with aIt is built with robustness and
A simple utility function to visualize the decision boundaries of Scikit-learn machine learning models/estimators.
Apr 12 ·6min read
Introduction
Scikit-learn is an amazing Python library for working and experimenting with a plethora of supervised and unsupervised machine learning (ML) algorithms and associated tools .
It is built with robustness and speed in mind — using NumPy and SciPy methods as much as possible with memory-optimization techniques . Most importantly, the library offers a simple and intuitive API across the board for all kinds of ML estimators — fitting the data, predicting, and examining the model parameters.
For many classification problems in the domain of supervised ML, we may want to go beyond the numerical prediction (of the class or of the probability) and visualize the actual decision boundary between the classes. This is, of course, particularly suitable for binary classification problems and for a pair of features — the visualization is displayed on a 2-dimensional (2D) plane.
For example, here is a visualization of the decision boundary for a Support Vector Machine (SVM) tutorial from the official Scikit-learn documentation.
While Scikit-learn does not offer a ready-made, accessible method for doing that kind of visualization, in this article, we examine a simple piece of Python code to achieve that.
A simple Python function
The full code is given here in my Github Repo on Python machine learning. You are certainly welcome to explore the whole repository for other useful ML tutorials, as well.
Here, we show the docstring for illustrating how this can be used,
You can pass on the model class and the model parameters (specific and unique to each model class) to the function, along with the feature and labels data (as NumPy arrays).
Here the model class denotes the exact Scikit-learn estimator class that you call in to instantiate your ML estimator object. Note that you don’t have to pass on the specific ML estimator that you are working with. Just the class name will suffice. This function will internally fit the data and predict to create the appropriate decision boundary (taking into account the model parameters that you also pass on).
At present, the function uses just the first two columns of the data for fitting the model as we need to find the predicted value for every point in a mesh grid-style scatter plot.
Some illustrative results
Code is boring, while results (and plots) are exciting, aren’t they?
For the demonstration, we used a divorce classification dataset. This dataset is about participants who completed the personal information form and a divorce predictors scale. The data is a modified version of the publicly available data at the UCI portal (after injecting some noise). There are 170 participants and 54 attributes (or predictor variables) that are all real-valued.
We compared the performance of multiple ML estimators on the same dataset,
- Naive Bayes
- Logistic regression
- K-nearest neighbor (KNN)
Because the binary classes of this particular dataset are fairly easily separable, all the ML algorithms perform almost equally well. However, their respective decision boundary looks different from each other and that is what we are interested in visualizing through this utility function.
Naive Bayes decision boundary
The decision boundary from the Naive Bayes algorithm was smooth and slightly nonlinear . And, with only four lines of code!
Logistic regression decision boundary
As expected, the decision boundary from the logistic regression estimator was visualized as a linear separator.
K-nearest neighbor (KNN) decision boundary
K-nearest neighbor is an algorithm based on the local geometry of the distribution of the data on the feature hyperplane (and their relative distance measures). The decision boundary, therefore, comes up as nonlinear and non-smooth .
You can pass even a neural network classifier
The function works with any Scikit-learn estimator, even a neural network. Here is the decision boundary with the MLPClassifier
estimator of Scikit-learn, which models a densely-connected neural network (with user-configurable parameters). Note, in the code, we pass on the hidden layer settings, the learning rate, and the optimizer ( Stochastic Gradient Descent or SGD).
Examining the impact of model parameters
As mentioned before, we can pass on any model parameters that we want to the utility function. In the case of the KNN classifier, as we increase the number of neighboring data points, the decision boundary becomes smoother. This can be readily visualized using our utility function. Note, in the code below, how we pass on the variable k
to the n_neighbors
model parameter inside a loop.
以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网
猜你喜欢:本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。