内容简介:In many situations using all the features available in a data set will not result in the most predictive model. Depending on the type of model being used, the size of the data set and various other factors, including excess features, can reduce model perfo
Feature Selection
In many situations using all the features available in a data set will not result in the most predictive model. Depending on the type of model being used, the size of the data set and various other factors, including excess features, can reduce model performance.
There are three main goals to feature selection.
- Improve the accuracy with which the model is able to predict for new data.
- Reduce computational cost.
- Produce a more interpretable model.
There are a number of reasons why you may remove certain features over others. This includes the relationships that exist between features, wether a statistical relationship to the target variable exists or is significant enough, or the value of the information contained within a feature.
Feature selection can be performed manually by analysis of the data set both pre and post-training, or through automated statistical methods.
Manual feature selection
There are a number of reasons why you may want to remove a feature from the training phase. These include:
- A feature that is highly correlated with another feature in the data set. If this is the case then both features are in essence providing the same information. Some algorithms are sensitive to correlated features.
- Features that provide little to no information. An example would be a feature where most examples have the same value.
- Features that have little to no statistical relationship with the target variable.
Features can be selected through data analysis performed either before or after training a model. Here are a couple of common techniques to manually perform feature selection.
Correlation plot
One manual technique to perform feature selection is to create a visualisation which plots the correlation measure for every feature in the data set. Seaborn is a good python library to use for this. The below code produces a correlation plot for features in the breast cancer data set available from the scikit-learn API.
# library imports import pandas as pd from sklearn.datasets import load_breast_cancer import matplotlib.pyplot as plt %matplotlib inline import seaborn as sns import numpy as np# load the breast_cancer data set from the scikit-learn api breast_cancer = load_breast_cancer() data = pd.DataFrame(data=breast_cancer['data'], columns = breast_cancer['feature_names']) data['target'] = breast_cancer['target'] data.head()# use the pands .corr() function to compute pairwise correlations for the dataframe corr = data.corr() # visualise the data with seaborn mask = np.triu(np.ones_like(corr, dtype=np.bool)) sns.set_style(style = 'white') f, ax = plt.subplots(figsize=(11, 9)) cmap = sns.diverging_palette(10, 250, as_cmap=True) sns.heatmap(corr, mask=mask, cmap=cmap, square=True, linewidths=.5, cbar_kws={"shrink": .5}, ax=ax)
In the resulting visualisation, we can identify some features that are closely correlated and we may, therefore, want to remove some of these, and some features that have very low correlation with the target variable which we may also want to remove.
Feature importances
Once we have trained a model it is possible to apply further statistical analysis to understand the effects features have on the output of the model and determine from this which features are most useful.
There are a number of tools and techniques available to determine feature importance. Some techniques are unique to a specific algorithm, whereas others can be applied to a wide range of models and are known as model agnostic .
To illustrate feature importances I will use the built-in feature importances method for a random forest classifier in scikit-learn. The code below fits the classifier and creates a plot displaying the feature importances.
from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import train_test_split# Spliiting data into test and train sets X_train, X_test, y_train, y_test = train_test_split(data.drop('target', axis=1), data['target'], test_size=0.20, random_state=0)# fitting the model model = RandomForestClassifier(n_estimators=500, n_jobs=-1, random_state=42) model.fit(X_train, y_train)# plotting feature importances features = data.drop('target', axis=1).columns importances = model.feature_importances_ indices = np.argsort(importances)plt.figure(figsize=(10,15)) plt.title('Feature Importances') plt.barh(range(len(indices)), importances[indices], color='b', align='center') plt.yticks(range(len(indices)), [features[i] for i in indices]) plt.xlabel('Relative Importance') plt.show()
This gives a good indicator of those features that are having an impact on the model and those that are not. We may choose to remove some of the less important features after analysing this chart.
Automated feature selection
There are a number of mechanisms that use statistical methods to find the optimal set of features to use in a model automatically. The scikit-learn library contains a number of methods that provide a very simple implementation for many of these techniques.
Variance threshold
In statistics, variance is the squared deviation of a variable from its mean, in other words, how far are the data points spread out for a given variable?
Suppose we were building a machine learning model to detect breast cancer and the data set had a boolean variable for gender. This data set is likely to consist almost entirely of one gender and therefore nearly all data points would be 1. This variable would have extremely low variance and would be not at all useful for predicting the target variable.
This is one of the most simple approaches to feature selection. The scikit-learn library has a method
called VarianceThreshold
. This method takes a threshold value and when fitted to a feature set will remove any features below this threshold. The default value for the threshold is 0 and this will remove any features with zero variance, or in other words where all values are the same.
Let’s apply this default setting to the breast cancer data set we used earlier to find out if any features are eliminated.
from sklearn.feature_selection import VarianceThresholdX = data.drop('target', axis=1) selector = VarianceThreshold() print("Original feature shape:", X.shape)new_X = selector.fit_transform(X) print("Transformed feature shape:", new_X.shape)
The output shows that the transformed features are the same shape so all features have at least some variance.
Univariate feature selection
Univariate feature selection applies univariate statistical tests to features and selects those which perform the best in these tests. Univariate tests are tests which involve only one dependent variable. This includes analysis of variance (ANOVA), linear regressions and t-tests of means.
Again scikit-learn provides a number of feature selection methods that apply a variety of different univariate tests to find the best features for machine learning.
We will apply one of these, known as SelectKBest to the breast cancer data set. This function selects the k best features based on a univariate statistical test. The default value of k is 10, so 10 features will be kept, and the default test is f_classif .
The f_classif test is used for categorical targets as is the case for the breast cancer data set. If the target is a continuous variable f_regression should be used. The f_classif test is based on the Analysis of Variance (ANOVA) statistical test which compares the means of a number of groups, or in our case features, and determines whether any of those means are statistically significant from one another.
The below code applies this function using the default parameters to the breast cancer data set.
from sklearn.feature_selection import SelectKBestX = data.drop('target', axis=1) y = data['target'] selector = SelectKBest() print("Original feature shape:", X.shape)new_X = selector.fit_transform(X, y) print("Transformed feature shape:", new_X.shape)
The output shows that the function has reduced the features to 10. We could experiment with different values of k, training multiple models, until we find the optimum number of features.
Recursive feature elimination
This method performs model training on a gradually smaller and smaller set of features. Each time the feature importances or coefficients are calculated and the features with the lowest scores are removed. At the end of this process, the optimal set of features is known.
As this method involves repeatably training a model we need to instantiate an estimator first. This method also works best if the data is first scaled so I have added a preprocessing step that normalizes the features.
from sklearn.feature_selection import RFECV from sklearn.svm import SVR from sklearn import preprocessingX_normalized = preprocessing.normalize(X, norm='l2') y = yestimator = SVR(kernel="linear") selector = RFECV(estimator, step=1, cv=2) selector = selector.fit(X, y) print("Features selected", selector.support_) print("Feature ranking", selector.ranking_)
The output below shows the features that have been selected and their ranking.
Feature engineering
Where the goal of feature selection is to reduce the dimensionality of a data set by removing unnecessary features, feature engineering is about transforming existing features and constructing new features to improve the performance of a model.
There are three main reasons why we may need to perform feature engineering to develop an optimal model.
- Features cannot be used in their raw form. This includes features such as dates and times, where a machine learning model can only make use of the information contained within them if they are transformed into a numerical representation e.g. integer representation of the day of the week.
- Features can be used in their raw form but the information contained within the feature is stronger if the data is aggregated or represented in a different way. An example here might be a feature containing the age of a person, aggregating the ages into buckets or bins may better represent the relationship to the target.
- A feature on its own does not have a strong enough statistical relationship with the target but when combined with another feature has a meaningful relationship. Let’s say we have a data set that has a number of features based on credit history for a group of customers and a target that denotes if they have defaulted on a loan. Suppose we have a loan amount and a salary value. If we combined these into a new feature called “loan to salary ratio” this may give more or better information than those features alone.
Manual feature engineering
Feature engineering can be performed through data analysis, intuition and domain knowledge. Often similar techniques to those used for manual feature selection are performed.
For example, observing how features correlate to the target and how they perform in terms of feature importance indicates which features to explore further in terms of analysis.
If we go back to the example of the loan data set, let’s imagine we have an age variable which from a correlation plot appears to have some relationship to the target variable. We might find when we further analyse this relationship that those that default are skewed towards a particular age group. In this case, we might engineer a feature that picks out this age group and this may provide better information to the model.
Automated feature engineering
Manual feature engineering can be an extremely time-consuming process and requires a large amount of human intuition and domain knowledge to get right. There are tools available which have the ability to automatically synthesise a large number of new features.
The Featuretools python library is an example of a tool that can perform automated feature engineering on a data set. Let’s install this library and walk through an example of automated feature engineering on the breast cancer data set.
Featuretools can be installed via pip.
pip install featuretools
Featuretools is designed to work with relational data sets (tables or dataframes that can be joined together with a unique identifier). The featuretools library refers to each table as an entity .
The breast cancer data set consists of only one entity but we can still use featuretools to generate the features. However, we need to first create a new column containing a unique id. The below code uses the index of the data frame to create a new id column.
data['id'] = data.index + 1
Next, we import featuretools and create the entity set.
import featuretools as ftes = ft.EntitySet(id = 'data') es.entity_from_dataframe(entity_id = 'data', dataframe = data, index = 'id')
This gives the following output.
We can now use the library to perform feature synthesis.
feature_matrix, feature_names = ft.dfs(entityset=es, target_entity = 'data', max_depth = 2, verbose = 1, n_jobs = 3)
Featuretools has created 31 new features.
We can view all the new features by running the following.
feature_matrix.columns
以上就是本文的全部内容,希望本文的内容对大家的学习或者工作能带来一定的帮助,也希望大家多多支持 码农网
猜你喜欢:本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。
生物信息学算法导论
N.C.琼斯 / 第1版 (2007年7月1日) / 2007-7 / 45.0
这是一本关于生物信息学算法和计算思想的导论性教科书,原著由国际上的权威学者撰写,经国内知名专家精心翻译为中文,系统介绍推动生物信息学不断进步的算法原理。全书强调的是算法中思想的运用,而不是对表面上并不相关的各类问题进行简单的堆砌。 体现了以下特色: 阐述生物学中的相关问题,涉及对问题的模型化处理并提供一种或多种解决方案: 简要介绍生物信息学领域领军人物; 饶有趣味的小插图使得概念更加具体和形象,方......一起来看看 《生物信息学算法导论》 这本书的介绍吧!