Add Binary Flags for Missing Values for Machine Learning

栏目: IT技术 · 发布时间: 4年前

内容简介:Missing values can cause problems when modeling classification and regression prediction problems with machine learning algorithms.A common approach is to replace missing values with a calculated statistic, such as the mean of the column. This allows the d

Missing values can cause problems when modeling classification and regression prediction problems with machine learning algorithms.

A common approach is to replace missing values with a calculated statistic, such as the mean of the column. This allows the dataset to be modeled as per normal but gives no indication to the model that the row original contained missing values.

One approach to address this issue is to include additional binary flag input features that indicate whether a row or a column contained a missing value that was imputed. This additional information may or may not be helpful to the model in predicting the target value.

In this tutorial, you will discover how to add binary flags for missing values for modeling.

After completing this tutorial, you will know:

  • How to load and evaluate models with statistical imputation on a classification dataset with missing values.
  • How to add a flag that indicates if a row has one more missing values and evaluate models with this new feature.
  • How to add a flag for each input variable that has missing values and evaluate models with these new features.

Discover data cleaning, feature selection, data transforms, dimensionality reduction and much morein my new book, with 30 step-by-step tutorials and full Python source code.

Let’s get started.

Add Binary Flags for Missing Values for Machine Learning

Add Binary Flags for Missing Values for Machine Learning

Photo by keith o connell , some rights reserved.

Tutorial Overview

This tutorial is divided into three parts; they are:

  1. Imputing the Horse Colic Dataset
  2. Model With a Binary Flag for Missing Values
  3. Model With Indicators of All Missing Values

Imputing the Horse Colic Dataset

The horse colic dataset describes medical characteristics of horses with colic and whether they lived or died.

There are 300 rows and 26 input variables with one output variable. It is a binary classification prediction task that involves predicting 1 if the horse lived and 2 if the horse died.

There are many fields we could select to predict in this dataset. In this case, we will predict whether the problem was surgical or not (column index 23), making it a binary classification problem.

The dataset has numerousmissing values for many of the columns where each missing value is marked with a question mark character (“?”).

Below provides an example of rows from the dataset with marked missing values.

2,1,530101,38.50,66,28,3,3,?,2,5,4,4,?,?,?,3,5,45.00,8.40,?,?,2,2,11300,00000,00000,2
1,1,534817,39.2,88,20,?,?,4,1,3,4,2,?,?,?,4,2,50,85,2,2,3,2,02208,00000,00000,2
2,1,530334,38.30,40,24,1,1,3,1,3,3,1,?,?,?,1,1,33.00,6.70,?,?,1,2,00000,00000,00000,1
1,9,5290409,39.10,164,84,4,1,6,2,2,4,4,1,2,5.00,3,?,48.00,7.20,3,5.30,2,1,02208,00000,00000,1
...

2,1,530101,38.50,66,28,3,3,?,2,5,4,4,?,?,?,3,5,45.00,8.40,?,?,2,2,11300,00000,00000,2

1,1,534817,39.2,88,20,?,?,4,1,3,4,2,?,?,?,4,2,50,85,2,2,3,2,02208,00000,00000,2

2,1,530334,38.30,40,24,1,1,3,1,3,3,1,?,?,?,1,1,33.00,6.70,?,?,1,2,00000,00000,00000,1

1,9,5290409,39.10,164,84,4,1,6,2,2,4,4,1,2,5.00,3,?,48.00,7.20,3,5.30,2,1,02208,00000,00000,1

...

You can learn more about the dataset here:

No need to download the dataset as we will download it automatically in the worked examples.

Marking missing values with a NaN (not a number) value in a loaded dataset using Python is a best practice.

We can load the dataset using the read_csv() Pandas function and specify the “ na_values ” to load values of ‘?’ as missing, marked with a NaN value.

The example below downloads the dataset, marks “?” values as NaN (missing) and summarizes the shape of the dataset.

# summarize the horse colic dataset
from pandas import read_csv
# load dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/horse-colic.csv'
dataframe = read_csv(url, header=None, na_values='?')
data = dataframe.values
# split into input and output elements
ix = [i for i in range(data.shape[1]) if i != 23]
X, y = data[:, ix], data[:, 23]
print(X.shape, y.shape)

# summarize the horse colic dataset

from pandas import read _ csv

# load dataset

url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/horse-colic.csv'

dataframe = read_csv ( url , header = None , na_values = '?' )

data = dataframe . values

# split into input and output elements

ix = [ i for i in range ( data . shape [ 1 ] ) if i != 23 ]

X , y = data [ : , ix ] , data [ : , 23 ]

print ( X . shape , y . shape )

Running the example downloads the dataset and reports the number of rows and columns, matching our expectations.

(300, 27) (300,)
(300, 27) (300,)

Next, we can evaluate a model on this dataset.

We can use the SimpleImputer class to perform statistical imputation and replace the missing values with the mean of each column. We can then fit a random forest model on the dataset.

For more on how to use the SimpleImputer class, see the tutorial:

To achieve this, we will define a pipeline that first performs imputation, then fits the model and evaluates this modeling pipeline using repeated stratified k-fold cross-validation with three repeats and 10 folds.

The complete example is listed below.

# evaluate mean imputation and random forest for the horse colic dataset
from numpy import mean
from numpy import std
from pandas import read_csv
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import SimpleImputer
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.pipeline import Pipeline
# load dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/horse-colic.csv'
dataframe = read_csv(url, header=None, na_values='?')
# split into input and output elements
data = dataframe.values
ix = [i for i in range(data.shape[1]) if i != 23]
X, y = data[:, ix], data[:, 23]
# define modeling pipeline
model = RandomForestClassifier()
imputer = SimpleImputer()
pipeline = Pipeline(steps=[('i', imputer), ('m', model)])
# define model evaluation
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# evaluate model
scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
print('Mean Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))

# evaluate mean imputation and random forest for the horse colic dataset

from numpy import mean

from numpy import std

from pandas import read_csv

from sklearn . ensemble import RandomForestClassifier

from sklearn . impute import SimpleImputer

from sklearn . model_selection import cross_val_score

from sklearn . model_selection import RepeatedStratifiedKFold

from sklearn . pipeline import Pipeline

# load dataset

url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/horse-colic.csv'

dataframe = read_csv ( url , header = None , na_values = '?' )

# split into input and output elements

data = dataframe . values

ix = [ i for i in range ( data . shape [ 1 ] ) if i != 23 ]

X , y = data [ : , ix ] , data [ : , 23 ]

# define modeling pipeline

model = RandomForestClassifier ( )

imputer = SimpleImputer ( )

pipeline = Pipeline ( steps = [ ( 'i' , imputer ) , ( 'm' , model ) ] )

# define model evaluation

cv = RepeatedStratifiedKFold ( n_splits = 10 , n_repeats = 3 , random_state = 1 )

# evaluate model

scores = cross_val_score ( pipeline , X , y , scoring = 'accuracy' , cv = cv , n_jobs = - 1 )

print ( 'Mean Accuracy: %.3f (%.3f)' % ( mean ( scores ) , std ( scores ) ) )

Running the example evaluates the random forest with mean statistical imputation on the horse colic dataset.

Your specific results may vary given the stochastic nature of the learning algorithm, the stochastic nature of the evaluation procedure, and differences in precision across machines. Try running the example a few times.

In this case, the pipeline achieved an estimated classification accuracy of about 86.2 percent.

Mean Accuracy: 0.862 (0.056)
Mean Accuracy: 0.862 (0.056)

Next, let’s see if we can improve the performance of the model by providing more information about missing values.

Want to Get Started With Data Preparation?

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Download Your FREE Mini-Course

Model With a Binary Flag for Missing Values

In the previous section, we replaced missing values with a calculated statistic.

The model is unaware that missing values were replaced.

It is possible that knowledge of whether a row contains a missing value or not will be useful to the model when making a prediction.

One approach to exposing the model to this knowledge is by providing an additional column that is a binary flag indicating whether the row had a missing value or not.

  • 0: Row does not contain a missing value.
  • 1: Row contains a missing value (which was/will be imputed).

This can be achieved directly on the loaded dataset. First, we can sum the values for each row to create a new column where if the row contains at least one NaN, then the sum will be a NaN.

We can then mark all values in the new column as 1 if they contain a NaN, or 0 otherwise.

Finally, we can add this column to the loaded dataset.

Tying this together, the complete example of adding a binary flag to indicate one or more missing values in each row is listed below.

# add a binary flag that indicates if a row contains a missing value
from numpy import isnan
from numpy import hstack
from pandas import read_csv
# load dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/horse-colic.csv'
dataframe = read_csv(url, header=None, na_values='?')
# split into input and output elements
data = dataframe.values
ix = [i for i in range(data.shape[1]) if i != 23]
X, y = data[:, ix], data[:, 23]
print(X.shape)
# sum each row where rows with a nan will sum to nan
a = X.sum(axis=1)
# mark all nan as 1
a[isnan(a)] = 1
# mark all non-nan as 0
a[~isnan(a)] = 0
a = a.reshape((len(a), 1))
# add to the dataset as another column
X = hstack((X, a))
print(X.shape)

# add a binary flag that indicates if a row contains a missing value

from numpy import isnan

from numpy import hstack

from pandas import read _ csv

# load dataset

url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/horse-colic.csv'

dataframe = read_csv ( url , header = None , na_values = '?' )

# split into input and output elements

data = dataframe . values

ix = [ i for i in range ( data . shape [ 1 ] ) if i != 23 ]

X , y = data [ : , ix ] , data [ : , 23 ]

print ( X . shape )

# sum each row where rows with a nan will sum to nan

a = X . sum ( axis = 1 )

# mark all nan as 1

a [ isnan ( a ) ] = 1

# mark all non-nan as 0

a [ ~ isnan ( a ) ] = 0

a = a . reshape ( ( len ( a ) , 1 ) )

# add to the dataset as another column

X = hstack ( ( X , a ) )

print ( X . shape )

Running the example first downloads the dataset and reports the number of rows and columns, as expected.

Then the new binary variable indicating whether a row contains a missing value is created and added to the end of the input variables. The shape of the input data is then reported, confirming the addition of the feature, from 27 to 28 columns.

(300, 27)
(300, 28)

(300, 27)

(300, 28)

We can then evaluate the model as we did in the previous section with the additional binary flag and see if it impacts model performance.

The complete example is listed below.

# evaluate model performance with a binary flag for missing values and imputed missing
from numpy import isnan
from numpy import hstack
from numpy import mean
from numpy import std
from pandas import read_csv
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import SimpleImputer
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.pipeline import Pipeline
# load dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/horse-colic.csv'
dataframe = read_csv(url, header=None, na_values='?')
# split into input and output elements
data = dataframe.values
ix = [i for i in range(data.shape[1]) if i != 23]
X, y = data[:, ix], data[:, 23]
# sum each row where rows with a nan will sum to nan
a = X.sum(axis=1)
# mark all nan as 1
a[isnan(a)] = 1
# mark all non-nan as 0
a[~isnan(a)] = 0
a = a.reshape((len(a), 1))
# add to the dataset as another column
X = hstack((X, a))
# define modeling pipeline
model = RandomForestClassifier()
imputer = SimpleImputer()
pipeline = Pipeline(steps=[('i', imputer), ('m', model)])
# define model evaluation
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# evaluate model
scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
print('Mean Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))

# evaluate model performance with a binary flag for missing values and imputed missing

from numpy import isnan

from numpy import hstack

from numpy import mean

from numpy import std

from pandas import read_csv

from sklearn . ensemble import RandomForestClassifier

from sklearn . impute import SimpleImputer

from sklearn . model_selection import cross_val_score

from sklearn . model_selection import RepeatedStratifiedKFold

from sklearn . pipeline import Pipeline

# load dataset

url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/horse-colic.csv'

dataframe = read_csv ( url , header = None , na_values = '?' )

# split into input and output elements

data = dataframe . values

ix = [ i for i in range ( data . shape [ 1 ] ) if i != 23 ]

X , y = data [ : , ix ] , data [ : , 23 ]

# sum each row where rows with a nan will sum to nan

a = X . sum ( axis = 1 )

# mark all nan as 1

a [ isnan ( a ) ] = 1

# mark all non-nan as 0

a [ ~ isnan ( a ) ] = 0

a = a . reshape ( ( len ( a ) , 1 ) )

# add to the dataset as another column

X = hstack ( ( X , a ) )

# define modeling pipeline

model = RandomForestClassifier ( )

imputer = SimpleImputer ( )

pipeline = Pipeline ( steps = [ ( 'i' , imputer ) , ( 'm' , model ) ] )

# define model evaluation

cv = RepeatedStratifiedKFold ( n_splits = 10 , n_repeats = 3 , random_state = 1 )

# evaluate model

scores = cross_val_score ( pipeline , X , y , scoring = 'accuracy' , cv = cv , n_jobs = - 1 )

print ( 'Mean Accuracy: %.3f (%.3f)' % ( mean ( scores ) , std ( scores ) ) )

Running the example reports the mean and standard deviation classification accuracy on the horse colic dataset with the additional feature and imputation.

Your specific results may vary given the stochastic nature of the learning algorithm, the stochastic nature of the evaluation procedure, and differences in precision across machines. Try running the example a few times.

In this case, we see a modest lift in performance from 86.2 percent to 86.3 percent. The difference is small and may not be statistically significant.

Mean Accuracy: 0.863 (0.055)
Mean Accuracy: 0.863 (0.055)

Most rows in this dataset have a missing value, and this approach might be more beneficial on datasets with fewer missing values.

Next, let’s see if we can provide even more information about the missing values to the model.

Model With Indicators of All Missing Values

In the previous section, we added one additional column to indicate whether a row contains a missing value or not.

One step further is to indicate whether each input value was missing and imputed or not. This effectively adds one additional column for each input variable that contains missing values and may offer benefit to the model.

This can be achieved by setting the “ add_indicator ” argument to True when defining the SimpleImputer instance .

...
# impute and mark missing values
X = SimpleImputer(add_indicator=True).fit_transform(X)

. . .

# impute and mark missing values

X = SimpleImputer ( add_indicator = True ) . fit_transform ( X )

We can demonstrate this with a worked example.

The example below loads the horse colic dataset as before, then imputes the missing values on the entire dataset and adds indicators variables for each input variable that has missing values

# impute and add indicators for columns with missing values
from pandas import read_csv
from sklearn.impute import SimpleImputer
# load dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/horse-colic.csv'
dataframe = read_csv(url, header=None, na_values='?')
data = dataframe.values
# split into input and output elements
ix = [i for i in range(data.shape[1]) if i != 23]
X, y = data[:, ix], data[:, 23]
print(X.shape)
# impute and mark missing values
X = SimpleImputer(strategy='mean', add_indicator=True).fit_transform(X)
print(X.shape)

# impute and add indicators for columns with missing values

from pandas import read_csv

from sklearn . impute import SimpleImputer

# load dataset

url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/horse-colic.csv'

dataframe = read_csv ( url , header = None , na_values = '?' )

data = dataframe . values

# split into input and output elements

ix = [ i for i in range ( data . shape [ 1 ] ) if i != 23 ]

X , y = data [ : , ix ] , data [ : , 23 ]

print ( X . shape )

# impute and mark missing values

X = SimpleImputer ( strategy = 'mean' , add_indicator = True ) . fit_transform ( X )

print ( X . shape )

Running the example first downloads and summarizes the shape of the dataset as expected, then applies the imputation and adds the binary (1 and 0 values) columns indicating whether each row contains a missing value for a given input variable.

We can see that the number of input variables has increased from 27 to 48, indicating the addition of 21 binary input variables, and in turn, that 21 of the 27 input variables must contain at least one missing value.

(300, 27)
(300, 48)

(300, 27)

(300, 48)

Next, we can evaluate the model with this additional information.

The complete example below demonstrates this.

# evaluate imputation with added indicators features on the horse colic dataset
from numpy import mean
from numpy import std
from pandas import read_csv
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import SimpleImputer
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.pipeline import Pipeline
# load dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/horse-colic.csv'
dataframe = read_csv(url, header=None, na_values='?')
# split into input and output elements
data = dataframe.values
ix = [i for i in range(data.shape[1]) if i != 23]
X, y = data[:, ix], data[:, 23]
# define modeling pipeline
model = RandomForestClassifier()
imputer = SimpleImputer(add_indicator=True)
pipeline = Pipeline(steps=[('i', imputer), ('m', model)])
# define model evaluation
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# evaluate model
scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
print('Mean Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))

# evaluate imputation with added indicators features on the horse colic dataset

from numpy import mean

from numpy import std

from pandas import read_csv

from sklearn . ensemble import RandomForestClassifier

from sklearn . impute import SimpleImputer

from sklearn . model_selection import cross_val_score

from sklearn . model_selection import RepeatedStratifiedKFold

from sklearn . pipeline import Pipeline

# load dataset

url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/horse-colic.csv'

dataframe = read_csv ( url , header = None , na_values = '?' )

# split into input and output elements

data = dataframe . values

ix = [ i for i in range ( data . shape [ 1 ] ) if i != 23 ]

X , y = data [ : , ix ] , data [ : , 23 ]

# define modeling pipeline

model = RandomForestClassifier ( )

imputer = SimpleImputer ( add_indicator = True )

pipeline = Pipeline ( steps = [ ( 'i' , imputer ) , ( 'm' , model ) ] )

# define model evaluation

cv = RepeatedStratifiedKFold ( n_splits = 10 , n_repeats = 3 , random_state = 1 )

# evaluate model

scores = cross_val_score ( pipeline , X , y , scoring = 'accuracy' , cv = cv , n_jobs = - 1 )

print ( 'Mean Accuracy: %.3f (%.3f)' % ( mean ( scores ) , std ( scores ) ) )

Running the example reports the mean and standard deviation classification accuracy on the horse colic dataset with the additional indicators features and imputation.

Your specific results may vary given the stochastic nature of the learning algorithm, the stochastic nature of the evaluation procedure, and differences in precision across machines. Try running the example a few times.

In this case, we see a nice lift in performance from 86.3 percent in the previous section to 86.7 percent.

This may provide strong evidence that adding one flag per column that was inputted is a better strategy on this dataset and chosen model.

Mean Accuracy: 0.867 (0.055)
Mean Accuracy: 0.867 (0.055)

Further Reading

This section provides more resources on the topic if you are looking to go deeper.

Related Tutorials

Summary

In this tutorial, you discovered how to add binary flags for missing values for modeling.

Specifically, you learned:

  • How to load and evaluate models with statistical imputation on a classification dataset with missing values.
  • How to add a flag that indicates if a row has one more missing values and evaluate models with this new feature.
  • How to add a flag for each input variable that has missing values and evaluate models with these new features.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

Get a Handle on Modern Data Preparation!

Add Binary Flags for Missing Values for Machine Learning

Prepare Your Machine Learning Data in Minutes

...with just a few lines of python code

Discover how in my new Ebook:

Data Preparation for Machine Learning

It provides self-study tutorials with full working code on:

Feature Selection , RFE , Data Cleaning , Data Transforms , Scaling , Dimensionality Reduction , and much more...

Bring Modern Data Preparation Techniques to

Your Machine Learning Projects

See What's Inside

以上所述就是小编给大家介绍的《Add Binary Flags for Missing Values for Machine Learning》,希望对大家有所帮助,如果大家有任何疑问请给我留言,小编会及时回复大家的。在此也非常感谢大家对 码农网 的支持!

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

世界因你不同

世界因你不同

李开复、范海涛 / 中信出版社 / 2010 / 29.8

这是李开复唯一的一本自传,字里行间,是岁月流逝中沉淀下来的宝贵的人生智慧和职场经验。捣蛋的“小皇帝”,11岁的“留学生”,奥巴马的大学同学,26岁的副教授,33岁的苹果副总裁,谷歌中国的创始人,他有着太多传奇的经历,为了他,两家最大的IT公司对簿公堂。而他的每一次人生选择,都是一次成功的自我超越。   透过这本自传,李开复真诚讲述了他鲜为人知的成长史、风雨兼程的成功史和烛照人生的心灵史,也首次全面......一起来看看 《世界因你不同》 这本书的介绍吧!

RGB转16进制工具
RGB转16进制工具

RGB HEX 互转工具

正则表达式在线测试
正则表达式在线测试

正则表达式在线测试

HEX CMYK 转换工具
HEX CMYK 转换工具

HEX CMYK 互转工具