Ensemble之stacking

栏目: 编程工具 · 发布时间: 5年前

内容简介:Stacking是一种模型组合技术,用于组合来自多个预测模型的信息,以生成一个新的模型。即将训练好的所有基模型对整个训练集进行预测,第j个基模型对第i个训练样本的预测值将作为新的训练集中第i个样本的第j个特征值,最后基于新的训练集进行训练。同理,预测的过程也要先经过所有基模型的预测形成新的测试集,最后再对测试集进行预测.当然,stacking并不是都能带来惊人的效果,当模型之间存在明显差异时,stacking的效果是相当好的,而当模型都很相似时,带来的效果往往并不是那么亮眼。

Stacking是一种模型组合技术,用于组合来自多个预测模型的信息,以生成一个新的模型。即将训练好的所有基模型对整个训练集进行预测,第j个基模型对第i个训练样本的预测值将作为新的训练集中第i个样本的第j个特征值,最后基于新的训练集进行训练。同理,预测的过程也要先经过所有基模型的预测形成新的测试集,最后再对测试集进行预测.

Ensemble之stacking

当然,stacking并不是都能带来惊人的效果,当模型之间存在明显差异时,stacking的效果是相当好的,而当模型都很相似时,带来的效果往往并不是那么亮眼。

实现

直接以kaggle的 Porto Seguro’s Safe Driver Prediction 比赛数据为例。

这是Kaggle在9月30日开启的一个新的比赛,举办者是巴西最大的汽车与住房保险公司之一:Porto Seguro。该比赛要求参赛者根据汽车保单持有人的数据建立机器学习模型,分析该持有人是否会在次年提出索赔。比赛所提供的数据均已进行处理,由于数据特征没有实际意义,因此无法根据常识或业界知识简单地进行特征工程。

数据下载地址: Data

加载所需要模块

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.naive_bayes import BernoulliNB
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import StratifiedKFold
import xgboost as xgb
import lightgbm as lgb
import time
pd.set_option('display.max_columns', 500)
pd.set_option('display.max_colwidth', 500)
pd.set_option('display.max_rows', 1000)

特征工程

这部分简单的对数据进行处理,主要有:

  • Categorical feature encoding
    • Frequency Encoding
    • Binary Encoding
    • Feature Reduction

1.1 Frequency Encoding

# 读取原始数据集
train=pd.read_csv('train.csv')
test=pd.read_csv('test.csv')
sample_submission=pd.read_csv('sample_submission.csv')
def freq_encoding(cols, train_df, test_df):
    result_train_df=pd.DataFrame()
    result_test_df=pd.DataFrame()
    for col in cols:
        col_freq=col+'_freq'
        # 计算类别变量的个数
        freq=train_df[col].value_counts()
        freq=pd.DataFrame(freq)
        freq.reset_index(inplace=True)
        freq.columns=[[col,col_freq]]
        temp_train_df=pd.merge(train_df[[col]], freq, how='left',on=col)
        temp_train_df.drop([col], axis=1, inplace=True)
        temp_test_df=pd.merge(test_df[[col]], freq, how='left', on=col)
        temp_test_df.drop([col], axis=1, inplace=True)
        # 如果test中的数据未在train中出现,则令为0
        temp_test_df.fillna(0, inplace=True)
        temp_test_df[col_freq]=temp_test_df[col_freq].astype(np.int32)
        if result_train_df.shape[0]==0:
            result_train_df=temp_train_df
            result_test_df=temp_test_df
        else:
 			result_train_df=pd.concat([result_train_df,
 temp_train_df],axis=1)
            result_test_df=pd.concat([result_test_df, temp_test_df],axis=1)
    return result_train_df, result_test_df

cat_cols=['ps_ind_02_cat','ps_car_04_cat', 'ps_car_09_cat',
          'ps_ind_05_cat', 'ps_car_01_cat', 'ps_car_11_cat']
train_freq, test_freq=freq_encoding(cat_cols, train, test)
# 合并原始数据
train=pd.concat([train, train_freq], axis=1)
test=pd.concat([test,test_freq], axis=1)

1.2.Binary Encoding

def binary_encoding(train_df, test_df, feat):
    # 计算最高值
    train_feat_max = train_df[feat].max()
    test_feat_max = test_df[feat].max()
    if train_feat_max > test_feat_max:
        feat_max = train_feat_max
    else:
        feat_max = test_feat_max
    # 使用feat_max+1替代缺失值
    train_df.loc[train_df[feat] == -1, feat] = feat_max + 1
    test_df.loc[test_df[feat] == -1, feat] = feat_max + 1
    # 并集并返回有序结果(唯一值)
    union_val = np.union1d(train_df[feat].unique(), test_df[feat].unique())
    max_dec = union_val.max()
    max_bin_len = len("{0:b}".format(max_dec))
    index = np.arange(len(union_val))
    columns = list([feat])
    bin_df = pd.DataFrame(index=index, columns=columns)
    bin_df[feat] = union_val
    feat_bin = bin_df[feat].apply(lambda x: "{0:b}".format(x).zfill(max_bin_len))
    splitted = feat_bin.apply(lambda x: pd.Series(list(x)).astype(np.uint8))
    splitted.columns = [feat + '_bin_' + str(x) for x in splitted.columns]
    bin_df = bin_df.join(splitted)
    train_df = pd.merge(train_df, bin_df, how='left', on=[feat])
    test_df = pd.merge(test_df, bin_df, how='left', on=[feat])
    return train_df, test_df

cat_cols=['ps_ind_02_cat','ps_car_04_cat', 'ps_car_09_cat',
          'ps_ind_05_cat', 'ps_car_01_cat']
train, test=binary_encoding(train, test, 'ps_ind_02_cat')
train, test=binary_encoding(train, test, 'ps_car_04_cat')
train, test=binary_encoding(train, test, 'ps_car_09_cat')
train, test=binary_encoding(train, test, 'ps_ind_05_cat')
train, test=binary_encoding(train, test, 'ps_car_01_cat')

1.3 Feature Reduction

是否删除原始数据,可以参考删除之前跟删除之后的cv结果

col_to_drop = train.columns[train.columns.str.startswith('ps_calc_')]
train.drop(col_to_drop, axis=1, inplace=True)
test.drop(col_to_drop, axis=1, inplace=True)

交叉验证—oof特征

注:模型的参数已经经过调整好了,所以在增加oof特征之前需要对模型进行调参处理

def auc_to_gini_norm(auc_score):
    return 2*auc_score-1

Sklearn K-fold & OOF function

主要针对 python 模块sklearn中的函数

def cross_validate_sklearn(clf, x_train, y_train , x_test, kf,scale=False, verbose=True):
    '''
    :param clf: 模型
    :param x_train: 训练数据
    :param y_train: 训练数据
    :param x_test: 测试数据
    :param kf: cv数
    :param scale: 是否归一化
    :param verbose:
    :return:
    '''
    start_time=time.time()
    # 初始化oof特征数据
    train_pred = np.zeros((x_train.shape[0]))
    test_pred = np.zeros((x_test.shape[0]))
    # cv产生oof特征
    for i, (train_index, val_index) in enumerate(kf.split(x_train, y_train)):
        x_train_kf, x_val_kf = x_train.loc[train_index, :], x_train.loc[val_index, :]
        y_train_kf, y_val_kf = y_train[train_index], y_train[val_index]
        # 是否要求归一化,比如线性模型算法
        if scale:
            scaler = StandardScaler().fit(x_train_kf.values)
            x_train_kf_values = scaler.transform(x_train_kf.values)
            x_val_kf_values = scaler.transform(x_val_kf.values)
            x_test_values = scaler.transform(x_test.values)
        else:
            x_train_kf_values = x_train_kf.values
            x_val_kf_values = x_val_kf.values
            x_test_values = x_test.values
        # 拟合模型
        clf.fit(x_train_kf_values, y_train_kf.values)
        # 预测概率
        val_pred=clf.predict_proba(x_val_kf_values)[:,1]
        train_pred[test_index] += val_pred
        y_test_preds = clf.predict_proba(x_test_values)[:,1]
        test_pred += y_test_preds
        fold_auc = roc_auc_score(y_val_kf.values, val_pred)
        fold_gini_norm = auc_to_gini_norm(fold_auc)
        if verbose:
            print('fold cv {} AUC score is {:.6f}, Gini_Norm score is {:.6f}'.format(i, fold_auc, fold_gini_norm))
	# kf次预测测试集取平均
    test_pred /= kf.n_splits
    cv_auc = roc_auc_score(y_train, train_pred)
    cv_gini_norm = auc_to_gini_norm(cv_auc)
    cv_score = [cv_auc, cv_gini_norm]
    if verbose:
        print('cv AUC score is {:.6f}, Gini_Norm score is {:.6f}'.format(cv_auc, cv_gini_norm))
        end_time = time.time()
        print("it takes %.3f seconds to perform cross validation" % (end_time - start_time))
    return cv_score, train_pred,test_pred

Xgboost K-fold & OOF function

接下来针对kaggle比赛杀器xgboost和lightgbm进行构造oof特征,主要使用原生的xgboost或lightgbm模块,当然你也可以使用sklearn的api,但是原生的模块会包含更多的功能。

# 对预测值进行排序
def probability_to_rank(prediction, scaler=1):
    pred_df=pd.DataFrame(columns=['probability'])
    pred_df['probability']=prediction
    pred_df['rank']=pred_df['probability'].rank()/len(prediction)*scaler
    return pred_df['rank'].values
def cross_validate_xgb(params, x_train, y_train, x_test, kf, cat_cols=[], verbose=True,
                       verbose_eval=50, num_boost_round=4000, use_rank=True):
    '''
    :param params: 模型参数
    :param x_train: 训练数据
    :param y_train: 训练数据
    :param x_test: 测试数据
    :param kf: cv数
    :param cat_cols:类别特征
    :param verbose:
    :param verbose_eval:
    :param num_boost_round:迭代数
    :param use_rank: 是否 排序 结果
    :return:
    '''
    start_time=time.time()
	# 初始化oof特征数据
    train_pred = np.zeros((x_train.shape[0]))
    test_pred = np.zeros((x_test.shape[0]))
    # cv产生oof特征
    for i, (train_index, val_index) in enumerate(kf.split(x_train, y_train)): # folds 1, 2 ,3 ,4, 5
        x_train_kf, x_val_kf = x_train.loc[train_index, :], x_train.loc[val_index, :]
        y_train_kf, y_val_kf = y_train[train_index], y_train[val_index]
        x_test_kf=x_test.copy()
		# xgboost数据格式
        d_train_kf = xgb.DMatrix(x_train_kf, label=y_train_kf)
        d_val_kf = xgb.DMatrix(x_val_kf, label=y_val_kf)
        d_test = xgb.DMatrix(x_test_kf)
		# 训练xgboost模型
        bst = xgb.train(params, d_train_kf, num_boost_round=num_boost_round,
                        evals=[(d_train_kf, 'train'), (d_val_kf, 'val')], verbose_eval=verbose_eval,
                        early_stopping_rounds=50)
        val_pred = bst.predict(d_val_kf, ntree_limit=bst.best_ntree_limit)
        if use_rank:
            train_pred[val_index] += probability_to_rank(val_pred)
            test_pred+=probability_to_rank(bst.predict(d_test))
        else:
            train_pred[val_index] += val_pred
            test_pred+=bst.predict(d_test)
        fold_auc = roc_auc_score(y_val_kf.values, val_pred)
        fold_gini_norm = auc_to_gini_norm(fold_auc)
        if verbose:
            print('fold cv {} AUC score is {:.6f}, Gini_Norm score is {:.6f}'.format(i, fold_auc, fold_gini_norm))
    test_pred /= kf.n_splits
    cv_auc = roc_auc_score(y_train, train_pred)
    cv_gini_norm = auc_to_gini_norm(cv_auc)
    cv_score = [cv_auc, cv_gini_norm]
    if verbose:
        print('cv AUC score is {:.6f}, Gini_Norm score is {:.6f}'.format(cv_auc, cv_gini_norm))
        end_time = time.time()
        print("it takes %.3f seconds to perform cross validation" % (end_time - start_time))
    return cv_score, train_pred,test_pred

LigthGBM K-fold & OOF function

类似于xgboost

def cross_validate_lgb(params, x_train, y_train, x_test, kf, cat_cols=[],
                       verbose=True, verbose_eval=50, use_cat=True, use_rank=True):
    '''
    :param params: 模型参数
    :param x_train: 训练数据
    :param y_train: 训练数据
    :param x_test: 测试数据
    :param kf: cv数
    :param cat_cols:
    :param verbose:
    :param verbose_eval:
    :param use_cat:
    :param use_rank:
    :return:
    '''
    start_time = time.time()
    # 初始化oof特征数据
    train_pred = np.zeros((x_train.shape[0]))
    test_pred = np.zeros((x_test.shape[0]))
    if len(cat_cols)==0: use_cat=False
    # cv
    for i, (train_index, val_index) in enumerate(kf.split(x_train, y_train)): # folds 1, 2 ,3 ,4, 5
        x_train_kf, x_val_kf = x_train.loc[train_index, :], x_train.loc[val_index, :]
        y_train_kf, y_val_kf = y_train[train_index], y_train[val_index]
        # 是否针对分类特征(lightGBM可以找到分类特征的最优分割)
        if use_cat:
            lgb_train = lgb.Dataset(x_train_kf, y_train_kf, categorical_feature=cat_cols)
            lgb_val = lgb.Dataset(x_val_kf, y_val_kf, reference=lgb_train, categorical_feature=cat_cols)
        else:
            lgb_train = lgb.Dataset(x_train_kf, y_train_kf)
            lgb_val = lgb.Dataset(x_val_kf, y_val_kf, reference=lgb_train)
		#训练lightgbm模型
        gbm = lgb.train(params,
                        lgb_train,
                        num_boost_round=4000,
                        valid_sets=lgb_val,
                        early_stopping_rounds=30,
                        verbose_eval=verbose_eval)
        val_pred = gbm.predict(x_val_kf)
        if use_rank:
            train_pred[val_index] += probability_to_rank(val_pred)
            test_pred += probability_to_rank(gbm.predict(x_test))
            # test_pred += gbm.predict(x_test)
        else:
            train_pred[val_index] += val_pred
            test_pred += gbm.predict(x_test)
        # test_pred += gbm.predict(x_test)
        fold_auc = roc_auc_score(y_val_kf.values, val_pred)
        fold_gini_norm = auc_to_gini_norm(fold_auc)
        if verbose:
            print('fold cv {} AUC score is {:.6f}, Gini_Norm score is {:.6f}'.format(i, fold_auc, fold_gini_norm))
    test_pred /= kf.n_splits
    cv_auc = roc_auc_score(y_train, train_pred)
    cv_gini_norm = auc_to_gini_norm(cv_auc)
    cv_score = [cv_auc, cv_gini_norm]
    if verbose:
        print('cv AUC score is {:.6f}, Gini_Norm score is {:.6f}'.format(cv_auc, cv_gini_norm))
        end_time = time.time()
        print("it takes %.3f seconds to perform cross validation" % (end_time - start_time))
    return cv_score, train_pred,test_pred

Generate level 1 OOF predictions

有了前面定义好的oof特征函数,接下来,将构造不同level oof的输出。首先定义好数据和cv数

drop_cols=['id','target']
y_train=train['target']
x_train=train.drop(drop_cols, axis=1)
x_test=test.drop(['id'], axis=1)
kf=StratifiedKFold(n_splits=5, shuffle=True, random_state=2017)

下面,将使用一些常见的模型产生level 1 model输出:

Random Forest

rf=RandomForestClassifier(n_estimators=200, n_jobs=6, min_samples_split=5, max_depth=7,
                          criterion='gini', random_state=0)
outcomes =cross_validate_sklearn(rf, x_train, y_train ,x_test, kf, scale=False, verbose=True)
rf_cv=outcomes[0]
rf_train_pred=outcomes[1]
rf_test_pred=outcomes[2]
rf_train_pred_df=pd.DataFrame(columns=['prediction_probability'], data=rf_train_pred)
rf_test_pred_df=pd.DataFrame(columns=['prediction_probability'], data=rf_test_pred)

Logistic Regression

logit=LogisticRegression(random_state=0, C=0.5)
outcomes = cross_validate_sklearn(logit, x_train, y_train ,x_test, kf, scale=True, verbose=True)
logit_cv=outcomes[0]
logit_train_pred=outcomes[1]
logit_test_pred=outcomes[2]
logit_train_pred_df=pd.DataFrame(columns=['prediction_probability'], data=logit_train_pred)
logit_test_pred_df=pd.DataFrame(columns=['prediction_probability'], data=logit_test_pred)

BernoulliNB

这种算法通常单个模型输出结果性能比不上xgb或lgb的结果,不过,它能带来结果的多样性,有助于提高stacking的性能。

nb=BernoulliNB()
outcomes =cross_validate_sklearn(nb, x_train, y_train ,x_test, kf, scale=True, verbose=True)
nb_cv=outcomes[0]
nb_train_pred=outcomes[1]
nb_test_pred=outcomes[2]
nb_train_pred_df=pd.DataFrame(columns=['prediction_probability'], data=nb_train_pred)
nb_test_pred_df=pd.DataFrame(columns=['prediction_probability'], data=nb_test_pred)

xgboost

xgb_params = {
    "booster"  :  "gbtree",
    "objective"         :  "binary:logistic",
    "tree_method": "hist",
    "eval_metric": "auc",
    "eta": 0.1,
    "max_depth": 5,
    "min_child_weight": 10,
    "gamma": 0.70,
    "subsample": 0.76,
    "colsample_bytree": 0.95,
    "nthread": 6,
    "seed": 0,
    'silent': 1
}
outcomes=cross_validate_xgb(xgb_params, x_train, y_train, x_test, kf, use_rank=False, verbose_eval=False)
xgb_cv=outcomes[0]
xgb_train_pred=outcomes[1]
xgb_test_pred=outcomes[2]
xgb_train_pred_df=pd.DataFrame(columns=['prediction_probability'], data=xgb_train_pred)
xgb_test_pred_df=pd.DataFrame(columns=['prediction_probability'], data=xgb_test_pred)

lightGBM

lgb_params = {
    'task': 'train',
    'boosting_type': 'dart',
    'objective': 'binary',
    'metric': {'auc'},
    'num_leaves': 22,
    'min_sum_hessian_in_leaf': 20,
    'max_depth': 5,
    'learning_rate': 0.1,  # 0.618580
    'num_threads': 6,
    'feature_fraction': 0.6894,
    'bagging_fraction': 0.4218,
    'max_drop': 5,
    'drop_rate': 0.0123,
    'min_data_in_leaf': 10,
    'bagging_freq': 1,
    'lambda_l1': 1,
    'lambda_l2': 0.01,
    'verbose': 1
}

cat_cols=['ps_ind_02_cat','ps_car_04_cat', 'ps_car_09_cat','ps_ind_05_cat', 'ps_car_01_cat']
outcomes=cross_validate_lgb(lgb_params,x_train, y_train ,x_test,kf, cat_cols, use_cat=True,
                            verbose_eval=False, use_rank=False)
lgb_cv=outcomes[0]
lgb_train_pred=outcomes[1]
lgb_test_pred=outcomes[2]
lgb_train_pred_df=pd.DataFrame(columns=['prediction_probability'], data=lgb_train_pred)
lgb_test_pred_df=pd.DataFrame(columns=['prediction_probability'], data=lgb_test_pred)

我们已经产生了level 1的特征,接下来进行level 2的ensemble

Level 2 ensemble

Generate L1 output dataframe

将level 1的oof特征作为level 2 的输入特征

columns=['rf','et','logit','nb','xgb','lgb']
train_pred_df_list=[rf_train_pred_df, et_train_pred_df, logit_train_pred_df, nb_train_pred_df,
                    xgb_train_pred_df, lgb_train_pred_df]
test_pred_df_list=[rf_test_pred_df, et_test_pred_df, logit_test_pred_df, nb_test_pred_df,xgb_test_pred_df, lgb_test_pred_df]
lv1_train_df=pd.DataFrame(columns=columns)
lv1_test_df=pd.DataFrame(columns=columns)
for i in range(0,len(columns)):
    lv1_train_df[columns[i]]=train_pred_df_list[i]['prediction_probability']
    lv1_test_df[columns[i]]=test_pred_df_list[i]['prediction_probability']

Level 2 XGB

对level 1 的oof特征训练xgboost模型,并输出level 2 oof特征

xgb_lv2_outcomes=cross_validate_xgb(xgb_params, lv1_train_df, y_train, lv1_test_df, kf,
                                          verbose=True, verbose_eval=False, use_rank=False)
xgb_lv2_cv=xgb_lv2_outcomes[0]
xgb_lv2_train_pred=xgb_lv2_outcomes[1]
xgb_lv2_test_pred=xgb_lv2_outcomes[2]

Level 2 LightGBM

lgb_lv2_outcomes=cross_validate_lgb(lgb_params,lv1_train_df, y_train ,lv1_test_df,kf, [], use_cat=False,
                                    verbose_eval=False, use_rank=True)
lgb_lv2_cv=xgb_lv2_outcomes[0]
lgb_lv2_train_pred=lgb_lv2_outcomes[1]
lgb_lv2_test_pred=lgb_lv2_outcomes[2]

Level 2 Random Forest

rf_lv2=RandomForestClassifier(n_estimators=200, n_jobs=6, min_samples_split=5, max_depth=7,
                          criterion='gini', random_state=0)
rf_lv2_outcomes = cross_validate_sklearn(rf_lv2, lv1_train_df, y_train ,lv1_test_df, kf,
                                            scale=True, verbose=True)
rf_lv2_cv=rf_lv2_outcomes[0]
rf_lv2_train_pred=rf_lv2_outcomes[1]
rf_lv2_test_pred=rf_lv2_outcomes[2]

Level 2 Logistic Regression

logit_lv2=LogisticRegression(random_state=0, C=0.5)
logit_lv2_outcomes = cross_validate_sklearn(logit_lv2, lv1_train_df, y_train ,lv1_test_df, kf,
                                            scale=True, verbose=True)
logit_lv2_cv=logit_lv2_outcomes[0]
logit_lv2_train_pred=logit_lv2_outcomes[1]
logit_lv2_test_pred=logit_lv2_outcomes[2]

Level 3 ensemble

类似于level 2 的ensemble,当然也可以加入level 1的oof特征

Generate L2 output dataframe

lv2_columns=['rf_lf2', 'logit_lv2', 'xgb_lv2','lgb_lv2']
train_lv2_pred_list=[rf_lv2_train_pred, logit_lv2_train_pred, xgb_lv2_train_pred, lgb_lv2_train_pred]
test_lv2_pred_list=[rf_lv2_test_pred, logit_lv2_test_pred, xgb_lv2_test_pred, lgb_lv2_test_pred]
lv2_train=pd.DataFrame(columns=lv2_columns)
lv2_test=pd.DataFrame(columns=lv2_columns)
for i in range(0,len(lv2_columns)):
    lv2_train[lv2_columns[i]]=train_lv2_pred_list[i]
    lv2_test[lv2_columns[i]]=test_lv2_pred_list[i]

Level 3 XGB

xgb_lv3_params = {
    "booster"  :  "gbtree",
    "objective"         :  "binary:logistic",
    "tree_method": "hist",
    "eval_metric": "auc",
    "eta": 0.1,
    "max_depth": 2,
    "min_child_weight": 10,
    "gamma": 0.70,
    "subsample": 0.76,
    "colsample_bytree": 0.95,
    "nthread": 6,
    "seed": 0,
    'silent': 1
}

xgb_lv3_outcomes=cross_validate_xgb(xgb_lv3_params, lv2_train, y_train, lv2_test, kf,
                                          verbose=True, verbose_eval=False, use_rank=True)
xgb_lv3_cv=xgb_lv3_outcomes[0]
xgb_lv3_train_pred=xgb_lv3_outcomes[1]
xgb_lv3_test_pred=xgb_lv3_outcomes[2]

Level 3 Logistic Regression

logit_lv3=LogisticRegression(random_state=0, C=0.5)
logit_lv3_outcomes = cross_validate_sklearn(logit_lv3, lv2_train, y_train ,lv2_test, kf,
                                            scale=True, verbose=True)
logit_lv3_cv=logit_lv3_outcomes[0]
logit_lv3_train_pred=logit_lv3_outcomes[1]
logit_lv3_test_pred=logit_lv3_outcomes[2]

Average L3 outputs & Submission Generation

weight_avg=logit_lv3_train_pred*0.5+ xgb_lv3_train_pred*0.5
print(auc_to_gini_norm(roc_auc_score(y_train, weight_avg)))

submission=sample_submission.copy()
submission['target']=logit_lv3_test_pred*0.5+ xgb_lv3_test_pred*0.5
filename='stacking_demonstration.csv.gz'
submission.to_csv(filename,compression='gzip', index=False)

结语

上述的3层stacking也许不是最好的,但是可以引导你发现更多有用的信息,对于stacking level 是不是越高越好呢?? 一般而言,大部分都只使用到level 2 或者level 3。而我自己一般的策略是:

一般只到level 2,然后平均level 2 的结果

对不同的random seed 使用相同的stacking策略

平均上面的结果

更新—增加模型ensemble权重函数

这里训练的ensemble的权重一般是最后几个模型进行线性融合的权重

#encoding:utf-8

'''
主要是优化最后线性融合模型时候的权重
'''
import pandas as pd
import numpy as np
from scipy.optimize import minimize # 优化函数
from sklearn.cross_validation import StratifiedShuffleSplit
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import log_loss
import os
os.system("ls ../input")
# 数据集
train = pd.read_csv("../input/train.csv")
print("Training set has {0[0]} rows and {0[1]} columns".format(train.shape))
labels = train['target']
train.drop(['target', 'id'], axis=1, inplace=True)
print(train.head())
### 划分数据集,用来训练ensemble的权重
sss = StratifiedShuffleSplit(labels, test_size=0.05, random_state=1234)
for train_index, test_index in sss:
    break
# 数据划分
train_x, train_y = train.values[train_index], labels.values[train_index]
test_x, test_y = train.values[test_index], labels.values[test_index]
### 分类器列表(
clfs = []
rfc = RandomForestClassifier(n_estimators=50, random_state=4141, n_jobs=-1)
rfc.fit(train_x, train_y)
print('RFC LogLoss {score}'.format(score=log_loss(test_y, rfc.predict_proba(test_x))))
clfs.append(rfc)
### 通常你可以使用xgboost lightgbm或者nn模型
## 这里的logistic模型只是为了演示使用
logreg = LogisticRegression()
logreg.fit(train_x, train_y)
print('LogisticRegression LogLoss {score}'.format(score=log_loss(test_y, logreg.predict_proba(test_x))))
clfs.append(logreg)
rfc2 = RandomForestClassifier(n_estimators=50, random_state=1337, n_jobs=-1)
rfc2.fit(train_x, train_y)
print('RFC2 LogLoss {score}'.format(score=log_loss(test_y, rfc2.predict_proba(test_x))))
clfs.append(rfc2)
### 训练ensemble权重
predictions = []
for clf in clfs:
    predictions.append(clf.predict_proba(test_x))
def log_loss_func(weights):
    ''' scipy minimize will pass the weights as a numpy array '''
    final_prediction = 0
    for weight, prediction in zip(weights, predictions):
        final_prediction += weight * prediction
    print(test_y, final_prediction)
    return log_loss(test_y, final_prediction)
# 默认使用0.5作为初始权重
# its better to choose many random starting points and run minimize a few times
starting_values = [0.5] * len(predictions)
cons = ({'type': 'eq', 'fun': lambda w: 1 - sum(w)})
# 权重的上下限
bounds = [(0, 1)] * len(predictions)
res = minimize(log_loss_func, starting_values, method='SLSQP', bounds=bounds, constraints=cons)
print('Ensamble Score: {best_score}'.format(best_score=res['fun']))
print('Best Weights: {weights}'.format(weights=res['x']))

以上所述就是小编给大家介绍的《Ensemble之stacking》,希望对大家有所帮助,如果大家有任何疑问请给我留言,小编会及时回复大家的。在此也非常感谢大家对 码农网 的支持!

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

Algorithms on Strings, Trees and Sequences

Algorithms on Strings, Trees and Sequences

Dan Gusfield / Cambridge University Press / 1997-5-28 / USD 99.99

String algorithms are a traditional area of study in computer science. In recent years their importance has grown dramatically with the huge increase of electronically stored text and of molecular seq......一起来看看 《Algorithms on Strings, Trees and Sequences》 这本书的介绍吧!

RGB转16进制工具
RGB转16进制工具

RGB HEX 互转工具

图片转BASE64编码
图片转BASE64编码

在线图片转Base64编码工具

Base64 编码/解码
Base64 编码/解码

Base64 编码/解码