内容简介:Stacking是一种模型组合技术,用于组合来自多个预测模型的信息,以生成一个新的模型。即将训练好的所有基模型对整个训练集进行预测,第j个基模型对第i个训练样本的预测值将作为新的训练集中第i个样本的第j个特征值,最后基于新的训练集进行训练。同理,预测的过程也要先经过所有基模型的预测形成新的测试集,最后再对测试集进行预测.当然,stacking并不是都能带来惊人的效果,当模型之间存在明显差异时,stacking的效果是相当好的,而当模型都很相似时,带来的效果往往并不是那么亮眼。
Stacking是一种模型组合技术,用于组合来自多个预测模型的信息,以生成一个新的模型。即将训练好的所有基模型对整个训练集进行预测,第j个基模型对第i个训练样本的预测值将作为新的训练集中第i个样本的第j个特征值,最后基于新的训练集进行训练。同理,预测的过程也要先经过所有基模型的预测形成新的测试集,最后再对测试集进行预测.
当然,stacking并不是都能带来惊人的效果,当模型之间存在明显差异时,stacking的效果是相当好的,而当模型都很相似时,带来的效果往往并不是那么亮眼。
实现
直接以kaggle的 Porto Seguro’s Safe Driver Prediction 比赛数据为例。
这是Kaggle在9月30日开启的一个新的比赛,举办者是巴西最大的汽车与住房保险公司之一:Porto Seguro。该比赛要求参赛者根据汽车保单持有人的数据建立机器学习模型,分析该持有人是否会在次年提出索赔。比赛所提供的数据均已进行处理,由于数据特征没有实际意义,因此无法根据常识或业界知识简单地进行特征工程。
数据下载地址: Data
加载所需要模块
import numpy as np # linear algebra import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv) from sklearn.preprocessing import StandardScaler from sklearn.metrics import roc_auc_score from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier from sklearn.naive_bayes import BernoulliNB from sklearn.linear_model import LogisticRegression from sklearn.model_selection import StratifiedKFold import xgboost as xgb import lightgbm as lgb import time pd.set_option('display.max_columns', 500) pd.set_option('display.max_colwidth', 500) pd.set_option('display.max_rows', 1000)
特征工程
这部分简单的对数据进行处理,主要有:
- Categorical feature encoding
- Frequency Encoding
- Binary Encoding
- Feature Reduction
1.1 Frequency Encoding
# 读取原始数据集 train=pd.read_csv('train.csv') test=pd.read_csv('test.csv') sample_submission=pd.read_csv('sample_submission.csv')
def freq_encoding(cols, train_df, test_df): result_train_df=pd.DataFrame() result_test_df=pd.DataFrame() for col in cols: col_freq=col+'_freq' # 计算类别变量的个数 freq=train_df[col].value_counts() freq=pd.DataFrame(freq) freq.reset_index(inplace=True) freq.columns=[[col,col_freq]] temp_train_df=pd.merge(train_df[[col]], freq, how='left',on=col) temp_train_df.drop([col], axis=1, inplace=True) temp_test_df=pd.merge(test_df[[col]], freq, how='left', on=col) temp_test_df.drop([col], axis=1, inplace=True) # 如果test中的数据未在train中出现,则令为0 temp_test_df.fillna(0, inplace=True) temp_test_df[col_freq]=temp_test_df[col_freq].astype(np.int32) if result_train_df.shape[0]==0: result_train_df=temp_train_df result_test_df=temp_test_df else: result_train_df=pd.concat([result_train_df, temp_train_df],axis=1) result_test_df=pd.concat([result_test_df, temp_test_df],axis=1) return result_train_df, result_test_df cat_cols=['ps_ind_02_cat','ps_car_04_cat', 'ps_car_09_cat', 'ps_ind_05_cat', 'ps_car_01_cat', 'ps_car_11_cat'] train_freq, test_freq=freq_encoding(cat_cols, train, test) # 合并原始数据 train=pd.concat([train, train_freq], axis=1) test=pd.concat([test,test_freq], axis=1)
1.2.Binary Encoding
def binary_encoding(train_df, test_df, feat): # 计算最高值 train_feat_max = train_df[feat].max() test_feat_max = test_df[feat].max() if train_feat_max > test_feat_max: feat_max = train_feat_max else: feat_max = test_feat_max # 使用feat_max+1替代缺失值 train_df.loc[train_df[feat] == -1, feat] = feat_max + 1 test_df.loc[test_df[feat] == -1, feat] = feat_max + 1 # 并集并返回有序结果(唯一值) union_val = np.union1d(train_df[feat].unique(), test_df[feat].unique()) max_dec = union_val.max() max_bin_len = len("{0:b}".format(max_dec)) index = np.arange(len(union_val)) columns = list([feat]) bin_df = pd.DataFrame(index=index, columns=columns) bin_df[feat] = union_val feat_bin = bin_df[feat].apply(lambda x: "{0:b}".format(x).zfill(max_bin_len)) splitted = feat_bin.apply(lambda x: pd.Series(list(x)).astype(np.uint8)) splitted.columns = [feat + '_bin_' + str(x) for x in splitted.columns] bin_df = bin_df.join(splitted) train_df = pd.merge(train_df, bin_df, how='left', on=[feat]) test_df = pd.merge(test_df, bin_df, how='left', on=[feat]) return train_df, test_df cat_cols=['ps_ind_02_cat','ps_car_04_cat', 'ps_car_09_cat', 'ps_ind_05_cat', 'ps_car_01_cat'] train, test=binary_encoding(train, test, 'ps_ind_02_cat') train, test=binary_encoding(train, test, 'ps_car_04_cat') train, test=binary_encoding(train, test, 'ps_car_09_cat') train, test=binary_encoding(train, test, 'ps_ind_05_cat') train, test=binary_encoding(train, test, 'ps_car_01_cat')
1.3 Feature Reduction
是否删除原始数据,可以参考删除之前跟删除之后的cv结果
col_to_drop = train.columns[train.columns.str.startswith('ps_calc_')] train.drop(col_to_drop, axis=1, inplace=True) test.drop(col_to_drop, axis=1, inplace=True)
交叉验证—oof特征
注:模型的参数已经经过调整好了,所以在增加oof特征之前需要对模型进行调参处理
def auc_to_gini_norm(auc_score): return 2*auc_score-1
Sklearn K-fold & OOF function
主要针对 python 模块sklearn中的函数
def cross_validate_sklearn(clf, x_train, y_train , x_test, kf,scale=False, verbose=True): ''' :param clf: 模型 :param x_train: 训练数据 :param y_train: 训练数据 :param x_test: 测试数据 :param kf: cv数 :param scale: 是否归一化 :param verbose: :return: ''' start_time=time.time() # 初始化oof特征数据 train_pred = np.zeros((x_train.shape[0])) test_pred = np.zeros((x_test.shape[0])) # cv产生oof特征 for i, (train_index, val_index) in enumerate(kf.split(x_train, y_train)): x_train_kf, x_val_kf = x_train.loc[train_index, :], x_train.loc[val_index, :] y_train_kf, y_val_kf = y_train[train_index], y_train[val_index] # 是否要求归一化,比如线性模型算法 if scale: scaler = StandardScaler().fit(x_train_kf.values) x_train_kf_values = scaler.transform(x_train_kf.values) x_val_kf_values = scaler.transform(x_val_kf.values) x_test_values = scaler.transform(x_test.values) else: x_train_kf_values = x_train_kf.values x_val_kf_values = x_val_kf.values x_test_values = x_test.values # 拟合模型 clf.fit(x_train_kf_values, y_train_kf.values) # 预测概率 val_pred=clf.predict_proba(x_val_kf_values)[:,1] train_pred[test_index] += val_pred y_test_preds = clf.predict_proba(x_test_values)[:,1] test_pred += y_test_preds fold_auc = roc_auc_score(y_val_kf.values, val_pred) fold_gini_norm = auc_to_gini_norm(fold_auc) if verbose: print('fold cv {} AUC score is {:.6f}, Gini_Norm score is {:.6f}'.format(i, fold_auc, fold_gini_norm)) # kf次预测测试集取平均 test_pred /= kf.n_splits cv_auc = roc_auc_score(y_train, train_pred) cv_gini_norm = auc_to_gini_norm(cv_auc) cv_score = [cv_auc, cv_gini_norm] if verbose: print('cv AUC score is {:.6f}, Gini_Norm score is {:.6f}'.format(cv_auc, cv_gini_norm)) end_time = time.time() print("it takes %.3f seconds to perform cross validation" % (end_time - start_time)) return cv_score, train_pred,test_pred
Xgboost K-fold & OOF function
接下来针对kaggle比赛杀器xgboost和lightgbm进行构造oof特征,主要使用原生的xgboost或lightgbm模块,当然你也可以使用sklearn的api,但是原生的模块会包含更多的功能。
# 对预测值进行排序 def probability_to_rank(prediction, scaler=1): pred_df=pd.DataFrame(columns=['probability']) pred_df['probability']=prediction pred_df['rank']=pred_df['probability'].rank()/len(prediction)*scaler return pred_df['rank'].values
def cross_validate_xgb(params, x_train, y_train, x_test, kf, cat_cols=[], verbose=True, verbose_eval=50, num_boost_round=4000, use_rank=True): ''' :param params: 模型参数 :param x_train: 训练数据 :param y_train: 训练数据 :param x_test: 测试数据 :param kf: cv数 :param cat_cols:类别特征 :param verbose: :param verbose_eval: :param num_boost_round:迭代数 :param use_rank: 是否 排序 结果 :return: ''' start_time=time.time() # 初始化oof特征数据 train_pred = np.zeros((x_train.shape[0])) test_pred = np.zeros((x_test.shape[0])) # cv产生oof特征 for i, (train_index, val_index) in enumerate(kf.split(x_train, y_train)): # folds 1, 2 ,3 ,4, 5 x_train_kf, x_val_kf = x_train.loc[train_index, :], x_train.loc[val_index, :] y_train_kf, y_val_kf = y_train[train_index], y_train[val_index] x_test_kf=x_test.copy() # xgboost数据格式 d_train_kf = xgb.DMatrix(x_train_kf, label=y_train_kf) d_val_kf = xgb.DMatrix(x_val_kf, label=y_val_kf) d_test = xgb.DMatrix(x_test_kf) # 训练xgboost模型 bst = xgb.train(params, d_train_kf, num_boost_round=num_boost_round, evals=[(d_train_kf, 'train'), (d_val_kf, 'val')], verbose_eval=verbose_eval, early_stopping_rounds=50) val_pred = bst.predict(d_val_kf, ntree_limit=bst.best_ntree_limit) if use_rank: train_pred[val_index] += probability_to_rank(val_pred) test_pred+=probability_to_rank(bst.predict(d_test)) else: train_pred[val_index] += val_pred test_pred+=bst.predict(d_test) fold_auc = roc_auc_score(y_val_kf.values, val_pred) fold_gini_norm = auc_to_gini_norm(fold_auc) if verbose: print('fold cv {} AUC score is {:.6f}, Gini_Norm score is {:.6f}'.format(i, fold_auc, fold_gini_norm)) test_pred /= kf.n_splits cv_auc = roc_auc_score(y_train, train_pred) cv_gini_norm = auc_to_gini_norm(cv_auc) cv_score = [cv_auc, cv_gini_norm] if verbose: print('cv AUC score is {:.6f}, Gini_Norm score is {:.6f}'.format(cv_auc, cv_gini_norm)) end_time = time.time() print("it takes %.3f seconds to perform cross validation" % (end_time - start_time)) return cv_score, train_pred,test_pred
LigthGBM K-fold & OOF function
类似于xgboost
def cross_validate_lgb(params, x_train, y_train, x_test, kf, cat_cols=[], verbose=True, verbose_eval=50, use_cat=True, use_rank=True): ''' :param params: 模型参数 :param x_train: 训练数据 :param y_train: 训练数据 :param x_test: 测试数据 :param kf: cv数 :param cat_cols: :param verbose: :param verbose_eval: :param use_cat: :param use_rank: :return: ''' start_time = time.time() # 初始化oof特征数据 train_pred = np.zeros((x_train.shape[0])) test_pred = np.zeros((x_test.shape[0])) if len(cat_cols)==0: use_cat=False # cv for i, (train_index, val_index) in enumerate(kf.split(x_train, y_train)): # folds 1, 2 ,3 ,4, 5 x_train_kf, x_val_kf = x_train.loc[train_index, :], x_train.loc[val_index, :] y_train_kf, y_val_kf = y_train[train_index], y_train[val_index] # 是否针对分类特征(lightGBM可以找到分类特征的最优分割) if use_cat: lgb_train = lgb.Dataset(x_train_kf, y_train_kf, categorical_feature=cat_cols) lgb_val = lgb.Dataset(x_val_kf, y_val_kf, reference=lgb_train, categorical_feature=cat_cols) else: lgb_train = lgb.Dataset(x_train_kf, y_train_kf) lgb_val = lgb.Dataset(x_val_kf, y_val_kf, reference=lgb_train) #训练lightgbm模型 gbm = lgb.train(params, lgb_train, num_boost_round=4000, valid_sets=lgb_val, early_stopping_rounds=30, verbose_eval=verbose_eval) val_pred = gbm.predict(x_val_kf) if use_rank: train_pred[val_index] += probability_to_rank(val_pred) test_pred += probability_to_rank(gbm.predict(x_test)) # test_pred += gbm.predict(x_test) else: train_pred[val_index] += val_pred test_pred += gbm.predict(x_test) # test_pred += gbm.predict(x_test) fold_auc = roc_auc_score(y_val_kf.values, val_pred) fold_gini_norm = auc_to_gini_norm(fold_auc) if verbose: print('fold cv {} AUC score is {:.6f}, Gini_Norm score is {:.6f}'.format(i, fold_auc, fold_gini_norm)) test_pred /= kf.n_splits cv_auc = roc_auc_score(y_train, train_pred) cv_gini_norm = auc_to_gini_norm(cv_auc) cv_score = [cv_auc, cv_gini_norm] if verbose: print('cv AUC score is {:.6f}, Gini_Norm score is {:.6f}'.format(cv_auc, cv_gini_norm)) end_time = time.time() print("it takes %.3f seconds to perform cross validation" % (end_time - start_time)) return cv_score, train_pred,test_pred
Generate level 1 OOF predictions
有了前面定义好的oof特征函数,接下来,将构造不同level oof的输出。首先定义好数据和cv数
drop_cols=['id','target'] y_train=train['target'] x_train=train.drop(drop_cols, axis=1) x_test=test.drop(['id'], axis=1) kf=StratifiedKFold(n_splits=5, shuffle=True, random_state=2017)
下面,将使用一些常见的模型产生level 1 model输出:
Random Forest
rf=RandomForestClassifier(n_estimators=200, n_jobs=6, min_samples_split=5, max_depth=7, criterion='gini', random_state=0) outcomes =cross_validate_sklearn(rf, x_train, y_train ,x_test, kf, scale=False, verbose=True) rf_cv=outcomes[0] rf_train_pred=outcomes[1] rf_test_pred=outcomes[2] rf_train_pred_df=pd.DataFrame(columns=['prediction_probability'], data=rf_train_pred) rf_test_pred_df=pd.DataFrame(columns=['prediction_probability'], data=rf_test_pred)
Logistic Regression
logit=LogisticRegression(random_state=0, C=0.5) outcomes = cross_validate_sklearn(logit, x_train, y_train ,x_test, kf, scale=True, verbose=True) logit_cv=outcomes[0] logit_train_pred=outcomes[1] logit_test_pred=outcomes[2] logit_train_pred_df=pd.DataFrame(columns=['prediction_probability'], data=logit_train_pred) logit_test_pred_df=pd.DataFrame(columns=['prediction_probability'], data=logit_test_pred)
BernoulliNB
这种算法通常单个模型输出结果性能比不上xgb或lgb的结果,不过,它能带来结果的多样性,有助于提高stacking的性能。
nb=BernoulliNB() outcomes =cross_validate_sklearn(nb, x_train, y_train ,x_test, kf, scale=True, verbose=True) nb_cv=outcomes[0] nb_train_pred=outcomes[1] nb_test_pred=outcomes[2] nb_train_pred_df=pd.DataFrame(columns=['prediction_probability'], data=nb_train_pred) nb_test_pred_df=pd.DataFrame(columns=['prediction_probability'], data=nb_test_pred)
xgboost
xgb_params = { "booster" : "gbtree", "objective" : "binary:logistic", "tree_method": "hist", "eval_metric": "auc", "eta": 0.1, "max_depth": 5, "min_child_weight": 10, "gamma": 0.70, "subsample": 0.76, "colsample_bytree": 0.95, "nthread": 6, "seed": 0, 'silent': 1 } outcomes=cross_validate_xgb(xgb_params, x_train, y_train, x_test, kf, use_rank=False, verbose_eval=False) xgb_cv=outcomes[0] xgb_train_pred=outcomes[1] xgb_test_pred=outcomes[2] xgb_train_pred_df=pd.DataFrame(columns=['prediction_probability'], data=xgb_train_pred) xgb_test_pred_df=pd.DataFrame(columns=['prediction_probability'], data=xgb_test_pred)
lightGBM
lgb_params = { 'task': 'train', 'boosting_type': 'dart', 'objective': 'binary', 'metric': {'auc'}, 'num_leaves': 22, 'min_sum_hessian_in_leaf': 20, 'max_depth': 5, 'learning_rate': 0.1, # 0.618580 'num_threads': 6, 'feature_fraction': 0.6894, 'bagging_fraction': 0.4218, 'max_drop': 5, 'drop_rate': 0.0123, 'min_data_in_leaf': 10, 'bagging_freq': 1, 'lambda_l1': 1, 'lambda_l2': 0.01, 'verbose': 1 } cat_cols=['ps_ind_02_cat','ps_car_04_cat', 'ps_car_09_cat','ps_ind_05_cat', 'ps_car_01_cat'] outcomes=cross_validate_lgb(lgb_params,x_train, y_train ,x_test,kf, cat_cols, use_cat=True, verbose_eval=False, use_rank=False) lgb_cv=outcomes[0] lgb_train_pred=outcomes[1] lgb_test_pred=outcomes[2] lgb_train_pred_df=pd.DataFrame(columns=['prediction_probability'], data=lgb_train_pred) lgb_test_pred_df=pd.DataFrame(columns=['prediction_probability'], data=lgb_test_pred)
我们已经产生了level 1的特征,接下来进行level 2的ensemble
Level 2 ensemble
Generate L1 output dataframe
将level 1的oof特征作为level 2 的输入特征
columns=['rf','et','logit','nb','xgb','lgb'] train_pred_df_list=[rf_train_pred_df, et_train_pred_df, logit_train_pred_df, nb_train_pred_df, xgb_train_pred_df, lgb_train_pred_df] test_pred_df_list=[rf_test_pred_df, et_test_pred_df, logit_test_pred_df, nb_test_pred_df,xgb_test_pred_df, lgb_test_pred_df] lv1_train_df=pd.DataFrame(columns=columns) lv1_test_df=pd.DataFrame(columns=columns) for i in range(0,len(columns)): lv1_train_df[columns[i]]=train_pred_df_list[i]['prediction_probability'] lv1_test_df[columns[i]]=test_pred_df_list[i]['prediction_probability']
Level 2 XGB
对level 1 的oof特征训练xgboost模型,并输出level 2 oof特征
xgb_lv2_outcomes=cross_validate_xgb(xgb_params, lv1_train_df, y_train, lv1_test_df, kf, verbose=True, verbose_eval=False, use_rank=False) xgb_lv2_cv=xgb_lv2_outcomes[0] xgb_lv2_train_pred=xgb_lv2_outcomes[1] xgb_lv2_test_pred=xgb_lv2_outcomes[2]
Level 2 LightGBM
lgb_lv2_outcomes=cross_validate_lgb(lgb_params,lv1_train_df, y_train ,lv1_test_df,kf, [], use_cat=False, verbose_eval=False, use_rank=True) lgb_lv2_cv=xgb_lv2_outcomes[0] lgb_lv2_train_pred=lgb_lv2_outcomes[1] lgb_lv2_test_pred=lgb_lv2_outcomes[2]
Level 2 Random Forest
rf_lv2=RandomForestClassifier(n_estimators=200, n_jobs=6, min_samples_split=5, max_depth=7, criterion='gini', random_state=0) rf_lv2_outcomes = cross_validate_sklearn(rf_lv2, lv1_train_df, y_train ,lv1_test_df, kf, scale=True, verbose=True) rf_lv2_cv=rf_lv2_outcomes[0] rf_lv2_train_pred=rf_lv2_outcomes[1] rf_lv2_test_pred=rf_lv2_outcomes[2]
Level 2 Logistic Regression
logit_lv2=LogisticRegression(random_state=0, C=0.5) logit_lv2_outcomes = cross_validate_sklearn(logit_lv2, lv1_train_df, y_train ,lv1_test_df, kf, scale=True, verbose=True) logit_lv2_cv=logit_lv2_outcomes[0] logit_lv2_train_pred=logit_lv2_outcomes[1] logit_lv2_test_pred=logit_lv2_outcomes[2]
Level 3 ensemble
类似于level 2 的ensemble,当然也可以加入level 1的oof特征
Generate L2 output dataframe
lv2_columns=['rf_lf2', 'logit_lv2', 'xgb_lv2','lgb_lv2'] train_lv2_pred_list=[rf_lv2_train_pred, logit_lv2_train_pred, xgb_lv2_train_pred, lgb_lv2_train_pred] test_lv2_pred_list=[rf_lv2_test_pred, logit_lv2_test_pred, xgb_lv2_test_pred, lgb_lv2_test_pred] lv2_train=pd.DataFrame(columns=lv2_columns) lv2_test=pd.DataFrame(columns=lv2_columns) for i in range(0,len(lv2_columns)): lv2_train[lv2_columns[i]]=train_lv2_pred_list[i] lv2_test[lv2_columns[i]]=test_lv2_pred_list[i]
Level 3 XGB
xgb_lv3_params = { "booster" : "gbtree", "objective" : "binary:logistic", "tree_method": "hist", "eval_metric": "auc", "eta": 0.1, "max_depth": 2, "min_child_weight": 10, "gamma": 0.70, "subsample": 0.76, "colsample_bytree": 0.95, "nthread": 6, "seed": 0, 'silent': 1 } xgb_lv3_outcomes=cross_validate_xgb(xgb_lv3_params, lv2_train, y_train, lv2_test, kf, verbose=True, verbose_eval=False, use_rank=True) xgb_lv3_cv=xgb_lv3_outcomes[0] xgb_lv3_train_pred=xgb_lv3_outcomes[1] xgb_lv3_test_pred=xgb_lv3_outcomes[2]
Level 3 Logistic Regression
logit_lv3=LogisticRegression(random_state=0, C=0.5) logit_lv3_outcomes = cross_validate_sklearn(logit_lv3, lv2_train, y_train ,lv2_test, kf, scale=True, verbose=True) logit_lv3_cv=logit_lv3_outcomes[0] logit_lv3_train_pred=logit_lv3_outcomes[1] logit_lv3_test_pred=logit_lv3_outcomes[2]
Average L3 outputs & Submission Generation
weight_avg=logit_lv3_train_pred*0.5+ xgb_lv3_train_pred*0.5 print(auc_to_gini_norm(roc_auc_score(y_train, weight_avg))) submission=sample_submission.copy() submission['target']=logit_lv3_test_pred*0.5+ xgb_lv3_test_pred*0.5 filename='stacking_demonstration.csv.gz' submission.to_csv(filename,compression='gzip', index=False)
结语
上述的3层stacking也许不是最好的,但是可以引导你发现更多有用的信息,对于stacking level 是不是越高越好呢?? 一般而言,大部分都只使用到level 2 或者level 3。而我自己一般的策略是:
一般只到level 2,然后平均level 2 的结果
对不同的random seed 使用相同的stacking策略
平均上面的结果
更新—增加模型ensemble权重函数
这里训练的ensemble的权重一般是最后几个模型进行线性融合的权重
#encoding:utf-8 ''' 主要是优化最后线性融合模型时候的权重 ''' import pandas as pd import numpy as np from scipy.optimize import minimize # 优化函数 from sklearn.cross_validation import StratifiedShuffleSplit from sklearn.ensemble import RandomForestClassifier from sklearn.linear_model import LogisticRegression from sklearn.metrics import log_loss import os os.system("ls ../input") # 数据集 train = pd.read_csv("../input/train.csv") print("Training set has {0[0]} rows and {0[1]} columns".format(train.shape)) labels = train['target'] train.drop(['target', 'id'], axis=1, inplace=True) print(train.head()) ### 划分数据集,用来训练ensemble的权重 sss = StratifiedShuffleSplit(labels, test_size=0.05, random_state=1234) for train_index, test_index in sss: break # 数据划分 train_x, train_y = train.values[train_index], labels.values[train_index] test_x, test_y = train.values[test_index], labels.values[test_index] ### 分类器列表( clfs = [] rfc = RandomForestClassifier(n_estimators=50, random_state=4141, n_jobs=-1) rfc.fit(train_x, train_y) print('RFC LogLoss {score}'.format(score=log_loss(test_y, rfc.predict_proba(test_x)))) clfs.append(rfc) ### 通常你可以使用xgboost lightgbm或者nn模型 ## 这里的logistic模型只是为了演示使用 logreg = LogisticRegression() logreg.fit(train_x, train_y) print('LogisticRegression LogLoss {score}'.format(score=log_loss(test_y, logreg.predict_proba(test_x)))) clfs.append(logreg) rfc2 = RandomForestClassifier(n_estimators=50, random_state=1337, n_jobs=-1) rfc2.fit(train_x, train_y) print('RFC2 LogLoss {score}'.format(score=log_loss(test_y, rfc2.predict_proba(test_x)))) clfs.append(rfc2) ### 训练ensemble权重 predictions = [] for clf in clfs: predictions.append(clf.predict_proba(test_x)) def log_loss_func(weights): ''' scipy minimize will pass the weights as a numpy array ''' final_prediction = 0 for weight, prediction in zip(weights, predictions): final_prediction += weight * prediction print(test_y, final_prediction) return log_loss(test_y, final_prediction) # 默认使用0.5作为初始权重 # its better to choose many random starting points and run minimize a few times starting_values = [0.5] * len(predictions) cons = ({'type': 'eq', 'fun': lambda w: 1 - sum(w)}) # 权重的上下限 bounds = [(0, 1)] * len(predictions) res = minimize(log_loss_func, starting_values, method='SLSQP', bounds=bounds, constraints=cons) print('Ensamble Score: {best_score}'.format(best_score=res['fun'])) print('Best Weights: {weights}'.format(weights=res['x']))
以上所述就是小编给大家介绍的《Ensemble之stacking》,希望对大家有所帮助,如果大家有任何疑问请给我留言,小编会及时回复大家的。在此也非常感谢大家对 码农网 的支持!
猜你喜欢:本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。