内容简介:Stacking是一种模型组合技术,用于组合来自多个预测模型的信息,以生成一个新的模型。即将训练好的所有基模型对整个训练集进行预测,第j个基模型对第i个训练样本的预测值将作为新的训练集中第i个样本的第j个特征值,最后基于新的训练集进行训练。同理,预测的过程也要先经过所有基模型的预测形成新的测试集,最后再对测试集进行预测.当然,stacking并不是都能带来惊人的效果,当模型之间存在明显差异时,stacking的效果是相当好的,而当模型都很相似时,带来的效果往往并不是那么亮眼。
Stacking是一种模型组合技术,用于组合来自多个预测模型的信息,以生成一个新的模型。即将训练好的所有基模型对整个训练集进行预测,第j个基模型对第i个训练样本的预测值将作为新的训练集中第i个样本的第j个特征值,最后基于新的训练集进行训练。同理,预测的过程也要先经过所有基模型的预测形成新的测试集,最后再对测试集进行预测.
当然,stacking并不是都能带来惊人的效果,当模型之间存在明显差异时,stacking的效果是相当好的,而当模型都很相似时,带来的效果往往并不是那么亮眼。
实现
直接以kaggle的 Porto Seguro’s Safe Driver Prediction 比赛数据为例。
这是Kaggle在9月30日开启的一个新的比赛,举办者是巴西最大的汽车与住房保险公司之一:Porto Seguro。该比赛要求参赛者根据汽车保单持有人的数据建立机器学习模型,分析该持有人是否会在次年提出索赔。比赛所提供的数据均已进行处理,由于数据特征没有实际意义,因此无法根据常识或业界知识简单地进行特征工程。
数据下载地址: Data
加载所需要模块
import numpy as np # linear algebra import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv) from sklearn.preprocessing import StandardScaler from sklearn.metrics import roc_auc_score from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier from sklearn.naive_bayes import BernoulliNB from sklearn.linear_model import LogisticRegression from sklearn.model_selection import StratifiedKFold import xgboost as xgb import lightgbm as lgb import time pd.set_option('display.max_columns', 500) pd.set_option('display.max_colwidth', 500) pd.set_option('display.max_rows', 1000)
特征工程
这部分简单的对数据进行处理,主要有:
- Categorical feature encoding
- Frequency Encoding
- Binary Encoding
- Feature Reduction
1.1 Frequency Encoding
# 读取原始数据集 train=pd.read_csv('train.csv') test=pd.read_csv('test.csv') sample_submission=pd.read_csv('sample_submission.csv')
def freq_encoding(cols, train_df, test_df): result_train_df=pd.DataFrame() result_test_df=pd.DataFrame() for col in cols: col_freq=col+'_freq' # 计算类别变量的个数 freq=train_df[col].value_counts() freq=pd.DataFrame(freq) freq.reset_index(inplace=True) freq.columns=[[col,col_freq]] temp_train_df=pd.merge(train_df[[col]], freq, how='left',on=col) temp_train_df.drop([col], axis=1, inplace=True) temp_test_df=pd.merge(test_df[[col]], freq, how='left', on=col) temp_test_df.drop([col], axis=1, inplace=True) # 如果test中的数据未在train中出现,则令为0 temp_test_df.fillna(0, inplace=True) temp_test_df[col_freq]=temp_test_df[col_freq].astype(np.int32) if result_train_df.shape[0]==0: result_train_df=temp_train_df result_test_df=temp_test_df else: result_train_df=pd.concat([result_train_df, temp_train_df],axis=1) result_test_df=pd.concat([result_test_df, temp_test_df],axis=1) return result_train_df, result_test_df cat_cols=['ps_ind_02_cat','ps_car_04_cat', 'ps_car_09_cat', 'ps_ind_05_cat', 'ps_car_01_cat', 'ps_car_11_cat'] train_freq, test_freq=freq_encoding(cat_cols, train, test) # 合并原始数据 train=pd.concat([train, train_freq], axis=1) test=pd.concat([test,test_freq], axis=1)
1.2.Binary Encoding
def binary_encoding(train_df, test_df, feat): # 计算最高值 train_feat_max = train_df[feat].max() test_feat_max = test_df[feat].max() if train_feat_max > test_feat_max: feat_max = train_feat_max else: feat_max = test_feat_max # 使用feat_max+1替代缺失值 train_df.loc[train_df[feat] == -1, feat] = feat_max + 1 test_df.loc[test_df[feat] == -1, feat] = feat_max + 1 # 并集并返回有序结果(唯一值) union_val = np.union1d(train_df[feat].unique(), test_df[feat].unique()) max_dec = union_val.max() max_bin_len = len("{0:b}".format(max_dec)) index = np.arange(len(union_val)) columns = list([feat]) bin_df = pd.DataFrame(index=index, columns=columns) bin_df[feat] = union_val feat_bin = bin_df[feat].apply(lambda x: "{0:b}".format(x).zfill(max_bin_len)) splitted = feat_bin.apply(lambda x: pd.Series(list(x)).astype(np.uint8)) splitted.columns = [feat + '_bin_' + str(x) for x in splitted.columns] bin_df = bin_df.join(splitted) train_df = pd.merge(train_df, bin_df, how='left', on=[feat]) test_df = pd.merge(test_df, bin_df, how='left', on=[feat]) return train_df, test_df cat_cols=['ps_ind_02_cat','ps_car_04_cat', 'ps_car_09_cat', 'ps_ind_05_cat', 'ps_car_01_cat'] train, test=binary_encoding(train, test, 'ps_ind_02_cat') train, test=binary_encoding(train, test, 'ps_car_04_cat') train, test=binary_encoding(train, test, 'ps_car_09_cat') train, test=binary_encoding(train, test, 'ps_ind_05_cat') train, test=binary_encoding(train, test, 'ps_car_01_cat')
1.3 Feature Reduction
是否删除原始数据,可以参考删除之前跟删除之后的cv结果
col_to_drop = train.columns[train.columns.str.startswith('ps_calc_')] train.drop(col_to_drop, axis=1, inplace=True) test.drop(col_to_drop, axis=1, inplace=True)
交叉验证—oof特征
注:模型的参数已经经过调整好了,所以在增加oof特征之前需要对模型进行调参处理
def auc_to_gini_norm(auc_score): return 2*auc_score-1
Sklearn K-fold & OOF function
主要针对 python 模块sklearn中的函数
def cross_validate_sklearn(clf, x_train, y_train , x_test, kf,scale=False, verbose=True): ''' :param clf: 模型 :param x_train: 训练数据 :param y_train: 训练数据 :param x_test: 测试数据 :param kf: cv数 :param scale: 是否归一化 :param verbose: :return: ''' start_time=time.time() # 初始化oof特征数据 train_pred = np.zeros((x_train.shape[0])) test_pred = np.zeros((x_test.shape[0])) # cv产生oof特征 for i, (train_index, val_index) in enumerate(kf.split(x_train, y_train)): x_train_kf, x_val_kf = x_train.loc[train_index, :], x_train.loc[val_index, :] y_train_kf, y_val_kf = y_train[train_index], y_train[val_index] # 是否要求归一化,比如线性模型算法 if scale: scaler = StandardScaler().fit(x_train_kf.values) x_train_kf_values = scaler.transform(x_train_kf.values) x_val_kf_values = scaler.transform(x_val_kf.values) x_test_values = scaler.transform(x_test.values) else: x_train_kf_values = x_train_kf.values x_val_kf_values = x_val_kf.values x_test_values = x_test.values # 拟合模型 clf.fit(x_train_kf_values, y_train_kf.values) # 预测概率 val_pred=clf.predict_proba(x_val_kf_values)[:,1] train_pred[test_index] += val_pred y_test_preds = clf.predict_proba(x_test_values)[:,1] test_pred += y_test_preds fold_auc = roc_auc_score(y_val_kf.values, val_pred) fold_gini_norm = auc_to_gini_norm(fold_auc) if verbose: print('fold cv {} AUC score is {:.6f}, Gini_Norm score is {:.6f}'.format(i, fold_auc, fold_gini_norm)) # kf次预测测试集取平均 test_pred /= kf.n_splits cv_auc = roc_auc_score(y_train, train_pred) cv_gini_norm = auc_to_gini_norm(cv_auc) cv_score = [cv_auc, cv_gini_norm] if verbose: print('cv AUC score is {:.6f}, Gini_Norm score is {:.6f}'.format(cv_auc, cv_gini_norm)) end_time = time.time() print("it takes %.3f seconds to perform cross validation" % (end_time - start_time)) return cv_score, train_pred,test_pred
Xgboost K-fold & OOF function
接下来针对kaggle比赛杀器xgboost和lightgbm进行构造oof特征,主要使用原生的xgboost或lightgbm模块,当然你也可以使用sklearn的api,但是原生的模块会包含更多的功能。
# 对预测值进行排序 def probability_to_rank(prediction, scaler=1): pred_df=pd.DataFrame(columns=['probability']) pred_df['probability']=prediction pred_df['rank']=pred_df['probability'].rank()/len(prediction)*scaler return pred_df['rank'].values
def cross_validate_xgb(params, x_train, y_train, x_test, kf, cat_cols=[], verbose=True, verbose_eval=50, num_boost_round=4000, use_rank=True): ''' :param params: 模型参数 :param x_train: 训练数据 :param y_train: 训练数据 :param x_test: 测试数据 :param kf: cv数 :param cat_cols:类别特征 :param verbose: :param verbose_eval: :param num_boost_round:迭代数 :param use_rank: 是否 排序 结果 :return: ''' start_time=time.time() # 初始化oof特征数据 train_pred = np.zeros((x_train.shape[0])) test_pred = np.zeros((x_test.shape[0])) # cv产生oof特征 for i, (train_index, val_index) in enumerate(kf.split(x_train, y_train)): # folds 1, 2 ,3 ,4, 5 x_train_kf, x_val_kf = x_train.loc[train_index, :], x_train.loc[val_index, :] y_train_kf, y_val_kf = y_train[train_index], y_train[val_index] x_test_kf=x_test.copy() # xgboost数据格式 d_train_kf = xgb.DMatrix(x_train_kf, label=y_train_kf) d_val_kf = xgb.DMatrix(x_val_kf, label=y_val_kf) d_test = xgb.DMatrix(x_test_kf) # 训练xgboost模型 bst = xgb.train(params, d_train_kf, num_boost_round=num_boost_round, evals=[(d_train_kf, 'train'), (d_val_kf, 'val')], verbose_eval=verbose_eval, early_stopping_rounds=50) val_pred = bst.predict(d_val_kf, ntree_limit=bst.best_ntree_limit) if use_rank: train_pred[val_index] += probability_to_rank(val_pred) test_pred+=probability_to_rank(bst.predict(d_test)) else: train_pred[val_index] += val_pred test_pred+=bst.predict(d_test) fold_auc = roc_auc_score(y_val_kf.values, val_pred) fold_gini_norm = auc_to_gini_norm(fold_auc) if verbose: print('fold cv {} AUC score is {:.6f}, Gini_Norm score is {:.6f}'.format(i, fold_auc, fold_gini_norm)) test_pred /= kf.n_splits cv_auc = roc_auc_score(y_train, train_pred) cv_gini_norm = auc_to_gini_norm(cv_auc) cv_score = [cv_auc, cv_gini_norm] if verbose: print('cv AUC score is {:.6f}, Gini_Norm score is {:.6f}'.format(cv_auc, cv_gini_norm)) end_time = time.time() print("it takes %.3f seconds to perform cross validation" % (end_time - start_time)) return cv_score, train_pred,test_pred
LigthGBM K-fold & OOF function
类似于xgboost
def cross_validate_lgb(params, x_train, y_train, x_test, kf, cat_cols=[], verbose=True, verbose_eval=50, use_cat=True, use_rank=True): ''' :param params: 模型参数 :param x_train: 训练数据 :param y_train: 训练数据 :param x_test: 测试数据 :param kf: cv数 :param cat_cols: :param verbose: :param verbose_eval: :param use_cat: :param use_rank: :return: ''' start_time = time.time() # 初始化oof特征数据 train_pred = np.zeros((x_train.shape[0])) test_pred = np.zeros((x_test.shape[0])) if len(cat_cols)==0: use_cat=False # cv for i, (train_index, val_index) in enumerate(kf.split(x_train, y_train)): # folds 1, 2 ,3 ,4, 5 x_train_kf, x_val_kf = x_train.loc[train_index, :], x_train.loc[val_index, :] y_train_kf, y_val_kf = y_train[train_index], y_train[val_index] # 是否针对分类特征(lightGBM可以找到分类特征的最优分割) if use_cat: lgb_train = lgb.Dataset(x_train_kf, y_train_kf, categorical_feature=cat_cols) lgb_val = lgb.Dataset(x_val_kf, y_val_kf, reference=lgb_train, categorical_feature=cat_cols) else: lgb_train = lgb.Dataset(x_train_kf, y_train_kf) lgb_val = lgb.Dataset(x_val_kf, y_val_kf, reference=lgb_train) #训练lightgbm模型 gbm = lgb.train(params, lgb_train, num_boost_round=4000, valid_sets=lgb_val, early_stopping_rounds=30, verbose_eval=verbose_eval) val_pred = gbm.predict(x_val_kf) if use_rank: train_pred[val_index] += probability_to_rank(val_pred) test_pred += probability_to_rank(gbm.predict(x_test)) # test_pred += gbm.predict(x_test) else: train_pred[val_index] += val_pred test_pred += gbm.predict(x_test) # test_pred += gbm.predict(x_test) fold_auc = roc_auc_score(y_val_kf.values, val_pred) fold_gini_norm = auc_to_gini_norm(fold_auc) if verbose: print('fold cv {} AUC score is {:.6f}, Gini_Norm score is {:.6f}'.format(i, fold_auc, fold_gini_norm)) test_pred /= kf.n_splits cv_auc = roc_auc_score(y_train, train_pred) cv_gini_norm = auc_to_gini_norm(cv_auc) cv_score = [cv_auc, cv_gini_norm] if verbose: print('cv AUC score is {:.6f}, Gini_Norm score is {:.6f}'.format(cv_auc, cv_gini_norm)) end_time = time.time() print("it takes %.3f seconds to perform cross validation" % (end_time - start_time)) return cv_score, train_pred,test_pred
Generate level 1 OOF predictions
有了前面定义好的oof特征函数,接下来,将构造不同level oof的输出。首先定义好数据和cv数
drop_cols=['id','target'] y_train=train['target'] x_train=train.drop(drop_cols, axis=1) x_test=test.drop(['id'], axis=1) kf=StratifiedKFold(n_splits=5, shuffle=True, random_state=2017)
下面,将使用一些常见的模型产生level 1 model输出:
Random Forest
rf=RandomForestClassifier(n_estimators=200, n_jobs=6, min_samples_split=5, max_depth=7, criterion='gini', random_state=0) outcomes =cross_validate_sklearn(rf, x_train, y_train ,x_test, kf, scale=False, verbose=True) rf_cv=outcomes[0] rf_train_pred=outcomes[1] rf_test_pred=outcomes[2] rf_train_pred_df=pd.DataFrame(columns=['prediction_probability'], data=rf_train_pred) rf_test_pred_df=pd.DataFrame(columns=['prediction_probability'], data=rf_test_pred)
Logistic Regression
logit=LogisticRegression(random_state=0, C=0.5) outcomes = cross_validate_sklearn(logit, x_train, y_train ,x_test, kf, scale=True, verbose=True) logit_cv=outcomes[0] logit_train_pred=outcomes[1] logit_test_pred=outcomes[2] logit_train_pred_df=pd.DataFrame(columns=['prediction_probability'], data=logit_train_pred) logit_test_pred_df=pd.DataFrame(columns=['prediction_probability'], data=logit_test_pred)
BernoulliNB
这种算法通常单个模型输出结果性能比不上xgb或lgb的结果,不过,它能带来结果的多样性,有助于提高stacking的性能。
nb=BernoulliNB() outcomes =cross_validate_sklearn(nb, x_train, y_train ,x_test, kf, scale=True, verbose=True) nb_cv=outcomes[0] nb_train_pred=outcomes[1] nb_test_pred=outcomes[2] nb_train_pred_df=pd.DataFrame(columns=['prediction_probability'], data=nb_train_pred) nb_test_pred_df=pd.DataFrame(columns=['prediction_probability'], data=nb_test_pred)
xgboost
xgb_params = { "booster" : "gbtree", "objective" : "binary:logistic", "tree_method": "hist", "eval_metric": "auc", "eta": 0.1, "max_depth": 5, "min_child_weight": 10, "gamma": 0.70, "subsample": 0.76, "colsample_bytree": 0.95, "nthread": 6, "seed": 0, 'silent': 1 } outcomes=cross_validate_xgb(xgb_params, x_train, y_train, x_test, kf, use_rank=False, verbose_eval=False) xgb_cv=outcomes[0] xgb_train_pred=outcomes[1] xgb_test_pred=outcomes[2] xgb_train_pred_df=pd.DataFrame(columns=['prediction_probability'], data=xgb_train_pred) xgb_test_pred_df=pd.DataFrame(columns=['prediction_probability'], data=xgb_test_pred)
lightGBM
lgb_params = { 'task': 'train', 'boosting_type': 'dart', 'objective': 'binary', 'metric': {'auc'}, 'num_leaves': 22, 'min_sum_hessian_in_leaf': 20, 'max_depth': 5, 'learning_rate': 0.1, # 0.618580 'num_threads': 6, 'feature_fraction': 0.6894, 'bagging_fraction': 0.4218, 'max_drop': 5, 'drop_rate': 0.0123, 'min_data_in_leaf': 10, 'bagging_freq': 1, 'lambda_l1': 1, 'lambda_l2': 0.01, 'verbose': 1 } cat_cols=['ps_ind_02_cat','ps_car_04_cat', 'ps_car_09_cat','ps_ind_05_cat', 'ps_car_01_cat'] outcomes=cross_validate_lgb(lgb_params,x_train, y_train ,x_test,kf, cat_cols, use_cat=True, verbose_eval=False, use_rank=False) lgb_cv=outcomes[0] lgb_train_pred=outcomes[1] lgb_test_pred=outcomes[2] lgb_train_pred_df=pd.DataFrame(columns=['prediction_probability'], data=lgb_train_pred) lgb_test_pred_df=pd.DataFrame(columns=['prediction_probability'], data=lgb_test_pred)
我们已经产生了level 1的特征,接下来进行level 2的ensemble
Level 2 ensemble
Generate L1 output dataframe
将level 1的oof特征作为level 2 的输入特征
columns=['rf','et','logit','nb','xgb','lgb'] train_pred_df_list=[rf_train_pred_df, et_train_pred_df, logit_train_pred_df, nb_train_pred_df, xgb_train_pred_df, lgb_train_pred_df] test_pred_df_list=[rf_test_pred_df, et_test_pred_df, logit_test_pred_df, nb_test_pred_df,xgb_test_pred_df, lgb_test_pred_df] lv1_train_df=pd.DataFrame(columns=columns) lv1_test_df=pd.DataFrame(columns=columns) for i in range(0,len(columns)): lv1_train_df[columns[i]]=train_pred_df_list[i]['prediction_probability'] lv1_test_df[columns[i]]=test_pred_df_list[i]['prediction_probability']
Level 2 XGB
对level 1 的oof特征训练xgboost模型,并输出level 2 oof特征
xgb_lv2_outcomes=cross_validate_xgb(xgb_params, lv1_train_df, y_train, lv1_test_df, kf, verbose=True, verbose_eval=False, use_rank=False) xgb_lv2_cv=xgb_lv2_outcomes[0] xgb_lv2_train_pred=xgb_lv2_outcomes[1] xgb_lv2_test_pred=xgb_lv2_outcomes[2]
Level 2 LightGBM
lgb_lv2_outcomes=cross_validate_lgb(lgb_params,lv1_train_df, y_train ,lv1_test_df,kf, [], use_cat=False, verbose_eval=False, use_rank=True) lgb_lv2_cv=xgb_lv2_outcomes[0] lgb_lv2_train_pred=lgb_lv2_outcomes[1] lgb_lv2_test_pred=lgb_lv2_outcomes[2]
Level 2 Random Forest
rf_lv2=RandomForestClassifier(n_estimators=200, n_jobs=6, min_samples_split=5, max_depth=7, criterion='gini', random_state=0) rf_lv2_outcomes = cross_validate_sklearn(rf_lv2, lv1_train_df, y_train ,lv1_test_df, kf, scale=True, verbose=True) rf_lv2_cv=rf_lv2_outcomes[0] rf_lv2_train_pred=rf_lv2_outcomes[1] rf_lv2_test_pred=rf_lv2_outcomes[2]
Level 2 Logistic Regression
logit_lv2=LogisticRegression(random_state=0, C=0.5) logit_lv2_outcomes = cross_validate_sklearn(logit_lv2, lv1_train_df, y_train ,lv1_test_df, kf, scale=True, verbose=True) logit_lv2_cv=logit_lv2_outcomes[0] logit_lv2_train_pred=logit_lv2_outcomes[1] logit_lv2_test_pred=logit_lv2_outcomes[2]
Level 3 ensemble
类似于level 2 的ensemble,当然也可以加入level 1的oof特征
Generate L2 output dataframe
lv2_columns=['rf_lf2', 'logit_lv2', 'xgb_lv2','lgb_lv2'] train_lv2_pred_list=[rf_lv2_train_pred, logit_lv2_train_pred, xgb_lv2_train_pred, lgb_lv2_train_pred] test_lv2_pred_list=[rf_lv2_test_pred, logit_lv2_test_pred, xgb_lv2_test_pred, lgb_lv2_test_pred] lv2_train=pd.DataFrame(columns=lv2_columns) lv2_test=pd.DataFrame(columns=lv2_columns) for i in range(0,len(lv2_columns)): lv2_train[lv2_columns[i]]=train_lv2_pred_list[i] lv2_test[lv2_columns[i]]=test_lv2_pred_list[i]
Level 3 XGB
xgb_lv3_params = { "booster" : "gbtree", "objective" : "binary:logistic", "tree_method": "hist", "eval_metric": "auc", "eta": 0.1, "max_depth": 2, "min_child_weight": 10, "gamma": 0.70, "subsample": 0.76, "colsample_bytree": 0.95, "nthread": 6, "seed": 0, 'silent': 1 } xgb_lv3_outcomes=cross_validate_xgb(xgb_lv3_params, lv2_train, y_train, lv2_test, kf, verbose=True, verbose_eval=False, use_rank=True) xgb_lv3_cv=xgb_lv3_outcomes[0] xgb_lv3_train_pred=xgb_lv3_outcomes[1] xgb_lv3_test_pred=xgb_lv3_outcomes[2]
Level 3 Logistic Regression
logit_lv3=LogisticRegression(random_state=0, C=0.5) logit_lv3_outcomes = cross_validate_sklearn(logit_lv3, lv2_train, y_train ,lv2_test, kf, scale=True, verbose=True) logit_lv3_cv=logit_lv3_outcomes[0] logit_lv3_train_pred=logit_lv3_outcomes[1] logit_lv3_test_pred=logit_lv3_outcomes[2]
Average L3 outputs & Submission Generation
weight_avg=logit_lv3_train_pred*0.5+ xgb_lv3_train_pred*0.5 print(auc_to_gini_norm(roc_auc_score(y_train, weight_avg))) submission=sample_submission.copy() submission['target']=logit_lv3_test_pred*0.5+ xgb_lv3_test_pred*0.5 filename='stacking_demonstration.csv.gz' submission.to_csv(filename,compression='gzip', index=False)
结语
上述的3层stacking也许不是最好的,但是可以引导你发现更多有用的信息,对于stacking level 是不是越高越好呢?? 一般而言,大部分都只使用到level 2 或者level 3。而我自己一般的策略是:
一般只到level 2,然后平均level 2 的结果
对不同的random seed 使用相同的stacking策略
平均上面的结果
更新—增加模型ensemble权重函数
这里训练的ensemble的权重一般是最后几个模型进行线性融合的权重
#encoding:utf-8 ''' 主要是优化最后线性融合模型时候的权重 ''' import pandas as pd import numpy as np from scipy.optimize import minimize # 优化函数 from sklearn.cross_validation import StratifiedShuffleSplit from sklearn.ensemble import RandomForestClassifier from sklearn.linear_model import LogisticRegression from sklearn.metrics import log_loss import os os.system("ls ../input") # 数据集 train = pd.read_csv("../input/train.csv") print("Training set has {0[0]} rows and {0[1]} columns".format(train.shape)) labels = train['target'] train.drop(['target', 'id'], axis=1, inplace=True) print(train.head()) ### 划分数据集,用来训练ensemble的权重 sss = StratifiedShuffleSplit(labels, test_size=0.05, random_state=1234) for train_index, test_index in sss: break # 数据划分 train_x, train_y = train.values[train_index], labels.values[train_index] test_x, test_y = train.values[test_index], labels.values[test_index] ### 分类器列表( clfs = [] rfc = RandomForestClassifier(n_estimators=50, random_state=4141, n_jobs=-1) rfc.fit(train_x, train_y) print('RFC LogLoss {score}'.format(score=log_loss(test_y, rfc.predict_proba(test_x)))) clfs.append(rfc) ### 通常你可以使用xgboost lightgbm或者nn模型 ## 这里的logistic模型只是为了演示使用 logreg = LogisticRegression() logreg.fit(train_x, train_y) print('LogisticRegression LogLoss {score}'.format(score=log_loss(test_y, logreg.predict_proba(test_x)))) clfs.append(logreg) rfc2 = RandomForestClassifier(n_estimators=50, random_state=1337, n_jobs=-1) rfc2.fit(train_x, train_y) print('RFC2 LogLoss {score}'.format(score=log_loss(test_y, rfc2.predict_proba(test_x)))) clfs.append(rfc2) ### 训练ensemble权重 predictions = [] for clf in clfs: predictions.append(clf.predict_proba(test_x)) def log_loss_func(weights): ''' scipy minimize will pass the weights as a numpy array ''' final_prediction = 0 for weight, prediction in zip(weights, predictions): final_prediction += weight * prediction print(test_y, final_prediction) return log_loss(test_y, final_prediction) # 默认使用0.5作为初始权重 # its better to choose many random starting points and run minimize a few times starting_values = [0.5] * len(predictions) cons = ({'type': 'eq', 'fun': lambda w: 1 - sum(w)}) # 权重的上下限 bounds = [(0, 1)] * len(predictions) res = minimize(log_loss_func, starting_values, method='SLSQP', bounds=bounds, constraints=cons) print('Ensamble Score: {best_score}'.format(best_score=res['fun'])) print('Best Weights: {weights}'.format(weights=res['x']))
以上所述就是小编给大家介绍的《Ensemble之stacking》,希望对大家有所帮助,如果大家有任何疑问请给我留言,小编会及时回复大家的。在此也非常感谢大家对 码农网 的支持!
猜你喜欢:本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。
Algorithms on Strings, Trees and Sequences
Dan Gusfield / Cambridge University Press / 1997-5-28 / USD 99.99
String algorithms are a traditional area of study in computer science. In recent years their importance has grown dramatically with the huge increase of electronically stored text and of molecular seq......一起来看看 《Algorithms on Strings, Trees and Sequences》 这本书的介绍吧!