内容简介:最近在medium中看到William Koehrsen,发现其分享了数十篇python相关的高质量的数据分析文章。我想尽量抽时间将他的文章翻译过来,分享给大家。作者:William Koehrsen标题“《Random Forest Simple Explanation-Understanding the random forest with an intuitive example》
最近在medium中看到William Koehrsen,发现其分享了数十篇 python 相关的高质量的数据分析文章。我想尽量抽时间将他的文章翻译过来,分享给大家。
作者:William Koehrsen
标题“《Random Forest Simple Explanation-Understanding the random forest with an intuitive example》
翻译:大邓
昨天分享了 五分钟带你了解随机森林 ,今天我们以一个小案例来看看如何应用python来实现随机森林。
任务介绍
随机森林属于监督学习,训练模型时需要同时输入 特征矩阵X 和 靶向量target 。本文将使用 西雅图的NOAA气候网站 的数据,其中 靶向量target(因变量:实际气温)是连续型数值。
数据介绍
本文使用 西雅图的NOAA气候网站 的csv文件数据,该csv有9个字段,分别是
-
year:2016年
-
month: 月份
-
day:年份中的第几天
-
week:一周之中的第几天
-
temp_2:该条记录2天前的最高气温
-
temp_1:该条记录1天前的最高气温
-
average:历史上这天的平均最高气温
-
actual: 当天实际最高气温
-
friend: 某个朋友的预测值
执行步骤
在我们开始编程之前,我们应该提供一个简短的行动指南,让我们保持正确的轨道。 一旦我们遇到问题和模型,以下步骤就构成了任何机器学习工作流程的基础:
-
获取数据
-
准备机器学习模型数据
-
建立基准线模型(baseline)
-
在训练数据上训练模型
-
对测试数据进行预测
-
检验分类器训练的效果
获取数据
import pandas as pd
features = pd.read_csv('temps.csv')
features.head(5)
One-Hot编码
数据中的week列是文本数据,一共有7种。这里使用one-hot方式将其编码。其实week这一列对模型训练帮助很小,在这里也算帮助大家一起学习pandas
One-hot编码前:
One-hot编码后:
features = pd.get_dummies(features)
features.head(5)
特征矩阵和靶向量
#靶向量(因变量)
targets = features['actual']
# 从特征矩阵中移除actual这一列
#axis=1表示移除列的方向是列方向
features= features.drop('actual', axis = 1)
# 特征名列表
feature_list = list(features.columns)
将数据分为训练集和测试集
from sklearn.model_selection import train_test_split
train_features, test_features, train_targets, test_targets = train_test_split(features, targets,
test_size = 0.25,
random_state = 42)
建立基准线模型(baseline)
为了能对比自己训练的模型好坏,我们建立一个参考的基准线。这里我们假设使用average看做基准线,看看训练出的随机森林模型预测效果与average这个基准比较对比孰优孰劣。
import numpy as np
#选中test_features所有行
#选中test_features中average列
baseline_preds = test_features.loc[:, 'average']
baseline_errors = abs(baseline_preds - test_targets)
print('平均误差: ', round(np.mean(baseline_errors), 2))
运行结果
平均基准误差: 5.06
训练随机森林模型
from sklearn.ensemble import RandomForestRegressor #1000个决策树 rf = RandomForestRegressor(n_estimators= 1000, random_state=42) rf.fit(train_features, train_targets)
运行结果
RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=1000, n_jobs=1,
oob_score=False, random_state=42, verbose=0, warm_start=False)
检验模型训练效果
predictions = rf.predict(test_features)
errors = abs(predictions - test_targets)
print('平均误差:', round(np.mean(errors), 2))
运行解果
平均误差: 3.87
准确率
#计算平均绝对百分误差mean absolute percentage error (MAPE)
mape = 100 * (errors / test_targets)
accuracy = 100 - np.mean(mape)
print('准确率:', round(accuracy, 2), '%.')
准确率: 93.94 %.
可视化决策树
模型中的决策树有 1000 个,这里我随便选一个决策树可视化。可视化部分发现在python3.7运行出问题。3.6正常
print('模型中的决策树有',len(rf.estimators_), '个')
运行结果
模型中的决策树有 1000 个
查看模型中前5个决策树
#从1000个决策树中抽选出前5个看看 rf.estimators_[:5]
运行结果
[DecisionTreeRegressor(criterion='mse', max_depth=None, max_features='auto',
max_leaf_nodes=None, min_impurity_decrease=0.0,
min_impurity_split=None, min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0,
presort=False, random_state=1608637542, splitter='best'),
DecisionTreeRegressor(criterion='mse', max_depth=None, max_features='auto',
max_leaf_nodes=None, min_impurity_decrease=0.0,
min_impurity_split=None, min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0,
presort=False, random_state=1273642419, splitter='best'),
DecisionTreeRegressor(criterion='mse', max_depth=None, max_features='auto',
max_leaf_nodes=None, min_impurity_decrease=0.0,
min_impurity_split=None, min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0,
presort=False, random_state=1935803228, splitter='best'),
DecisionTreeRegressor(criterion='mse', max_depth=None, max_features='auto',
max_leaf_nodes=None, min_impurity_decrease=0.0,
min_impurity_split=None, min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0,
presort=False, random_state=787846414, splitter='best'),
DecisionTreeRegressor(criterion='mse', max_depth=None, max_features='auto',
max_leaf_nodes=None, min_impurity_decrease=0.0,
min_impurity_split=None, min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0,
presort=False, random_state=996406378, splitter='best')]
在本文中只随机选择一个决策树将其可视化
from sklearn.tree import export_graphviz
import pydot
# 从这1000个决策树中,我心情好,就选第6个决策树吧。
tree = rf.estimators_[5]
#将决策树输出到dot文件中
export_graphviz(tree,
out_file = 'tree.dot',
feature_names = feature_list,
rounded = True,
precision = 1)
# 将dot文件转化为图结构
(graph, ) = pydot.graph_from_dot_file('tree.dot')
#将graph图输出为png图片文件
graph.write_png('tree.png')
print('该决策树的最大深度(层数)是:', tree.tree_.max_depth)
运行结果
该决策树的最大深度(层数)是: 13
决策树层数太多,太复杂。我们精简决策树,设置max_depth=3
rf_small = RandomForestRegressor(n_estimators=10, max_depth = 3, random_state=42)
rf_small.fit(train_features, train_labels)
tree_small = rf_small.estimators_[5]
export_graphviz(tree_small, out_file = 'small_tree.dot',
feature_names = feature_list,
rounded = True,
precision = 1)
(graph, ) = pydot.graph_from_dot_file('small_tree.dot')
graph.write_png('small_tree.png')
特征重要性
#获得特征重要性信息
importances = list(rf.feature_importances_)
feature_importances = [(feature, round(importance, 2))
for feature, importance in zip(feature_list, importances)]
#重要性从高到低排序
feature_importances = sorted(feature_importances, key = lambda x: x[1], reverse = True)
# Print out the feature and importances
[print('Variable: {:20} Importance: {}'.format(*pair)) for pair in feature_importances]
运行结果
Variable: temp_1 Importance: 0.66 Variable: average Importance: 0.15 Variable: forecast_noaa Importance: 0.05 Variable: forecast_acc Importance: 0.03 Variable: day Importance: 0.02 Variable: temp_2 Importance: 0.02 Variable: forecast_under Importance: 0.02 Variable: friend Importance: 0.02 Variable: month Importance: 0.01 Variable: year Importance: 0.0 Variable: week_Fri Importance: 0.0 Variable: week_Mon Importance: 0.0 Variable: week_Sat Importance: 0.0 Variable: week_Sun Importance: 0.0 Variable: week_Thurs Importance: 0.0 Variable: week_Tues Importance: 0.0 Variable: week_Wed Importance: 0.0
特征重要性可视化
import matplotlib.pyplot as plt
%matplotlib inline
#设置画布风格
plt.style.use('fivethirtyeight')
# list of x locations for plotting
x_values = list(range(len(importances)))
# Make a bar chart
plt.bar(x_values, importances, orientation = 'vertical')
# Tick labels for x axis
plt.xticks(x_values, feature_list, rotation='vertical')
# Axis labels and title
plt.ylabel('Importance'); plt.xlabel('Variable'); plt.title('Variable Importances');
(看到这里了,大家帮忙动动金手指支持大邓创作O(∩_∩)O~)
精选文章
深度学习之 图解LSTM
以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网
猜你喜欢:本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。
Learn Python the Hard Way
Zed Shaw / Example Product Manufacturer / 2011
This is a very beginner book for people who want to learn to code. If you can already code then the book will probably drive you insane. It's intended for people who have no coding chops to build up t......一起来看看 《Learn Python the Hard Way》 这本书的介绍吧!