[译] 机器学习之随机森林实战

栏目: 数据库 · 发布时间: 7年前

内容简介：最近在medium中看到William Koehrsen，发现其分享了数十篇python相关的高质量的数据分析文章。我想尽量抽时间将他的文章翻译过来，分享给大家。作者：William Koehrsen标题“《Random Forest Simple Explanation-Understanding the random forest with an intuitive example》

最近在medium中看到William Koehrsen，发现其分享了数十篇 python 相关的高质量的数据分析文章。我想尽量抽时间将他的文章翻译过来，分享给大家。

作者：William Koehrsen

标题“《Random Forest Simple Explanation-Understanding the random forest with an intuitive example》

翻译：大邓

昨天分享了五分钟带你了解随机森林，今天我们以一个小案例来看看如何应用python来实现随机森林。

任务介绍

随机森林属于监督学习，训练模型时需要同时输入 特征矩阵X 和 靶向量target 。本文将使用 西雅图的NOAA气候网站 的数据，其中靶向量target（因变量：实际气温）是连续型数值。

数据介绍

本文使用 西雅图的NOAA气候网站 的csv文件数据，该csv有9个字段，分别是

year:2016年
month: 月份
day:年份中的第几天
week:一周之中的第几天
temp_2:该条记录2天前的最高气温
temp_1:该条记录1天前的最高气温
average:历史上这天的平均最高气温
actual: 当天实际最高气温
friend: 某个朋友的预测值

执行步骤

在我们开始编程之前，我们应该提供一个简短的行动指南，让我们保持正确的轨道。一旦我们遇到问题和模型，以下步骤就构成了任何机器学习工作流程的基础：

获取数据
准备机器学习模型数据
建立基准线模型（baseline）
在训练数据上训练模型
对测试数据进行预测
检验分类器训练的效果

获取数据

import pandas as pd

features = pd.read_csv('temps.csv')
features.head(5)

[译] 机器学习之随机森林实战

One-Hot编码

数据中的week列是文本数据，一共有7种。这里使用one-hot方式将其编码。其实week这一列对模型训练帮助很小，在这里也算帮助大家一起学习pandas

One-hot编码前:

[译] 机器学习之随机森林实战

One-hot编码后:

[译] 机器学习之随机森林实战

features = pd.get_dummies(features)
features.head(5)

[译] 机器学习之随机森林实战

特征矩阵和靶向量

#靶向量（因变量）
targets = features['actual']

# 从特征矩阵中移除actual这一列
#axis=1表示移除列的方向是列方向
features= features.drop('actual', axis = 1)

# 特征名列表
feature_list = list(features.columns)

将数据分为训练集和测试集

from sklearn.model_selection import train_test_split

train_features, test_features, train_targets, test_targets = train_test_split(features, targets, 
                                                                            test_size = 0.25,
                                                                           random_state = 42)

建立基准线模型（baseline）

为了能对比自己训练的模型好坏，我们建立一个参考的基准线。这里我们假设使用average看做基准线，看看训练出的随机森林模型预测效果与average这个基准比较对比孰优孰劣。

import numpy as np

#选中test_features所有行
#选中test_features中average列
baseline_preds = test_features.loc[:, 'average']


baseline_errors = abs(baseline_preds - test_targets)
print('平均误差: ', round(np.mean(baseline_errors), 2))

运行结果

平均基准误差:  5.06

训练随机森林模型

from sklearn.ensemble import RandomForestRegressor

#1000个决策树
rf = RandomForestRegressor(n_estimators= 1000, random_state=42)
rf.fit(train_features, train_targets)

运行结果

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=1000, n_jobs=1,
           oob_score=False, random_state=42, verbose=0, warm_start=False)

检验模型训练效果

predictions = rf.predict(test_features)

errors = abs(predictions - test_targets)

print('平均误差:', round(np.mean(errors), 2))

运行解果

平均误差: 3.87

准确率

#计算平均绝对百分误差mean absolute percentage error (MAPE)
mape = 100 * (errors / test_targets)

accuracy = 100 - np.mean(mape)
print('准确率:', round(accuracy, 2), '%.')

准确率: 93.94 %.

可视化决策树

模型中的决策树有 1000 个，这里我随便选一个决策树可视化。可视化部分发现在python3.7运行出问题。3.6正常

print('模型中的决策树有',len(rf.estimators_), '个')

运行结果

模型中的决策树有 1000 个

查看模型中前5个决策树

#从1000个决策树中抽选出前5个看看
rf.estimators_[:5]

运行结果

[DecisionTreeRegressor(criterion='mse', max_depth=None, max_features='auto',
            max_leaf_nodes=None, min_impurity_decrease=0.0,
            min_impurity_split=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=1608637542, splitter='best'),
 DecisionTreeRegressor(criterion='mse', max_depth=None, max_features='auto',
            max_leaf_nodes=None, min_impurity_decrease=0.0,
            min_impurity_split=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=1273642419, splitter='best'),
 DecisionTreeRegressor(criterion='mse', max_depth=None, max_features='auto',
            max_leaf_nodes=None, min_impurity_decrease=0.0,
            min_impurity_split=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=1935803228, splitter='best'),
 DecisionTreeRegressor(criterion='mse', max_depth=None, max_features='auto',
            max_leaf_nodes=None, min_impurity_decrease=0.0,
            min_impurity_split=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=787846414, splitter='best'),
 DecisionTreeRegressor(criterion='mse', max_depth=None, max_features='auto',
            max_leaf_nodes=None, min_impurity_decrease=0.0,
            min_impurity_split=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=996406378, splitter='best')]

在本文中只随机选择一个决策树将其可视化

from sklearn.tree import export_graphviz
import pydot

# 从这1000个决策树中，我心情好，就选第6个决策树吧。
tree = rf.estimators_[5]

#将决策树输出到dot文件中
export_graphviz(tree, 
                out_file = 'tree.dot', 
                feature_names = feature_list, 
                rounded = True, 
                precision = 1)

# 将dot文件转化为图结构
(graph, ) = pydot.graph_from_dot_file('tree.dot')

#将graph图输出为png图片文件
graph.write_png('tree.png')

[译] 机器学习之随机森林实战

print('该决策树的最大深度（层数）是:', tree.tree_.max_depth)

运行结果

该决策树的最大深度（层数）是: 13

决策树层数太多，太复杂。我们精简决策树，设置max_depth=3

rf_small = RandomForestRegressor(n_estimators=10, max_depth = 3, random_state=42)
rf_small.fit(train_features, train_labels)

tree_small = rf_small.estimators_[5]

export_graphviz(tree_small, out_file = 'small_tree.dot', 
                feature_names = feature_list, 
                rounded = True, 
                precision = 1)

(graph, ) = pydot.graph_from_dot_file('small_tree.dot')

graph.write_png('small_tree.png')

[译] 机器学习之随机森林实战

特征重要性

#获得特征重要性信息
importances = list(rf.feature_importances_)

feature_importances = [(feature, round(importance, 2)) 
                       for feature, importance in zip(feature_list, importances)]

#重要性从高到低排序
feature_importances = sorted(feature_importances, key = lambda x: x[1], reverse = True)

# Print out the feature and importances 
[print('Variable: {:20} Importance: {}'.format(*pair)) for pair in feature_importances]

运行结果

Variable: temp_1               Importance: 0.66
Variable: average              Importance: 0.15
Variable: forecast_noaa        Importance: 0.05
Variable: forecast_acc         Importance: 0.03
Variable: day                  Importance: 0.02
Variable: temp_2               Importance: 0.02
Variable: forecast_under       Importance: 0.02
Variable: friend               Importance: 0.02
Variable: month                Importance: 0.01
Variable: year                 Importance: 0.0
Variable: week_Fri             Importance: 0.0
Variable: week_Mon             Importance: 0.0
Variable: week_Sat             Importance: 0.0
Variable: week_Sun             Importance: 0.0
Variable: week_Thurs           Importance: 0.0
Variable: week_Tues            Importance: 0.0
Variable: week_Wed             Importance: 0.0

特征重要性可视化

import matplotlib.pyplot as plt
%matplotlib inline

#设置画布风格
plt.style.use('fivethirtyeight')

# list of x locations for plotting
x_values = list(range(len(importances)))

# Make a bar chart
plt.bar(x_values, importances, orientation = 'vertical')

# Tick labels for x axis
plt.xticks(x_values, feature_list, rotation='vertical')

# Axis labels and title
plt.ylabel('Importance'); plt.xlabel('Variable'); plt.title('Variable Importances');

[译] 机器学习之随机森林实战

（看到这里了，大家帮忙动动金手指支持大邓创作O(∩_∩)O~）

精选文章

【视频讲解】Scrapy递归抓取简书用户信息

美团商家信息采集神器

大邓强力推荐-jupyter notebook使用小技巧

10分钟理解深度学习中的~卷积~

深度学习之图解LSTM

PyTorch实战: 使用卷积神经网络对照片进行分类

Pytorch实战：使用RNN网络对姓名进行分类

数据清洗常用正则表达式大全

PySimpleGUI: 开发自己第一个软件

深度特征合成：自动生成机器学习中的特征

Python 3.7中dataclass的终极指南（一）

Python 3.7中dataclass的终极指南（二）

[计算消费者的偏好]推荐系统与协同过滤、奇异值分解

[译] 机器学习之随机森林实战

以上就是本文的全部内容，希望对大家的学习有所帮助，也希望大家多多支持码农网

查看所有标签

本站部分资源来源于网络，本站转载出于传递更多信息之目的，版权归原作者或者来源机构所有，如转载稿涉及版权问题，请联系我们。

码农书籍

多处理器编程的艺术

（美）Maurice Herlihy、（美）Nir Shavit / 机械工业出版社 / 2013-2 / 79.00元

工业界称为多核的多处理器机器正迅速地渗入计算的各个领域。多处理器编程要求理解新型计算原理、算法及编程工具，至今很少有人能够精通这门编程艺术。现今，大多数工程技术人员都是通过艰辛的反复实践、求助有经验的朋友来学习多处理器编程技巧。这本最新的权威著作致力于改变这种状况，作者全面阐述了多处理器编程的指导原则，介绍了编制高效的多处理器程序所必备的算法技术。了解本书所涵盖的多处理器编程关键问题将使在......一起来看看《多处理器编程的艺术》这本书的介绍吧!

码农工具

[译] 机器学习之随机森林实战

任务介绍

数据介绍

执行步骤

获取数据

One-Hot编码

特征矩阵和靶向量

将数据分为训练集和测试集

建立基准线模型（baseline）

训练随机森林模型

检验模型训练效果

可视化决策树

特征重要性

特征重要性可视化

精选文章

多处理器编程的艺术

JSON 在线解析

在线进制转换器

MD5 加密