使用feature Importance进行特征选择

栏目: IT技术 · 发布时间: 5年前

内容简介：在前一篇机器学习之特征选择的文章中讲到了树模型中GBDT也可用来作为基模型进行特征选择。今天在此基础上进行拓展，介绍除决策树外用的比较多的XGBoost、LightGBM。决策树的feature_importances_属性，返回的重要性是按照决策树种被用来分割后带来的增益(gain)总和进行返回。关于信息增益（Gain）相关介绍可以决策树简介。

在前一篇机器学习之特征选择的文章中讲到了树模型中GBDT也可用来作为基模型进行特征选择。今天在此基础上进行拓展，介绍除决策树外用的比较多的XGBoost、LightGBM。

DecisionTree

决策树的feature_importances_属性，返回的重要性是按照决策树种被用来分割后带来的增益(gain)总和进行返回。

The importance of a feature is computed as the (normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance.

关于信息增益（Gain）相关介绍可以决策树简介。

参考链接： https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier.feature_importances_

GradientBoosting和ExtraTrees与DecisionTree类似。

XGBoost

get_score(fmap='', importance_type='weight')
Get feature importance of each feature. Importance type can be defined as:
‘weight’: the number of times a feature is used to split the data across all trees.
‘gain’: the average gain across all splits the feature is used in.
‘cover’: the average coverage across all splits the feature is used in.
‘total_gain’: the total gain across all splits the feature is used in.
‘total_cover’: the total coverage across all splits the feature is used in.

其中：

weight：该特征被选为分裂特征的次数。
gain：该特征的带来平均增益(有多棵树)。在tree中用到时的gain之和/在tree中用到的次数计数。gain = total_gain / weight
cover：该特征对每棵树的覆盖率。
total_gain：在所有树中，某特征在每次分裂节点时带来的总增益
total_cover：在所有树中，某特征在每次分裂节点时处理(覆盖)的所有样例的数量。

参考链接： https://xgboost.readthedocs.io/en/latest/python/python_api.html#xgboost.Booster.get_score

LightGBM

feature_importance(importance_type='split', iteration=None)
Get feature importances.
 importance_type (string, optional (default="split")) – How the importance is calculated. If “split”, result contains numbers of times the feature is used in a model. If “gain”, result contains total gains of splits which use the feature.
 iteration (int or None, optional (default=None)) – Limit number of iterations in the feature importance calculation. If None, if the best iteration exists, it is used; otherwise, all trees are used. If <= 0, all trees are used (no limits).

其中：

split就是特征在所有决策树中被用来分割的总次数。
gain就是特征在所有决策树种被用来分割后带来的增益(gain)总和

参考链接： https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.Booster.html#lightgbm.Booster.feature_importance

以上所述就是小编给大家介绍的《使用feature Importance进行特征选择》，希望对大家有所帮助，如果大家有任何疑问请给我留言，小编会及时回复大家的。在此也非常感谢大家对码农网的支持！

查看所有标签

猜你喜欢:

本站部分资源来源于网络，本站转载出于传递更多信息之目的，版权归原作者或者来源机构所有，如转载稿涉及版权问题，请联系我们。

码农书籍

Growth Hacker Marketing

Ryan Holiday / Portfolio / 2013-9-3 / USD 10.31

Dropbox, Facebook, AirBnb, Twitter. A new generation of multibillion dollar brands built without spending a dime on “traditional marketing.” No press releases, no PR firms, and no billboards in Times ......一起来看看《Growth Hacker Marketing》这本书的介绍吧!

码农工具