Learn from Top Kagglers：高级特征工程 II

栏目: 数据库 · 发布时间: 6年前

内容简介：这是一篇笔记，课程来自Coursera上的本篇文章讲解在数据科学竞赛中常用的特征工程技巧，这是本篇文章的下部分。

这是一篇笔记，课程来自Coursera上的

How to Win a Data Science Competition: Learn from Top Kagglers

本篇文章讲解在数据科学竞赛中常用的特征工程技巧，这是本篇文章的下部分。

如果你正在使用电脑查看这篇文章，建议进入阅读原文，查看jupyeter notebook文件。

Statistics and distance based features

该部分专注于此高级特征工程：计算由另一个分组的一个特征的各种统计数据和从给定点的邻域分析得到的特征。

groupby and nearest neighbor methods

例子：这里有一些CTR任务的数据

Learn from Top Kagglers：高级特征工程 II

我们可以暗示广告有页面上的最低价格将吸引大部分注意力。页面上的其他广告不会很有吸引力。计算与这种含义相关的特征非常容易。我们可以为每个广告的每个用户和网页添加最低和最高价格。在这种情况下，具有最低价格的广告的位置也可以使用。

Learn from Top Kagglers：高级特征工程 II

代码实现 Learn from Top Kagglers：高级特征工程 II

More feature

How many pages user visited
Standard deviation of prices
Most visited page
Many, many more

如果没有特征可以像这样使用groupby呢？可以使用最近邻点

Neighbors

Explicit group is not needed
More flexible
Much harder to implement

Examples

Number of houses in 500m, 1000m,..
Average price per square meter in 500m, 1000m,..
Number of schools/supermarkets/parking lots in 500m, 1000m,..
Distance to colsest subway station

讲师在 Springleaf 比赛中使用了它。

KNN features in springleaf

Mean encode all the variables
For every point, find 2000 nearst neighbors using Bray-Curtis metric
Calculate various features from those 2000 neighbors

Evaluate

Mean target of neatrest 5,10,15,500,2000, neighbors
Mean distance to 10 closest neighbors
Mean distance to 10 closest neighbors with target 1
Mean distance to 10 closest neighbors with target 0

Matrix factorizations for feature extraction

Example of feature fusion

Notes about Matrix Fatorization

Can be apply only for some columns
Can provide additional diversity

Good for ensembles

It is lossy transformation.Its’ efficirncy depends on:

Usually 5-100
Particular task
Number of latent factors

Implementtation

Serveral MF methods you can find in sklearn
SVD and PCA

Standart tools for Matrix Fatorization

TruncatedSVD

Works with sparse matrices

Non-negative Matrix Fatorization(NMF)

Ensures that all latent fators are non-negative
Good for counts-like data

NMF for tree-based methods

non-negative matrix factorization 简称NMF，它以一种使数据更适合决策树的方式转换数据。 Learn from Top Kagglers：高级特征工程 II

可以看出，NMF变换数据形成平行于轴的线。

因子分解

可以使用与线性模型的技巧来分解矩阵。 Learn from Top Kagglers：高级特征工程 II

Conclusion

Matrix Factorization is a very general approach for dimensionality reduction and feature extraction
It can be applied for transforming categorical features into real-valued
Many of tricks trick suitable for linear models can be useful for MF

Feature interactions

特征值的所有组合

Example:banner selection

假设我们正在构建一个预测模型，在网站上显示的最佳广告横幅。

…	category_ad	category_site	…	is_clicked
…	auto_part	game_news	…	0
…	music_tickets	music_news	..	1
…	mobile_phones	auto_blog	…	0

将广告横幅本身的类别和横幅将显示的网站类别，进行组合将构成一个非常强的特征。

…	ad_site	…	is_clicked
…	auto_part \| game_news	…	0
…	music_tickets \| music_news	..	1
…	mobile_phones \| auto_blog	…	0

构建这两个特征的组合特征 ad_site

从技术角度来看，有两种方法可以构建这种交互。

Example of interactions

方法1

Learn from Top Kagglers：高级特征工程 II

方法2

Learn from Top Kagglers：高级特征工程 II

相似的想法也可用于数值变量

事实上，这不限于乘法操作，还可以是其他的

Learn from Top Kagglers：高级特征工程 II

Multiplication
Sum
Diff
Division
..

Practival Notes

We have a lot of possible interactions -N*N for N features.

a. Even more if use several types in interactions

Need ti reduce it’s number

a. Dimensionality reduction
b. Feature selection

通过这种方法生成了大量的特征，可以使用特征选择或降维的方法减少特征。以下用特征选择举例说明

Learn from Top Kagglers：高级特征工程 II

Interactions’ order

We looked at 2nd order interactions.
Such approach can be generalized for higher orders.
It is hard to do generation and selection automatically.
Manual building of high-order interactions is some kind of art.

Extract features from DT

Learn from Top Kagglers：高级特征工程 II

看一下决策树。让我们将每个叶子映射成二进制特征。对象叶子的索引可以用作新分类特征的值。如果我们不使用单个树而是使用它们的整体。例如，随机森林，那么这种操作可以应用于每个条目。这是一种提取高阶交互的强大方法。

How to use it

In sklearn:

tree_model.apply()

In xgboost:

booster.predict(pred_leaf=True)

Conclusion

We looked at ways to build an interaction of categorical attributes
Extended this approach to real-valued features
Learn how to extract features via decision trees

t-SNE

用于探索数据分析。可以被视为从数据中获取特征的方法。

Practical Notes

Result heavily depends on hyperparameters(perplexity)

Good practice is to use several projections with different perplexities(5-100)

Due to stochastic nature, tSNE provides different projections even for the same data\hyperparams

Train and test should be projected together

tSNE runs for a long time with a big number of features

it is common to do dimensionality reduction before projection.

Implementation of tSNE can be found in sklearn library.
But personally I perfer you use stand-alone implementation python package tsne due to its’ faster speed.

Conclusion

tSNE is a great tool for visualization
It can be used as feature as well
Be careful with interpretation of results
Try different perplexities

矩阵分解：

矩阵分解方法概述（sklearn） (http://scikit-learn.org/stable/modules/decomposition.html)

T-SNOW：
多核t-SNE实现

(https://github.com/DmitryUlyanov/Multicore-TSNE)

流形学习方法的比较（sklearn)

(http://scikit-learn.org/stable/auto_examples/manifold/plot_compare_methods.html)

如何有效使用t-SNE（distill.pub博客）

(https://distill.pub/2016/misread-tsne/)

tSNE主页（Laurens van der Maaten）

(https://lvdmaaten.github.io/tsne/)

示例：具有不同困惑的tSNE（sklearn）

(http://scikit-learn.org/stable/auto_examples/manifold/plot_t_sne_perplexity.html#sphx-glr-auto-examples-manifold-plot-t-sne-perplexity-py)

互动：

Facebook Research的论文关于从树中提取分类特征

(https://research.fb.com/publications/practical-lessons-from-predicting-clicks-on-ads-at-facebook/)

示例：使用树集合进行要素转换（sklearn）

(http://scikit-learn.org/stable/auto_examples/ensemble/plot_feature_transformation.html)

点击阅读原文，查看jupyter notebook文件

Learn from Top Kagglers：高级特征工程 II

长按识别二维码

获取更多AI资讯

Learn from Top Kagglers：高级特征工程 II

以上就是本文的全部内容，希望对大家的学习有所帮助，也希望大家多多支持码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络，本站转载出于传递更多信息之目的，版权归原作者或者来源机构所有，如转载稿涉及版权问题，请联系我们。

码农书籍

性能之巅

Brendan Gregg / 徐章宁、吴寒思、陈磊 / 电子工业出版社 / 2015-8-15 / 128

《性能之巅：洞悉系统、企业与云计算》基于Linux 和Solaris 系统阐述了适用于所有系统的性能理论和方法，Brendan Gregg 将业界普遍承认的性能方法、工具和指标收集于本书之中。阅读本书，你能洞悉系统运作的方式，学习到分析和提高系统与应用程序性能的方法，这些性能方法同样适用于大型企业与云计算这类最为复杂的环境的性能分析与调优。一起来看看《性能之巅》这本书的介绍吧!

码农工具

Learn from Top Kagglers：高级特征工程 II

Statistics and distance based features

例子：这里有一些CTR任务的数据

Neighbors

Examples

KNN features in springleaf

Evaluate

Matrix factorizations for feature extraction

Example of feature fusion

Notes about Matrix Fatorization

Implementtation

NMF for tree-based methods

因子分解

Conclusion

Feature interactions

Example:banner selection

Example of interactions

方法1

方法2

相似的想法也可用于数值变量

Practival Notes

Interactions’ order

Extract features from DT

How to use it

Conclusion

t-SNE

Practical Notes

Conclusion

矩阵分解：

T-SNOW：

互动：

性能之巅

HTML 压缩/解压工具

图片转BASE64编码

XML、JSON 在线转换