7个你可能在2018错过的开源python AI项目与机器学习库，David 9的收藏

栏目: Python · 发布时间: 6年前

内容简介：2018也许是1.也许你之前听过

2018也许是 AutoML （自动化机器学习）的探索元年。就让我们从AutoML聊起。

1. AdaNet — 一个基于TensorFlow的开源神经网络自动学习项目。

也许你之前听过 Auto-Keras 和 Auto-Sklearn ，但是如果要认真去做神经网络的AutoML， AdaNet 有许多值得借鉴的地方。

7个你可能在2018错过的开源python AI项目与机器学习库，David 9的收藏 — 来自： https://github.com/tensorflow/adanet

如上图， AdaNet 会在网络层中尝试使用不用的候选（Candidates）结构和参数。并且自己维护一个Adanet loss（带正则）：

入门 AdaNet 可以先通读项目中的例程： https://github.com/tensorflow/adanet/blob/master/adanet/examples/tutorials/adanet_objective.ipynb ，并理解如何使用AdaNet已有类构造子网络生成器：

class SimpleDNNGenerator(adanet.subnetwork.Generator):
  """Generates a two DNN subnetworks at each iteration.

  The first DNN has an identical shape to the most recently added subnetwork
  in `previous_ensemble`. The second has the same shape plus one more dense
  layer on top. This is similar to the adaptive network presented in Figure 2 of
  [Cortes et al. ICML 2017](https://arxiv.org/abs/1607.01097), without the
  connections to hidden layers of networks from previous iterations.
  """

  def __init__(self,
               optimizer,
               layer_size=64,
               learn_mixture_weights=False,
               seed=None):
    """Initializes a DNN `Generator`.

    Args:
      optimizer: An `Optimizer` instance for training both the subnetwork and
        the mixture weights.
      layer_size: Number of nodes in each hidden layer of the subnetwork
        candidates. Note that this parameter is ignored in a DNN with no hidden
        layers.
      learn_mixture_weights: Whether to solve a learning problem to find the
        best mixture weights, or use their default value according to the
        mixture weight type. When `False`, the subnetworks will return a no_op
        for the mixture weight train op.
      seed: A random seed.

    Returns:
      An instance of `Generator`.
    """

    self._seed = seed
    self._dnn_builder_fn = functools.partial(
        _SimpleDNNBuilder,
        optimizer=optimizer,
        layer_size=layer_size,
        learn_mixture_weights=learn_mixture_weights)

  def generate_candidates(self, previous_ensemble, iteration_number,
                          previous_ensemble_reports, all_reports):
    """See `adanet.subnetwork.Generator`."""

    num_layers = 0
    seed = self._seed
    if previous_ensemble:
      num_layers = tf.contrib.util.constant_value(
          previous_ensemble.weighted_subnetworks[
              -1].subnetwork.persisted_tensors[_NUM_LAYERS_KEY])
    if seed is not None:
      seed += iteration_number
    return [
        self._dnn_builder_fn(num_layers=num_layers, seed=seed),
        self._dnn_builder_fn(num_layers=num_layers + 1, seed=seed),
    ]

2. TPOT — 贴心到要把 特征选择、模型选择和模型优化 一并做了

TPOT试图把繁琐的 特征选择、模型选择和模型优化 一并做优化并输出在另一个py文件中：

from tpot import TPOTClassifier
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split

digits = load_digits()
X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target,
                                                    train_size=0.75, test_size=0.25)

tpot = TPOTClassifier(generations=5, population_size=20, verbosity=2)
tpot.fit(X_train, y_train)
print(tpot.score(X_test, y_test))
tpot.export('tpot_mnist_pipeline.py')

运行以上代码会自动优化并输出

tpot_mnist_pipeline.py 代码文件：

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

# NOTE: Make sure that the class is labeled 'target' in the data file
tpot_data = pd.read_csv('PATH/TO/DATA/FILE', sep='COLUMN_SEPARATOR', dtype=np.float64)
features = tpot_data.drop('target', axis=1).values
training_features, testing_features, training_target, testing_target = \
            train_test_split(features, tpot_data['target'].values, random_state=None)


exported_pipeline = KNeighborsClassifier(n_neighbors=6, weights="distance")

exported_pipeline.fit(training_features, training_classes)
results = exported_pipeline.predict(testing_features)

但是tpot基于scikit-learn，如果没有进行很高的优化，代码运行时间可能会令你无法忍受。其使用简单到是适合初学者实验。

3. SHAP — 解释模型的预测行为

SHAP比一般模型分析工具好用的地方有两个，

支持tensorflow，pytorch，keras等深度学习框架
支持深度神经网络模型的预测行为可视化，如下图，红色的像素区域表示在当前标签下的概率更大：

4. Augmentor — 简单实用的数据增强库

只需短短几行代码你就可以生成数据增强图片：

p = Augmentor.Pipeline("/path/to/images")
# Point to a directory containing ground truth data.
# Images with the same file names will be added as ground truth data
# and augmented in parallel to the original data.
p.ground_truth("/path/to/ground_truth_images")
# Add operations to the pipeline as normal:
p.rotate(probability=1, max_left_rotation=5, max_right_rotation=5)
p.flip_left_right(probability=0.5)
p.zoom_random(probability=0.5, percentage_area=0.8)
p.flip_top_bottom(probability=0.5)
p.sample(50)

Augmentor还支持加入图片噪声和图像扭曲等功能：

5. spaCy — 帮你构建高级的NLP自然语言应用

2018年不乏许多好的自然语言项目， spaCy 就是其中之一。spaCy 使用较新的研究成果作出产品级别的功能，包含的feature不限于以下所列：

仅spaCy的分词（ tokenization ）就支持31种语言和嵌套分词：

6. pytext — 深度学习+ NLP + PyTorch

来自facebook的开源项目pytext是基于pytorch的，自身带着一股研究性（如果你想寻找 深度学习+ NLP 的论文实现），如David 9 在之前文章（一维卷积在语义理解中的应用，莫斯科物理技术学院开源聊天机器人DeepPavlov解析及代码）提到的一维卷积：

7. flair — 另一个简单易用的自然语言框架

除了简单易用，与pytext不同的是，flair不专注于神经网络，但也对近年来一些成熟的方案给出了实现：

flair另一个亮点是有自己一套简单的方式组合不同的词嵌入（

embeddings ），包括

Flair embeddings , BERT embeddings 和LMo embeddings。

参考文献：

本文采用署名 – 非商业性使用 – 禁止演绎 3.0 中国大陆许可协议进行许可。著作权属于“David 9的博客”原创，如需转载，请联系微信: david9ml，或邮箱：yanchao727@gmail.com

或直接扫二维码:

以上就是本文的全部内容，希望对大家的学习有所帮助，也希望大家多多支持码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络，本站转载出于传递更多信息之目的，版权归原作者或者来源机构所有，如转载稿涉及版权问题，请联系我们。

码农书籍

区块链技术驱动金融

阿尔文德·纳拉亚南、约什·贝努、爱德华·费尔顿、安德鲁·米勒、史蒂文·戈德费德 / 林华、王勇 / 中信出版社,中信出版集团 / 2016-8-25 / CNY 79.00

从数字货币及智能合约技术层面，解读了区块链技术在金融领域的运用。“如果你正在寻找一本在技术层面解释比特币是如何运作的，并且你有一定计算机科学和编程的基本知识，这本书应该很适合你。” 《区块链：技术驱动金融》回答了一系列关于比特币如何运用区块链技术运作的问题，并且着重讲述了各种技术功能，以及未来会形成的网络。比特币是如何运作的？它因何而与众不同？你的比特币安全吗？比特币用户如何匿名？区块链如何......一起来看看《区块链技术驱动金融》这本书的介绍吧!

码农工具

7个你可能在2018错过的开源python AI项目与机器学习库，David 9的收藏

或直接扫二维码:

区块链技术驱动金融

MD5 加密

HEX HSV 转换工具

HSV CMYK 转换工具