Data processing with pipe

内容简介：现如今云计算，大数据流式处理都会涉及到MapReduce，pipeline等概念，而首先，照着耗子哥文章，先来实现一个Pipe装饰器类这里用到的

现如今云计算，大数据流式处理都会涉及到MapReduce，pipeline等概念，而《左耳朵耗子：什么是函数式编程？》对其深入浅出，尤其是最后一段Pipe相关的代码，very graceful and elegent！那么这篇文章也将练习一下Pipe的用法。

首先，照着耗子哥文章，先来实现一个Pipe装饰器类

import functools

class Pipe:
    def __init__(self, func):
        self.func = func
        functools.update_wrapper(self, func)

    def __ror__(self, pipe_left_obj):
        return self.func(pipe_left_obj)

    def __call__(self, *args, **kwargs):
        def wrapped(pipe_left_obj):
            return self.func(pipe_left_obj, *args, **kwargs)

        return Pipe(wrapped)

这里用到的 spacial method __ror__ 是重载了 | 运算符.

注意 __ror__ 和 __or__ 的区别，重载 __ror__ 是因为我们需要数据是从 | 的左边对象传给右边对象，比如 x | y 等于 y.__ror__(x) , 而 __or__ 则相反, 它等于 x.__or__(y)

Pipe的用法示例：

@Pipe
def to_str(data, sep=','):
    return sep.join(map(str, data))

print [1,2,3] | to_str   # output is '1,2,3'
print [4,5,6] | to_str('#')  # output is '1#2#3'

这里的 to_str('#') 会调用 Pipe.__call__() , 实现 __call__ 需要注意几点： 1. 定义的时候带上 (*args, **kwargs) 来接受 to_str 的参数。 2. 返回值应该是Pipe对象，用于 | 运算。 3. Pipe初始化的时候需要传入函数对象（wrapped）做参数，且此函数的第一个参数是用于接受 | 左边对象。 4. 在 __call__ 中的 self.func 是指的 function to_str , 而在 __ror__ 里的 self.func 则是指的 function wrapped 。

教的曲唱不得，为了深刻理解，最好还是自己在pycharm里用debug单步调试一下看看。

接下来我们尝试一下大数据里常遇到场景，假设有一段英文文章，我们对它统计词频并排序后打印分哪几步？ - 先将整段文章分割成单词 - 然后聚合 - 对聚合后的数据进行计数统计 - 根据规则进行排序 - 打印

import itertools

@Pipe
def split_to_words(content):
    return content.split()

@Pipe
def groupby(iterable, keyfunc):
    return itertools.groupby(sorted(iterable, key=keyfunc), keyfunc)

@Pipe
def mapping(iterable, func):
    returm (func(x) for x in iterable)

@Pipe
def count(iterable):
    return sum(map(lambda x: 1, iterable))

@Pipe
def sort(iterable, **kwargs):
    return sorted(iterable, **kwargs)

@Pipe
def echo(iterable):
    print iterable

我们拿《The Zen of Python》来试试效果：

text = """
The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!
"""

text | split_to_words | groupby(lambda x: x) | mapping(lambda x: (x[0], x[1] | count)) | sort(key =lambda x: x[1], reverse=True) | echo

输出如下:

[('is', 10), ('better', 8), ('than', 8), ('the', 5), ('to', 5), ('Although', 3), ('be', 3), ('of', 3), ('If', 2), ('a', 2), ('do', 2), ('explain,', 2), ('idea.', 2), ('implementation', 2), ('may', 2), ('never', 2), ('one', 2), ('should', 2), ('way', 2), ('*right*', 1), ('--', 1), ('--obvious', 1), ('Beautiful', 1), ('Complex', 1), ('Dutch.', 1), ('Errors', 1), ('Explicit', 1), ('Flat', 1), ('In', 1), ('Namespaces', 1), ('Now', 1), ('Peters', 1), ('Python,', 1), ('Readability', 1), ('Simple', 1), ('Sparse', 1), ('Special', 1), ('The', 1), ('There', 1), ('Tim', 1), ('Unless', 1), ('Zen', 1), ('ambiguity,', 1), ('and', 1), ('are', 1), ("aren't", 1), ('at', 1), ('bad', 1), ('beats', 1), ('break', 1), ('by', 1), ('cases', 1), ('complex.', 1), ('complicated.', 1), ('counts.', 1), ('dense.', 1), ('easy', 1), ('enough', 1), ('explicitly', 1), ('face', 1), ('first', 1), ('good', 1), ('great', 1), ('guess.', 1), ('hard', 1), ('honking', 1), ('idea', 1), ('implicit.', 1), ('it', 1), ("it's", 1), ('it.', 1), ("let's", 1), ('more', 1), ('nested.', 1), ('never.', 1), ('not', 1), ('now.', 1), ('obvious', 1), ('often', 1), ('one--', 1), ('only', 1), ('pass', 1), ('practicality', 1), ('preferably', 1), ('purity.', 1), ('refuse', 1), ('rules.', 1), ('silenced.', 1), ('silently.', 1), ('special', 1), ('temptation', 1), ('that', 1), ('those!', 1), ('ugly.', 1), ('unless', 1), ("you're", 1)]

Works like a charm!

以上就是本文的全部内容，希望对大家的学习有所帮助，也希望大家多多支持码农网

查看所有标签

猜你喜欢:

Data processing with pipe

本站部分资源来源于网络，本站转载出于传递更多信息之目的，版权归原作者或者来源机构所有，如转载稿涉及版权问题，请联系我们。

码农书籍

计算几何

周培德 / 清华大学出版社 / 2011-9 / 82.00元

《计算几何--算法设计与分析(第4版)》(作者周培德)系统地介绍了计算几何中的基本概念、求解诸多问题的算法及复杂性分析，概括了求解几何问题所特有的许多思想方法、几何结构与数据结构。全书共分10章，包括：预备知识，几何查找(检索)，多边形，凸壳及其应用，Voronoi图、三角剖分及其应用，交与并及其应用，多边形的获取及相关问题，几何体的划分与等分，路径与回路，几何拓扑网络设计等。《计......一起来看看《计算几何》这本书的介绍吧!

码农工具

Data processing with pipe

计算几何

CSS 压缩/解压工具

随机密码生成器

RGB HSV 转换