Data processing with pipe

栏目: Ruby on Rails · 发布时间: 5年前

内容简介:现如今云计算,大数据流式处理都会涉及到MapReduce,pipeline等概念,而首先,照着耗子哥文章,先来实现一个Pipe装饰器类这里用到的

现如今云计算,大数据流式处理都会涉及到MapReduce,pipeline等概念,而 《左耳朵耗子:什么是函数式编程?》 对其深入浅出,尤其是最后一段Pipe相关的代码,very graceful and elegent! 那么这篇文章也将练习一下Pipe的用法。

首先,照着耗子哥文章,先来实现一个Pipe装饰器类

import functools

class Pipe:
    def __init__(self, func):
        self.func = func
        functools.update_wrapper(self, func)

    def __ror__(self, pipe_left_obj):
        return self.func(pipe_left_obj)

    def __call__(self, *args, **kwargs):
        def wrapped(pipe_left_obj):
            return self.func(pipe_left_obj, *args, **kwargs)

        return Pipe(wrapped)

这里用到的 spacial method __ror__ 是重载了 | 运算符.

注意 __ror____or__ 的区别,重载 __ror__ 是因为我们需要数据是从 | 的左边对象传给右边对象,比如 x | y 等于 y.__ror__(x) , 而 __or__ 则相反, 它等于 x.__or__(y)

Pipe的用法示例:

@Pipe
def to_str(data, sep=','):
    return sep.join(map(str, data))

print [1,2,3] | to_str   # output is '1,2,3'
print [4,5,6] | to_str('#')  # output is '1#2#3'

这里的 to_str('#') 会调用 Pipe.__call__() , 实现 __call__ 需要注意几点: 1. 定义的时候带上 (*args, **kwargs) 来接受 to_str 的参数。 2. 返回值应该是Pipe对象,用于 | 运算。 3. Pipe初始化的时候需要传入函数对象(wrapped)做参数,且此函数的第一个参数是用于接受 | 左边对象。 4. 在 __call__ 中的 self.func 是指的 function to_str , 而在 __ror__ 里的 self.func 则是指的 function wrapped

教的曲唱不得,为了深刻理解,最好还是自己在pycharm里用debug单步调试一下看看。

接下来我们尝试一下大数据里常遇到场景,假设有一段英文文章,我们对它统计词频并 排序 后打印分哪几步? - 先将整段文章分割成单词 - 然后聚合 - 对聚合后的数据进行计数统计 - 根据规则进行排序 - 打印

import itertools

@Pipe
def split_to_words(content):
    return content.split()

@Pipe
def groupby(iterable, keyfunc):
    return itertools.groupby(sorted(iterable, key=keyfunc), keyfunc)

@Pipe
def mapping(iterable, func):
    returm (func(x) for x in iterable)

@Pipe
def count(iterable):
    return sum(map(lambda x: 1, iterable))

@Pipe
def sort(iterable, **kwargs):
    return sorted(iterable, **kwargs)

@Pipe
def echo(iterable):
    print iterable

我们拿《The Zen of Python》来试试效果:

text = """
The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!
"""

text | split_to_words | groupby(lambda x: x) | mapping(lambda x: (x[0], x[1] | count)) | sort(key =lambda x: x[1], reverse=True) | echo

输出如下:

[('is', 10), ('better', 8), ('than', 8), ('the', 5), ('to', 5), ('Although', 3), ('be', 3), ('of', 3), ('If', 2), ('a', 2), ('do', 2), ('explain,', 2), ('idea.', 2), ('implementation', 2), ('may', 2), ('never', 2), ('one', 2), ('should', 2), ('way', 2), ('*right*', 1), ('--', 1), ('--obvious', 1), ('Beautiful', 1), ('Complex', 1), ('Dutch.', 1), ('Errors', 1), ('Explicit', 1), ('Flat', 1), ('In', 1), ('Namespaces', 1), ('Now', 1), ('Peters', 1), ('Python,', 1), ('Readability', 1), ('Simple', 1), ('Sparse', 1), ('Special', 1), ('The', 1), ('There', 1), ('Tim', 1), ('Unless', 1), ('Zen', 1), ('ambiguity,', 1), ('and', 1), ('are', 1), ("aren't", 1), ('at', 1), ('bad', 1), ('beats', 1), ('break', 1), ('by', 1), ('cases', 1), ('complex.', 1), ('complicated.', 1), ('counts.', 1), ('dense.', 1), ('easy', 1), ('enough', 1), ('explicitly', 1), ('face', 1), ('first', 1), ('good', 1), ('great', 1), ('guess.', 1), ('hard', 1), ('honking', 1), ('idea', 1), ('implicit.', 1), ('it', 1), ("it's", 1), ('it.', 1), ("let's", 1), ('more', 1), ('nested.', 1), ('never.', 1), ('not', 1), ('now.', 1), ('obvious', 1), ('often', 1), ('one--', 1), ('only', 1), ('pass', 1), ('practicality', 1), ('preferably', 1), ('purity.', 1), ('refuse', 1), ('rules.', 1), ('silenced.', 1), ('silently.', 1), ('special', 1), ('temptation', 1), ('that', 1), ('those!', 1), ('ugly.', 1), ('unless', 1), ("you're", 1)]

Works like a charm!


以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

Hit Refresh

Hit Refresh

Satya Nadella、Greg Shaw / HarperBusiness / 2017-9-26 / USD 20.37

Hit Refresh is about individual change, about the transformation happening inside of Microsoft and the technology that will soon impact all of our lives—the arrival of the most exciting and disruptive......一起来看看 《Hit Refresh》 这本书的介绍吧!

JSON 在线解析
JSON 在线解析

在线 JSON 格式化工具

RGB CMYK 转换工具
RGB CMYK 转换工具

RGB CMYK 互转工具

HSV CMYK 转换工具
HSV CMYK 转换工具

HSV CMYK互换工具