博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
Discretized Streams, 离散化的流数据处理
阅读量:4511 次
发布时间:2019-06-08

本文共 1860 字,大约阅读时间需要 6 分钟。

Discretized Streams: An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters

 

当前的流处理方案, Yahoo!’s S4, Twitter’s Storm, 都是采用传统的"record at-a-time”处理模式, 当收到一条record, 或者更新状态, 或者产生新的record

问题是, 在使用这些方案的时候, 用户需要考虑的东西很多, 比如

Fault tolerance

传统解决Fault tolerance的方案有两种,

a, 处理节点replication, 需要多倍的硬件资源, 而且也有可能碰到所有节点down的可能性 
b, 源节点backup和replay, storm的方案, recovery的时间比较长, 因为基于超时, 需要等

Consistency

Depending on the system, it can be hard to reason about the global state, because different nodes may be processing data that arrived at different times. For example, suppose that a system

counts page views from male users on one node and from females on another. If one of these nodes is backlogged (积压), the ratio of their counters will be wrong.

Unification with batch processing

现有stream处理模型需要编写额外的code, 而无法重用batch的逻辑

 

Discretized streams (D-Streams), that overcomes these challenges.

The key idea behind D-Streams is to treat a streaming computation as a series of deterministic batch computations on small time intervals.

 

实现中的两个问题,

Low latency

这个借助spark和RDD可以达到1s以内

快速的Fault tolerance

采用"parallel recovery”

The system periodically checkpoints some of the state RDDs, by asynchronously replicating them to other nodes.
其实比较简单, 会定期的checkpoints一些状态RDDS, 并在其他节点上建立replicas
当出现故障的时候, 就读出最近的checkpoints, 并继续linear replay出最新state 
 

这篇文章后面主要在谈如果fault tolerance,但是也不够细节

One reason why parallel recovery was hard to perform in previous streaming systems is that they process data on a per-record basis, which requires complex and costly bookkeeping protocols (e.g., Flux [20]) even for basic replication. In contrast, D-Streams apply deterministic transformations at the much coarser granularity of RDD partitions, which leads to far lighter bookkeeping and simple recovery similar to batch data flow systems [6].

转载于:https://www.cnblogs.com/fxjwind/p/3333213.html

你可能感兴趣的文章
Discretized Streams, 离散化的流数据处理
查看>>
Spark源码分析 – SchedulerBackend
查看>>
黑马程序员 Java输入\输出
查看>>
python字符串处理
查看>>
live555学习笔记4-计划任务(TaskScheduler)深入探讨
查看>>
【Unity3D】获取鼠标在三维空间(世界坐标系)的位置
查看>>
Python虚拟机函数机制之名字空间(二)
查看>>
线段树
查看>>
SharePoint2010联合搜索——Google、百度
查看>>
php静态
查看>>
python基础之文件操作
查看>>
在eclipse里头用checkstyle检查项目出现 File contains tab characters (this is the first instance)原因...
查看>>
个人github链接及git学习心得总结
查看>>
c++ 计算器 带括号 代码实现
查看>>
objective -c初写
查看>>
C#中如何设置窗体的默认按钮和取消按钮
查看>>
[Swift]LeetCode276. 粉刷栅栏 $ Paint Fence
查看>>
[Swift]LeetCode351. 安卓解锁模式 $ Android Unlock Patterns
查看>>
break语句和continue语句
查看>>
java代码中添加log4j日志
查看>>