正文之前
终于翻译完了,可以开始看论文了,开心啊。。。。。。
正文
Event time for a given event essentially never changes, but processing time changes constantly for each event as it flows through the pipeline and time marches ever forward. This is an important distinction when it comes to robustly analyzing events in the context of when they occurred.
一个给定的事件的时间基本不会变,但是处理时间就会随着事件的数据在处理管道一步步被处理而随时间前移而不断地变化。这是一个十分重要的区别,特别是我们迫切需要根据事件的发生时间进行分析的时候。
During processing, the realities of the systems in use (communication delays, scheduling algorithms, time spent processing, pipeline serialization, etc.) result in an inherent and dynamically changing amount of skew between the two domains. Global progress metrics, such as punctuations or watermarks, provide a good way to visualize this skew. For our purposes, we’ll consider something like MillWheel’s watermark, which is a lower bound (often heuristically established) on event times that have been processed by the pipeline. As we’ve made very clear above, notions of completeness are generally incompatible with correctness, so we won’t rely on watermarks as such. They do, however, provide a useful notion of when the system thinks it likely that all data up to a given point in event time have been observed, and thus find application in not only visualizing skew, but in monitoring overall system health and progress, as well as making decisions around progress that do not require complete accuracy, such as basic garbage collection policies.
在处理过程中,由于使用的系统的一些现实因素的影响(通信延迟、调度算法、处理时间、流水线序列化等),或导致这两个时间域中产生一些内在的、动态的波动变化。全局数据处理进度,比如标记或者水位标记是一种很好的将上述差值可视化的办法。为了达到这个目的,考虑类似MillWheel的水位标记,它是一个下界,代表着在这个时间之前的数据已经完全被系统处理了(通常采用启发式的方法建立。)前面我们已经说的很清楚了,数据已经被完全处理的标记往往是和正确性不兼容,相互冲突的。所以我们同样不应该依赖于水位标记。但是,它提供了一个很有用的概念,当系统认为一个时间节点之前的数据都已经全部被观察到了。应用就可以用这个时间节点来可视化处理时间差,并且还能检测系统上层的健康状态和进程,还可以对一些对精确度要求不高的决策,比如基本的垃圾回收策略等进行决策。
In an ideal world, time domain skew would always be zero; we would always be processing all events immediately as they happen. Reality is not so favorable, however, and often what we end up with looks more like Figure 2. Starting around 12:00, the watermark starts to skew more away from real time as the pipeline lags, diving back close to real time around 12:02, then lagging behind again noticeably by the time 12:03 rolls around. This dynamic variance in skew is very common in distributed data processing systems, and will play a big role in defining what functionality is necessary for providing correct, repeatable results.
理想情况下,系统时间与时间处理时间之间的差为0,我们应该是在事件发生的时候就立刻处理。实际上远没有那么好,往往更像是图2所显示的那样子。从12点左右开始,管道处理开始迟滞,水位标记出现偏差,12:02左右的时候又开始靠拢,然后又开始在12:03的时候又有了很大的迟滞。分布式数据处理系统中这个动态的时间差变量很常见,而且它在考虑如何提供一个正确的,可重复的结果的时候是一个必须考虑的重要角色(变量)。
2. DATAFLOW MODEL
2. 数据流模型
In this section, we will define the formal model for the system and explain why its semantics are general enough to subsume the standard batch, micro-batch, and streaming models, as well as the hybrid streaming and batch semantics of the Lambda Architecture. For code examples, we will use a simplified variant of the Dataflow Java SDK, which itself is an evolution of the FlumeJava API.
这一节中,我们将会正式定义一个系统模型,并且解释为什么它的语义能够泛化到将批处理、微批处理、流失模型归纳到一起,同时还有混合流和Lambda架构的语义。举个编程的实例,我们将会用DataFlow 的Java SDK的一个简单变体来进行示例,这个变体也是FlumeJava API演化来的。
2.1 Core Primitives
2.1 核心基元
To begin with, let us consider primitives from the classic batch model. The Dataflow SDK has two core transforms that operate on the (key,value) pairs flowing through the system:
首先让我们来考虑经典的批处理模型的基元。DataFlow SDK在通过系统转换成的键值对上有两个核心的转换操作:
? ParDo for generic parallel processing. Each input element to be processed (which itself may be a finite collection) is provided to a user-defined function (called a DoFn in Dataflow), which can yield zero or more output elements per input. For example, consider an operation which expands all prefixes of the input key, duplicating the value across them:
ParDo 用于并行处理。每一个要被出处理的输入(可能是个有限集合)都会被提供一个用户定义的函数(在DataFlow里面被称为 DoFn ),它可以为每一个输入产生一个或者多个的输出。例如,考虑一个给输入的键扩展所有前缀的操作,在他们之间复制所有的值:
? GroupByKey for key-grouping (key, value) pairs.
GroupByKey 用来按照键对键值对元素重新分组。
The ParDo operation operates element-wise on each input element, and thus translates naturally to unbounded data. The GroupByKey operation, on the other hand, collects all data for a given key before sending them downstream for reduction. If the input source is unbounded, we have no way of knowing when it will end. The common solution to this problem is to window the data.
ParDo 操作因为对每个输入数据进行处理,因此自然地转化为无界数据。而另一方面,GroupByKey操作则是根据给定的Key值,在送到下流进行聚合前,收集所有此键对应的数据。如果输入源无界,那么我们就无法知道是什么时候结束了。通用的解决办法是对数据进行窗口化。
2.2 Windowing
2.2 窗口化
Systems which support grouping typically redefine their GroupByKey operation to essentially be GroupByKeyAndWindow. Our primary contribution here is support for unaligned windows, for which there are two key insights. The first is that it is simpler to treat all windowing strategies as unaligned from the perspective of the model, and allow underlying implementations to apply optimizations relevant to the aligned cases where applicable. The second is that windowing can be broken apart into two related operations:
支持分组的系统通常都会重定义其GroupByKey操作为GroupByKeyWindow操作。我们的主要的贡献就是支持不对称窗口,这个贡献包含两个关键见解。第一个是:从模型的角度来看,将所有的窗口视为未对齐的会比较简单,并且允许底层实现对对齐窗口的相关优化(而底层实现来负责把对齐窗口作为一个特例进行优化)。第二个是:窗口化可以被分割成两个相关的操作:
? Set<Window> AssignWindows(T datum), which assigns the element to zero or more windows. This is essentially the Bucket Operator from Li.
使用 Set<Window> AssignWindows(T datum) 将元素分配给0或者几个窗口(窗口分配操作)。这是Li在22文献中提到的Bucket(桶)操作符。
? Set<Window> MergeWindows(Set<Window> windows), which merges windows at grouping time. This allows datadriven windows to be constructed over time as data arrive and are grouped together.
通过Set<Window> MergeWindows(Set<Window> windows),可以在汇总数据。这一操作允许数据驱动窗口可以在数据到达的过程中逐渐建立并且被组合。
For any given windowing strategy, the two operations are intimately related; sliding window assignment requires sliding window merging, sessions window assignment requires sessions window merging, etc.
Note that, to support event-time windowing natively, instead of passing (key,value) pairs through the system, we now pass (key, value, event time, window) 4-tuples. Elements are provided to the system with event-time timestamps (which may also be modified at any point in the pipeline), and are initially assigned to a default global window, covering all of event time, providing semantics that match the defaults in the standard batch model.
对于任何给定的窗口策略,这两个操作都是相互关联的;滑动窗口分配需要滑动窗口合并?;峄按翱诜峙湫枰峄按翱诤喜⒌取G胱⒁?,为了支持事件时间窗口化,而不是通过系统传递键值对,我们要传递【键-值-事件时间-窗口】这个四元元组。被提交到系统的元素需要自带一个事件发生时间戳(在后面的管道处理过程中可能会被修改),并且初始分配到一个全局的窗口中,这个窗口覆盖了所有的事件发生时间,还提供了一个标准的批处理模型的语义作为默认配置。
2.2.1 Window Assignment
2.2.1 窗口分配
From the model’s perspective, window assignment creates a new copy of the element in each of the windows to which it has been assigned. For example, consider windowing a dataset by sliding windows of two-minute width and one- minute period, as shown in Figure 3 (for brevity, timestamps are given in HH:MM format).
从模型的角度来说,窗口的分配创造了一个窗口内所有元素的副本到其被分配的对应窗口。比如考虑一个滑动窗口的数据集分配,它的宽度为2mins,并且滑动周期为1min,如图3(简单起见,时间戳格式为:HH:MM)。
In this case, each of the two (key,value) pairs is duplicated to exist in both of the windows that overlapped the element’s timestamp. Since windows are associated directly with the elements to which they belong, this means window assignment can happen anywhere in the pipeline before grouping is applied. This is important, as the grouping operation may be buried somewhere downstream inside a composite transformation (e.g. Sum.integersPerKey()).
在这个例子里面,两条四元元组数据被复制到每一个元素时间戳重叠的窗口。因为窗口是直接与元素相关,这就意味着窗口分配在处理管道的聚合动作发生前随时都可能会发生。这很重要!因为聚合操作可能会出现在下游的某一个复杂的组合变化中。(比如Sum.integersPerKey())
2.2.2 Window Merging
2.2.2 窗口合并
Window merging occurs as part of the GroupByKeyAndWindow operation, and is best explained in the context of an example. We will use session windowing since it is our motivating use case. Figure 4 shows four example data, three for k1 and one for k2, as they are windowed by session, with a 30-minute session timeout. All are initially placed in a default global window by the system. The sessions implementation of AssignWindows puts each element into a single window that extends 30 minutes beyond its own timestamp; this window denotes the range of time into which later events can fall if they are to be considered part of the same session. We then begin the GroupByKeyAndWindow operation, which is really a five-part composite operation:
窗口合并是GroupByKeyWindow操作的一个子操作,这一点我们最好在下文的例子中进行说明。我们将会使用会话窗口因为它是我们很想要解决的用例。图4展示了四个示例数据。3条包含了k1,一条包含了k2,并且他们是被会话划分的窗口,会话的过期时间是30mins。所有的数据都被初始化放置在一个全局的窗口中。AssignWindows的会话实现把每一个元素放到一个由它自身的时间戳往外延伸30mins长的单独的窗口中,这个窗口表示后面的事件如果可以被纳入同一个会话的话就会落入这个窗口中(这个窗口的时间段如果和另外一个窗口的时间段相互重合,则意味着这两个窗口应该属于同一个会话)。然后我们就开始由五个部分所混合组成的GroupByKeyAndWindow操作:
? DropTimestamps - Drops element timestamps, as only the window is relevant from here on out.
DropTimestamps :丢弃元素的时间戳,在输出里面只有窗口的与此相关(后续输出里面我们只关心窗口);
? GroupByKey - Groups (value, window) tuples by key.
GroupByKey 按照键值将(值,窗口)分组。
? MergeWindows - Merges the set of currently buffered windows for a key. The actual merge logic is defined by the windowing strategy. In this case, the windows for v1 and v4 overlap, so the sessions windowing strategy merges them into a single new, larger session, as indicated in bold.
MergeWindows 把同一个键对应的(值,窗口)进行窗口合并。具体的合并方式取决于窗口策略。在这个例子中,v1的窗口和v4的窗口重叠了。所以会话窗口策略将他们合并成一个单独的、更大的会话,如粗体所示。
? GroupAlsoByWindow - For each key, groups values by window. After merging in the prior step, v1 and v4 are now in identical windows, and thus are grouped together at this step.
GroupAlsoByWindow 对每一个键,通过窗口对值进行分组。在上一步的合并后,v1和v4在一个完全相同的窗口中,并且这一步被分在同一个组里面。
? ExpandToElements - Expands per-key, per-window groups of values into (key, value, event time, window) tuples, with new per-window timestamps. In this example, we set the timestamp to the end of the window, but any timestamp greater than or equal to the timestamp of the earliest event in the window is valid with respect to watermark correctness.
ExpandToElements 扩展每一个键、每一个窗口的值,用新的时间戳使之成为(键、值、事件时间、窗口)的四元元组。在这个例子里面,我们将时间设置为窗口的末尾,但是任何时间戳都比最早的事件时间要好,或者差不多,因为这样就符合水位标记的正确性需求。
2.2.3 API
As a brief example of the use of windowing in practice, consider the following Cloud Dataflow SDK code to calculate keyed integer sums:
作为一个简单的窗口应用实例。考虑下面的Cloud Dataflow SDK编程计算键控整数和:
To do the same thing, but windowed into sessions with a 30-minute timeout as in Figure 4, one would add a single Window.into call before initiating the summation:
要做到一样的效果,但是窗口化为一个30mins终止时间的会话窗口,好比在图4中那样。那么我们在开始求和之前就要加一个单独的Window.into 调用。
2.3 Triggers & Incremental Processing
2.3 触发器 & 增量处理
The ability to build unaligned, event-time windows is an improvement, but now we have two more shortcomings to address:
建立非对齐、基于事件时间窗口的能力是一种改进,但是现在我们需要说明两个缺点:
? We need some way of providing support for tuple- and processing-time-based windows, otherwise we have regressed our windowing semantics relative to other systems in existence.
我们需要一些方法去提供基于记录(元组)和基于处理时间的窗口。否则就会出现我们的窗口语义与别的系统不兼容的情况。
? We need some way of knowing when to emit the results for a window. Since the data are unordered with respect to event time, we require some other signal to tell us when the window is done.
我们需要一些方法来了解到窗口的结果。因为数据无序随我们需要一些其他的信号来告诉我们窗口结束了。
The problem of tuple- and processing-time-based windows we will address in Section 2.4, once we have built up a solution to the window completeness problem. As to window completeness, an initial inclination for solving it might be to use some sort of global event-time progress metric, such as watermarks. However, watermarks themselves have two major shortcomings with respect to correctness:
关于第一点,基于记录和基于处理时间的窗口在2.4节中就会讨论到,现在我们要构建了一个窗口完整性问题的解决方案。对于窗口完整性问题,一开始我们倾向于使用一些全局的事件时间进度标记,比如水位标记来解决它。但是会未标记这种又有两个正确性方面的缺点:
? They are sometimes too fast, meaning there may be late data that arrives behind the watermark. For many distributed data sources, it is intractable to derive a completely perfect event time watermark, and thus impossible to rely on it solely if we want 100% correctness in our output data.
它们有时候太快了(水位标记可能设置的过短),这就意味着会有数据在水位标记之后到达。对于很多分布式数据源来说,一个很难解决的问题就是得到完美的事件时间标记,因此如果我们想要100%的输出结果正确率,那么就不能只依靠水位标记了。
? They are sometimes too slow. Because they are a global progress metric, the watermark can be held back for the entire pipeline by a single slow datum. And even for healthy pipelines with little variability in event-time skew, the baseline level of skew may still be multiple minutes or more, depending upon the input source. As a result, using watermarks as the sole signal for emitting window results is likely to yield higher latency of overall results than, for example, a comparable Lambda Architecture pipeline.
它们有时候太慢了(水位标记可能设置的过长),因为水位标记是全局进程标志,它可以只要一个数据迟到就会影响到整个水位标记。甚至对一些只有一些小小的事件时间延迟的健康管道而言,延迟基准可能几分钟或者更多,这取决于输入源。因此,使用水位标记作为唯一的窗口完整触发信号,可能导致整个处理结果比诸如Lambda架构管道的处理结果更大的延迟。
For these reasons, we postulate that watermarks alone are insufficient. A useful insight in addressing the completeness problem is that the Lambda Architecture effectively sidesteps the issue: it does not solve the completeness problem by somehow providing correct answers faster; it simply provides the best low-latency estimate of a result that the streaming pipeline can provide, with the promise of eventual consistency and correctness once the batch pipeline runs. If we want to do the same thing from within a single pipeline (regardless of execution engine), then we will need a way to provide multiple answers (or panes) for any given window. We call this feature triggers, since they allow the specification of when to trigger the output results for a given window.
因为这些原因,我们假设单独的水位标记是不够的。一个在完整性问题方面的实用深刻的见解是Labmda架构的有效地回避了这个问题:它并不是通过提供一个更快的方式来解决这个问题,它只提供了最好的流管道输出结果低延迟预估,而且承诺通过批处理保证结果的一致性和正确性。如果我们通过单独的处理管道(不论是什么执行引擎)做到同样的事情,那么我们将需要一种对任何给定的窗口提供多种答案的方式(或者可以叫做窗格 译者注:对窗口这个比喻的引申)。我们称之为触发器,它们可以选择在何时触发指定窗口的输出结果。
In a nutshell, triggers are a mechanism for stimulating the production of GroupByKeyAndWindow results in response to internal or external signals. They are complementary to the windowing model, in that they each affect system behaviour along a different axis of time:
简言之,触发器是一种受内部或者外部信号触发,从而激发GroupByKeyAndWindow执行并且将结果输出的机制。它们与窗口模型互补,因为它们从不同的时间维度影响着系统的行为:
? Windowing determines where in event time data are grouped together for processing.
窗口化将基于事件时间的数据(where 基于时间段选择)进行分组等待处理。
? Triggering determines when in processing time the results of groupings are emitted as panes.
触发器的目的是在处理过程中(when 基于处理时间选择)决定分组结果作为窗格输出。
Our systems provide predefined trigger implementations for triggering at completion estimates (e.g. watermarks, including percentile watermarks, which provide useful semantics for dealing with stragglers in both batch and streaming execution engines when you care more about processing a minimum percentage of the input data quickly than processing every last piece of it), at points in processing time, and in response to data arriving (counts, bytes, data punctuations, pattern matching, etc.). We also support composing triggers into logical combinations (and, or, etc.), loops, sequences, and other such constructions. In addition, users may define their own triggers utilizing both the underlying primitives of the execution runtime (e.g. watermark timers, processingtime timers, data arrival, composition support) and any other relevant external signals (data injection requests, external progress metrics, RPC completion callbacks, etc.). We will look more closely at examples in Section 2.4.
我们的系统提供了基于窗口的完成度的估计的预定义触发器(比如,水位标记,包含百分比水位标记,当我们更关心迅速处理一小部分输入信息而不是等待最后的那一点数据到来时,它能提供一个实用的语义在批处理和流式系统中处理迟来的数据)。触发器有基于处理时间的,有基于数据抵达清苦啊大哥(数据条数,字节数,数据到达标记,模式匹配等),我们也支持对基础触发器进行逻辑的组合(与、或)、循环、序列和一些其他的复合构造方法。另外,用户可以基于系统底层执行时间(比如,水位标记计时器,处理时间计时器,数据到达,复合构造)和任何的其他的外部信号(数据注入请求,外部进程进度指标,RPC完成回调等)来自定义触发器。我们会在2.4节节中看到更多的具体的例子。
In addition to controlling when results are emitted, the triggers system provides a way to control how multiple panes for the same window relate to each other, via three different refinement modes:
除了控制结果输出,触发器系统也定义了控制一个窗口的多个相互关联的窗格的方法,主要通过三种不同的模式:
? Discarding: Upon triggering, window contents are discarded, and later results bear no relation to previous results. This mode is useful in cases where the downstream consumer of the data (either internal or external to the pipeline) expects the values from various trigger fires to be independent (e.g. when injecting into a system that generates a sum of the values injected). It is also the most efficient in terms of amount of data buffered, though for associative and commutative operations which can be modeled as a Dataflow Combiner, the efficiency delta will often be minimal. For our video sessions use case, this is not sufficient, since it is impractical to require downstream consumers of our data to stitch together partial sessions.
丢弃:触发后,窗口内容就被丢弃了,之后的结果与之前的结果无关。这个模式在下游数据使用者(管道内部或者外部)希望来自不同的触发器计算结果是各自独立的(比如当数据注入到求和系统中的时候)时候很实用。这个模式在数据缓存方面也是最有效的,然而对于可以建模成Dataflow的Combiner的累积的、交替操作,效率增量一般是最小的。在视频会话实例场景中,抛弃模式效率并不好,因为要求下游的数据消费者只关心部分的会话是不切实际的。
? Accumulating: Upon triggering, window contents are left intact in persistent state, and later results become a refinement of previous results. This is useful when the downstream consumer expects to overwrite old values with new ones when receiving multiple results for the same window, and is effectively the mode used in Lambda Architecture systems, where the streaming pipeline produces low-latency results, which are then overwritten in the future by the results from the batch pipeline. For video sessions, this might be sufficient if we are simply calculating sessions and then immediately writing them to some output source that supports updates (e.g. a database or key/value store).
累积:触发后,窗口内容被完整的保留在持久化的状态中,后来的结果就成了对前一次结果的改良。当下游的数据消费者希望在接收到同一个窗口的多个计算结果,并且希望用新的结果覆写旧的数据的时候这个模式就很有用了。当流管道产生可能在后面被批处理管道产生的新结果覆写的低延迟结果的时候,这个模式在Lambda架构里面十分有效。对于视频会话实例场景中,如果单纯的计算然后立刻写到一些支持更新的输出中的话,这个模式很有效(数据库或者键/值存储)。
? Accumulating & Retracting: Upon triggering, in addition to the Accumulating semantics, a copy of the emitted value is also stored in persistent state. When the window triggers again in the future, a retraction for the previous value will be emitted first, followed by the new value as a normal datum. Retractions are necessary in pipelines with multiple serial GroupByKeyAndWindow operations, since the multiple results generated by a single window over subsequent trigger fires may end up on separate keys when grouped downstream. In that case, the second grouping operation will generate incorrect results for those keys unless it is informed via a retraction that the effects of the original output should be reversed. Dataflow Combiner operations that are also reversible can support retractions efficiently via an uncombine method. For video sessions, this mode is the ideal. If we are performing aggregations downstream from session creation that depend on properties of the sessions themselves, for example detecting unpopular ads (such as those which are viewed for less than five seconds in a majority of sessions), initial results may be invalidated as inputs evolve over time, e.g. as a significant number of offline mobile viewers come back online and upload session data. Retractions provide a way for us to adapt to these types of changes in complex pipelines with multiple serial grouping stages.
累积和撤回:触发后,除了累积语义之外,一个输出副本也被存储在持久化的状态中。当窗口再次被触发的时候,一个以前的输出结果的撤回操作讲会首先执行,然后新的结果作为正常输出。撤回在多个串行GroupByKeyAndWindow操作中是必须的,因为由单一窗口所产生的多个触发结果可能会在下游中被分组到不同的键对应的数据中去。在这种情况下,除非通过一个撤回操作撤回上一次分组的结果,否则第二个分组操作就会产生错误的结果。Dataflow Combiner操作同样也是可逆的,它可以通过uncombine方法有效的支持撤回操作。在视频会话中,这个模式是很理想的。如果我们在下游会话创建的一开始,就基于会话本身的一些属性执行汇总操作,比如检查不受欢迎的广告(比如在很多的会话中都被关看不到5秒的广告),初始的结果随着输入的增加会变的无效,比如说离线手机用户再次上线并且更新数据。撤回为我们提供了一种适应这些复杂的、有多个串行组合阶段的管道的方法。(简单的撤回实现只能支持确定性的计算,而非确定性计算的支持需要更复杂,代价也更高。我们已经看到这样的使用场景,比如说概率模型 译者注:比如说基于布隆过滤器的UV统计)
2.4 Examples
2.4 例子
We will now consider a series of examples that highlight the plurality of useful output patterns supported by the Dataflow Model. We will look at each example in the context of the integer summation pipeline from Section 2.2.3:
我们将会考虑一系列的例子,并且高亮显示Dataflow模式支持的多种有用的输出格式。下面的例子是关于2.2.3 中提到的整数求和管道:
Let us assume we have an input source from which we are observing ten data points, each themselves small integer values. We will consider them in the context of both bounded and unbounded data sources. For diagrammatic simplicity, we will assume all these data are for the same key; in a real pipeline, the types of operations we describe here would be happening in parallel for multiple keys. Figure 5 diagrams how these data relate together along both axes of time we care about. The X axis plots the data in event time (i.e. when the events actually occurred), while the Y axis plots the data in processing time (i.e. when the pipeline observes them). All examples assume execution on our streaming engine unless otherwise specified.
假设有一个输入源,我们从中选择十个数据点,每一个都是比较小的整数。我们将会考虑有界输入源与无界数据输入源两种情况。为了画图的简单,我们假设所有的数据都属于一个键。在一个真正的管道中,我们描述的操作类型都是多个键并行处理的。图5 画出了我们关心的两个时间轴的联系。X轴是事件时间(比如事件实际发生的时间),Y轴画出了处理时间(管道聚焦于这些数据上的时间)所有的例子都假设数据处理执行都是在流处理引擎上的。
Many of the examples will also depend on watermarks, in which cases we will include them in our diagrams. We will graph both the ideal watermark and an example actual watermark. The straight dotted line with slope of one represents the ideal watermark, i.e. if there were no event-time skew and all events were processed by the system as they occurred. Given the vagaries of distributed systems, skew is a common occurrence; this is exemplified by the meandering path the actual watermark takes in Figure 5, represented by the darker, dashed line. Note also that the heuristic nature of this watermark is exemplified by the single “late” datum with value 9 that appears behind the watermark.
在很多的例子中也会用到水位标记,在我们的图中也用到了。我们画出理想的水位标记和实际的水位标记。直的虚线代表着理想水位标记,也就是说在系统在事件发生的时候就已经处理了它们,以至于没有事件时间延迟。不过考虑到分布式系统的一些特性,延迟是一个常见的偏差;在图5中实际的那条黑色的、弯曲的水位标记线就很好的例证了这一点。另外要注意,在水位标记后面出现的单独的延迟的数值9也表明了实际的水位标记线是猜测获得的。
If we were to process these data in a classic batch system using the described summation pipeline, we would wait for all the data to arrive, group them together into one bundle (since these data are all for the same key), and sum their values to arrive at total result of 51. This result is represented by the darkened rectangle in Figure 6, whose area covers the ranges of event and processing time included in the sum (with the top of the rectangle denoting when in processing time the result was materialized). Since classic batch processing is event-time agnostic, the result is contained within a single global window covering all of event time. And since outputs are only calculated once all inputs are received, the result covers all of processing time for the execution.
如果我们在传统的批处理系统上构建的上述的数据求和管道处理数据,我们应该等到所有的数据都抵达,将他们聚合成一批(假设这些数据都是同一个键对应的值),然后再进行求和,最后得到结果51。这个结果的示意图如图6中的黑色边框,长方形区域代表求和运算涵盖的处理时间和事件时间。
Note the inclusion of watermarks in this diagram. Though not typically used for classic batch processing, watermarks would semantically be held at the beginning of time until all data had been processed, then advanced to infinity. An important point to note is that one can get identical semantics to classic batch by running the data through a streaming system with watermarks progressed in this manner.
注意此图中包含水位标记,虽然通常不用于经典的批处理,但是在语义上我们仍然可以引入这个概念。在一开始的时候在所有的数据都到达之前水位标记线是不动的,到达之后水位线近似平行于事件发生时间轴开始平移,然后一直到无限远处。一个重点是,我们可以通过在流式系统上运行一个水位标记进程来在批处理系统上得到同样的语义。(这提示我们其实水位线的概念可以同样适用于批处理)
Now let us say we want to convert this pipeline to run over an unbounded data source. In Dataflow, the default triggering semantics are to emit windows when the watermark passes them. But when using the global window with an unbounded input source, we are guaranteed that will never happen, since the global window covers all of event time. As such, we will need to either trigger by something other than the default trigger, or window by something other than the global window. Otherwise, we will never get any output.
我们想要转换管道让它可以在无界数据上运行。在Dataflow里面,默认的触发器语义是当水位标记通过的时候制造窗口。但是当对无界的输入数据源使用全局窗口的时候,那么窗口就永远都不会被触发,因为全局窗口覆盖了所有的事件时间。因此,我们需要在默认触发器外设置其他的触发器计算。或者按照另外的方式打开窗口而不是全局窗口。否则我们就永远得不到任何输出。
Let us first look at changing the trigger, since this will allow us to to generate conceptually identical output (a global per-key sum over all time), but with periodic updates. In this example, we apply a Window.trigger operation that repeatedly fires on one-minute periodic processing-time boundaries. We also specify Accumulating mode so that our global sum will be refined over time (this assumes we have an output sink into which we can simply overwrite previous results for the key with new results, e.g. a database or key/value store). Thus, in Figure 7, we generate updated global sums once per minute of processing time. Note how the semi-transparent output rectangles overlap, since Accumulating panes build upon prior results by incorporating overlapping regions of processing time:
我们首先看一下改变窗口这条思路,因为这会帮助我们产生概念上一致的输出(一个全局的包含所有时间的按键求和),以及周期性的更新结果。在这个例子中,我们使用 Window.trigger操作在一分钟的周期内重复触发窗口。我们使用累积模式来修正我们的全局求和的结果,确保其比较精确(这假设我们将结果输出到一个数据库或者是键值对中)。因此,在图7中我们生成了每分钟更新的全局求和结果。注意半透明的输出长方形是重叠的,因为累积窗格模式下计算时包含了之前的输出窗口的内容:
If we instead wanted to generate the delta in sums once per minute, we could switch to Discarding mode, as in Figure 8. Note that this effectively gives the processing-time windowing semantics provided by many streaming systems. The output panes no longer overlap, since their results incorporate data from independent regions of processing time.
如果我们想要求出每分钟的求和增量,那么我们应该转化为丢弃模式。就跟在图8中的一样。注意,通过流式系统提供的处理时间窗口化的计算模式很有效。输出窗格不再重叠,因为输出结果来自于相互独立的时间内的数据。
Another, more robust way of providing processing-time windowing semantics is to simply assign arrival time as event times at data ingress, then use event time windowing. A nice side effect of using arrival time event times is that the system has perfect knowledge of the event times in flight, and thus can provide perfect (i.e. non-heuristic) watermarks, with no late data. This is an effective and cost-efficient way of processing unbounded data for use cases where true event times are not necessary or available.
另一种更加健壮的方式是:提供处理时间窗口实现方式,将数据进入处理管道的时间分配为数据的事件时间,然后使用这个事件时间进行窗口化。使用读取时间作为事件时间的一个好处是系统对流式系统的事件时间十分的清楚,因此可以提供完美的、无延迟数据的水位标记(比如无启发式)。所以这种方式在无界数据且事件时间并非必须或者无法获取的情况下是一种非常低成本且有效的方式。
Before we look more closely at other windowing options, let us consider one more change to the triggers for this pipeline. The other common windowing mode we would like to model is tuple-based windows. We can provide this sort of functionality by simply changing the trigger to fire after a certain number of data arrive, say two. In Figure 9, we get five outputs, each containing the sum of two adjacent (by processing time) data. More sophisticated tuple-based windowing schemes (e.g. sliding tuple-based windows) require custom windowing strategies, but are otherwise supported.
在我们讨论更多的窗口类型前,先看一下对于对于处理管道的触发器的改进。另一种常用的窗口化模式是基于记录数据数的窗口策略。我们可以通过简单的修改触发器使其在一定数据达到后被触发,来实现基于记录数据数的窗口化。图9中我们给定五个输出,每一个都包含了两个在处理时间上邻近的数据的和。更加精密的基于记录数的窗口方案(比如基于记录数的滑动窗口)需要定制化窗口策略来支持。
Let us now return to the other option for supporting unbounded sources: switching away from global windowing. To start with, let us window the data into fixed, two-minute Accumulating windows:
让我们回到另一个支持无界数据源的选项:将视线从全局窗口上移开。一开始,我们观察固定的、两分钟宽度的累积窗口:
With no trigger strategy specified, the system would use the default trigger, which is effectively:
没有定义触发器策略,系统只能使用默认的有效触发器:
The watermark trigger fires when the watermark passes the end of the window in question. Both batch and streaming engines implement watermarks, as detailed in Section 3.1. The Repeat call in the trigger is used to handle late data; should any data arrive after the watermark, they will instantiate the repeated watermark trigger, which will fire immediately since the watermark has already passed.
水位标记触发器在水位线到达窗口尽头的时候会被触发。我们假设批处理和流式引擎都实现了水位标记,在3.1节中详细解释。Repeat在触发器中用于处理后续迟到的数据,也就是那些在水位标记之后抵达的数据。这就意味着当越过水位线之后,水位标志触发器会立刻触发。因为按照定义来说,此时的水位线已经越过了窗口尽头了。
Figures 10?12 each characterize this pipeline on a different type of runtime engine. We will first observe what execution of this pipeline would look like on a batch engine. Given our current implementation, the data source would have to be a bounded one, so as with the classic batch example above, we would wait for all data in the batch to arrive. We would then process the data in event-time order, with windows being emitted as the simulated watermark advances, as in Figure 10:
图10-12描述了窗口在上述三种类型的数据处理引擎上运行的特征。首先观察在批处理引擎上这个数据管道是如何运行的。受限于当前的实现情况,输入源必须是有限的,所以前面提到的就像是经典的批处理示例一样,我们应该等所有的数据到达之后再进行运算。然后我们要根据数据的事件时间进行处理,在模拟的水位线到达后窗口计算被触发,输出计算结果。整体如图10所示。
Now imagine executing a micro-batch engine over this data source with one minute micro-batches. The system would gather input data for one minute, process them, and repeat. Each time, the watermark for the current batch would start at the beginning of time and advance to the end of time (technically jumping from the end time of the batch to the end of time instantaneously, since no data would exist for that period). We would thus end up with a new watermark for every micro-batch round, and corresponding outputs for all windows whose contents had changed since the last round. This provides a very nice mix of latency and eventual correctness, as in Figure 11:
现在考虑下微型批处理引擎,每分钟弄对数据源进行一次微型批处理。系统每分钟先收集一分钟的数据,然后处理,然后再重复这个过程。每一次开始之后,水位线会从当前的批次开始并且上升到该批次的结束时间(技术上来说是即刻完成的,取决于一分钟内管道内是否会有数据积压)。这样每一轮的微型批处理都是以一个新的水位标记为结束,并且对应所有窗口都会产生不同的内容输出。者提供了一个很好地延迟和最终准确新的混合权衡。如图11:
Next, consider this pipeline executed on a streaming engine, as in Figure 12. Most windows are emitted when the watermark passes them. Note however that the datum with value 9 is actually late relative to the watermark. For whatever reason (mobile input source being offline, network partition, etc.), the system did not realize that datum had not yet been injected, and thus, having observed the 5, allowed the watermark to proceed past the point in event time that would eventually be occupied by the 9. Hence, once the 9 finally arrives, it causes the first window (for event-time range [12:00, 12:02)) to retrigger with an updated sum:
接下来,考虑在流式引擎上执行管道计算,就比如在图12中显示的那样。大部分的窗口都会在水位标记到达的时候进行输出。注意值为9的那个数据是在水位标记到达之后才抵达的。不论什么原因(手机用户的输入源可能离线,或者网络故障分区等)系统并未意识到数据那一条数据并未被注入,因此,看数据5,一开始在水位标记到达的时候触发了计算,后来值9来了之后窗口重新触发,进行计算。因此,当9最终抵达,会造成一开始的窗口(事件时间为 [12:00, 12:02)的那一个)再次触发,并且更新求和的值:
This output pattern is nice in that we have roughly one output per window, with a single refinement in the case of the late datum. But the overall latency of results is noticeably worse than the micro-batch system, on account of having to wait for the watermark to advance; this is the case of watermarks being too slow from Section 2.3.
输出格式是每一个窗口严格的一个输出,并且对迟到的数据有一个细微的改良,这一点很不错。但是总体结果的延迟比之微型批处理系统更为糟糕,因为必须要等到水位标记有动作。这就是在2.3节中说到的单纯地依赖水位标记的问题。
If we want lower latency via multiple partial results for all of our windows, we can add in some additional, processing-time-based triggers to provide us with regular updates until the watermark actually passes, as in Figure 13. This yields somewhat better latency than the micro-batch pipeline, since data are accumulated in windows as they arrive instead of being processed in small batches. Given strongly-consistent micro-batch and streaming engines, the choice between them (as well as the choice of micro-batch size) really becomes just a matter of latency versus cost, which is exactly one of the goals we set out to achieve with this model.
如果我们想要通过对所有的窗口输出多种局部结果来降低延迟,我们可以加一个额外的、基于处理时间的触发器来提供规律的刷新,直到水位标记最终通过,比如图13。这样我们可以得到比微型批处理管道更低的延迟,因为数据在会窗口中累积而不是到达之后就被小批量的处理了。假设微型批处理和流式引擎都是强一致的,那么在他们之间的选择(类似于对微型批处理的尺寸的选择)就会成为一个影响延迟与成本之间的关键问题,这也是我们这个模型所致力于要去实现的。
As one final exercise, let us update our example to satisfy the video sessions requirements (modulo the use of summation as the aggregation operation, which we will maintain for diagrammatic consistency; switching to another aggregation would be trivial), by updating to session windowing with a one minute timeout and enabling retractions. This highlights the composability provided by breaking the model into four pieces (what you are computing, where in event time you are computing it, when in processing time you are observing the answers, and how those answers relate to later refinements), and also illustrates the power of reverting previous values which otherwise might be left uncorrelated to the value offered as replacement.
作为最后一个例子,通过更新会话窗口化,设置终止时间为1分钟,并且开启撤回功能,我们来看下更新后如何满足视频会话的需求(为了图表的一致性,我们继续把求合作为我们的示范操作,转变为别的聚合操作也是很简单的)。高亮的部分也显示了我们将模型的四个维度拆开之后所带来的高灵活性(就算什么?在哪一段事件时间里面计算?在处理时间中什么时候看到结果?结果如何与后面的细微改良相关联?),同时这也展示了可以可以进行撤回操作是一个强力的工具,否则可能会让下游之前接受到的数据无法进行修正。
In this example, we output initial singleton sessions for values 5 and 7 at the first one-minute processing-time boundary. At the second minute boundary, we output a third session with value 10, built up from the values 3, 4, and 3. When the value of 8 is finally observed, it joins the two sessions with values 7 and 10. As the watermark passes the end of this new combined session, retractions for the 7 and 10 sessions are emitted, as well as a normal datum for the new session with value 25. Similarly, when the 9 arrives (late), it joins the session with value 5 to the session with value 25. The repeated watermark trigger then immediately emits retractions for the 5 and the 25, followed by a combined session of value 39. A similar dance occurs for the values 3, 8, and 1, ultimately ending with a retraction for an initial 3 session, followed by a combined session of 12.
在这个例子中,由于5和7之间事件发生时间大于1分钟,因此被当做了两个会话,因为两者之间的事件时间大于一分钟,所以被当做两个结果输出。在第二分钟的分界线上,我们输出了第三个值为10的会话,它是由数据3、4、3得来的。当数据8最后被系统观察到的时候,它就加入了数据7和10的会话中,当水位标记到达这个合并后的和窗口的终点对数据7和10的撤回操作就会发出,撤回方式是往下游发送两条键为之前的两个会话标记,值为-7和-10的记录,然后发送一个值为25的新的窗口计算结果。同样的,当数据9到达(迟到的),它与值为5的会话一起与值为25会话组成了新窗口。重复的水位标记触发器马上对值5和25进行撤回操作,并且随之发送一个值为39的合并窗口。一个类似的行为发生在值3、8、1上,最终对三个初始的窗口进行撤回操作,随之的是一个结果为12的新会话。
正文之后
本文大部分是自己翻译,小部分比较长、难的句子,则是参考了欧陆字典、Google 翻译、以及最重要的下面的来自阿里云的一篇已经翻译好的部分!