本书以大数据处理引擎Spark的稳定版本1.6.x为基础,从应用案例、原理、源码、流程、调 优等多个角度剖析Spark上的实时计算框架Spark Streaming。在勾勒出Spark Streaming架构轮廓的 基础上,从基本源码开始进行剖析,由浅入深地引导已具有Spark和Spark Streaming基础技术知识 的读者进行Spark Streaming的进阶学习,理解Spark Streaming的原理和运行机制,为流数据处理 的决策和应用提供了技术参考;结合Spark Streaming的深入应用的需要,对Spark Streaming的性 能调优进行了分析,也对Spark Streaming功能的改造和扩展提供了指导。 本书适合大数据领域CTO、架构师、高级软件工程师,尤其是Spark领域已有Spark Streaming 基础知识的从业人员阅读,也可供需要深入学习Spark、Spark Streaming的高校研究生和高年级本科生参考。
目录
- 第1章 Spark Streaming应用概述······1
- 1.1 Spark Streaming应用案例 ·······2
- 1.2 Spark Streaming应用剖析 ·····13
- 第2章 Spark Streaming基本原理····15
- 2.1 Spark Core简介 ··················16
- 2.2 Spark Streaming设计思想 ·····26
- 2.3 Spark Streaming整体架构 ·····30
- 2.4 编程接口 ·························33
- 第3章 Spark Streaming运行流程详解·············39
- 3.1 从StreamingContext的初始化到启动 ··········40
- 3.2 数据接收 ·························54
- 3.3 数据处理 ·························91
- 3.4 数据清理 ························115
- 3.5 容错机制 ························127
- 3.5.1 容错原理 ·························128
- 3.5.2 Driver容错机制 ·················152
- 3.5.3 Executor容错机制 ··············161
- 3.6 No Receiver方式 ···············167
- 3.7 输出不重复 ·····················175
- 3.8 消费速率的动态控制 ·········176
- 3.9 状态操作 ························189
- 3.10 窗口操作 ·······················212
- 3.11 页面展示 ·······················216
- 3.12 Spark Streaming应用程序的停止··········227
- 第4章Spark Streaming 性能调优机制···········237
- 4.1 并行度解析 ·····················238
- 4.1.1 数据接收的并行度 ·············238
- 4.1.2 数据处理的并行度 ·············240
- 4.2 内存······························240
- 4.3 序列化 ···························240
- 4.4 Batch Interval ···················241
- 4.5 Task ·······························242
- 4.6 JVM GC ·························242
- 第5章Spark 2.0中的流计算··········245
- 5.1 连续应用程序 ··················246
- 5.2 无边界表unbounded table ····248
- 5.3 增量输出模式 ··················249
- 5.4 API简化 ··························250
- 5.5 其他改进 ························250