spark集群搭建和简单运维操作

栏目: Scala · 发布时间: 5年前

spark

spark集群搭建和简单运维操作
Apache Spark 是专为大规模数据处理而设计的快速通用的计算引擎。
Spark是UC Berkeley AMP lab (加州大学伯克利分校的AMP实验室)所开源的类Hadoop MapReduce的通用并行框架,
Spark,拥有Hadoop MapReduce所具有的优点;
但不同于MapReduce的是——Job中间输出结果可以保存在内存中,从而不再需要读写HDFS,
因此Spark能更好地适用于数据挖掘与机器学习等需要迭代的MapReduce的算法。


Spark 是一种与 Hadoop 相似的开源集群计算环境,
但是两者之间还存在一些不同之处,这些有用的不同之处使 Spark 在某些工作负载方面表现得更加优越,
换句话说,Spark 启用了内存分布数据集,除了能够提供交互式查询外,它还可以优化迭代工作负载。

Spark 是在 Scala 语言中实现的,它将 Scala 用作其应用程序框架。
与 Hadoop 不同,Spark 和 Scala 能够紧密集成,其中的 Scala 可以像操作本地集合对象一样轻松地操作分布式数据集。

尽管创建 Spark 是为了支持分布式数据集上的迭代作业,
但是实际上它是对 Hadoop 的补充,可以在 Hadoop 文件系统中并行运行。
通过名为 Mesos 的第三方集群框架可以支持此行为。

Spark 由加州大学伯克利分校 AMP 实验室 (Algorithms, Machines, and People Lab) 开发,
可用来构建大型的、低延迟的数据分析应用程序。

spark集群搭建和简单运维操作

spark集群搭建和简单运维操作

spark集群搭建和简单运维操作

spark集群搭建和简单运维操作

spark集群搭建和简单运维操作 spark集群搭建和简单运维操作 spark集群搭建和简单运维操作 spark集群搭建和简单运维操作 spark集群搭建和简单运维操作 ‘’ spark集群搭建和简单运维操作

启动 spark-shell 后,在 scala 中加载数据“1,2,3,4,5,6,7,8,9,10”, 求这些数据的 2 倍乘积能够被 3 整除的数字,并通过 toDebugString 方法来查看 RDD 的谱系。将以上操作命令和结果信息以文本形式提交到答题框中。

scala> val num = sc.parallelize(1 to 10)
num: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:27

scala> val doublenum = num.map(_*2)
doublenum: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[1] at map at <console>:29

scala> val threenum = doublenum.filter(_%3==0)
threenum: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[2] at filter at <console>:31
scala> threenum.collect
19/02/11 13:06:18 INFO SparkContext: Starting job: collect at <console>:34
19/02/11 13:06:18 INFO DAGScheduler: Got job 0 (collect at <console>:34) with 2 output partitions
19/02/11 13:06:18 INFO DAGScheduler: Final stage: ResultStage 0 (collect at <console>:34)
19/02/11 13:06:18 INFO DAGScheduler: Parents of final stage: List()
19/02/11 13:06:18 INFO DAGScheduler: Missing parents: List()
19/02/11 13:06:18 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[2] at filter at <console>:31), which has no missing parents
19/02/11 13:06:18 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 2.2 KB, free 511.1 MB)
19/02/11 13:06:18 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 1333.0 B, free 511.1 MB)
19/02/11 13:06:18 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:58532 (size: 1333.0 B, free: 511.1 MB)
19/02/11 13:06:18 INFO SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:1008
19/02/11 13:06:18 INFO DAGScheduler: Submitting 2 missing tasks from ResultStage 0 (MapPartitionsRDD[2] at filter at <console>:31)
19/02/11 13:06:18 INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks
19/02/11 13:06:18 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, partition 0,PROCESS_LOCAL, 2078 bytes)
19/02/11 13:06:18 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, localhost, partition 1,PROCESS_LOCAL, 2135 bytes)
19/02/11 13:06:18 INFO Executor: Running task 1.0 in stage 0.0 (TID 1)
19/02/11 13:06:18 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
19/02/11 13:06:18 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 902 bytes result sent to driver
19/02/11 13:06:18 INFO Executor: Finished task 1.0 in stage 0.0 (TID 1). 906 bytes result sent to driver
19/02/11 13:06:18 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 207 ms on localhost (1/2)
19/02/11 13:06:18 INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 172 ms on localhost (2/2)
19/02/11 13:06:18 INFO DAGScheduler: ResultStage 0 (collect at <console>:34) finished in 0.275 s
19/02/11 13:06:18 INFO DAGScheduler: Job 0 finished: collect at <console>:34, took 0.714041 s
19/02/11 13:06:18 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 
res2: Array[Int] = Array(6, 12, 18)

scala> threenum.toDebugString
res3: String = 
(2) MapPartitionsRDD[2] at filter at <console>:31 []
 |  MapPartitionsRDD[1] at map at <console>:29 []
 |  ParallelCollectionRDD[0] at parallelize at <console>:27 []

scala>

3.启动 spark-shell 后,在 scala 中加载 Key-Value 数据(“A”,1),(“B”,2), (“C”,3),(“A”,4), (“B”,5), (“C”,4), (“A”,3), (“A”,9), (“B”,4), (“D”,5),将这些数据以 Key 为基准进行升序排序,并以 Key 为基准进行分组。 将以上操作命令和结果信息以文本形式提交到答题框中。

scala> val kv=sc.parallelize(List(("A",1),("B",2),("C",3),("A",4),("B",5),("C",4),("A",3),("A",9),("B",4),("D",5)))
kv: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[3] at parallelize at <console>:27

scala> kv.sortByKey().collect
19/02/11 13:20:10 INFO SparkContext: Starting job: sortByKey at <console>:30
19/02/11 13:20:10 INFO DAGScheduler: Got job 1 (sortByKey at <console>:30) with 2 output partitions
19/02/11 13:20:10 INFO DAGScheduler: Final stage: ResultStage 1 (sortByKey at <console>:30)
19/02/11 13:20:10 INFO DAGScheduler: Parents of final stage: List()
19/02/11 13:20:10 INFO DAGScheduler: Missing parents: List()
19/02/11 13:20:10 INFO DAGScheduler: Submitting ResultStage 1 (MapPartitionsRDD[5] at sortByKey at <console>:30), which has no missing parents
19/02/11 13:20:10 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 2.3 KB, free 511.1 MB)
19/02/11 13:20:10 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 1398.0 B, free 511.1 MB)
19/02/11 13:20:10 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on localhost:58532 (size: 1398.0 B, free: 511.1 MB)
19/02/11 13:20:10 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1008
19/02/11 13:20:10 INFO DAGScheduler: Submitting 2 missing tasks from ResultStage 1 (MapPartitionsRDD[5] at sortByKey at <console>:30)
19/02/11 13:20:10 INFO TaskSchedulerImpl: Adding task set 1.0 with 2 tasks
19/02/11 13:20:10 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 2, localhost, partition 0,PROCESS_LOCAL, 2261 bytes)
19/02/11 13:20:10 INFO TaskSetManager: Starting task 1.0 in stage 1.0 (TID 3, localhost, partition 1,PROCESS_LOCAL, 2255 bytes)
19/02/11 13:20:10 INFO Executor: Running task 0.0 in stage 1.0 (TID 2)
19/02/11 13:20:10 INFO Executor: Running task 1.0 in stage 1.0 (TID 3)
19/02/11 13:20:10 INFO Executor: Finished task 0.0 in stage 1.0 (TID 2). 1179 bytes result sent to driver
19/02/11 13:20:10 INFO Executor: Finished task 1.0 in stage 1.0 (TID 3). 1178 bytes result sent to driver
19/02/11 13:20:10 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 2) in 71 ms on localhost (1/2)
19/02/11 13:20:10 INFO TaskSetManager: Finished task 1.0 in stage 1.0 (TID 3) in 61 ms on localhost (2/2)
19/02/11 13:20:10 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool 
19/02/11 13:20:10 INFO DAGScheduler: ResultStage 1 (sortByKey at <console>:30) finished in 0.075 s
19/02/11 13:20:10 INFO DAGScheduler: Job 1 finished: sortByKey at <console>:30, took 0.095223 s
19/02/11 13:20:10 INFO SparkContext: Starting job: collect at <console>:30
19/02/11 13:20:10 INFO DAGScheduler: Registering RDD 3 (parallelize at <console>:27)
19/02/11 13:20:10 INFO DAGScheduler: Got job 2 (collect at <console>:30) with 2 output partitions
19/02/11 13:20:10 INFO DAGScheduler: Final stage: ResultStage 3 (collect at <console>:30)
19/02/11 13:20:10 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 2)
19/02/11 13:20:10 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 2)
19/02/11 13:20:10 INFO DAGScheduler: Submitting ShuffleMapStage 2 (ParallelCollectionRDD[3] at parallelize at <console>:27), which has no missing parents
19/02/11 13:20:10 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 2.3 KB, free 511.1 MB)
19/02/11 13:20:10 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 1427.0 B, free 511.1 MB)
19/02/11 13:20:10 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on localhost:58532 (size: 1427.0 B, free: 511.1 MB)
19/02/11 13:20:10 INFO SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:1008
19/02/11 13:20:10 INFO DAGScheduler: Submitting 2 missing tasks from ShuffleMapStage 2 (ParallelCollectionRDD[3] at parallelize at <console>:27)
19/02/11 13:20:10 INFO TaskSchedulerImpl: Adding task set 2.0 with 2 tasks
19/02/11 13:20:10 INFO TaskSetManager: Starting task 0.0 in stage 2.0 (TID 4, localhost, partition 0,PROCESS_LOCAL, 2250 bytes)
19/02/11 13:20:10 INFO TaskSetManager: Starting task 1.0 in stage 2.0 (TID 5, localhost, partition 1,PROCESS_LOCAL, 2244 bytes)
19/02/11 13:20:10 INFO Executor: Running task 0.0 in stage 2.0 (TID 4)
19/02/11 13:20:10 INFO Executor: Running task 1.0 in stage 2.0 (TID 5)
19/02/11 13:20:10 INFO Executor: Finished task 0.0 in stage 2.0 (TID 4). 1159 bytes result sent to driver
19/02/11 13:20:10 INFO Executor: Finished task 1.0 in stage 2.0 (TID 5). 1159 bytes result sent to driver
19/02/11 13:20:10 INFO TaskSetManager: Finished task 0.0 in stage 2.0 (TID 4) in 70 ms on localhost (1/2)
19/02/11 13:20:10 INFO TaskSetManager: Finished task 1.0 in stage 2.0 (TID 5) in 69 ms on localhost (2/2)
19/02/11 13:20:10 INFO TaskSchedulerImpl: Removed TaskSet 2.0, whose tasks have all completed, from pool 
19/02/11 13:20:10 INFO DAGScheduler: ShuffleMapStage 2 (parallelize at <console>:27) finished in 0.074 s
19/02/11 13:20:10 INFO DAGScheduler: looking for newly runnable stages
19/02/11 13:20:10 INFO DAGScheduler: running: Set()
19/02/11 13:20:10 INFO DAGScheduler: waiting: Set(ResultStage 3)
19/02/11 13:20:10 INFO DAGScheduler: failed: Set()
19/02/11 13:20:10 INFO DAGScheduler: Submitting ResultStage 3 (ShuffledRDD[6] at sortByKey at <console>:30), which has no missing parents
19/02/11 13:20:10 INFO MemoryStore: Block broadcast_3 stored as values in memory (estimated size 2.9 KB, free 511.1 MB)
19/02/11 13:20:10 INFO MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 1770.0 B, free 511.1 MB)
19/02/11 13:20:10 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on localhost:58532 (size: 1770.0 B, free: 511.1 MB)
19/02/11 13:20:10 INFO SparkContext: Created broadcast 3 from broadcast at DAGScheduler.scala:1008
19/02/11 13:20:10 INFO DAGScheduler: Submitting 2 missing tasks from ResultStage 3 (ShuffledRDD[6] at sortByKey at <console>:30)
19/02/11 13:20:10 INFO TaskSchedulerImpl: Adding task set 3.0 with 2 tasks
19/02/11 13:20:10 INFO TaskSetManager: Starting task 0.0 in stage 3.0 (TID 6, localhost, partition 0,NODE_LOCAL, 1894 bytes)
19/02/11 13:20:10 INFO TaskSetManager: Starting task 1.0 in stage 3.0 (TID 7, localhost, partition 1,NODE_LOCAL, 1894 bytes)
19/02/11 13:20:10 INFO Executor: Running task 0.0 in stage 3.0 (TID 6)
19/02/11 13:20:10 INFO Executor: Running task 1.0 in stage 3.0 (TID 7)
19/02/11 13:20:10 INFO ShuffleBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks
19/02/11 13:20:10 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 18 ms
19/02/11 13:20:10 INFO ShuffleBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks
19/02/11 13:20:10 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 28 ms
19/02/11 13:20:10 INFO Executor: Finished task 0.0 in stage 3.0 (TID 6). 1433 bytes result sent to driver
19/02/11 13:20:10 INFO Executor: Finished task 1.0 in stage 3.0 (TID 7). 1347 bytes result sent to driver
19/02/11 13:20:10 INFO TaskSetManager: Finished task 0.0 in stage 3.0 (TID 6) in 295 ms on localhost (1/2)
19/02/11 13:20:10 INFO TaskSetManager: Finished task 1.0 in stage 3.0 (TID 7) in 295 ms on localhost (2/2)
19/02/11 13:20:10 INFO TaskSchedulerImpl: Removed TaskSet 3.0, whose tasks have all completed, from pool 
19/02/11 13:20:10 INFO DAGScheduler: ResultStage 3 (collect at <console>:30) finished in 0.300 s
19/02/11 13:20:10 INFO DAGScheduler: Job 2 finished: collect at <console>:30, took 0.456128 s
res5: Array[(String, Int)] = Array((A,1), (A,4), (A,3), (A,9), (B,2), (B,5), (B,4), (C,3), (C,4), (D,5))

scala>
scala> kv.groupByKey().collect
19/02/11 13:21:15 INFO SparkContext: Starting job: collect at <console>:30
19/02/11 13:21:15 INFO DAGScheduler: Registering RDD 3 (parallelize at <console>:27)
19/02/11 13:21:15 INFO DAGScheduler: Got job 3 (collect at <console>:30) with 2 output partitions
19/02/11 13:21:15 INFO DAGScheduler: Final stage: ResultStage 5 (collect at <console>:30)
19/02/11 13:21:15 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 4)
19/02/11 13:21:15 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 4)
19/02/11 13:21:15 INFO DAGScheduler: Submitting ShuffleMapStage 4 (ParallelCollectionRDD[3] at parallelize at <console>:27), which has no missing parents
19/02/11 13:21:15 INFO MemoryStore: Block broadcast_4 stored as values in memory (estimated size 2.9 KB, free 511.1 MB)
19/02/11 13:21:15 INFO MemoryStore: Block broadcast_4_piece0 stored as bytes in memory (estimated size 1644.0 B, free 511.1 MB)
19/02/11 13:21:15 INFO BlockManagerInfo: Added broadcast_4_piece0 in memory on localhost:58532 (size: 1644.0 B, free: 511.1 MB)
19/02/11 13:21:15 INFO SparkContext: Created broadcast 4 from broadcast at DAGScheduler.scala:1008
19/02/11 13:21:15 INFO DAGScheduler: Submitting 2 missing tasks from ShuffleMapStage 4 (ParallelCollectionRDD[3] at parallelize at <console>:27)
19/02/11 13:21:15 INFO TaskSchedulerImpl: Adding task set 4.0 with 2 tasks
19/02/11 13:21:15 INFO TaskSetManager: Starting task 0.0 in stage 4.0 (TID 8, localhost, partition 0,PROCESS_LOCAL, 2250 bytes)
19/02/11 13:21:15 INFO TaskSetManager: Starting task 1.0 in stage 4.0 (TID 9, localhost, partition 1,PROCESS_LOCAL, 2244 bytes)
19/02/11 13:21:15 INFO Executor: Running task 0.0 in stage 4.0 (TID 8)
19/02/11 13:21:15 INFO Executor: Finished task 0.0 in stage 4.0 (TID 8). 1159 bytes result sent to driver
19/02/11 13:21:15 INFO TaskSetManager: Finished task 0.0 in stage 4.0 (TID 8) in 52 ms on localhost (1/2)
19/02/11 13:21:15 INFO Executor: Running task 1.0 in stage 4.0 (TID 9)
19/02/11 13:21:15 INFO Executor: Finished task 1.0 in stage 4.0 (TID 9). 1159 bytes result sent to driver
19/02/11 13:21:15 INFO TaskSetManager: Finished task 1.0 in stage 4.0 (TID 9) in 75 ms on localhost (2/2)
19/02/11 13:21:15 INFO DAGScheduler: ShuffleMapStage 4 (parallelize at <console>:27) finished in 0.076 s
19/02/11 13:21:15 INFO TaskSchedulerImpl: Removed TaskSet 4.0, whose tasks have all completed, from pool 
19/02/11 13:21:15 INFO DAGScheduler: looking for newly runnable stages
19/02/11 13:21:15 INFO DAGScheduler: running: Set()
19/02/11 13:21:15 INFO DAGScheduler: waiting: Set(ResultStage 5)
19/02/11 13:21:15 INFO DAGScheduler: failed: Set()
19/02/11 13:21:15 INFO DAGScheduler: Submitting ResultStage 5 (ShuffledRDD[7] at groupByKey at <console>:30), which has no missing parents
19/02/11 13:21:15 INFO MemoryStore: Block broadcast_5 stored as values in memory (estimated size 3.9 KB, free 511.1 MB)
19/02/11 13:21:15 INFO MemoryStore: Block broadcast_5_piece0 stored as bytes in memory (estimated size 2.1 KB, free 511.1 MB)
19/02/11 13:21:15 INFO BlockManagerInfo: Added broadcast_5_piece0 in memory on localhost:58532 (size: 2.1 KB, free: 511.1 MB)
19/02/11 13:21:15 INFO SparkContext: Created broadcast 5 from broadcast at DAGScheduler.scala:1008
19/02/11 13:21:15 INFO DAGScheduler: Submitting 2 missing tasks from ResultStage 5 (ShuffledRDD[7] at groupByKey at <console>:30)
19/02/11 13:21:15 INFO TaskSchedulerImpl: Adding task set 5.0 with 2 tasks
19/02/11 13:21:15 INFO TaskSetManager: Starting task 0.0 in stage 5.0 (TID 10, localhost, partition 0,NODE_LOCAL, 1894 bytes)
19/02/11 13:21:15 INFO TaskSetManager: Starting task 1.0 in stage 5.0 (TID 11, localhost, partition 1,NODE_LOCAL, 1894 bytes)
19/02/11 13:21:15 INFO Executor: Running task 0.0 in stage 5.0 (TID 10)
19/02/11 13:21:15 INFO Executor: Running task 1.0 in stage 5.0 (TID 11)
19/02/11 13:21:15 INFO ShuffleBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks
19/02/11 13:21:15 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
19/02/11 13:21:15 INFO ShuffleBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks
19/02/11 13:21:15 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 1 ms
19/02/11 13:21:15 INFO Executor: Finished task 0.0 in stage 5.0 (TID 10). 1780 bytes result sent to driver
19/02/11 13:21:15 INFO TaskSetManager: Finished task 0.0 in stage 5.0 (TID 10) in 84 ms on localhost (1/2)
19/02/11 13:21:15 INFO Executor: Finished task 1.0 in stage 5.0 (TID 11). 1789 bytes result sent to driver
19/02/11 13:21:15 INFO TaskSetManager: Finished task 1.0 in stage 5.0 (TID 11) in 103 ms on localhost (2/2)
19/02/11 13:21:15 INFO TaskSchedulerImpl: Removed TaskSet 5.0, whose tasks have all completed, from pool 
19/02/11 13:21:15 INFO DAGScheduler: ResultStage 5 (collect at <console>:30) finished in 0.113 s
19/02/11 13:21:15 INFO DAGScheduler: Job 3 finished: collect at <console>:30, took 0.252907 s
res6: Array[(String, Iterable[Int])] = Array((B,CompactBuffer(2, 5, 4)), (D,CompactBuffer(5)), (A,CompactBuffer(1, 4, 3, 9)), (C,CompactBuffer(3, 4)))

scala>

4.启动 spark-shell 后,在 scala 中加载 Key-Value 数据(“A”,1),(“B”,3), (“C”,5),(“D”,4), (“B”,7), (“C”,4), (“E”,5), (“A”,8), (“B”,4), (“D”,5),将这些数据以 Key 为基准进行升序排序,并对相同的 Key 进行 Value 求和计算。将以上操作命令和结果信息以文本形式提交到答题框中。

scala> val kv1=sc.parallelize(List(("A",1),("B",3),("C",5),("D",4),("B",7),("C",4),("E",5),("A",8),("B",4),("D",5)))
kv1: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[0] at parallelize at <console>:27

scala> kv1.sortByKey().collect

19/02/11 13:31:01 INFO SparkContext: Starting job: sortByKey at <console>:30
19/02/11 13:31:01 INFO DAGScheduler: Got job 0 (sortByKey at <console>:30) with 2 output partitions
19/02/11 13:31:01 INFO DAGScheduler: Final stage: ResultStage 0 (sortByKey at <console>:30)
19/02/11 13:31:01 INFO DAGScheduler: Parents of final stage: List()
19/02/11 13:31:01 INFO DAGScheduler: Missing parents: List()
19/02/11 13:31:01 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[2] at sortByKey at <console>:30), which has no missing parents
19/02/11 13:31:03 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 2.3 KB, free 511.1 MB)
19/02/11 13:31:03 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 1408.0 B, free 511.1 MB)
19/02/11 13:31:03 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:49365 (size: 1408.0 B, free: 511.1 MB)
19/02/11 13:31:03 INFO SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:1008
19/02/11 13:31:03 INFO DAGScheduler: Submitting 2 missing tasks from ResultStage 0 (MapPartitionsRDD[2] at sortByKey at <console>:30)
19/02/11 13:31:03 INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks
19/02/11 13:31:03 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, partition 0,PROCESS_LOCAL, 2260 bytes)
19/02/11 13:31:03 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, localhost, partition 1,PROCESS_LOCAL, 2249 bytes)
19/02/11 13:31:03 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
19/02/11 13:31:03 INFO Executor: Running task 1.0 in stage 0.0 (TID 1)
19/02/11 13:31:04 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 1178 bytes result sent to driver
19/02/11 13:31:04 INFO Executor: Finished task 1.0 in stage 0.0 (TID 1). 1177 bytes result sent to driver
19/02/11 13:31:04 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 657 ms on localhost (1/2)
19/02/11 13:31:04 INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 582 ms on localhost (2/2)
19/02/11 13:31:04 INFO DAGScheduler: ResultStage 0 (sortByKey at <console>:30) finished in 0.814 s
19/02/11 13:31:04 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 
19/02/11 13:31:04 INFO DAGScheduler: Job 0 finished: sortByKey at <console>:30, took 2.947240 s
19/02/11 13:31:04 INFO SparkContext: Starting job: collect at <console>:30
19/02/11 13:31:04 INFO DAGScheduler: Registering RDD 0 (parallelize at <console>:27)
19/02/11 13:31:04 INFO DAGScheduler: Got job 1 (collect at <console>:30) with 2 output partitions
19/02/11 13:31:04 INFO DAGScheduler: Final stage: ResultStage 2 (collect at <console>:30)
19/02/11 13:31:04 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 1)
19/02/11 13:31:04 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 1)
19/02/11 13:31:04 INFO DAGScheduler: Submitting ShuffleMapStage 1 (ParallelCollectionRDD[0] at parallelize at <console>:27), which has no missing parents
19/02/11 13:31:04 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 2.3 KB, free 511.1 MB)
19/02/11 13:31:04 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 1424.0 B, free 511.1 MB)
19/02/11 13:31:04 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on localhost:49365 (size: 1424.0 B, free: 511.1 MB)
19/02/11 13:31:04 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1008
19/02/11 13:31:04 INFO DAGScheduler: Submitting 2 missing tasks from ShuffleMapStage 1 (ParallelCollectionRDD[0] at parallelize at <console>:27)
19/02/11 13:31:04 INFO TaskSchedulerImpl: Adding task set 1.0 with 2 tasks
19/02/11 13:31:04 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 2, localhost, partition 0,PROCESS_LOCAL, 2249 bytes)
19/02/11 13:31:04 INFO TaskSetManager: Starting task 1.0 in stage 1.0 (TID 3, localhost, partition 1,PROCESS_LOCAL, 2238 bytes)
19/02/11 13:31:04 INFO Executor: Running task 0.0 in stage 1.0 (TID 2)
19/02/11 13:31:04 INFO Executor: Running task 1.0 in stage 1.0 (TID 3)
19/02/11 13:31:04 INFO Executor: Finished task 1.0 in stage 1.0 (TID 3). 1159 bytes result sent to driver
19/02/11 13:31:04 INFO Executor: Finished task 0.0 in stage 1.0 (TID 2). 1159 bytes result sent to driver
19/02/11 13:31:04 INFO TaskSetManager: Finished task 1.0 in stage 1.0 (TID 3) in 176 ms on localhost (1/2)
19/02/11 13:31:04 INFO DAGScheduler: ShuffleMapStage 1 (parallelize at <console>:27) finished in 0.225 s
19/02/11 13:31:04 INFO DAGScheduler: looking for newly runnable stages
19/02/11 13:31:04 INFO DAGScheduler: running: Set()
19/02/11 13:31:04 INFO DAGScheduler: waiting: Set(ResultStage 2)
19/02/11 13:31:04 INFO DAGScheduler: failed: Set()
19/02/11 13:31:04 INFO DAGScheduler: Submitting ResultStage 2 (ShuffledRDD[3] at sortByKey at <console>:30), which has no missing parents
19/02/11 13:31:04 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 2) in 209 ms on localhost (2/2)
19/02/11 13:31:04 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool 
19/02/11 13:31:04 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 2.9 KB, free 511.1 MB)
19/02/11 13:31:04 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 1770.0 B, free 511.1 MB)
19/02/11 13:31:04 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on localhost:49365 (size: 1770.0 B, free: 511.1 MB)
19/02/11 13:31:04 INFO SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:1008
19/02/11 13:31:04 INFO DAGScheduler: Submitting 2 missing tasks from ResultStage 2 (ShuffledRDD[3] at sortByKey at <console>:30)
19/02/11 13:31:04 INFO TaskSchedulerImpl: Adding task set 2.0 with 2 tasks
19/02/11 13:31:04 INFO TaskSetManager: Starting task 0.0 in stage 2.0 (TID 4, localhost, partition 0,NODE_LOCAL, 1894 bytes)
19/02/11 13:31:04 INFO TaskSetManager: Starting task 1.0 in stage 2.0 (TID 5, localhost, partition 1,NODE_LOCAL, 1894 bytes)
19/02/11 13:31:04 INFO Executor: Running task 1.0 in stage 2.0 (TID 5)
19/02/11 13:31:04 INFO Executor: Running task 0.0 in stage 2.0 (TID 4)
19/02/11 13:31:04 INFO ShuffleBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks
19/02/11 13:31:04 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 9 ms
19/02/11 13:31:05 INFO ShuffleBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks
19/02/11 13:31:05 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 17 ms
19/02/11 13:31:05 INFO Executor: Finished task 0.0 in stage 2.0 (TID 4). 1391 bytes result sent to driver
19/02/11 13:31:05 INFO TaskSetManager: Finished task 0.0 in stage 2.0 (TID 4) in 512 ms on localhost (1/2)
19/02/11 13:31:05 INFO Executor: Finished task 1.0 in stage 2.0 (TID 5). 1385 bytes result sent to driver
19/02/11 13:31:05 INFO TaskSetManager: Finished task 1.0 in stage 2.0 (TID 5) in 530 ms on localhost (2/2)
19/02/11 13:31:05 INFO TaskSchedulerImpl: Removed TaskSet 2.0, whose tasks have all completed, from pool 
19/02/11 13:31:05 INFO DAGScheduler: ResultStage 2 (collect at <console>:30) finished in 0.541 s
19/02/11 13:31:05 INFO DAGScheduler: Job 1 finished: collect at <console>:30, took 0.940656 s
res0: Array[(String, Int)] = Array((A,1), (A,8), (B,3), (B,7), (B,4), (C,5), (C,4), (D,4), (D,5), (E,5))
scala> kv1.reduceByKey(_+_).collect
19/02/11 13:33:00 INFO SparkContext: Starting job: collect at <console>:30
19/02/11 13:33:00 INFO DAGScheduler: Registering RDD 0 (parallelize at <console>:27)
19/02/11 13:33:00 INFO DAGScheduler: Got job 2 (collect at <console>:30) with 2 output partitions
19/02/11 13:33:00 INFO DAGScheduler: Final stage: ResultStage 4 (collect at <console>:30)
19/02/11 13:33:00 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 3)
19/02/11 13:33:00 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 3)
19/02/11 13:33:00 INFO DAGScheduler: Submitting ShuffleMapStage 3 (ParallelCollectionRDD[0] at parallelize at <console>:27), which has no missing parents
19/02/11 13:33:00 INFO MemoryStore: Block broadcast_3 stored as values in memory (estimated size 2.0 KB, free 511.1 MB)
19/02/11 13:33:00 INFO MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 1291.0 B, free 511.1 MB)
19/02/11 13:33:00 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on localhost:49365 (size: 1291.0 B, free: 511.1 MB)
19/02/11 13:33:00 INFO SparkContext: Created broadcast 3 from broadcast at DAGScheduler.scala:1008
19/02/11 13:33:00 INFO DAGScheduler: Submitting 2 missing tasks from ShuffleMapStage 3 (ParallelCollectionRDD[0] at parallelize at <console>:27)
19/02/11 13:33:00 INFO TaskSchedulerImpl: Adding task set 3.0 with 2 tasks
19/02/11 13:33:00 INFO TaskSetManager: Starting task 0.0 in stage 3.0 (TID 6, localhost, partition 0,PROCESS_LOCAL, 2249 bytes)
19/02/11 13:33:00 INFO TaskSetManager: Starting task 1.0 in stage 3.0 (TID 7, localhost, partition 1,PROCESS_LOCAL, 2238 bytes)
19/02/11 13:33:00 INFO Executor: Running task 0.0 in stage 3.0 (TID 6)
19/02/11 13:33:00 INFO Executor: Running task 1.0 in stage 3.0 (TID 7)
19/02/11 13:33:00 INFO Executor: Finished task 1.0 in stage 3.0 (TID 7). 1159 bytes result sent to driver
19/02/11 13:33:00 INFO Executor: Finished task 0.0 in stage 3.0 (TID 6). 1159 bytes result sent to driver
19/02/11 13:33:00 INFO TaskSetManager: Finished task 1.0 in stage 3.0 (TID 7) in 52 ms on localhost (1/2)
19/02/11 13:33:00 INFO TaskSetManager: Finished task 0.0 in stage 3.0 (TID 6) in 52 ms on localhost (2/2)
19/02/11 13:33:00 INFO TaskSchedulerImpl: Removed TaskSet 3.0, whose tasks have all completed, from pool 
19/02/11 13:33:00 INFO DAGScheduler: ShuffleMapStage 3 (parallelize at <console>:27) finished in 0.054 s
19/02/11 13:33:00 INFO DAGScheduler: looking for newly runnable stages
19/02/11 13:33:00 INFO DAGScheduler: running: Set()
19/02/11 13:33:00 INFO DAGScheduler: waiting: Set(ResultStage 4)
19/02/11 13:33:00 INFO DAGScheduler: failed: Set()
19/02/11 13:33:00 INFO DAGScheduler: Submitting ResultStage 4 (ShuffledRDD[4] at reduceByKey at <console>:30), which has no missing parents
19/02/11 13:33:00 INFO MemoryStore: Block broadcast_4 stored as values in memory (estimated size 2.7 KB, free 511.1 MB)
19/02/11 13:33:00 INFO MemoryStore: Block broadcast_4_piece0 stored as bytes in memory (estimated size 1609.0 B, free 511.1 MB)
19/02/11 13:33:00 INFO BlockManagerInfo: Added broadcast_4_piece0 in memory on localhost:49365 (size: 1609.0 B, free: 511.1 MB)
19/02/11 13:33:00 INFO SparkContext: Created broadcast 4 from broadcast at DAGScheduler.scala:1008
19/02/11 13:33:00 INFO DAGScheduler: Submitting 2 missing tasks from ResultStage 4 (ShuffledRDD[4] at reduceByKey at <console>:30)
19/02/11 13:33:00 INFO TaskSchedulerImpl: Adding task set 4.0 with 2 tasks
19/02/11 13:33:00 INFO TaskSetManager: Starting task 0.0 in stage 4.0 (TID 8, localhost, partition 0,NODE_LOCAL, 1894 bytes)
19/02/11 13:33:00 INFO TaskSetManager: Starting task 1.0 in stage 4.0 (TID 9, localhost, partition 1,NODE_LOCAL, 1894 bytes)
19/02/11 13:33:00 INFO Executor: Running task 0.0 in stage 4.0 (TID 8)
19/02/11 13:33:00 INFO ShuffleBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks
19/02/11 13:33:00 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
19/02/11 13:33:00 INFO Executor: Finished task 0.0 in stage 4.0 (TID 8). 1327 bytes result sent to driver
19/02/11 13:33:00 INFO TaskSetManager: Finished task 0.0 in stage 4.0 (TID 8) in 29 ms on localhost (1/2)
19/02/11 13:33:00 INFO Executor: Running task 1.0 in stage 4.0 (TID 9)
19/02/11 13:33:00 INFO ShuffleBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks
19/02/11 13:33:00 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
19/02/11 13:33:00 INFO Executor: Finished task 1.0 in stage 4.0 (TID 9). 1342 bytes result sent to driver
19/02/11 13:33:00 INFO TaskSetManager: Finished task 1.0 in stage 4.0 (TID 9) in 43 ms on localhost (2/2)
19/02/11 13:33:00 INFO TaskSchedulerImpl: Removed TaskSet 4.0, whose tasks have all completed, from pool 
19/02/11 13:33:00 INFO DAGScheduler: ResultStage 4 (collect at <console>:30) finished in 0.051 s
19/02/11 13:33:00 INFO DAGScheduler: Job 2 finished: collect at <console>:30, took 0.139948 s
res1: Array[(String, Int)] = Array((B,14), (D,9), (A,9), (C,9), (E,5))

scala>

5.启动 spark-shell 后,在 scala 中加载 Key-Value 数据(“A”,4),(“A”,2), (“C”,3),(“A”,4),(“B”,5),(“C”,3),(“A”,4),以 Key 为基准进行去重操 作,并通过 toDebugString 方法来查看 RDD 的谱系。将以上操作命令和结果信 息以文本形式提交到答题框中。

`

scala> val kv2=sc.parallelize(List(("A",4),("A",2),("C",3),("A",4),("B",5),("C",3),("A",4)))
kv2: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[5] at parallelize at <console>:27

scala> kv2.distinct.collect
19/02/11 13:42:15 INFO SparkContext: Starting job: collect at <console>:30
19/02/11 13:42:15 INFO DAGScheduler: Registering RDD 6 (distinct at <console>:30)
19/02/11 13:42:15 INFO DAGScheduler: Got job 3 (collect at <console>:30) with 2 output partitions
19/02/11 13:42:15 INFO DAGScheduler: Final stage: ResultStage 6 (collect at <console>:30)
19/02/11 13:42:15 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 5)
19/02/11 13:42:15 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 5)
19/02/11 13:42:15 INFO DAGScheduler: Submitting ShuffleMapStage 5 (MapPartitionsRDD[6] at distinct at <console>:30), which has no missing parents
19/02/11 13:42:15 INFO MemoryStore: Block broadcast_5 stored as values in memory (estimated size 2.7 KB, free 511.1 MB)
19/02/11 13:42:15 INFO MemoryStore: Block broadcast_5_piece0 stored as bytes in memory (estimated size 1618.0 B, free 511.1 MB)
19/02/11 13:42:15 INFO BlockManagerInfo: Added broadcast_5_piece0 in memory on localhost:49365 (size: 1618.0 B, free: 511.1 MB)
19/02/11 13:42:15 INFO SparkContext: Created broadcast 5 from broadcast at DAGScheduler.scala:1008
19/02/11 13:42:15 INFO DAGScheduler: Submitting 2 missing tasks from ShuffleMapStage 5 (MapPartitionsRDD[6] at distinct at <console>:30)
19/02/11 13:42:15 INFO TaskSchedulerImpl: Adding task set 5.0 with 2 tasks
19/02/11 13:42:15 INFO TaskSetManager: Starting task 0.0 in stage 5.0 (TID 10, localhost, partition 0,PROCESS_LOCAL, 2209 bytes)
19/02/11 13:42:15 INFO TaskSetManager: Starting task 1.0 in stage 5.0 (TID 11, localhost, partition 1,PROCESS_LOCAL, 2224 bytes)
19/02/11 13:42:15 INFO Executor: Running task 1.0 in stage 5.0 (TID 11)
19/02/11 13:42:15 INFO Executor: Running task 0.0 in stage 5.0 (TID 10)
19/02/11 13:42:15 INFO Executor: Finished task 1.0 in stage 5.0 (TID 11). 1159 bytes result sent to driver
19/02/11 13:42:15 INFO Executor: Finished task 0.0 in stage 5.0 (TID 10). 1159 bytes result sent to driver
19/02/11 13:42:15 INFO TaskSetManager: Finished task 1.0 in stage 5.0 (TID 11) in 93 ms on localhost (1/2)
19/02/11 13:42:15 INFO TaskSetManager: Finished task 0.0 in stage 5.0 (TID 10) in 97 ms on localhost (2/2)
19/02/11 13:42:15 INFO TaskSchedulerImpl: Removed TaskSet 5.0, whose tasks have all completed, from pool 
19/02/11 13:42:15 INFO DAGScheduler: ShuffleMapStage 5 (distinct at <console>:30) finished in 0.100 s
19/02/11 13:42:15 INFO DAGScheduler: looking for newly runnable stages
19/02/11 13:42:15 INFO DAGScheduler: running: Set()
19/02/11 13:42:15 INFO DAGScheduler: waiting: Set(ResultStage 6)
19/02/11 13:42:15 INFO DAGScheduler: failed: Set()
19/02/11 13:42:15 INFO DAGScheduler: Submitting ResultStage 6 (MapPartitionsRDD[8] at distinct at <console>:30), which has no missing parents
19/02/11 13:42:15 INFO MemoryStore: Block broadcast_6 stored as values in memory (estimated size 3.3 KB, free 511.1 MB)
19/02/11 13:42:15 INFO MemoryStore: Block broadcast_6_piece0 stored as bytes in memory (estimated size 1930.0 B, free 511.1 MB)
19/02/11 13:42:15 INFO BlockManagerInfo: Added broadcast_6_piece0 in memory on localhost:49365 (size: 1930.0 B, free: 511.1 MB)
19/02/11 13:42:15 INFO SparkContext: Created broadcast 6 from broadcast at DAGScheduler.scala:1008
19/02/11 13:42:15 INFO DAGScheduler: Submitting 2 missing tasks from ResultStage 6 (MapPartitionsRDD[8] at distinct at <console>:30)
19/02/11 13:42:15 INFO TaskSchedulerImpl: Adding task set 6.0 with 2 tasks
19/02/11 13:42:15 INFO TaskSetManager: Starting task 0.0 in stage 6.0 (TID 12, localhost, partition 0,NODE_LOCAL, 1894 bytes)
19/02/11 13:42:15 INFO TaskSetManager: Starting task 1.0 in stage 6.0 (TID 13, localhost, partition 1,NODE_LOCAL, 1894 bytes)
19/02/11 13:42:15 INFO Executor: Running task 0.0 in stage 6.0 (TID 12)
19/02/11 13:42:15 INFO Executor: Running task 1.0 in stage 6.0 (TID 13)
19/02/11 13:42:15 INFO ShuffleBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks
19/02/11 13:42:15 INFO ShuffleBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks
19/02/11 13:42:15 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 1 ms
19/02/11 13:42:15 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 1 ms
19/02/11 13:42:15 INFO Executor: Finished task 1.0 in stage 6.0 (TID 13). 1347 bytes result sent to driver
19/02/11 13:42:15 INFO Executor: Finished task 0.0 in stage 6.0 (TID 12). 1307 bytes result sent to driver
19/02/11 13:42:15 INFO TaskSetManager: Finished task 1.0 in stage 6.0 (TID 13) in 29 ms on localhost (1/2)
19/02/11 13:42:15 INFO TaskSetManager: Finished task 0.0 in stage 6.0 (TID 12) in 34 ms on localhost (2/2)
19/02/11 13:42:15 INFO TaskSchedulerImpl: Removed TaskSet 6.0, whose tasks have all completed, from pool 
19/02/11 13:42:15 INFO DAGScheduler: ResultStage 6 (collect at <console>:30) finished in 0.037 s
19/02/11 13:42:15 INFO DAGScheduler: Job 3 finished: collect at <console>:30, took 0.220315 s
res2: Array[(String, Int)] = Array((A,4), (B,5), (C,3), (A,2))

scala>
scala> kv2.toDebugString
res3: String = (2) ParallelCollectionRDD[5] at parallelize at <console>:27 []

scala>

6.启动 spark-shell 后,在 scala 中加载两组 Key-Value 数据(“A”,1),(“B”, 2),(“C”,3),(“A”,4),(“B”,5)、(“A”,1),(“B”,2),(“C”,3),(“A”,4), (“B”,5),将两组数据以 Key 为基准进行 JOIN 操作,将以上操作命令和结果信 息以文本形式提交到答题框中。

scala> val kv4=sc.parallelize(List(("A",1),("B",2),("C",3),("A",4),("B",5)))
kv4: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[9] at parallelize at <console>:27

scala> val kv5=sc.parallelize(List(("A",1),("B",2),("C",3),("A",4),("B",5)))
kv5: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[10] at parallelize at <console>:27

scala> kv4.join(kv5).collect
19/02/11 13:47:38 INFO SparkContext: Starting job: collect at <console>:32
19/02/11 13:47:38 INFO DAGScheduler: Registering RDD 9 (parallelize at <console>:27)
19/02/11 13:47:38 INFO DAGScheduler: Registering RDD 10 (parallelize at <console>:27)
19/02/11 13:47:38 INFO DAGScheduler: Got job 4 (collect at <console>:32) with 2 output partitions
19/02/11 13:47:38 INFO DAGScheduler: Final stage: ResultStage 9 (collect at <console>:32)
19/02/11 13:47:38 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 7, ShuffleMapStage 8)
19/02/11 13:47:38 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 7, ShuffleMapStage 8)
19/02/11 13:47:38 INFO DAGScheduler: Submitting ShuffleMapStage 7 (ParallelCollectionRDD[9] at parallelize at <console>:27), which has no missing parents
19/02/11 13:47:38 INFO MemoryStore: Block broadcast_7 stored as values in memory (estimated size 1864.0 B, free 511.1 MB)
19/02/11 13:47:38 INFO MemoryStore: Block broadcast_7_piece0 stored as bytes in memory (estimated size 1188.0 B, free 511.1 MB)
19/02/11 13:47:38 INFO BlockManagerInfo: Added broadcast_7_piece0 in memory on localhost:49365 (size: 1188.0 B, free: 511.1 MB)
19/02/11 13:47:38 INFO SparkContext: Created broadcast 7 from broadcast at DAGScheduler.scala:1008
19/02/11 13:47:38 INFO DAGScheduler: Submitting 2 missing tasks from ShuffleMapStage 7 (ParallelCollectionRDD[9] at parallelize at <console>:27)
19/02/11 13:47:38 INFO TaskSchedulerImpl: Adding task set 7.0 with 2 tasks
19/02/11 13:47:38 INFO DAGScheduler: Submitting ShuffleMapStage 8 (ParallelCollectionRDD[10] at parallelize at <console>:27), which has no missing parents
19/02/11 13:47:38 INFO MemoryStore: Block broadcast_8 stored as values in memory (estimated size 1864.0 B, free 511.1 MB)
19/02/11 13:47:38 INFO MemoryStore: Block broadcast_8_piece0 stored as bytes in memory (estimated size 1188.0 B, free 511.1 MB)
19/02/11 13:47:38 INFO BlockManagerInfo: Added broadcast_8_piece0 in memory on localhost:49365 (size: 1188.0 B, free: 511.1 MB)
19/02/11 13:47:38 INFO SparkContext: Created broadcast 8 from broadcast at DAGScheduler.scala:1008
19/02/11 13:47:38 INFO DAGScheduler: Submitting 2 missing tasks from ShuffleMapStage 8 (ParallelCollectionRDD[10] at parallelize at <console>:27)
19/02/11 13:47:38 INFO TaskSchedulerImpl: Adding task set 8.0 with 2 tasks
19/02/11 13:47:38 INFO TaskSetManager: Starting task 0.0 in stage 7.0 (TID 14, localhost, partition 0,PROCESS_LOCAL, 2188 bytes)
19/02/11 13:47:38 INFO TaskSetManager: Starting task 1.0 in stage 7.0 (TID 15, localhost, partition 1,PROCESS_LOCAL, 2208 bytes)
19/02/11 13:47:38 INFO Executor: Running task 0.0 in stage 7.0 (TID 14)
19/02/11 13:47:38 INFO Executor: Running task 1.0 in stage 7.0 (TID 15)
19/02/11 13:47:38 INFO Executor: Finished task 1.0 in stage 7.0 (TID 15). 1159 bytes result sent to driver
19/02/11 13:47:38 INFO Executor: Finished task 0.0 in stage 7.0 (TID 14). 1159 bytes result sent to driver
19/02/11 13:47:38 INFO TaskSetManager: Starting task 0.0 in stage 8.0 (TID 16, localhost, partition 0,PROCESS_LOCAL, 2188 bytes)
19/02/11 13:47:38 INFO TaskSetManager: Starting task 1.0 in stage 8.0 (TID 17, localhost, partition 1,PROCESS_LOCAL, 2208 bytes)
19/02/11 13:47:38 INFO Executor: Running task 0.0 in stage 8.0 (TID 16)
19/02/11 13:47:38 INFO Executor: Running task 1.0 in stage 8.0 (TID 17)
19/02/11 13:47:38 INFO TaskSetManager: Finished task 0.0 in stage 7.0 (TID 14) in 133 ms on localhost (1/2)
19/02/11 13:47:38 INFO TaskSetManager: Finished task 1.0 in stage 7.0 (TID 15) in 134 ms on localhost (2/2)
19/02/11 13:47:38 INFO TaskSchedulerImpl: Removed TaskSet 7.0, whose tasks have all completed, from pool 
19/02/11 13:47:38 INFO DAGScheduler: ShuffleMapStage 7 (parallelize at <console>:27) finished in 0.177 s
19/02/11 13:47:38 INFO DAGScheduler: looking for newly runnable stages
19/02/11 13:47:38 INFO DAGScheduler: running: Set(ShuffleMapStage 8)
19/02/11 13:47:38 INFO DAGScheduler: waiting: Set(ResultStage 9)
19/02/11 13:47:38 INFO DAGScheduler: failed: Set()
19/02/11 13:47:38 INFO Executor: Finished task 0.0 in stage 8.0 (TID 16). 1159 bytes result sent to driver
19/02/11 13:47:38 INFO Executor: Finished task 1.0 in stage 8.0 (TID 17). 1159 bytes result sent to driver
19/02/11 13:47:38 INFO TaskSetManager: Finished task 0.0 in stage 8.0 (TID 16) in 68 ms on localhost (1/2)
19/02/11 13:47:38 INFO DAGScheduler: ShuffleMapStage 8 (parallelize at <console>:27) finished in 0.166 s
19/02/11 13:47:38 INFO DAGScheduler: looking for newly runnable stages
19/02/11 13:47:38 INFO DAGScheduler: running: Set()
19/02/11 13:47:38 INFO DAGScheduler: waiting: Set(ResultStage 9)
19/02/11 13:47:38 INFO DAGScheduler: failed: Set()
19/02/11 13:47:38 INFO DAGScheduler: Submitting ResultStage 9 (MapPartitionsRDD[13] at join at <console>:32), which has no missing parents
19/02/11 13:47:38 INFO TaskSetManager: Finished task 1.0 in stage 8.0 (TID 17) in 65 ms on localhost (2/2)
19/02/11 13:47:38 INFO TaskSchedulerImpl: Removed TaskSet 8.0, whose tasks have all completed, from pool 
19/02/11 13:47:38 INFO MemoryStore: Block broadcast_9 stored as values in memory (estimated size 3.2 KB, free 511.1 MB)
19/02/11 13:47:38 INFO MemoryStore: Block broadcast_9_piece0 stored as bytes in memory (estimated size 1814.0 B, free 511.1 MB)
19/02/11 13:47:38 INFO BlockManagerInfo: Added broadcast_9_piece0 in memory on localhost:49365 (size: 1814.0 B, free: 511.1 MB)
19/02/11 13:47:38 INFO SparkContext: Created broadcast 9 from broadcast at DAGScheduler.scala:1008
19/02/11 13:47:38 INFO DAGScheduler: Submitting 2 missing tasks from ResultStage 9 (MapPartitionsRDD[13] at join at <console>:32)
19/02/11 13:47:38 INFO TaskSchedulerImpl: Adding task set 9.0 with 2 tasks
19/02/11 13:47:38 INFO TaskSetManager: Starting task 0.0 in stage 9.0 (TID 18, localhost, partition 0,PROCESS_LOCAL, 1967 bytes)
19/02/11 13:47:38 INFO TaskSetManager: Starting task 1.0 in stage 9.0 (TID 19, localhost, partition 1,PROCESS_LOCAL, 1967 bytes)
19/02/11 13:47:38 INFO Executor: Running task 1.0 in stage 9.0 (TID 19)
19/02/11 13:47:38 INFO Executor: Running task 0.0 in stage 9.0 (TID 18)
19/02/11 13:47:38 INFO ShuffleBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks
19/02/11 13:47:38 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 1 ms
19/02/11 13:47:38 INFO ShuffleBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks
19/02/11 13:47:38 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
19/02/11 13:47:38 INFO ShuffleBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks
19/02/11 13:47:38 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
19/02/11 13:47:38 INFO ShuffleBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks
19/02/11 13:47:38 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
19/02/11 13:47:38 INFO Executor: Finished task 1.0 in stage 9.0 (TID 19). 1453 bytes result sent to driver
19/02/11 13:47:38 INFO Executor: Finished task 0.0 in stage 9.0 (TID 18). 1417 bytes result sent to driver
19/02/11 13:47:38 INFO TaskSetManager: Finished task 1.0 in stage 9.0 (TID 19) in 184 ms on localhost (1/2)
19/02/11 13:47:38 INFO TaskSetManager: Finished task 0.0 in stage 9.0 (TID 18) in 207 ms on localhost (2/2)
19/02/11 13:47:38 INFO TaskSchedulerImpl: Removed TaskSet 9.0, whose tasks have all completed, from pool 
19/02/11 13:47:38 INFO DAGScheduler: ResultStage 9 (collect at <console>:32) finished in 0.212 s
19/02/11 13:47:38 INFO DAGScheduler: Job 4 finished: collect at <console>:32, took 0.671400 s
res4: Array[(String, (Int, Int))] = Array((B,(2,2)), (B,(2,5)), (B,(5,2)), (B,(5,5)), (A,(1,1)), (A,(1,4)), (A,(4,1)), (A,(4,4)), (C,(3,3)))

scala>

7.登录 spark-shell,定义 i 值为 1,sum 值为 0,使用 while 循环,求从 1 加 到 100 的值,最后使用 scala 的标准输出函数输出 sum 值。将上述所有操作命令 和返回结果以文本形式提交到答题框。

scala> var i=1
i: Int = 1

scala> var sum=0
sum: Int = 0

scala> while(i<100)
     | {
     | i+=1
     | sum+=i
     | }

scala> println(sum)
5049

8.登录 spark-shell,定义 i 值为 1,sum 值为 0,使用 for 循环,求从 1 加到 100 的值,最后使用 scala 的标准输出函数输出 sum 值。将上述所有操作命令和 返回结果以文本形式提交到答题框

scala> var i=1
i: Int = 1

scala> var sum=0
sum: Int = 0

scala> for(i<- 1 to 100) sum+=i

scala> println(sum)
5050

scala>
9.任何一种函数式语言中,都有 map 函数与 faltMap 这两个函数: map 函数的用法,顾名思义,将一个函数传入 map 中,
然后利用传入的这 个函数,将集合中的每个元素处理,并将处理后的结果返回。
 而 flatMap 与 map 唯一不一样的地方就是传入的函数在处理完后返回值必须 是 List,
 所以需要返回值是 List 才能执行 flat 这一步。 
(1)登录 spark-shell,自定义一个 list,然后利用 map 函数,对这个 list 进 行元素乘 2 的操作,

将上述所有操作命令和返回结果以文本形式提交到答题框。 


(2)登录 spark-shell,自定义一个 list,然后利用 flatMap 函数将 list 转换为 单个字母并转换为大写
,将上述所有命令和返回结果以文本形式提交到答题框。


10.登录大数据云主机 master 节点,在 root 目录下新建一个 abc.txt,
里面的 内容为: hadoop hive solr redis kafka hadoop storm flume sqoop docker spark spark hadoop spark elasticsearch hbase hadoop hive spark hive hadoop spark 然后登录 spark-shell,
首先使用命令统计 abc.txt 的行数,接着对 abc.txt 文档 中的单词进行计数,
并按照单词首字母的升序进行排序,最后统计结果行数,将 上述操作命令和返回结果以文本形式提交到答题框。


.登录 spark-shell,自定义一个 List,使用 spark 自带函数对这个 List 进行 去重操作,

以上就是本文的全部内容,希望本文的内容对大家的学习或者工作能带来一定的帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

运营有道:重新定义互联网运营

运营有道:重新定义互联网运营

李明轩 / 机械工业出版社 / 2017-7-31 / 69.00元

本书是前百度资深运营专家多年运营经验的总结,是作者运营千万级用户规模的大型互联网产品的实操经验复盘,是作者在“在行”上为近百位CEO和高管提供互联网运营咨询服务后对互联网运营需求的深入洞见。 本书的思想基础是“运营必须以用户为中心”,从产品、用户、市场3个维度对互联网运营重新进行了系统性的梳理:从道的层面解读并重新定义运营方法论,从术的层面围绕方法论提出行之有效的解决方法和实际案例。重点不在......一起来看看 《运营有道:重新定义互联网运营》 这本书的介绍吧!

CSS 压缩/解压工具
CSS 压缩/解压工具

在线压缩/解压 CSS 代码

图片转BASE64编码
图片转BASE64编码

在线图片转Base64编码工具

Markdown 在线编辑器
Markdown 在线编辑器

Markdown 在线编辑器