spark
Apache Spark 是专为大规模数据处理而设计的快速通用的计算引擎。 Spark是UC Berkeley AMP lab (加州大学伯克利分校的AMP实验室)所开源的类Hadoop MapReduce的通用并行框架, Spark,拥有Hadoop MapReduce所具有的优点; 但不同于MapReduce的是——Job中间输出结果可以保存在内存中,从而不再需要读写HDFS, 因此Spark能更好地适用于数据挖掘与机器学习等需要迭代的MapReduce的算法。 Spark 是一种与 Hadoop 相似的开源集群计算环境, 但是两者之间还存在一些不同之处,这些有用的不同之处使 Spark 在某些工作负载方面表现得更加优越, 换句话说,Spark 启用了内存分布数据集,除了能够提供交互式查询外,它还可以优化迭代工作负载。 Spark 是在 Scala 语言中实现的,它将 Scala 用作其应用程序框架。 与 Hadoop 不同,Spark 和 Scala 能够紧密集成,其中的 Scala 可以像操作本地集合对象一样轻松地操作分布式数据集。 尽管创建 Spark 是为了支持分布式数据集上的迭代作业, 但是实际上它是对 Hadoop 的补充,可以在 Hadoop 文件系统中并行运行。 通过名为 Mesos 的第三方集群框架可以支持此行为。 Spark 由加州大学伯克利分校 AMP 实验室 (Algorithms, Machines, and People Lab) 开发, 可用来构建大型的、低延迟的数据分析应用程序。
‘’
启动 spark-shell 后,在 scala 中加载数据“1,2,3,4,5,6,7,8,9,10”, 求这些数据的 2 倍乘积能够被 3 整除的数字,并通过 toDebugString 方法来查看 RDD 的谱系。将以上操作命令和结果信息以文本形式提交到答题框中。
scala> val num = sc.parallelize(1 to 10) num: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:27 scala> val doublenum = num.map(_*2) doublenum: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[1] at map at <console>:29 scala> val threenum = doublenum.filter(_%3==0) threenum: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[2] at filter at <console>:31 scala> threenum.collect 19/02/11 13:06:18 INFO SparkContext: Starting job: collect at <console>:34 19/02/11 13:06:18 INFO DAGScheduler: Got job 0 (collect at <console>:34) with 2 output partitions 19/02/11 13:06:18 INFO DAGScheduler: Final stage: ResultStage 0 (collect at <console>:34) 19/02/11 13:06:18 INFO DAGScheduler: Parents of final stage: List() 19/02/11 13:06:18 INFO DAGScheduler: Missing parents: List() 19/02/11 13:06:18 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[2] at filter at <console>:31), which has no missing parents 19/02/11 13:06:18 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 2.2 KB, free 511.1 MB) 19/02/11 13:06:18 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 1333.0 B, free 511.1 MB) 19/02/11 13:06:18 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:58532 (size: 1333.0 B, free: 511.1 MB) 19/02/11 13:06:18 INFO SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:1008 19/02/11 13:06:18 INFO DAGScheduler: Submitting 2 missing tasks from ResultStage 0 (MapPartitionsRDD[2] at filter at <console>:31) 19/02/11 13:06:18 INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks 19/02/11 13:06:18 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, partition 0,PROCESS_LOCAL, 2078 bytes) 19/02/11 13:06:18 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, localhost, partition 1,PROCESS_LOCAL, 2135 bytes) 19/02/11 13:06:18 INFO Executor: Running task 1.0 in stage 0.0 (TID 1) 19/02/11 13:06:18 INFO Executor: Running task 0.0 in stage 0.0 (TID 0) 19/02/11 13:06:18 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 902 bytes result sent to driver 19/02/11 13:06:18 INFO Executor: Finished task 1.0 in stage 0.0 (TID 1). 906 bytes result sent to driver 19/02/11 13:06:18 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 207 ms on localhost (1/2) 19/02/11 13:06:18 INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 172 ms on localhost (2/2) 19/02/11 13:06:18 INFO DAGScheduler: ResultStage 0 (collect at <console>:34) finished in 0.275 s 19/02/11 13:06:18 INFO DAGScheduler: Job 0 finished: collect at <console>:34, took 0.714041 s 19/02/11 13:06:18 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool res2: Array[Int] = Array(6, 12, 18) scala> threenum.toDebugString res3: String = (2) MapPartitionsRDD[2] at filter at <console>:31 [] | MapPartitionsRDD[1] at map at <console>:29 [] | ParallelCollectionRDD[0] at parallelize at <console>:27 [] scala>
3.启动 spark-shell 后,在 scala 中加载 Key-Value 数据(“A”,1),(“B”,2), (“C”,3),(“A”,4), (“B”,5), (“C”,4), (“A”,3), (“A”,9), (“B”,4), (“D”,5),将这些数据以 Key 为基准进行升序排序,并以 Key 为基准进行分组。 将以上操作命令和结果信息以文本形式提交到答题框中。
scala> val kv=sc.parallelize(List(("A",1),("B",2),("C",3),("A",4),("B",5),("C",4),("A",3),("A",9),("B",4),("D",5))) kv: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[3] at parallelize at <console>:27 scala> kv.sortByKey().collect 19/02/11 13:20:10 INFO SparkContext: Starting job: sortByKey at <console>:30 19/02/11 13:20:10 INFO DAGScheduler: Got job 1 (sortByKey at <console>:30) with 2 output partitions 19/02/11 13:20:10 INFO DAGScheduler: Final stage: ResultStage 1 (sortByKey at <console>:30) 19/02/11 13:20:10 INFO DAGScheduler: Parents of final stage: List() 19/02/11 13:20:10 INFO DAGScheduler: Missing parents: List() 19/02/11 13:20:10 INFO DAGScheduler: Submitting ResultStage 1 (MapPartitionsRDD[5] at sortByKey at <console>:30), which has no missing parents 19/02/11 13:20:10 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 2.3 KB, free 511.1 MB) 19/02/11 13:20:10 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 1398.0 B, free 511.1 MB) 19/02/11 13:20:10 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on localhost:58532 (size: 1398.0 B, free: 511.1 MB) 19/02/11 13:20:10 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1008 19/02/11 13:20:10 INFO DAGScheduler: Submitting 2 missing tasks from ResultStage 1 (MapPartitionsRDD[5] at sortByKey at <console>:30) 19/02/11 13:20:10 INFO TaskSchedulerImpl: Adding task set 1.0 with 2 tasks 19/02/11 13:20:10 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 2, localhost, partition 0,PROCESS_LOCAL, 2261 bytes) 19/02/11 13:20:10 INFO TaskSetManager: Starting task 1.0 in stage 1.0 (TID 3, localhost, partition 1,PROCESS_LOCAL, 2255 bytes) 19/02/11 13:20:10 INFO Executor: Running task 0.0 in stage 1.0 (TID 2) 19/02/11 13:20:10 INFO Executor: Running task 1.0 in stage 1.0 (TID 3) 19/02/11 13:20:10 INFO Executor: Finished task 0.0 in stage 1.0 (TID 2). 1179 bytes result sent to driver 19/02/11 13:20:10 INFO Executor: Finished task 1.0 in stage 1.0 (TID 3). 1178 bytes result sent to driver 19/02/11 13:20:10 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 2) in 71 ms on localhost (1/2) 19/02/11 13:20:10 INFO TaskSetManager: Finished task 1.0 in stage 1.0 (TID 3) in 61 ms on localhost (2/2) 19/02/11 13:20:10 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool 19/02/11 13:20:10 INFO DAGScheduler: ResultStage 1 (sortByKey at <console>:30) finished in 0.075 s 19/02/11 13:20:10 INFO DAGScheduler: Job 1 finished: sortByKey at <console>:30, took 0.095223 s 19/02/11 13:20:10 INFO SparkContext: Starting job: collect at <console>:30 19/02/11 13:20:10 INFO DAGScheduler: Registering RDD 3 (parallelize at <console>:27) 19/02/11 13:20:10 INFO DAGScheduler: Got job 2 (collect at <console>:30) with 2 output partitions 19/02/11 13:20:10 INFO DAGScheduler: Final stage: ResultStage 3 (collect at <console>:30) 19/02/11 13:20:10 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 2) 19/02/11 13:20:10 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 2) 19/02/11 13:20:10 INFO DAGScheduler: Submitting ShuffleMapStage 2 (ParallelCollectionRDD[3] at parallelize at <console>:27), which has no missing parents 19/02/11 13:20:10 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 2.3 KB, free 511.1 MB) 19/02/11 13:20:10 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 1427.0 B, free 511.1 MB) 19/02/11 13:20:10 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on localhost:58532 (size: 1427.0 B, free: 511.1 MB) 19/02/11 13:20:10 INFO SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:1008 19/02/11 13:20:10 INFO DAGScheduler: Submitting 2 missing tasks from ShuffleMapStage 2 (ParallelCollectionRDD[3] at parallelize at <console>:27) 19/02/11 13:20:10 INFO TaskSchedulerImpl: Adding task set 2.0 with 2 tasks 19/02/11 13:20:10 INFO TaskSetManager: Starting task 0.0 in stage 2.0 (TID 4, localhost, partition 0,PROCESS_LOCAL, 2250 bytes) 19/02/11 13:20:10 INFO TaskSetManager: Starting task 1.0 in stage 2.0 (TID 5, localhost, partition 1,PROCESS_LOCAL, 2244 bytes) 19/02/11 13:20:10 INFO Executor: Running task 0.0 in stage 2.0 (TID 4) 19/02/11 13:20:10 INFO Executor: Running task 1.0 in stage 2.0 (TID 5) 19/02/11 13:20:10 INFO Executor: Finished task 0.0 in stage 2.0 (TID 4). 1159 bytes result sent to driver 19/02/11 13:20:10 INFO Executor: Finished task 1.0 in stage 2.0 (TID 5). 1159 bytes result sent to driver 19/02/11 13:20:10 INFO TaskSetManager: Finished task 0.0 in stage 2.0 (TID 4) in 70 ms on localhost (1/2) 19/02/11 13:20:10 INFO TaskSetManager: Finished task 1.0 in stage 2.0 (TID 5) in 69 ms on localhost (2/2) 19/02/11 13:20:10 INFO TaskSchedulerImpl: Removed TaskSet 2.0, whose tasks have all completed, from pool 19/02/11 13:20:10 INFO DAGScheduler: ShuffleMapStage 2 (parallelize at <console>:27) finished in 0.074 s 19/02/11 13:20:10 INFO DAGScheduler: looking for newly runnable stages 19/02/11 13:20:10 INFO DAGScheduler: running: Set() 19/02/11 13:20:10 INFO DAGScheduler: waiting: Set(ResultStage 3) 19/02/11 13:20:10 INFO DAGScheduler: failed: Set() 19/02/11 13:20:10 INFO DAGScheduler: Submitting ResultStage 3 (ShuffledRDD[6] at sortByKey at <console>:30), which has no missing parents 19/02/11 13:20:10 INFO MemoryStore: Block broadcast_3 stored as values in memory (estimated size 2.9 KB, free 511.1 MB) 19/02/11 13:20:10 INFO MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 1770.0 B, free 511.1 MB) 19/02/11 13:20:10 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on localhost:58532 (size: 1770.0 B, free: 511.1 MB) 19/02/11 13:20:10 INFO SparkContext: Created broadcast 3 from broadcast at DAGScheduler.scala:1008 19/02/11 13:20:10 INFO DAGScheduler: Submitting 2 missing tasks from ResultStage 3 (ShuffledRDD[6] at sortByKey at <console>:30) 19/02/11 13:20:10 INFO TaskSchedulerImpl: Adding task set 3.0 with 2 tasks 19/02/11 13:20:10 INFO TaskSetManager: Starting task 0.0 in stage 3.0 (TID 6, localhost, partition 0,NODE_LOCAL, 1894 bytes) 19/02/11 13:20:10 INFO TaskSetManager: Starting task 1.0 in stage 3.0 (TID 7, localhost, partition 1,NODE_LOCAL, 1894 bytes) 19/02/11 13:20:10 INFO Executor: Running task 0.0 in stage 3.0 (TID 6) 19/02/11 13:20:10 INFO Executor: Running task 1.0 in stage 3.0 (TID 7) 19/02/11 13:20:10 INFO ShuffleBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks 19/02/11 13:20:10 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 18 ms 19/02/11 13:20:10 INFO ShuffleBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks 19/02/11 13:20:10 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 28 ms 19/02/11 13:20:10 INFO Executor: Finished task 0.0 in stage 3.0 (TID 6). 1433 bytes result sent to driver 19/02/11 13:20:10 INFO Executor: Finished task 1.0 in stage 3.0 (TID 7). 1347 bytes result sent to driver 19/02/11 13:20:10 INFO TaskSetManager: Finished task 0.0 in stage 3.0 (TID 6) in 295 ms on localhost (1/2) 19/02/11 13:20:10 INFO TaskSetManager: Finished task 1.0 in stage 3.0 (TID 7) in 295 ms on localhost (2/2) 19/02/11 13:20:10 INFO TaskSchedulerImpl: Removed TaskSet 3.0, whose tasks have all completed, from pool 19/02/11 13:20:10 INFO DAGScheduler: ResultStage 3 (collect at <console>:30) finished in 0.300 s 19/02/11 13:20:10 INFO DAGScheduler: Job 2 finished: collect at <console>:30, took 0.456128 s res5: Array[(String, Int)] = Array((A,1), (A,4), (A,3), (A,9), (B,2), (B,5), (B,4), (C,3), (C,4), (D,5)) scala> scala> kv.groupByKey().collect 19/02/11 13:21:15 INFO SparkContext: Starting job: collect at <console>:30 19/02/11 13:21:15 INFO DAGScheduler: Registering RDD 3 (parallelize at <console>:27) 19/02/11 13:21:15 INFO DAGScheduler: Got job 3 (collect at <console>:30) with 2 output partitions 19/02/11 13:21:15 INFO DAGScheduler: Final stage: ResultStage 5 (collect at <console>:30) 19/02/11 13:21:15 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 4) 19/02/11 13:21:15 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 4) 19/02/11 13:21:15 INFO DAGScheduler: Submitting ShuffleMapStage 4 (ParallelCollectionRDD[3] at parallelize at <console>:27), which has no missing parents 19/02/11 13:21:15 INFO MemoryStore: Block broadcast_4 stored as values in memory (estimated size 2.9 KB, free 511.1 MB) 19/02/11 13:21:15 INFO MemoryStore: Block broadcast_4_piece0 stored as bytes in memory (estimated size 1644.0 B, free 511.1 MB) 19/02/11 13:21:15 INFO BlockManagerInfo: Added broadcast_4_piece0 in memory on localhost:58532 (size: 1644.0 B, free: 511.1 MB) 19/02/11 13:21:15 INFO SparkContext: Created broadcast 4 from broadcast at DAGScheduler.scala:1008 19/02/11 13:21:15 INFO DAGScheduler: Submitting 2 missing tasks from ShuffleMapStage 4 (ParallelCollectionRDD[3] at parallelize at <console>:27) 19/02/11 13:21:15 INFO TaskSchedulerImpl: Adding task set 4.0 with 2 tasks 19/02/11 13:21:15 INFO TaskSetManager: Starting task 0.0 in stage 4.0 (TID 8, localhost, partition 0,PROCESS_LOCAL, 2250 bytes) 19/02/11 13:21:15 INFO TaskSetManager: Starting task 1.0 in stage 4.0 (TID 9, localhost, partition 1,PROCESS_LOCAL, 2244 bytes) 19/02/11 13:21:15 INFO Executor: Running task 0.0 in stage 4.0 (TID 8) 19/02/11 13:21:15 INFO Executor: Finished task 0.0 in stage 4.0 (TID 8). 1159 bytes result sent to driver 19/02/11 13:21:15 INFO TaskSetManager: Finished task 0.0 in stage 4.0 (TID 8) in 52 ms on localhost (1/2) 19/02/11 13:21:15 INFO Executor: Running task 1.0 in stage 4.0 (TID 9) 19/02/11 13:21:15 INFO Executor: Finished task 1.0 in stage 4.0 (TID 9). 1159 bytes result sent to driver 19/02/11 13:21:15 INFO TaskSetManager: Finished task 1.0 in stage 4.0 (TID 9) in 75 ms on localhost (2/2) 19/02/11 13:21:15 INFO DAGScheduler: ShuffleMapStage 4 (parallelize at <console>:27) finished in 0.076 s 19/02/11 13:21:15 INFO TaskSchedulerImpl: Removed TaskSet 4.0, whose tasks have all completed, from pool 19/02/11 13:21:15 INFO DAGScheduler: looking for newly runnable stages 19/02/11 13:21:15 INFO DAGScheduler: running: Set() 19/02/11 13:21:15 INFO DAGScheduler: waiting: Set(ResultStage 5) 19/02/11 13:21:15 INFO DAGScheduler: failed: Set() 19/02/11 13:21:15 INFO DAGScheduler: Submitting ResultStage 5 (ShuffledRDD[7] at groupByKey at <console>:30), which has no missing parents 19/02/11 13:21:15 INFO MemoryStore: Block broadcast_5 stored as values in memory (estimated size 3.9 KB, free 511.1 MB) 19/02/11 13:21:15 INFO MemoryStore: Block broadcast_5_piece0 stored as bytes in memory (estimated size 2.1 KB, free 511.1 MB) 19/02/11 13:21:15 INFO BlockManagerInfo: Added broadcast_5_piece0 in memory on localhost:58532 (size: 2.1 KB, free: 511.1 MB) 19/02/11 13:21:15 INFO SparkContext: Created broadcast 5 from broadcast at DAGScheduler.scala:1008 19/02/11 13:21:15 INFO DAGScheduler: Submitting 2 missing tasks from ResultStage 5 (ShuffledRDD[7] at groupByKey at <console>:30) 19/02/11 13:21:15 INFO TaskSchedulerImpl: Adding task set 5.0 with 2 tasks 19/02/11 13:21:15 INFO TaskSetManager: Starting task 0.0 in stage 5.0 (TID 10, localhost, partition 0,NODE_LOCAL, 1894 bytes) 19/02/11 13:21:15 INFO TaskSetManager: Starting task 1.0 in stage 5.0 (TID 11, localhost, partition 1,NODE_LOCAL, 1894 bytes) 19/02/11 13:21:15 INFO Executor: Running task 0.0 in stage 5.0 (TID 10) 19/02/11 13:21:15 INFO Executor: Running task 1.0 in stage 5.0 (TID 11) 19/02/11 13:21:15 INFO ShuffleBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks 19/02/11 13:21:15 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms 19/02/11 13:21:15 INFO ShuffleBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks 19/02/11 13:21:15 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 1 ms 19/02/11 13:21:15 INFO Executor: Finished task 0.0 in stage 5.0 (TID 10). 1780 bytes result sent to driver 19/02/11 13:21:15 INFO TaskSetManager: Finished task 0.0 in stage 5.0 (TID 10) in 84 ms on localhost (1/2) 19/02/11 13:21:15 INFO Executor: Finished task 1.0 in stage 5.0 (TID 11). 1789 bytes result sent to driver 19/02/11 13:21:15 INFO TaskSetManager: Finished task 1.0 in stage 5.0 (TID 11) in 103 ms on localhost (2/2) 19/02/11 13:21:15 INFO TaskSchedulerImpl: Removed TaskSet 5.0, whose tasks have all completed, from pool 19/02/11 13:21:15 INFO DAGScheduler: ResultStage 5 (collect at <console>:30) finished in 0.113 s 19/02/11 13:21:15 INFO DAGScheduler: Job 3 finished: collect at <console>:30, took 0.252907 s res6: Array[(String, Iterable[Int])] = Array((B,CompactBuffer(2, 5, 4)), (D,CompactBuffer(5)), (A,CompactBuffer(1, 4, 3, 9)), (C,CompactBuffer(3, 4))) scala>
4.启动 spark-shell 后,在 scala 中加载 Key-Value 数据(“A”,1),(“B”,3), (“C”,5),(“D”,4), (“B”,7), (“C”,4), (“E”,5), (“A”,8), (“B”,4), (“D”,5),将这些数据以 Key 为基准进行升序排序,并对相同的 Key 进行 Value 求和计算。将以上操作命令和结果信息以文本形式提交到答题框中。
scala> val kv1=sc.parallelize(List(("A",1),("B",3),("C",5),("D",4),("B",7),("C",4),("E",5),("A",8),("B",4),("D",5))) kv1: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[0] at parallelize at <console>:27 scala> kv1.sortByKey().collect 19/02/11 13:31:01 INFO SparkContext: Starting job: sortByKey at <console>:30 19/02/11 13:31:01 INFO DAGScheduler: Got job 0 (sortByKey at <console>:30) with 2 output partitions 19/02/11 13:31:01 INFO DAGScheduler: Final stage: ResultStage 0 (sortByKey at <console>:30) 19/02/11 13:31:01 INFO DAGScheduler: Parents of final stage: List() 19/02/11 13:31:01 INFO DAGScheduler: Missing parents: List() 19/02/11 13:31:01 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[2] at sortByKey at <console>:30), which has no missing parents 19/02/11 13:31:03 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 2.3 KB, free 511.1 MB) 19/02/11 13:31:03 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 1408.0 B, free 511.1 MB) 19/02/11 13:31:03 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:49365 (size: 1408.0 B, free: 511.1 MB) 19/02/11 13:31:03 INFO SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:1008 19/02/11 13:31:03 INFO DAGScheduler: Submitting 2 missing tasks from ResultStage 0 (MapPartitionsRDD[2] at sortByKey at <console>:30) 19/02/11 13:31:03 INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks 19/02/11 13:31:03 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, partition 0,PROCESS_LOCAL, 2260 bytes) 19/02/11 13:31:03 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, localhost, partition 1,PROCESS_LOCAL, 2249 bytes) 19/02/11 13:31:03 INFO Executor: Running task 0.0 in stage 0.0 (TID 0) 19/02/11 13:31:03 INFO Executor: Running task 1.0 in stage 0.0 (TID 1) 19/02/11 13:31:04 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 1178 bytes result sent to driver 19/02/11 13:31:04 INFO Executor: Finished task 1.0 in stage 0.0 (TID 1). 1177 bytes result sent to driver 19/02/11 13:31:04 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 657 ms on localhost (1/2) 19/02/11 13:31:04 INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 582 ms on localhost (2/2) 19/02/11 13:31:04 INFO DAGScheduler: ResultStage 0 (sortByKey at <console>:30) finished in 0.814 s 19/02/11 13:31:04 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 19/02/11 13:31:04 INFO DAGScheduler: Job 0 finished: sortByKey at <console>:30, took 2.947240 s 19/02/11 13:31:04 INFO SparkContext: Starting job: collect at <console>:30 19/02/11 13:31:04 INFO DAGScheduler: Registering RDD 0 (parallelize at <console>:27) 19/02/11 13:31:04 INFO DAGScheduler: Got job 1 (collect at <console>:30) with 2 output partitions 19/02/11 13:31:04 INFO DAGScheduler: Final stage: ResultStage 2 (collect at <console>:30) 19/02/11 13:31:04 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 1) 19/02/11 13:31:04 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 1) 19/02/11 13:31:04 INFO DAGScheduler: Submitting ShuffleMapStage 1 (ParallelCollectionRDD[0] at parallelize at <console>:27), which has no missing parents 19/02/11 13:31:04 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 2.3 KB, free 511.1 MB) 19/02/11 13:31:04 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 1424.0 B, free 511.1 MB) 19/02/11 13:31:04 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on localhost:49365 (size: 1424.0 B, free: 511.1 MB) 19/02/11 13:31:04 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1008 19/02/11 13:31:04 INFO DAGScheduler: Submitting 2 missing tasks from ShuffleMapStage 1 (ParallelCollectionRDD[0] at parallelize at <console>:27) 19/02/11 13:31:04 INFO TaskSchedulerImpl: Adding task set 1.0 with 2 tasks 19/02/11 13:31:04 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 2, localhost, partition 0,PROCESS_LOCAL, 2249 bytes) 19/02/11 13:31:04 INFO TaskSetManager: Starting task 1.0 in stage 1.0 (TID 3, localhost, partition 1,PROCESS_LOCAL, 2238 bytes) 19/02/11 13:31:04 INFO Executor: Running task 0.0 in stage 1.0 (TID 2) 19/02/11 13:31:04 INFO Executor: Running task 1.0 in stage 1.0 (TID 3) 19/02/11 13:31:04 INFO Executor: Finished task 1.0 in stage 1.0 (TID 3). 1159 bytes result sent to driver 19/02/11 13:31:04 INFO Executor: Finished task 0.0 in stage 1.0 (TID 2). 1159 bytes result sent to driver 19/02/11 13:31:04 INFO TaskSetManager: Finished task 1.0 in stage 1.0 (TID 3) in 176 ms on localhost (1/2) 19/02/11 13:31:04 INFO DAGScheduler: ShuffleMapStage 1 (parallelize at <console>:27) finished in 0.225 s 19/02/11 13:31:04 INFO DAGScheduler: looking for newly runnable stages 19/02/11 13:31:04 INFO DAGScheduler: running: Set() 19/02/11 13:31:04 INFO DAGScheduler: waiting: Set(ResultStage 2) 19/02/11 13:31:04 INFO DAGScheduler: failed: Set() 19/02/11 13:31:04 INFO DAGScheduler: Submitting ResultStage 2 (ShuffledRDD[3] at sortByKey at <console>:30), which has no missing parents 19/02/11 13:31:04 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 2) in 209 ms on localhost (2/2) 19/02/11 13:31:04 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool 19/02/11 13:31:04 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 2.9 KB, free 511.1 MB) 19/02/11 13:31:04 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 1770.0 B, free 511.1 MB) 19/02/11 13:31:04 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on localhost:49365 (size: 1770.0 B, free: 511.1 MB) 19/02/11 13:31:04 INFO SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:1008 19/02/11 13:31:04 INFO DAGScheduler: Submitting 2 missing tasks from ResultStage 2 (ShuffledRDD[3] at sortByKey at <console>:30) 19/02/11 13:31:04 INFO TaskSchedulerImpl: Adding task set 2.0 with 2 tasks 19/02/11 13:31:04 INFO TaskSetManager: Starting task 0.0 in stage 2.0 (TID 4, localhost, partition 0,NODE_LOCAL, 1894 bytes) 19/02/11 13:31:04 INFO TaskSetManager: Starting task 1.0 in stage 2.0 (TID 5, localhost, partition 1,NODE_LOCAL, 1894 bytes) 19/02/11 13:31:04 INFO Executor: Running task 1.0 in stage 2.0 (TID 5) 19/02/11 13:31:04 INFO Executor: Running task 0.0 in stage 2.0 (TID 4) 19/02/11 13:31:04 INFO ShuffleBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks 19/02/11 13:31:04 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 9 ms 19/02/11 13:31:05 INFO ShuffleBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks 19/02/11 13:31:05 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 17 ms 19/02/11 13:31:05 INFO Executor: Finished task 0.0 in stage 2.0 (TID 4). 1391 bytes result sent to driver 19/02/11 13:31:05 INFO TaskSetManager: Finished task 0.0 in stage 2.0 (TID 4) in 512 ms on localhost (1/2) 19/02/11 13:31:05 INFO Executor: Finished task 1.0 in stage 2.0 (TID 5). 1385 bytes result sent to driver 19/02/11 13:31:05 INFO TaskSetManager: Finished task 1.0 in stage 2.0 (TID 5) in 530 ms on localhost (2/2) 19/02/11 13:31:05 INFO TaskSchedulerImpl: Removed TaskSet 2.0, whose tasks have all completed, from pool 19/02/11 13:31:05 INFO DAGScheduler: ResultStage 2 (collect at <console>:30) finished in 0.541 s 19/02/11 13:31:05 INFO DAGScheduler: Job 1 finished: collect at <console>:30, took 0.940656 s res0: Array[(String, Int)] = Array((A,1), (A,8), (B,3), (B,7), (B,4), (C,5), (C,4), (D,4), (D,5), (E,5)) scala> kv1.reduceByKey(_+_).collect 19/02/11 13:33:00 INFO SparkContext: Starting job: collect at <console>:30 19/02/11 13:33:00 INFO DAGScheduler: Registering RDD 0 (parallelize at <console>:27) 19/02/11 13:33:00 INFO DAGScheduler: Got job 2 (collect at <console>:30) with 2 output partitions 19/02/11 13:33:00 INFO DAGScheduler: Final stage: ResultStage 4 (collect at <console>:30) 19/02/11 13:33:00 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 3) 19/02/11 13:33:00 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 3) 19/02/11 13:33:00 INFO DAGScheduler: Submitting ShuffleMapStage 3 (ParallelCollectionRDD[0] at parallelize at <console>:27), which has no missing parents 19/02/11 13:33:00 INFO MemoryStore: Block broadcast_3 stored as values in memory (estimated size 2.0 KB, free 511.1 MB) 19/02/11 13:33:00 INFO MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 1291.0 B, free 511.1 MB) 19/02/11 13:33:00 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on localhost:49365 (size: 1291.0 B, free: 511.1 MB) 19/02/11 13:33:00 INFO SparkContext: Created broadcast 3 from broadcast at DAGScheduler.scala:1008 19/02/11 13:33:00 INFO DAGScheduler: Submitting 2 missing tasks from ShuffleMapStage 3 (ParallelCollectionRDD[0] at parallelize at <console>:27) 19/02/11 13:33:00 INFO TaskSchedulerImpl: Adding task set 3.0 with 2 tasks 19/02/11 13:33:00 INFO TaskSetManager: Starting task 0.0 in stage 3.0 (TID 6, localhost, partition 0,PROCESS_LOCAL, 2249 bytes) 19/02/11 13:33:00 INFO TaskSetManager: Starting task 1.0 in stage 3.0 (TID 7, localhost, partition 1,PROCESS_LOCAL, 2238 bytes) 19/02/11 13:33:00 INFO Executor: Running task 0.0 in stage 3.0 (TID 6) 19/02/11 13:33:00 INFO Executor: Running task 1.0 in stage 3.0 (TID 7) 19/02/11 13:33:00 INFO Executor: Finished task 1.0 in stage 3.0 (TID 7). 1159 bytes result sent to driver 19/02/11 13:33:00 INFO Executor: Finished task 0.0 in stage 3.0 (TID 6). 1159 bytes result sent to driver 19/02/11 13:33:00 INFO TaskSetManager: Finished task 1.0 in stage 3.0 (TID 7) in 52 ms on localhost (1/2) 19/02/11 13:33:00 INFO TaskSetManager: Finished task 0.0 in stage 3.0 (TID 6) in 52 ms on localhost (2/2) 19/02/11 13:33:00 INFO TaskSchedulerImpl: Removed TaskSet 3.0, whose tasks have all completed, from pool 19/02/11 13:33:00 INFO DAGScheduler: ShuffleMapStage 3 (parallelize at <console>:27) finished in 0.054 s 19/02/11 13:33:00 INFO DAGScheduler: looking for newly runnable stages 19/02/11 13:33:00 INFO DAGScheduler: running: Set() 19/02/11 13:33:00 INFO DAGScheduler: waiting: Set(ResultStage 4) 19/02/11 13:33:00 INFO DAGScheduler: failed: Set() 19/02/11 13:33:00 INFO DAGScheduler: Submitting ResultStage 4 (ShuffledRDD[4] at reduceByKey at <console>:30), which has no missing parents 19/02/11 13:33:00 INFO MemoryStore: Block broadcast_4 stored as values in memory (estimated size 2.7 KB, free 511.1 MB) 19/02/11 13:33:00 INFO MemoryStore: Block broadcast_4_piece0 stored as bytes in memory (estimated size 1609.0 B, free 511.1 MB) 19/02/11 13:33:00 INFO BlockManagerInfo: Added broadcast_4_piece0 in memory on localhost:49365 (size: 1609.0 B, free: 511.1 MB) 19/02/11 13:33:00 INFO SparkContext: Created broadcast 4 from broadcast at DAGScheduler.scala:1008 19/02/11 13:33:00 INFO DAGScheduler: Submitting 2 missing tasks from ResultStage 4 (ShuffledRDD[4] at reduceByKey at <console>:30) 19/02/11 13:33:00 INFO TaskSchedulerImpl: Adding task set 4.0 with 2 tasks 19/02/11 13:33:00 INFO TaskSetManager: Starting task 0.0 in stage 4.0 (TID 8, localhost, partition 0,NODE_LOCAL, 1894 bytes) 19/02/11 13:33:00 INFO TaskSetManager: Starting task 1.0 in stage 4.0 (TID 9, localhost, partition 1,NODE_LOCAL, 1894 bytes) 19/02/11 13:33:00 INFO Executor: Running task 0.0 in stage 4.0 (TID 8) 19/02/11 13:33:00 INFO ShuffleBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks 19/02/11 13:33:00 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms 19/02/11 13:33:00 INFO Executor: Finished task 0.0 in stage 4.0 (TID 8). 1327 bytes result sent to driver 19/02/11 13:33:00 INFO TaskSetManager: Finished task 0.0 in stage 4.0 (TID 8) in 29 ms on localhost (1/2) 19/02/11 13:33:00 INFO Executor: Running task 1.0 in stage 4.0 (TID 9) 19/02/11 13:33:00 INFO ShuffleBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks 19/02/11 13:33:00 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms 19/02/11 13:33:00 INFO Executor: Finished task 1.0 in stage 4.0 (TID 9). 1342 bytes result sent to driver 19/02/11 13:33:00 INFO TaskSetManager: Finished task 1.0 in stage 4.0 (TID 9) in 43 ms on localhost (2/2) 19/02/11 13:33:00 INFO TaskSchedulerImpl: Removed TaskSet 4.0, whose tasks have all completed, from pool 19/02/11 13:33:00 INFO DAGScheduler: ResultStage 4 (collect at <console>:30) finished in 0.051 s 19/02/11 13:33:00 INFO DAGScheduler: Job 2 finished: collect at <console>:30, took 0.139948 s res1: Array[(String, Int)] = Array((B,14), (D,9), (A,9), (C,9), (E,5)) scala>
5.启动 spark-shell 后,在 scala 中加载 Key-Value 数据(“A”,4),(“A”,2), (“C”,3),(“A”,4),(“B”,5),(“C”,3),(“A”,4),以 Key 为基准进行去重操 作,并通过 toDebugString 方法来查看 RDD 的谱系。将以上操作命令和结果信 息以文本形式提交到答题框中。
`
scala> val kv2=sc.parallelize(List(("A",4),("A",2),("C",3),("A",4),("B",5),("C",3),("A",4))) kv2: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[5] at parallelize at <console>:27 scala> kv2.distinct.collect 19/02/11 13:42:15 INFO SparkContext: Starting job: collect at <console>:30 19/02/11 13:42:15 INFO DAGScheduler: Registering RDD 6 (distinct at <console>:30) 19/02/11 13:42:15 INFO DAGScheduler: Got job 3 (collect at <console>:30) with 2 output partitions 19/02/11 13:42:15 INFO DAGScheduler: Final stage: ResultStage 6 (collect at <console>:30) 19/02/11 13:42:15 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 5) 19/02/11 13:42:15 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 5) 19/02/11 13:42:15 INFO DAGScheduler: Submitting ShuffleMapStage 5 (MapPartitionsRDD[6] at distinct at <console>:30), which has no missing parents 19/02/11 13:42:15 INFO MemoryStore: Block broadcast_5 stored as values in memory (estimated size 2.7 KB, free 511.1 MB) 19/02/11 13:42:15 INFO MemoryStore: Block broadcast_5_piece0 stored as bytes in memory (estimated size 1618.0 B, free 511.1 MB) 19/02/11 13:42:15 INFO BlockManagerInfo: Added broadcast_5_piece0 in memory on localhost:49365 (size: 1618.0 B, free: 511.1 MB) 19/02/11 13:42:15 INFO SparkContext: Created broadcast 5 from broadcast at DAGScheduler.scala:1008 19/02/11 13:42:15 INFO DAGScheduler: Submitting 2 missing tasks from ShuffleMapStage 5 (MapPartitionsRDD[6] at distinct at <console>:30) 19/02/11 13:42:15 INFO TaskSchedulerImpl: Adding task set 5.0 with 2 tasks 19/02/11 13:42:15 INFO TaskSetManager: Starting task 0.0 in stage 5.0 (TID 10, localhost, partition 0,PROCESS_LOCAL, 2209 bytes) 19/02/11 13:42:15 INFO TaskSetManager: Starting task 1.0 in stage 5.0 (TID 11, localhost, partition 1,PROCESS_LOCAL, 2224 bytes) 19/02/11 13:42:15 INFO Executor: Running task 1.0 in stage 5.0 (TID 11) 19/02/11 13:42:15 INFO Executor: Running task 0.0 in stage 5.0 (TID 10) 19/02/11 13:42:15 INFO Executor: Finished task 1.0 in stage 5.0 (TID 11). 1159 bytes result sent to driver 19/02/11 13:42:15 INFO Executor: Finished task 0.0 in stage 5.0 (TID 10). 1159 bytes result sent to driver 19/02/11 13:42:15 INFO TaskSetManager: Finished task 1.0 in stage 5.0 (TID 11) in 93 ms on localhost (1/2) 19/02/11 13:42:15 INFO TaskSetManager: Finished task 0.0 in stage 5.0 (TID 10) in 97 ms on localhost (2/2) 19/02/11 13:42:15 INFO TaskSchedulerImpl: Removed TaskSet 5.0, whose tasks have all completed, from pool 19/02/11 13:42:15 INFO DAGScheduler: ShuffleMapStage 5 (distinct at <console>:30) finished in 0.100 s 19/02/11 13:42:15 INFO DAGScheduler: looking for newly runnable stages 19/02/11 13:42:15 INFO DAGScheduler: running: Set() 19/02/11 13:42:15 INFO DAGScheduler: waiting: Set(ResultStage 6) 19/02/11 13:42:15 INFO DAGScheduler: failed: Set() 19/02/11 13:42:15 INFO DAGScheduler: Submitting ResultStage 6 (MapPartitionsRDD[8] at distinct at <console>:30), which has no missing parents 19/02/11 13:42:15 INFO MemoryStore: Block broadcast_6 stored as values in memory (estimated size 3.3 KB, free 511.1 MB) 19/02/11 13:42:15 INFO MemoryStore: Block broadcast_6_piece0 stored as bytes in memory (estimated size 1930.0 B, free 511.1 MB) 19/02/11 13:42:15 INFO BlockManagerInfo: Added broadcast_6_piece0 in memory on localhost:49365 (size: 1930.0 B, free: 511.1 MB) 19/02/11 13:42:15 INFO SparkContext: Created broadcast 6 from broadcast at DAGScheduler.scala:1008 19/02/11 13:42:15 INFO DAGScheduler: Submitting 2 missing tasks from ResultStage 6 (MapPartitionsRDD[8] at distinct at <console>:30) 19/02/11 13:42:15 INFO TaskSchedulerImpl: Adding task set 6.0 with 2 tasks 19/02/11 13:42:15 INFO TaskSetManager: Starting task 0.0 in stage 6.0 (TID 12, localhost, partition 0,NODE_LOCAL, 1894 bytes) 19/02/11 13:42:15 INFO TaskSetManager: Starting task 1.0 in stage 6.0 (TID 13, localhost, partition 1,NODE_LOCAL, 1894 bytes) 19/02/11 13:42:15 INFO Executor: Running task 0.0 in stage 6.0 (TID 12) 19/02/11 13:42:15 INFO Executor: Running task 1.0 in stage 6.0 (TID 13) 19/02/11 13:42:15 INFO ShuffleBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks 19/02/11 13:42:15 INFO ShuffleBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks 19/02/11 13:42:15 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 1 ms 19/02/11 13:42:15 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 1 ms 19/02/11 13:42:15 INFO Executor: Finished task 1.0 in stage 6.0 (TID 13). 1347 bytes result sent to driver 19/02/11 13:42:15 INFO Executor: Finished task 0.0 in stage 6.0 (TID 12). 1307 bytes result sent to driver 19/02/11 13:42:15 INFO TaskSetManager: Finished task 1.0 in stage 6.0 (TID 13) in 29 ms on localhost (1/2) 19/02/11 13:42:15 INFO TaskSetManager: Finished task 0.0 in stage 6.0 (TID 12) in 34 ms on localhost (2/2) 19/02/11 13:42:15 INFO TaskSchedulerImpl: Removed TaskSet 6.0, whose tasks have all completed, from pool 19/02/11 13:42:15 INFO DAGScheduler: ResultStage 6 (collect at <console>:30) finished in 0.037 s 19/02/11 13:42:15 INFO DAGScheduler: Job 3 finished: collect at <console>:30, took 0.220315 s res2: Array[(String, Int)] = Array((A,4), (B,5), (C,3), (A,2)) scala> scala> kv2.toDebugString res3: String = (2) ParallelCollectionRDD[5] at parallelize at <console>:27 [] scala>
6.启动 spark-shell 后,在 scala 中加载两组 Key-Value 数据(“A”,1),(“B”, 2),(“C”,3),(“A”,4),(“B”,5)、(“A”,1),(“B”,2),(“C”,3),(“A”,4), (“B”,5),将两组数据以 Key 为基准进行 JOIN 操作,将以上操作命令和结果信 息以文本形式提交到答题框中。
scala> val kv4=sc.parallelize(List(("A",1),("B",2),("C",3),("A",4),("B",5))) kv4: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[9] at parallelize at <console>:27 scala> val kv5=sc.parallelize(List(("A",1),("B",2),("C",3),("A",4),("B",5))) kv5: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[10] at parallelize at <console>:27 scala> kv4.join(kv5).collect 19/02/11 13:47:38 INFO SparkContext: Starting job: collect at <console>:32 19/02/11 13:47:38 INFO DAGScheduler: Registering RDD 9 (parallelize at <console>:27) 19/02/11 13:47:38 INFO DAGScheduler: Registering RDD 10 (parallelize at <console>:27) 19/02/11 13:47:38 INFO DAGScheduler: Got job 4 (collect at <console>:32) with 2 output partitions 19/02/11 13:47:38 INFO DAGScheduler: Final stage: ResultStage 9 (collect at <console>:32) 19/02/11 13:47:38 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 7, ShuffleMapStage 8) 19/02/11 13:47:38 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 7, ShuffleMapStage 8) 19/02/11 13:47:38 INFO DAGScheduler: Submitting ShuffleMapStage 7 (ParallelCollectionRDD[9] at parallelize at <console>:27), which has no missing parents 19/02/11 13:47:38 INFO MemoryStore: Block broadcast_7 stored as values in memory (estimated size 1864.0 B, free 511.1 MB) 19/02/11 13:47:38 INFO MemoryStore: Block broadcast_7_piece0 stored as bytes in memory (estimated size 1188.0 B, free 511.1 MB) 19/02/11 13:47:38 INFO BlockManagerInfo: Added broadcast_7_piece0 in memory on localhost:49365 (size: 1188.0 B, free: 511.1 MB) 19/02/11 13:47:38 INFO SparkContext: Created broadcast 7 from broadcast at DAGScheduler.scala:1008 19/02/11 13:47:38 INFO DAGScheduler: Submitting 2 missing tasks from ShuffleMapStage 7 (ParallelCollectionRDD[9] at parallelize at <console>:27) 19/02/11 13:47:38 INFO TaskSchedulerImpl: Adding task set 7.0 with 2 tasks 19/02/11 13:47:38 INFO DAGScheduler: Submitting ShuffleMapStage 8 (ParallelCollectionRDD[10] at parallelize at <console>:27), which has no missing parents 19/02/11 13:47:38 INFO MemoryStore: Block broadcast_8 stored as values in memory (estimated size 1864.0 B, free 511.1 MB) 19/02/11 13:47:38 INFO MemoryStore: Block broadcast_8_piece0 stored as bytes in memory (estimated size 1188.0 B, free 511.1 MB) 19/02/11 13:47:38 INFO BlockManagerInfo: Added broadcast_8_piece0 in memory on localhost:49365 (size: 1188.0 B, free: 511.1 MB) 19/02/11 13:47:38 INFO SparkContext: Created broadcast 8 from broadcast at DAGScheduler.scala:1008 19/02/11 13:47:38 INFO DAGScheduler: Submitting 2 missing tasks from ShuffleMapStage 8 (ParallelCollectionRDD[10] at parallelize at <console>:27) 19/02/11 13:47:38 INFO TaskSchedulerImpl: Adding task set 8.0 with 2 tasks 19/02/11 13:47:38 INFO TaskSetManager: Starting task 0.0 in stage 7.0 (TID 14, localhost, partition 0,PROCESS_LOCAL, 2188 bytes) 19/02/11 13:47:38 INFO TaskSetManager: Starting task 1.0 in stage 7.0 (TID 15, localhost, partition 1,PROCESS_LOCAL, 2208 bytes) 19/02/11 13:47:38 INFO Executor: Running task 0.0 in stage 7.0 (TID 14) 19/02/11 13:47:38 INFO Executor: Running task 1.0 in stage 7.0 (TID 15) 19/02/11 13:47:38 INFO Executor: Finished task 1.0 in stage 7.0 (TID 15). 1159 bytes result sent to driver 19/02/11 13:47:38 INFO Executor: Finished task 0.0 in stage 7.0 (TID 14). 1159 bytes result sent to driver 19/02/11 13:47:38 INFO TaskSetManager: Starting task 0.0 in stage 8.0 (TID 16, localhost, partition 0,PROCESS_LOCAL, 2188 bytes) 19/02/11 13:47:38 INFO TaskSetManager: Starting task 1.0 in stage 8.0 (TID 17, localhost, partition 1,PROCESS_LOCAL, 2208 bytes) 19/02/11 13:47:38 INFO Executor: Running task 0.0 in stage 8.0 (TID 16) 19/02/11 13:47:38 INFO Executor: Running task 1.0 in stage 8.0 (TID 17) 19/02/11 13:47:38 INFO TaskSetManager: Finished task 0.0 in stage 7.0 (TID 14) in 133 ms on localhost (1/2) 19/02/11 13:47:38 INFO TaskSetManager: Finished task 1.0 in stage 7.0 (TID 15) in 134 ms on localhost (2/2) 19/02/11 13:47:38 INFO TaskSchedulerImpl: Removed TaskSet 7.0, whose tasks have all completed, from pool 19/02/11 13:47:38 INFO DAGScheduler: ShuffleMapStage 7 (parallelize at <console>:27) finished in 0.177 s 19/02/11 13:47:38 INFO DAGScheduler: looking for newly runnable stages 19/02/11 13:47:38 INFO DAGScheduler: running: Set(ShuffleMapStage 8) 19/02/11 13:47:38 INFO DAGScheduler: waiting: Set(ResultStage 9) 19/02/11 13:47:38 INFO DAGScheduler: failed: Set() 19/02/11 13:47:38 INFO Executor: Finished task 0.0 in stage 8.0 (TID 16). 1159 bytes result sent to driver 19/02/11 13:47:38 INFO Executor: Finished task 1.0 in stage 8.0 (TID 17). 1159 bytes result sent to driver 19/02/11 13:47:38 INFO TaskSetManager: Finished task 0.0 in stage 8.0 (TID 16) in 68 ms on localhost (1/2) 19/02/11 13:47:38 INFO DAGScheduler: ShuffleMapStage 8 (parallelize at <console>:27) finished in 0.166 s 19/02/11 13:47:38 INFO DAGScheduler: looking for newly runnable stages 19/02/11 13:47:38 INFO DAGScheduler: running: Set() 19/02/11 13:47:38 INFO DAGScheduler: waiting: Set(ResultStage 9) 19/02/11 13:47:38 INFO DAGScheduler: failed: Set() 19/02/11 13:47:38 INFO DAGScheduler: Submitting ResultStage 9 (MapPartitionsRDD[13] at join at <console>:32), which has no missing parents 19/02/11 13:47:38 INFO TaskSetManager: Finished task 1.0 in stage 8.0 (TID 17) in 65 ms on localhost (2/2) 19/02/11 13:47:38 INFO TaskSchedulerImpl: Removed TaskSet 8.0, whose tasks have all completed, from pool 19/02/11 13:47:38 INFO MemoryStore: Block broadcast_9 stored as values in memory (estimated size 3.2 KB, free 511.1 MB) 19/02/11 13:47:38 INFO MemoryStore: Block broadcast_9_piece0 stored as bytes in memory (estimated size 1814.0 B, free 511.1 MB) 19/02/11 13:47:38 INFO BlockManagerInfo: Added broadcast_9_piece0 in memory on localhost:49365 (size: 1814.0 B, free: 511.1 MB) 19/02/11 13:47:38 INFO SparkContext: Created broadcast 9 from broadcast at DAGScheduler.scala:1008 19/02/11 13:47:38 INFO DAGScheduler: Submitting 2 missing tasks from ResultStage 9 (MapPartitionsRDD[13] at join at <console>:32) 19/02/11 13:47:38 INFO TaskSchedulerImpl: Adding task set 9.0 with 2 tasks 19/02/11 13:47:38 INFO TaskSetManager: Starting task 0.0 in stage 9.0 (TID 18, localhost, partition 0,PROCESS_LOCAL, 1967 bytes) 19/02/11 13:47:38 INFO TaskSetManager: Starting task 1.0 in stage 9.0 (TID 19, localhost, partition 1,PROCESS_LOCAL, 1967 bytes) 19/02/11 13:47:38 INFO Executor: Running task 1.0 in stage 9.0 (TID 19) 19/02/11 13:47:38 INFO Executor: Running task 0.0 in stage 9.0 (TID 18) 19/02/11 13:47:38 INFO ShuffleBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks 19/02/11 13:47:38 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 1 ms 19/02/11 13:47:38 INFO ShuffleBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks 19/02/11 13:47:38 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms 19/02/11 13:47:38 INFO ShuffleBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks 19/02/11 13:47:38 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms 19/02/11 13:47:38 INFO ShuffleBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks 19/02/11 13:47:38 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms 19/02/11 13:47:38 INFO Executor: Finished task 1.0 in stage 9.0 (TID 19). 1453 bytes result sent to driver 19/02/11 13:47:38 INFO Executor: Finished task 0.0 in stage 9.0 (TID 18). 1417 bytes result sent to driver 19/02/11 13:47:38 INFO TaskSetManager: Finished task 1.0 in stage 9.0 (TID 19) in 184 ms on localhost (1/2) 19/02/11 13:47:38 INFO TaskSetManager: Finished task 0.0 in stage 9.0 (TID 18) in 207 ms on localhost (2/2) 19/02/11 13:47:38 INFO TaskSchedulerImpl: Removed TaskSet 9.0, whose tasks have all completed, from pool 19/02/11 13:47:38 INFO DAGScheduler: ResultStage 9 (collect at <console>:32) finished in 0.212 s 19/02/11 13:47:38 INFO DAGScheduler: Job 4 finished: collect at <console>:32, took 0.671400 s res4: Array[(String, (Int, Int))] = Array((B,(2,2)), (B,(2,5)), (B,(5,2)), (B,(5,5)), (A,(1,1)), (A,(1,4)), (A,(4,1)), (A,(4,4)), (C,(3,3))) scala>
7.登录 spark-shell,定义 i 值为 1,sum 值为 0,使用 while 循环,求从 1 加 到 100 的值,最后使用 scala 的标准输出函数输出 sum 值。将上述所有操作命令 和返回结果以文本形式提交到答题框。
scala> var i=1 i: Int = 1 scala> var sum=0 sum: Int = 0 scala> while(i<100) | { | i+=1 | sum+=i | } scala> println(sum) 5049
8.登录 spark-shell,定义 i 值为 1,sum 值为 0,使用 for 循环,求从 1 加到 100 的值,最后使用 scala 的标准输出函数输出 sum 值。将上述所有操作命令和 返回结果以文本形式提交到答题框
scala> var i=1 i: Int = 1 scala> var sum=0 sum: Int = 0 scala> for(i<- 1 to 100) sum+=i scala> println(sum) 5050 scala>
9.任何一种函数式语言中,都有 map 函数与 faltMap 这两个函数: map 函数的用法,顾名思义,将一个函数传入 map 中, 然后利用传入的这 个函数,将集合中的每个元素处理,并将处理后的结果返回。 而 flatMap 与 map 唯一不一样的地方就是传入的函数在处理完后返回值必须 是 List, 所以需要返回值是 List 才能执行 flat 这一步。 (1)登录 spark-shell,自定义一个 list,然后利用 map 函数,对这个 list 进 行元素乘 2 的操作, 将上述所有操作命令和返回结果以文本形式提交到答题框。 (2)登录 spark-shell,自定义一个 list,然后利用 flatMap 函数将 list 转换为 单个字母并转换为大写 ,将上述所有命令和返回结果以文本形式提交到答题框。 10.登录大数据云主机 master 节点,在 root 目录下新建一个 abc.txt, 里面的 内容为: hadoop hive solr redis kafka hadoop storm flume sqoop docker spark spark hadoop spark elasticsearch hbase hadoop hive spark hive hadoop spark 然后登录 spark-shell, 首先使用命令统计 abc.txt 的行数,接着对 abc.txt 文档 中的单词进行计数, 并按照单词首字母的升序进行排序,最后统计结果行数,将 上述操作命令和返回结果以文本形式提交到答题框。 .登录 spark-shell,自定义一个 List,使用 spark 自带函数对这个 List 进行 去重操作,
以上就是本文的全部内容,希望本文的内容对大家的学习或者工作能带来一定的帮助,也希望大家多多支持 码农网
猜你喜欢:- Elasticsearch 集群搭建和集群原理
- Zookeeper学习系列【二】Zookeeper 集群章节之集群搭建
- Spark集群环境搭建
- Zookeeper搭建集群
- FastDFS集群搭建
- Zookeeper集群环境搭建
本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。
运营有道:重新定义互联网运营
李明轩 / 机械工业出版社 / 2017-7-31 / 69.00元
本书是前百度资深运营专家多年运营经验的总结,是作者运营千万级用户规模的大型互联网产品的实操经验复盘,是作者在“在行”上为近百位CEO和高管提供互联网运营咨询服务后对互联网运营需求的深入洞见。 本书的思想基础是“运营必须以用户为中心”,从产品、用户、市场3个维度对互联网运营重新进行了系统性的梳理:从道的层面解读并重新定义运营方法论,从术的层面围绕方法论提出行之有效的解决方法和实际案例。重点不在......一起来看看 《运营有道:重新定义互联网运营》 这本书的介绍吧!