spark集群搭建和简单运维操作

栏目: Scala · 发布时间: 5年前

spark

spark集群搭建和简单运维操作
Apache Spark 是专为大规模数据处理而设计的快速通用的计算引擎。
Spark是UC Berkeley AMP lab (加州大学伯克利分校的AMP实验室)所开源的类Hadoop MapReduce的通用并行框架,
Spark,拥有Hadoop MapReduce所具有的优点;
但不同于MapReduce的是——Job中间输出结果可以保存在内存中,从而不再需要读写HDFS,
因此Spark能更好地适用于数据挖掘与机器学习等需要迭代的MapReduce的算法。


Spark 是一种与 Hadoop 相似的开源集群计算环境,
但是两者之间还存在一些不同之处,这些有用的不同之处使 Spark 在某些工作负载方面表现得更加优越,
换句话说,Spark 启用了内存分布数据集,除了能够提供交互式查询外,它还可以优化迭代工作负载。

Spark 是在 Scala 语言中实现的,它将 Scala 用作其应用程序框架。
与 Hadoop 不同,Spark 和 Scala 能够紧密集成,其中的 Scala 可以像操作本地集合对象一样轻松地操作分布式数据集。

尽管创建 Spark 是为了支持分布式数据集上的迭代作业,
但是实际上它是对 Hadoop 的补充,可以在 Hadoop 文件系统中并行运行。
通过名为 Mesos 的第三方集群框架可以支持此行为。

Spark 由加州大学伯克利分校 AMP 实验室 (Algorithms, Machines, and People Lab) 开发,
可用来构建大型的、低延迟的数据分析应用程序。

spark集群搭建和简单运维操作

spark集群搭建和简单运维操作

spark集群搭建和简单运维操作

spark集群搭建和简单运维操作

spark集群搭建和简单运维操作 spark集群搭建和简单运维操作 spark集群搭建和简单运维操作 spark集群搭建和简单运维操作 spark集群搭建和简单运维操作 ‘’ spark集群搭建和简单运维操作

启动 spark-shell 后,在 scala 中加载数据“1,2,3,4,5,6,7,8,9,10”, 求这些数据的 2 倍乘积能够被 3 整除的数字,并通过 toDebugString 方法来查看 RDD 的谱系。将以上操作命令和结果信息以文本形式提交到答题框中。

scala> val num = sc.parallelize(1 to 10)
num: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:27

scala> val doublenum = num.map(_*2)
doublenum: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[1] at map at <console>:29

scala> val threenum = doublenum.filter(_%3==0)
threenum: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[2] at filter at <console>:31
scala> threenum.collect
19/02/11 13:06:18 INFO SparkContext: Starting job: collect at <console>:34
19/02/11 13:06:18 INFO DAGScheduler: Got job 0 (collect at <console>:34) with 2 output partitions
19/02/11 13:06:18 INFO DAGScheduler: Final stage: ResultStage 0 (collect at <console>:34)
19/02/11 13:06:18 INFO DAGScheduler: Parents of final stage: List()
19/02/11 13:06:18 INFO DAGScheduler: Missing parents: List()
19/02/11 13:06:18 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[2] at filter at <console>:31), which has no missing parents
19/02/11 13:06:18 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 2.2 KB, free 511.1 MB)
19/02/11 13:06:18 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 1333.0 B, free 511.1 MB)
19/02/11 13:06:18 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:58532 (size: 1333.0 B, free: 511.1 MB)
19/02/11 13:06:18 INFO SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:1008
19/02/11 13:06:18 INFO DAGScheduler: Submitting 2 missing tasks from ResultStage 0 (MapPartitionsRDD[2] at filter at <console>:31)
19/02/11 13:06:18 INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks
19/02/11 13:06:18 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, partition 0,PROCESS_LOCAL, 2078 bytes)
19/02/11 13:06:18 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, localhost, partition 1,PROCESS_LOCAL, 2135 bytes)
19/02/11 13:06:18 INFO Executor: Running task 1.0 in stage 0.0 (TID 1)
19/02/11 13:06:18 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
19/02/11 13:06:18 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 902 bytes result sent to driver
19/02/11 13:06:18 INFO Executor: Finished task 1.0 in stage 0.0 (TID 1). 906 bytes result sent to driver
19/02/11 13:06:18 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 207 ms on localhost (1/2)
19/02/11 13:06:18 INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 172 ms on localhost (2/2)
19/02/11 13:06:18 INFO DAGScheduler: ResultStage 0 (collect at <console>:34) finished in 0.275 s
19/02/11 13:06:18 INFO DAGScheduler: Job 0 finished: collect at <console>:34, took 0.714041 s
19/02/11 13:06:18 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 
res2: Array[Int] = Array(6, 12, 18)

scala> threenum.toDebugString
res3: String = 
(2) MapPartitionsRDD[2] at filter at <console>:31 []
 |  MapPartitionsRDD[1] at map at <console>:29 []
 |  ParallelCollectionRDD[0] at parallelize at <console>:27 []

scala>

3.启动 spark-shell 后,在 scala 中加载 Key-Value 数据(“A”,1),(“B”,2), (“C”,3),(“A”,4), (“B”,5), (“C”,4), (“A”,3), (“A”,9), (“B”,4), (“D”,5),将这些数据以 Key 为基准进行升序排序,并以 Key 为基准进行分组。 将以上操作命令和结果信息以文本形式提交到答题框中。

scala> val kv=sc.parallelize(List(("A",1),("B",2),("C",3),("A",4),("B",5),("C",4),("A",3),("A",9),("B",4),("D",5)))
kv: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[3] at parallelize at <console>:27

scala> kv.sortByKey().collect
19/02/11 13:20:10 INFO SparkContext: Starting job: sortByKey at <console>:30
19/02/11 13:20:10 INFO DAGScheduler: Got job 1 (sortByKey at <console>:30) with 2 output partitions
19/02/11 13:20:10 INFO DAGScheduler: Final stage: ResultStage 1 (sortByKey at <console>:30)
19/02/11 13:20:10 INFO DAGScheduler: Parents of final stage: List()
19/02/11 13:20:10 INFO DAGScheduler: Missing parents: List()
19/02/11 13:20:10 INFO DAGScheduler: Submitting ResultStage 1 (MapPartitionsRDD[5] at sortByKey at <console>:30), which has no missing parents
19/02/11 13:20:10 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 2.3 KB, free 511.1 MB)
19/02/11 13:20:10 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 1398.0 B, free 511.1 MB)
19/02/11 13:20:10 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on localhost:58532 (size: 1398.0 B, free: 511.1 MB)
19/02/11 13:20:10 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1008
19/02/11 13:20:10 INFO DAGScheduler: Submitting 2 missing tasks from ResultStage 1 (MapPartitionsRDD[5] at sortByKey at <console>:30)
19/02/11 13:20:10 INFO TaskSchedulerImpl: Adding task set 1.0 with 2 tasks
19/02/11 13:20:10 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 2, localhost, partition 0,PROCESS_LOCAL, 2261 bytes)
19/02/11 13:20:10 INFO TaskSetManager: Starting task 1.0 in stage 1.0 (TID 3, localhost, partition 1,PROCESS_LOCAL, 2255 bytes)
19/02/11 13:20:10 INFO Executor: Running task 0.0 in stage 1.0 (TID 2)
19/02/11 13:20:10 INFO Executor: Running task 1.0 in stage 1.0 (TID 3)
19/02/11 13:20:10 INFO Executor: Finished task 0.0 in stage 1.0 (TID 2). 1179 bytes result sent to driver
19/02/11 13:20:10 INFO Executor: Finished task 1.0 in stage 1.0 (TID 3). 1178 bytes result sent to driver
19/02/11 13:20:10 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 2) in 71 ms on localhost (1/2)
19/02/11 13:20:10 INFO TaskSetManager: Finished task 1.0 in stage 1.0 (TID 3) in 61 ms on localhost (2/2)
19/02/11 13:20:10 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool 
19/02/11 13:20:10 INFO DAGScheduler: ResultStage 1 (sortByKey at <console>:30) finished in 0.075 s
19/02/11 13:20:10 INFO DAGScheduler: Job 1 finished: sortByKey at <console>:30, took 0.095223 s
19/02/11 13:20:10 INFO SparkContext: Starting job: collect at <console>:30
19/02/11 13:20:10 INFO DAGScheduler: Registering RDD 3 (parallelize at <console>:27)
19/02/11 13:20:10 INFO DAGScheduler: Got job 2 (collect at <console>:30) with 2 output partitions
19/02/11 13:20:10 INFO DAGScheduler: Final stage: ResultStage 3 (collect at <console>:30)
19/02/11 13:20:10 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 2)
19/02/11 13:20:10 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 2)
19/02/11 13:20:10 INFO DAGScheduler: Submitting ShuffleMapStage 2 (ParallelCollectionRDD[3] at parallelize at <console>:27), which has no missing parents
19/02/11 13:20:10 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 2.3 KB, free 511.1 MB)
19/02/11 13:20:10 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 1427.0 B, free 511.1 MB)
19/02/11 13:20:10 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on localhost:58532 (size: 1427.0 B, free: 511.1 MB)
19/02/11 13:20:10 INFO SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:1008
19/02/11 13:20:10 INFO DAGScheduler: Submitting 2 missing tasks from ShuffleMapStage 2 (ParallelCollectionRDD[3] at parallelize at <console>:27)
19/02/11 13:20:10 INFO TaskSchedulerImpl: Adding task set 2.0 with 2 tasks
19/02/11 13:20:10 INFO TaskSetManager: Starting task 0.0 in stage 2.0 (TID 4, localhost, partition 0,PROCESS_LOCAL, 2250 bytes)
19/02/11 13:20:10 INFO TaskSetManager: Starting task 1.0 in stage 2.0 (TID 5, localhost, partition 1,PROCESS_LOCAL, 2244 bytes)
19/02/11 13:20:10 INFO Executor: Running task 0.0 in stage 2.0 (TID 4)
19/02/11 13:20:10 INFO Executor: Running task 1.0 in stage 2.0 (TID 5)
19/02/11 13:20:10 INFO Executor: Finished task 0.0 in stage 2.0 (TID 4). 1159 bytes result sent to driver
19/02/11 13:20:10 INFO Executor: Finished task 1.0 in stage 2.0 (TID 5). 1159 bytes result sent to driver
19/02/11 13:20:10 INFO TaskSetManager: Finished task 0.0 in stage 2.0 (TID 4) in 70 ms on localhost (1/2)
19/02/11 13:20:10 INFO TaskSetManager: Finished task 1.0 in stage 2.0 (TID 5) in 69 ms on localhost (2/2)
19/02/11 13:20:10 INFO TaskSchedulerImpl: Removed TaskSet 2.0, whose tasks have all completed, from pool 
19/02/11 13:20:10 INFO DAGScheduler: ShuffleMapStage 2 (parallelize at <console>:27) finished in 0.074 s
19/02/11 13:20:10 INFO DAGScheduler: looking for newly runnable stages
19/02/11 13:20:10 INFO DAGScheduler: running: Set()
19/02/11 13:20:10 INFO DAGScheduler: waiting: Set(ResultStage 3)
19/02/11 13:20:10 INFO DAGScheduler: failed: Set()
19/02/11 13:20:10 INFO DAGScheduler: Submitting ResultStage 3 (ShuffledRDD[6] at sortByKey at <console>:30), which has no missing parents
19/02/11 13:20:10 INFO MemoryStore: Block broadcast_3 stored as values in memory (estimated size 2.9 KB, free 511.1 MB)
19/02/11 13:20:10 INFO MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 1770.0 B, free 511.1 MB)
19/02/11 13:20:10 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on localhost:58532 (size: 1770.0 B, free: 511.1 MB)
19/02/11 13:20:10 INFO SparkContext: Created broadcast 3 from broadcast at DAGScheduler.scala:1008
19/02/11 13:20:10 INFO DAGScheduler: Submitting 2 missing tasks from ResultStage 3 (ShuffledRDD[6] at sortByKey at <console>:30)
19/02/11 13:20:10 INFO TaskSchedulerImpl: Adding task set 3.0 with 2 tasks
19/02/11 13:20:10 INFO TaskSetManager: Starting task 0.0 in stage 3.0 (TID 6, localhost, partition 0,NODE_LOCAL, 1894 bytes)
19/02/11 13:20:10 INFO TaskSetManager: Starting task 1.0 in stage 3.0 (TID 7, localhost, partition 1,NODE_LOCAL, 1894 bytes)
19/02/11 13:20:10 INFO Executor: Running task 0.0 in stage 3.0 (TID 6)
19/02/11 13:20:10 INFO Executor: Running task 1.0 in stage 3.0 (TID 7)
19/02/11 13:20:10 INFO ShuffleBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks
19/02/11 13:20:10 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 18 ms
19/02/11 13:20:10 INFO ShuffleBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks
19/02/11 13:20:10 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 28 ms
19/02/11 13:20:10 INFO Executor: Finished task 0.0 in stage 3.0 (TID 6). 1433 bytes result sent to driver
19/02/11 13:20:10 INFO Executor: Finished task 1.0 in stage 3.0 (TID 7). 1347 bytes result sent to driver
19/02/11 13:20:10 INFO TaskSetManager: Finished task 0.0 in stage 3.0 (TID 6) in 295 ms on localhost (1/2)
19/02/11 13:20:10 INFO TaskSetManager: Finished task 1.0 in stage 3.0 (TID 7) in 295 ms on localhost (2/2)
19/02/11 13:20:10 INFO TaskSchedulerImpl: Removed TaskSet 3.0, whose tasks have all completed, from pool 
19/02/11 13:20:10 INFO DAGScheduler: ResultStage 3 (collect at <console>:30) finished in 0.300 s
19/02/11 13:20:10 INFO DAGScheduler: Job 2 finished: collect at <console>:30, took 0.456128 s
res5: Array[(String, Int)] = Array((A,1), (A,4), (A,3), (A,9), (B,2), (B,5), (B,4), (C,3), (C,4), (D,5))

scala>
scala> kv.groupByKey().collect
19/02/11 13:21:15 INFO SparkContext: Starting job: collect at <console>:30
19/02/11 13:21:15 INFO DAGScheduler: Registering RDD 3 (parallelize at <console>:27)
19/02/11 13:21:15 INFO DAGScheduler: Got job 3 (collect at <console>:30) with 2 output partitions
19/02/11 13:21:15 INFO DAGScheduler: Final stage: ResultStage 5 (collect at <console>:30)
19/02/11 13:21:15 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 4)
19/02/11 13:21:15 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 4)
19/02/11 13:21:15 INFO DAGScheduler: Submitting ShuffleMapStage 4 (ParallelCollectionRDD[3] at parallelize at <console>:27), which has no missing parents
19/02/11 13:21:15 INFO MemoryStore: Block broadcast_4 stored as values in memory (estimated size 2.9 KB, free 511.1 MB)
19/02/11 13:21:15 INFO MemoryStore: Block broadcast_4_piece0 stored as bytes in memory (estimated size 1644.0 B, free 511.1 MB)
19/02/11 13:21:15 INFO BlockManagerInfo: Added broadcast_4_piece0 in memory on localhost:58532 (size: 1644.0 B, free: 511.1 MB)
19/02/11 13:21:15 INFO SparkContext: Created broadcast 4 from broadcast at DAGScheduler.scala:1008
19/02/11 13:21:15 INFO DAGScheduler: Submitting 2 missing tasks from ShuffleMapStage 4 (ParallelCollectionRDD[3] at parallelize at <console>:27)
19/02/11 13:21:15 INFO TaskSchedulerImpl: Adding task set 4.0 with 2 tasks
19/02/11 13:21:15 INFO TaskSetManager: Starting task 0.0 in stage 4.0 (TID 8, localhost, partition 0,PROCESS_LOCAL, 2250 bytes)
19/02/11 13:21:15 INFO TaskSetManager: Starting task 1.0 in stage 4.0 (TID 9, localhost, partition 1,PROCESS_LOCAL, 2244 bytes)
19/02/11 13:21:15 INFO Executor: Running task 0.0 in stage 4.0 (TID 8)
19/02/11 13:21:15 INFO Executor: Finished task 0.0 in stage 4.0 (TID 8). 1159 bytes result sent to driver
19/02/11 13:21:15 INFO TaskSetManager: Finished task 0.0 in stage 4.0 (TID 8) in 52 ms on localhost (1/2)
19/02/11 13:21:15 INFO Executor: Running task 1.0 in stage 4.0 (TID 9)
19/02/11 13:21:15 INFO Executor: Finished task 1.0 in stage 4.0 (TID 9). 1159 bytes result sent to driver
19/02/11 13:21:15 INFO TaskSetManager: Finished task 1.0 in stage 4.0 (TID 9) in 75 ms on localhost (2/2)
19/02/11 13:21:15 INFO DAGScheduler: ShuffleMapStage 4 (parallelize at <console>:27) finished in 0.076 s
19/02/11 13:21:15 INFO TaskSchedulerImpl: Removed TaskSet 4.0, whose tasks have all completed, from pool 
19/02/11 13:21:15 INFO DAGScheduler: looking for newly runnable stages
19/02/11 13:21:15 INFO DAGScheduler: running: Set()
19/02/11 13:21:15 INFO DAGScheduler: waiting: Set(ResultStage 5)
19/02/11 13:21:15 INFO DAGScheduler: failed: Set()
19/02/11 13:21:15 INFO DAGScheduler: Submitting ResultStage 5 (ShuffledRDD[7] at groupByKey at <console>:30), which has no missing parents
19/02/11 13:21:15 INFO MemoryStore: Block broadcast_5 stored as values in memory (estimated size 3.9 KB, free 511.1 MB)
19/02/11 13:21:15 INFO MemoryStore: Block broadcast_5_piece0 stored as bytes in memory (estimated size 2.1 KB, free 511.1 MB)
19/02/11 13:21:15 INFO BlockManagerInfo: Added broadcast_5_piece0 in memory on localhost:58532 (size: 2.1 KB, free: 511.1 MB)
19/02/11 13:21:15 INFO SparkContext: Created broadcast 5 from broadcast at DAGScheduler.scala:1008
19/02/11 13:21:15 INFO DAGScheduler: Submitting 2 missing tasks from ResultStage 5 (ShuffledRDD[7] at groupByKey at <console>:30)
19/02/11 13:21:15 INFO TaskSchedulerImpl: Adding task set 5.0 with 2 tasks
19/02/11 13:21:15 INFO TaskSetManager: Starting task 0.0 in stage 5.0 (TID 10, localhost, partition 0,NODE_LOCAL, 1894 bytes)
19/02/11 13:21:15 INFO TaskSetManager: Starting task 1.0 in stage 5.0 (TID 11, localhost, partition 1,NODE_LOCAL, 1894 bytes)
19/02/11 13:21:15 INFO Executor: Running task 0.0 in stage 5.0 (TID 10)
19/02/11 13:21:15 INFO Executor: Running task 1.0 in stage 5.0 (TID 11)
19/02/11 13:21:15 INFO ShuffleBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks
19/02/11 13:21:15 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
19/02/11 13:21:15 INFO ShuffleBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks
19/02/11 13:21:15 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 1 ms
19/02/11 13:21:15 INFO Executor: Finished task 0.0 in stage 5.0 (TID 10). 1780 bytes result sent to driver
19/02/11 13:21:15 INFO TaskSetManager: Finished task 0.0 in stage 5.0 (TID 10) in 84 ms on localhost (1/2)
19/02/11 13:21:15 INFO Executor: Finished task 1.0 in stage 5.0 (TID 11). 1789 bytes result sent to driver
19/02/11 13:21:15 INFO TaskSetManager: Finished task 1.0 in stage 5.0 (TID 11) in 103 ms on localhost (2/2)
19/02/11 13:21:15 INFO TaskSchedulerImpl: Removed TaskSet 5.0, whose tasks have all completed, from pool 
19/02/11 13:21:15 INFO DAGScheduler: ResultStage 5 (collect at <console>:30) finished in 0.113 s
19/02/11 13:21:15 INFO DAGScheduler: Job 3 finished: collect at <console>:30, took 0.252907 s
res6: Array[(String, Iterable[Int])] = Array((B,CompactBuffer(2, 5, 4)), (D,CompactBuffer(5)), (A,CompactBuffer(1, 4, 3, 9)), (C,CompactBuffer(3, 4)))

scala>

4.启动 spark-shell 后,在 scala 中加载 Key-Value 数据(“A”,1),(“B”,3), (“C”,5),(“D”,4), (“B”,7), (“C”,4), (“E”,5), (“A”,8), (“B”,4), (“D”,5),将这些数据以 Key 为基准进行升序排序,并对相同的 Key 进行 Value 求和计算。将以上操作命令和结果信息以文本形式提交到答题框中。

scala> val kv1=sc.parallelize(List(("A",1),("B",3),("C",5),("D",4),("B",7),("C",4),("E",5),("A",8),("B",4),("D",5)))
kv1: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[0] at parallelize at <console>:27

scala> kv1.sortByKey().collect

19/02/11 13:31:01 INFO SparkContext: Starting job: sortByKey at <console>:30
19/02/11 13:31:01 INFO DAGScheduler: Got job 0 (sortByKey at <console>:30) with 2 output partitions
19/02/11 13:31:01 INFO DAGScheduler: Final stage: ResultStage 0 (sortByKey at <console>:30)
19/02/11 13:31:01 INFO DAGScheduler: Parents of final stage: List()
19/02/11 13:31:01 INFO DAGScheduler: Missing parents: List()
19/02/11 13:31:01 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[2] at sortByKey at <console>:30), which has no missing parents
19/02/11 13:31:03 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 2.3 KB, free 511.1 MB)
19/02/11 13:31:03 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 1408.0 B, free 511.1 MB)
19/02/11 13:31:03 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:49365 (size: 1408.0 B, free: 511.1 MB)
19/02/11 13:31:03 INFO SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:1008
19/02/11 13:31:03 INFO DAGScheduler: Submitting 2 missing tasks from ResultStage 0 (MapPartitionsRDD[2] at sortByKey at <console>:30)
19/02/11 13:31:03 INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks
19/02/11 13:31:03 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, partition 0,PROCESS_LOCAL, 2260 bytes)
19/02/11 13:31:03 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, localhost, partition 1,PROCESS_LOCAL, 2249 bytes)
19/02/11 13:31:03 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
19/02/11 13:31:03 INFO Executor: Running task 1.0 in stage 0.0 (TID 1)
19/02/11 13:31:04 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 1178 bytes result sent to driver
19/02/11 13:31:04 INFO Executor: Finished task 1.0 in stage 0.0 (TID 1). 1177 bytes result sent to driver
19/02/11 13:31:04 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 657 ms on localhost (1/2)
19/02/11 13:31:04 INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 582 ms on localhost (2/2)
19/02/11 13:31:04 INFO DAGScheduler: ResultStage 0 (sortByKey at <console>:30) finished in 0.814 s
19/02/11 13:31:04 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 
19/02/11 13:31:04 INFO DAGScheduler: Job 0 finished: sortByKey at <console>:30, took 2.947240 s
19/02/11 13:31:04 INFO SparkContext: Starting job: collect at <console>:30
19/02/11 13:31:04 INFO DAGScheduler: Registering RDD 0 (parallelize at <console>:27)
19/02/11 13:31:04 INFO DAGScheduler: Got job 1 (collect at <console>:30) with 2 output partitions
19/02/11 13:31:04 INFO DAGScheduler: Final stage: ResultStage 2 (collect at <console>:30)
19/02/11 13:31:04 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 1)
19/02/11 13:31:04 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 1)
19/02/11 13:31:04 INFO DAGScheduler: Submitting ShuffleMapStage 1 (ParallelCollectionRDD[0] at parallelize at <console>:27), which has no missing parents
19/02/11 13:31:04 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 2.3 KB, free 511.1 MB)
19/02/11 13:31:04 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 1424.0 B, free 511.1 MB)
19/02/11 13:31:04 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on localhost:49365 (size: 1424.0 B, free: 511.1 MB)
19/02/11 13:31:04 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1008
19/02/11 13:31:04 INFO DAGScheduler: Submitting 2 missing tasks from ShuffleMapStage 1 (ParallelCollectionRDD[0] at parallelize at <console>:27)
19/02/11 13:31:04 INFO TaskSchedulerImpl: Adding task set 1.0 with 2 tasks
19/02/11 13:31:04 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 2, localhost, partition 0,PROCESS_LOCAL, 2249 bytes)
19/02/11 13:31:04 INFO TaskSetManager: Starting task 1.0 in stage 1.0 (TID 3, localhost, partition 1,PROCESS_LOCAL, 2238 bytes)
19/02/11 13:31:04 INFO Executor: Running task 0.0 in stage 1.0 (TID 2)
19/02/11 13:31:04 INFO Executor: Running task 1.0 in stage 1.0 (TID 3)
19/02/11 13:31:04 INFO Executor: Finished task 1.0 in stage 1.0 (TID 3). 1159 bytes result sent to driver
19/02/11 13:31:04 INFO Executor: Finished task 0.0 in stage 1.0 (TID 2). 1159 bytes result sent to driver
19/02/11 13:31:04 INFO TaskSetManager: Finished task 1.0 in stage 1.0 (TID 3) in 176 ms on localhost (1/2)
19/02/11 13:31:04 INFO DAGScheduler: ShuffleMapStage 1 (parallelize at <console>:27) finished in 0.225 s
19/02/11 13:31:04 INFO DAGScheduler: looking for newly runnable stages
19/02/11 13:31:04 INFO DAGScheduler: running: Set()
19/02/11 13:31:04 INFO DAGScheduler: waiting: Set(ResultStage 2)
19/02/11 13:31:04 INFO DAGScheduler: failed: Set()
19/02/11 13:31:04 INFO DAGScheduler: Submitting ResultStage 2 (ShuffledRDD[3] at sortByKey at <console>:30), which has no missing parents
19/02/11 13:31:04 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 2) in 209 ms on localhost (2/2)
19/02/11 13:31:04 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool 
19/02/11 13:31:04 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 2.9 KB, free 511.1 MB)
19/02/11 13:31:04 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 1770.0 B, free 511.1 MB)
19/02/11 13:31:04 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on localhost:49365 (size: 1770.0 B, free: 511.1 MB)
19/02/11 13:31:04 INFO SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:1008
19/02/11 13:31:04 INFO DAGScheduler: Submitting 2 missing tasks from ResultStage 2 (ShuffledRDD[3] at sortByKey at <console>:30)
19/02/11 13:31:04 INFO TaskSchedulerImpl: Adding task set 2.0 with 2 tasks
19/02/11 13:31:04 INFO TaskSetManager: Starting task 0.0 in stage 2.0 (TID 4, localhost, partition 0,NODE_LOCAL, 1894 bytes)
19/02/11 13:31:04 INFO TaskSetManager: Starting task 1.0 in stage 2.0 (TID 5, localhost, partition 1,NODE_LOCAL, 1894 bytes)
19/02/11 13:31:04 INFO Executor: Running task 1.0 in stage 2.0 (TID 5)
19/02/11 13:31:04 INFO Executor: Running task 0.0 in stage 2.0 (TID 4)
19/02/11 13:31:04 INFO ShuffleBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks
19/02/11 13:31:04 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 9 ms
19/02/11 13:31:05 INFO ShuffleBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks
19/02/11 13:31:05 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 17 ms
19/02/11 13:31:05 INFO Executor: Finished task 0.0 in stage 2.0 (TID 4). 1391 bytes result sent to driver
19/02/11 13:31:05 INFO TaskSetManager: Finished task 0.0 in stage 2.0 (TID 4) in 512 ms on localhost (1/2)
19/02/11 13:31:05 INFO Executor: Finished task 1.0 in stage 2.0 (TID 5). 1385 bytes result sent to driver
19/02/11 13:31:05 INFO TaskSetManager: Finished task 1.0 in stage 2.0 (TID 5) in 530 ms on localhost (2/2)
19/02/11 13:31:05 INFO TaskSchedulerImpl: Removed TaskSet 2.0, whose tasks have all completed, from pool 
19/02/11 13:31:05 INFO DAGScheduler: ResultStage 2 (collect at <console>:30) finished in 0.541 s
19/02/11 13:31:05 INFO DAGScheduler: Job 1 finished: collect at <console>:30, took 0.940656 s
res0: Array[(String, Int)] = Array((A,1), (A,8), (B,3), (B,7), (B,4), (C,5), (C,4), (D,4), (D,5), (E,5))
scala> kv1.reduceByKey(_+_).collect
19/02/11 13:33:00 INFO SparkContext: Starting job: collect at <console>:30
19/02/11 13:33:00 INFO DAGScheduler: Registering RDD 0 (parallelize at <console>:27)
19/02/11 13:33:00 INFO DAGScheduler: Got job 2 (collect at <console>:30) with 2 output partitions
19/02/11 13:33:00 INFO DAGScheduler: Final stage: ResultStage 4 (collect at <console>:30)
19/02/11 13:33:00 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 3)
19/02/11 13:33:00 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 3)
19/02/11 13:33:00 INFO DAGScheduler: Submitting ShuffleMapStage 3 (ParallelCollectionRDD[0] at parallelize at <console>:27), which has no missing parents
19/02/11 13:33:00 INFO MemoryStore: Block broadcast_3 stored as values in memory (estimated size 2.0 KB, free 511.1 MB)
19/02/11 13:33:00 INFO MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 1291.0 B, free 511.1 MB)
19/02/11 13:33:00 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on localhost:49365 (size: 1291.0 B, free: 511.1 MB)
19/02/11 13:33:00 INFO SparkContext: Created broadcast 3 from broadcast at DAGScheduler.scala:1008
19/02/11 13:33:00 INFO DAGScheduler: Submitting 2 missing tasks from ShuffleMapStage 3 (ParallelCollectionRDD[0] at parallelize at <console>:27)
19/02/11 13:33:00 INFO TaskSchedulerImpl: Adding task set 3.0 with 2 tasks
19/02/11 13:33:00 INFO TaskSetManager: Starting task 0.0 in stage 3.0 (TID 6, localhost, partition 0,PROCESS_LOCAL, 2249 bytes)
19/02/11 13:33:00 INFO TaskSetManager: Starting task 1.0 in stage 3.0 (TID 7, localhost, partition 1,PROCESS_LOCAL, 2238 bytes)
19/02/11 13:33:00 INFO Executor: Running task 0.0 in stage 3.0 (TID 6)
19/02/11 13:33:00 INFO Executor: Running task 1.0 in stage 3.0 (TID 7)
19/02/11 13:33:00 INFO Executor: Finished task 1.0 in stage 3.0 (TID 7). 1159 bytes result sent to driver
19/02/11 13:33:00 INFO Executor: Finished task 0.0 in stage 3.0 (TID 6). 1159 bytes result sent to driver
19/02/11 13:33:00 INFO TaskSetManager: Finished task 1.0 in stage 3.0 (TID 7) in 52 ms on localhost (1/2)
19/02/11 13:33:00 INFO TaskSetManager: Finished task 0.0 in stage 3.0 (TID 6) in 52 ms on localhost (2/2)
19/02/11 13:33:00 INFO TaskSchedulerImpl: Removed TaskSet 3.0, whose tasks have all completed, from pool 
19/02/11 13:33:00 INFO DAGScheduler: ShuffleMapStage 3 (parallelize at <console>:27) finished in 0.054 s
19/02/11 13:33:00 INFO DAGScheduler: looking for newly runnable stages
19/02/11 13:33:00 INFO DAGScheduler: running: Set()
19/02/11 13:33:00 INFO DAGScheduler: waiting: Set(ResultStage 4)
19/02/11 13:33:00 INFO DAGScheduler: failed: Set()
19/02/11 13:33:00 INFO DAGScheduler: Submitting ResultStage 4 (ShuffledRDD[4] at reduceByKey at <console>:30), which has no missing parents
19/02/11 13:33:00 INFO MemoryStore: Block broadcast_4 stored as values in memory (estimated size 2.7 KB, free 511.1 MB)
19/02/11 13:33:00 INFO MemoryStore: Block broadcast_4_piece0 stored as bytes in memory (estimated size 1609.0 B, free 511.1 MB)
19/02/11 13:33:00 INFO BlockManagerInfo: Added broadcast_4_piece0 in memory on localhost:49365 (size: 1609.0 B, free: 511.1 MB)
19/02/11 13:33:00 INFO SparkContext: Created broadcast 4 from broadcast at DAGScheduler.scala:1008
19/02/11 13:33:00 INFO DAGScheduler: Submitting 2 missing tasks from ResultStage 4 (ShuffledRDD[4] at reduceByKey at <console>:30)
19/02/11 13:33:00 INFO TaskSchedulerImpl: Adding task set 4.0 with 2 tasks
19/02/11 13:33:00 INFO TaskSetManager: Starting task 0.0 in stage 4.0 (TID 8, localhost, partition 0,NODE_LOCAL, 1894 bytes)
19/02/11 13:33:00 INFO TaskSetManager: Starting task 1.0 in stage 4.0 (TID 9, localhost, partition 1,NODE_LOCAL, 1894 bytes)
19/02/11 13:33:00 INFO Executor: Running task 0.0 in stage 4.0 (TID 8)
19/02/11 13:33:00 INFO ShuffleBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks
19/02/11 13:33:00 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
19/02/11 13:33:00 INFO Executor: Finished task 0.0 in stage 4.0 (TID 8). 1327 bytes result sent to driver
19/02/11 13:33:00 INFO TaskSetManager: Finished task 0.0 in stage 4.0 (TID 8) in 29 ms on localhost (1/2)
19/02/11 13:33:00 INFO Executor: Running task 1.0 in stage 4.0 (TID 9)
19/02/11 13:33:00 INFO ShuffleBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks
19/02/11 13:33:00 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
19/02/11 13:33:00 INFO Executor: Finished task 1.0 in stage 4.0 (TID 9). 1342 bytes result sent to driver
19/02/11 13:33:00 INFO TaskSetManager: Finished task 1.0 in stage 4.0 (TID 9) in 43 ms on localhost (2/2)
19/02/11 13:33:00 INFO TaskSchedulerImpl: Removed TaskSet 4.0, whose tasks have all completed, from pool 
19/02/11 13:33:00 INFO DAGScheduler: ResultStage 4 (collect at <console>:30) finished in 0.051 s
19/02/11 13:33:00 INFO DAGScheduler: Job 2 finished: collect at <console>:30, took 0.139948 s
res1: Array[(String, Int)] = Array((B,14), (D,9), (A,9), (C,9), (E,5))

scala>

5.启动 spark-shell 后,在 scala 中加载 Key-Value 数据(“A”,4),(“A”,2), (“C”,3),(“A”,4),(“B”,5),(“C”,3),(“A”,4),以 Key 为基准进行去重操 作,并通过 toDebugString 方法来查看 RDD 的谱系。将以上操作命令和结果信 息以文本形式提交到答题框中。

`

scala> val kv2=sc.parallelize(List(("A",4),("A",2),("C",3),("A",4),("B",5),("C",3),("A",4)))
kv2: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[5] at parallelize at <console>:27

scala> kv2.distinct.collect
19/02/11 13:42:15 INFO SparkContext: Starting job: collect at <console>:30
19/02/11 13:42:15 INFO DAGScheduler: Registering RDD 6 (distinct at <console>:30)
19/02/11 13:42:15 INFO DAGScheduler: Got job 3 (collect at <console>:30) with 2 output partitions
19/02/11 13:42:15 INFO DAGScheduler: Final stage: ResultStage 6 (collect at <console>:30)
19/02/11 13:42:15 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 5)
19/02/11 13:42:15 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 5)
19/02/11 13:42:15 INFO DAGScheduler: Submitting ShuffleMapStage 5 (MapPartitionsRDD[6] at distinct at <console>:30), which has no missing parents
19/02/11 13:42:15 INFO MemoryStore: Block broadcast_5 stored as values in memory (estimated size 2.7 KB, free 511.1 MB)
19/02/11 13:42:15 INFO MemoryStore: Block broadcast_5_piece0 stored as bytes in memory (estimated size 1618.0 B, free 511.1 MB)
19/02/11 13:42:15 INFO BlockManagerInfo: Added broadcast_5_piece0 in memory on localhost:49365 (size: 1618.0 B, free: 511.1 MB)
19/02/11 13:42:15 INFO SparkContext: Created broadcast 5 from broadcast at DAGScheduler.scala:1008
19/02/11 13:42:15 INFO DAGScheduler: Submitting 2 missing tasks from ShuffleMapStage 5 (MapPartitionsRDD[6] at distinct at <console>:30)
19/02/11 13:42:15 INFO TaskSchedulerImpl: Adding task set 5.0 with 2 tasks
19/02/11 13:42:15 INFO TaskSetManager: Starting task 0.0 in stage 5.0 (TID 10, localhost, partition 0,PROCESS_LOCAL, 2209 bytes)
19/02/11 13:42:15 INFO TaskSetManager: Starting task 1.0 in stage 5.0 (TID 11, localhost, partition 1,PROCESS_LOCAL, 2224 bytes)
19/02/11 13:42:15 INFO Executor: Running task 1.0 in stage 5.0 (TID 11)
19/02/11 13:42:15 INFO Executor: Running task 0.0 in stage 5.0 (TID 10)
19/02/11 13:42:15 INFO Executor: Finished task 1.0 in stage 5.0 (TID 11). 1159 bytes result sent to driver
19/02/11 13:42:15 INFO Executor: Finished task 0.0 in stage 5.0 (TID 10). 1159 bytes result sent to driver
19/02/11 13:42:15 INFO TaskSetManager: Finished task 1.0 in stage 5.0 (TID 11) in 93 ms on localhost (1/2)
19/02/11 13:42:15 INFO TaskSetManager: Finished task 0.0 in stage 5.0 (TID 10) in 97 ms on localhost (2/2)
19/02/11 13:42:15 INFO TaskSchedulerImpl: Removed TaskSet 5.0, whose tasks have all completed, from pool 
19/02/11 13:42:15 INFO DAGScheduler: ShuffleMapStage 5 (distinct at <console>:30) finished in 0.100 s
19/02/11 13:42:15 INFO DAGScheduler: looking for newly runnable stages
19/02/11 13:42:15 INFO DAGScheduler: running: Set()
19/02/11 13:42:15 INFO DAGScheduler: waiting: Set(ResultStage 6)
19/02/11 13:42:15 INFO DAGScheduler: failed: Set()
19/02/11 13:42:15 INFO DAGScheduler: Submitting ResultStage 6 (MapPartitionsRDD[8] at distinct at <console>:30), which has no missing parents
19/02/11 13:42:15 INFO MemoryStore: Block broadcast_6 stored as values in memory (estimated size 3.3 KB, free 511.1 MB)
19/02/11 13:42:15 INFO MemoryStore: Block broadcast_6_piece0 stored as bytes in memory (estimated size 1930.0 B, free 511.1 MB)
19/02/11 13:42:15 INFO BlockManagerInfo: Added broadcast_6_piece0 in memory on localhost:49365 (size: 1930.0 B, free: 511.1 MB)
19/02/11 13:42:15 INFO SparkContext: Created broadcast 6 from broadcast at DAGScheduler.scala:1008
19/02/11 13:42:15 INFO DAGScheduler: Submitting 2 missing tasks from ResultStage 6 (MapPartitionsRDD[8] at distinct at <console>:30)
19/02/11 13:42:15 INFO TaskSchedulerImpl: Adding task set 6.0 with 2 tasks
19/02/11 13:42:15 INFO TaskSetManager: Starting task 0.0 in stage 6.0 (TID 12, localhost, partition 0,NODE_LOCAL, 1894 bytes)
19/02/11 13:42:15 INFO TaskSetManager: Starting task 1.0 in stage 6.0 (TID 13, localhost, partition 1,NODE_LOCAL, 1894 bytes)
19/02/11 13:42:15 INFO Executor: Running task 0.0 in stage 6.0 (TID 12)
19/02/11 13:42:15 INFO Executor: Running task 1.0 in stage 6.0 (TID 13)
19/02/11 13:42:15 INFO ShuffleBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks
19/02/11 13:42:15 INFO ShuffleBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks
19/02/11 13:42:15 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 1 ms
19/02/11 13:42:15 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 1 ms
19/02/11 13:42:15 INFO Executor: Finished task 1.0 in stage 6.0 (TID 13). 1347 bytes result sent to driver
19/02/11 13:42:15 INFO Executor: Finished task 0.0 in stage 6.0 (TID 12). 1307 bytes result sent to driver
19/02/11 13:42:15 INFO TaskSetManager: Finished task 1.0 in stage 6.0 (TID 13) in 29 ms on localhost (1/2)
19/02/11 13:42:15 INFO TaskSetManager: Finished task 0.0 in stage 6.0 (TID 12) in 34 ms on localhost (2/2)
19/02/11 13:42:15 INFO TaskSchedulerImpl: Removed TaskSet 6.0, whose tasks have all completed, from pool 
19/02/11 13:42:15 INFO DAGScheduler: ResultStage 6 (collect at <console>:30) finished in 0.037 s
19/02/11 13:42:15 INFO DAGScheduler: Job 3 finished: collect at <console>:30, took 0.220315 s
res2: Array[(String, Int)] = Array((A,4), (B,5), (C,3), (A,2))

scala>
scala> kv2.toDebugString
res3: String = (2) ParallelCollectionRDD[5] at parallelize at <console>:27 []

scala>

6.启动 spark-shell 后,在 scala 中加载两组 Key-Value 数据(“A”,1),(“B”, 2),(“C”,3),(“A”,4),(“B”,5)、(“A”,1),(“B”,2),(“C”,3),(“A”,4), (“B”,5),将两组数据以 Key 为基准进行 JOIN 操作,将以上操作命令和结果信 息以文本形式提交到答题框中。

scala> val kv4=sc.parallelize(List(("A",1),("B",2),("C",3),("A",4),("B",5)))
kv4: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[9] at parallelize at <console>:27

scala> val kv5=sc.parallelize(List(("A",1),("B",2),("C",3),("A",4),("B",5)))
kv5: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[10] at parallelize at <console>:27

scala> kv4.join(kv5).collect
19/02/11 13:47:38 INFO SparkContext: Starting job: collect at <console>:32
19/02/11 13:47:38 INFO DAGScheduler: Registering RDD 9 (parallelize at <console>:27)
19/02/11 13:47:38 INFO DAGScheduler: Registering RDD 10 (parallelize at <console>:27)
19/02/11 13:47:38 INFO DAGScheduler: Got job 4 (collect at <console>:32) with 2 output partitions
19/02/11 13:47:38 INFO DAGScheduler: Final stage: ResultStage 9 (collect at <console>:32)
19/02/11 13:47:38 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 7, ShuffleMapStage 8)
19/02/11 13:47:38 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 7, ShuffleMapStage 8)
19/02/11 13:47:38 INFO DAGScheduler: Submitting ShuffleMapStage 7 (ParallelCollectionRDD[9] at parallelize at <console>:27), which has no missing parents
19/02/11 13:47:38 INFO MemoryStore: Block broadcast_7 stored as values in memory (estimated size 1864.0 B, free 511.1 MB)
19/02/11 13:47:38 INFO MemoryStore: Block broadcast_7_piece0 stored as bytes in memory (estimated size 1188.0 B, free 511.1 MB)
19/02/11 13:47:38 INFO BlockManagerInfo: Added broadcast_7_piece0 in memory on localhost:49365 (size: 1188.0 B, free: 511.1 MB)
19/02/11 13:47:38 INFO SparkContext: Created broadcast 7 from broadcast at DAGScheduler.scala:1008
19/02/11 13:47:38 INFO DAGScheduler: Submitting 2 missing tasks from ShuffleMapStage 7 (ParallelCollectionRDD[9] at parallelize at <console>:27)
19/02/11 13:47:38 INFO TaskSchedulerImpl: Adding task set 7.0 with 2 tasks
19/02/11 13:47:38 INFO DAGScheduler: Submitting ShuffleMapStage 8 (ParallelCollectionRDD[10] at parallelize at <console>:27), which has no missing parents
19/02/11 13:47:38 INFO MemoryStore: Block broadcast_8 stored as values in memory (estimated size 1864.0 B, free 511.1 MB)
19/02/11 13:47:38 INFO MemoryStore: Block broadcast_8_piece0 stored as bytes in memory (estimated size 1188.0 B, free 511.1 MB)
19/02/11 13:47:38 INFO BlockManagerInfo: Added broadcast_8_piece0 in memory on localhost:49365 (size: 1188.0 B, free: 511.1 MB)
19/02/11 13:47:38 INFO SparkContext: Created broadcast 8 from broadcast at DAGScheduler.scala:1008
19/02/11 13:47:38 INFO DAGScheduler: Submitting 2 missing tasks from ShuffleMapStage 8 (ParallelCollectionRDD[10] at parallelize at <console>:27)
19/02/11 13:47:38 INFO TaskSchedulerImpl: Adding task set 8.0 with 2 tasks
19/02/11 13:47:38 INFO TaskSetManager: Starting task 0.0 in stage 7.0 (TID 14, localhost, partition 0,PROCESS_LOCAL, 2188 bytes)
19/02/11 13:47:38 INFO TaskSetManager: Starting task 1.0 in stage 7.0 (TID 15, localhost, partition 1,PROCESS_LOCAL, 2208 bytes)
19/02/11 13:47:38 INFO Executor: Running task 0.0 in stage 7.0 (TID 14)
19/02/11 13:47:38 INFO Executor: Running task 1.0 in stage 7.0 (TID 15)
19/02/11 13:47:38 INFO Executor: Finished task 1.0 in stage 7.0 (TID 15). 1159 bytes result sent to driver
19/02/11 13:47:38 INFO Executor: Finished task 0.0 in stage 7.0 (TID 14). 1159 bytes result sent to driver
19/02/11 13:47:38 INFO TaskSetManager: Starting task 0.0 in stage 8.0 (TID 16, localhost, partition 0,PROCESS_LOCAL, 2188 bytes)
19/02/11 13:47:38 INFO TaskSetManager: Starting task 1.0 in stage 8.0 (TID 17, localhost, partition 1,PROCESS_LOCAL, 2208 bytes)
19/02/11 13:47:38 INFO Executor: Running task 0.0 in stage 8.0 (TID 16)
19/02/11 13:47:38 INFO Executor: Running task 1.0 in stage 8.0 (TID 17)
19/02/11 13:47:38 INFO TaskSetManager: Finished task 0.0 in stage 7.0 (TID 14) in 133 ms on localhost (1/2)
19/02/11 13:47:38 INFO TaskSetManager: Finished task 1.0 in stage 7.0 (TID 15) in 134 ms on localhost (2/2)
19/02/11 13:47:38 INFO TaskSchedulerImpl: Removed TaskSet 7.0, whose tasks have all completed, from pool 
19/02/11 13:47:38 INFO DAGScheduler: ShuffleMapStage 7 (parallelize at <console>:27) finished in 0.177 s
19/02/11 13:47:38 INFO DAGScheduler: looking for newly runnable stages
19/02/11 13:47:38 INFO DAGScheduler: running: Set(ShuffleMapStage 8)
19/02/11 13:47:38 INFO DAGScheduler: waiting: Set(ResultStage 9)
19/02/11 13:47:38 INFO DAGScheduler: failed: Set()
19/02/11 13:47:38 INFO Executor: Finished task 0.0 in stage 8.0 (TID 16). 1159 bytes result sent to driver
19/02/11 13:47:38 INFO Executor: Finished task 1.0 in stage 8.0 (TID 17). 1159 bytes result sent to driver
19/02/11 13:47:38 INFO TaskSetManager: Finished task 0.0 in stage 8.0 (TID 16) in 68 ms on localhost (1/2)
19/02/11 13:47:38 INFO DAGScheduler: ShuffleMapStage 8 (parallelize at <console>:27) finished in 0.166 s
19/02/11 13:47:38 INFO DAGScheduler: looking for newly runnable stages
19/02/11 13:47:38 INFO DAGScheduler: running: Set()
19/02/11 13:47:38 INFO DAGScheduler: waiting: Set(ResultStage 9)
19/02/11 13:47:38 INFO DAGScheduler: failed: Set()
19/02/11 13:47:38 INFO DAGScheduler: Submitting ResultStage 9 (MapPartitionsRDD[13] at join at <console>:32), which has no missing parents
19/02/11 13:47:38 INFO TaskSetManager: Finished task 1.0 in stage 8.0 (TID 17) in 65 ms on localhost (2/2)
19/02/11 13:47:38 INFO TaskSchedulerImpl: Removed TaskSet 8.0, whose tasks have all completed, from pool 
19/02/11 13:47:38 INFO MemoryStore: Block broadcast_9 stored as values in memory (estimated size 3.2 KB, free 511.1 MB)
19/02/11 13:47:38 INFO MemoryStore: Block broadcast_9_piece0 stored as bytes in memory (estimated size 1814.0 B, free 511.1 MB)
19/02/11 13:47:38 INFO BlockManagerInfo: Added broadcast_9_piece0 in memory on localhost:49365 (size: 1814.0 B, free: 511.1 MB)
19/02/11 13:47:38 INFO SparkContext: Created broadcast 9 from broadcast at DAGScheduler.scala:1008
19/02/11 13:47:38 INFO DAGScheduler: Submitting 2 missing tasks from ResultStage 9 (MapPartitionsRDD[13] at join at <console>:32)
19/02/11 13:47:38 INFO TaskSchedulerImpl: Adding task set 9.0 with 2 tasks
19/02/11 13:47:38 INFO TaskSetManager: Starting task 0.0 in stage 9.0 (TID 18, localhost, partition 0,PROCESS_LOCAL, 1967 bytes)
19/02/11 13:47:38 INFO TaskSetManager: Starting task 1.0 in stage 9.0 (TID 19, localhost, partition 1,PROCESS_LOCAL, 1967 bytes)
19/02/11 13:47:38 INFO Executor: Running task 1.0 in stage 9.0 (TID 19)
19/02/11 13:47:38 INFO Executor: Running task 0.0 in stage 9.0 (TID 18)
19/02/11 13:47:38 INFO ShuffleBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks
19/02/11 13:47:38 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 1 ms
19/02/11 13:47:38 INFO ShuffleBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks
19/02/11 13:47:38 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
19/02/11 13:47:38 INFO ShuffleBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks
19/02/11 13:47:38 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
19/02/11 13:47:38 INFO ShuffleBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks
19/02/11 13:47:38 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
19/02/11 13:47:38 INFO Executor: Finished task 1.0 in stage 9.0 (TID 19). 1453 bytes result sent to driver
19/02/11 13:47:38 INFO Executor: Finished task 0.0 in stage 9.0 (TID 18). 1417 bytes result sent to driver
19/02/11 13:47:38 INFO TaskSetManager: Finished task 1.0 in stage 9.0 (TID 19) in 184 ms on localhost (1/2)
19/02/11 13:47:38 INFO TaskSetManager: Finished task 0.0 in stage 9.0 (TID 18) in 207 ms on localhost (2/2)
19/02/11 13:47:38 INFO TaskSchedulerImpl: Removed TaskSet 9.0, whose tasks have all completed, from pool 
19/02/11 13:47:38 INFO DAGScheduler: ResultStage 9 (collect at <console>:32) finished in 0.212 s
19/02/11 13:47:38 INFO DAGScheduler: Job 4 finished: collect at <console>:32, took 0.671400 s
res4: Array[(String, (Int, Int))] = Array((B,(2,2)), (B,(2,5)), (B,(5,2)), (B,(5,5)), (A,(1,1)), (A,(1,4)), (A,(4,1)), (A,(4,4)), (C,(3,3)))

scala>

7.登录 spark-shell,定义 i 值为 1,sum 值为 0,使用 while 循环,求从 1 加 到 100 的值,最后使用 scala 的标准输出函数输出 sum 值。将上述所有操作命令 和返回结果以文本形式提交到答题框。

scala> var i=1
i: Int = 1

scala> var sum=0
sum: Int = 0

scala> while(i<100)
     | {
     | i+=1
     | sum+=i
     | }

scala> println(sum)
5049

8.登录 spark-shell,定义 i 值为 1,sum 值为 0,使用 for 循环,求从 1 加到 100 的值,最后使用 scala 的标准输出函数输出 sum 值。将上述所有操作命令和 返回结果以文本形式提交到答题框

scala> var i=1
i: Int = 1

scala> var sum=0
sum: Int = 0

scala> for(i<- 1 to 100) sum+=i

scala> println(sum)
5050

scala>
9.任何一种函数式语言中,都有 map 函数与 faltMap 这两个函数: map 函数的用法,顾名思义,将一个函数传入 map 中,
然后利用传入的这 个函数,将集合中的每个元素处理,并将处理后的结果返回。
 而 flatMap 与 map 唯一不一样的地方就是传入的函数在处理完后返回值必须 是 List,
 所以需要返回值是 List 才能执行 flat 这一步。 
(1)登录 spark-shell,自定义一个 list,然后利用 map 函数,对这个 list 进 行元素乘 2 的操作,

将上述所有操作命令和返回结果以文本形式提交到答题框。 


(2)登录 spark-shell,自定义一个 list,然后利用 flatMap 函数将 list 转换为 单个字母并转换为大写
,将上述所有命令和返回结果以文本形式提交到答题框。


10.登录大数据云主机 master 节点,在 root 目录下新建一个 abc.txt,
里面的 内容为: hadoop hive solr redis kafka hadoop storm flume sqoop docker spark spark hadoop spark elasticsearch hbase hadoop hive spark hive hadoop spark 然后登录 spark-shell,
首先使用命令统计 abc.txt 的行数,接着对 abc.txt 文档 中的单词进行计数,
并按照单词首字母的升序进行排序,最后统计结果行数,将 上述操作命令和返回结果以文本形式提交到答题框。


.登录 spark-shell,自定义一个 List,使用 spark 自带函数对这个 List 进行 去重操作,

以上就是本文的全部内容,希望本文的内容对大家的学习或者工作能带来一定的帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

计算机程序设计艺术(第3卷 英文版·第2版)

计算机程序设计艺术(第3卷 英文版·第2版)

Donald E.Knuth / 人民邮电出版社 / 2010-10 / 119.00元

《计算机程序设计艺术》系列被公认为计算机科学领域的权威之作,深入阐述了程序设计理论,对计算机领域的发展有着极为深远的影响。本书是该系列的第3卷,扩展了第1卷中信息结构的内容,主要讲排序和查找。书中对排序和查找算法进行了详细的介绍,并对各种算法的效率做了大量的分析。 本书适合从事计算机科学、计算数学等各方面工作的人员阅读,也适合高等院校相关专业的师生作为教学参考书,对于想深入理解计算机算法的读......一起来看看 《计算机程序设计艺术(第3卷 英文版·第2版)》 这本书的介绍吧!

Base64 编码/解码
Base64 编码/解码

Base64 编码/解码

XML、JSON 在线转换
XML、JSON 在线转换

在线XML、JSON转换工具

RGB HSV 转换
RGB HSV 转换

RGB HSV 互转工具