spark杂记:Spark Basics

栏目: 编程工具 · 发布时间: 5年前

内容简介:下面来看几个问题,下面将关注几个问题进行阐述:可以参考:Big Data Analytics using Spark这个课程:若查看所有版的JAVA_HOME,使用命令:/usr/libexec/java_home -v

下面来看几个问题,下面将关注几个问题进行阐述:

  • Mac下安装pyspark
  • spark相关基础知识

1、Mac下安装pyspark

可以参考:Big Data Analytics using Spark这个课程: https://courses.edx.org/courses/course-v1:UCSanDiegoX+DSE230x+1T2018/courseware/b341cd4498054fa089cc99dcadd5875a/13b26725c3564f73ba763ef209b8449e/1?activate_block_id=block-v1%3AUCSanDiegoX%2BDSE230x%2B1T2018%2Btype%40vertical%2Bblock%40ff752a67a23547db9efbc7769dc93987

若查看所有版的JAVA_HOME,使用命令:/usr/libexec/java_home -v

下载完以后,可以不用配置通过下面方法进行使用:

import os
import sys

#下面这些目录都是你自己机器的Spark安装目录和 Java 安装目录
os.environ['SPARK_HOME'] = "/Users/liupeng/spark/spark-2.4.0-bin-hadoop2.7/"

sys.path.append("/Users/liupeng/spark/spark-2.4.0-bin-hadoop2.7/bin")
sys.path.append("/Users/liupeng/spark/spark-2.4.0-bin-hadoop2.7/python")
sys.path.append("/Users/liupeng/spark/spark-2.4.0-bin-hadoop2.7/python/pyspark")
sys.path.append("/Users/liupeng/spark/spark-2.4.0-bin-hadoop2.7/python/lib")
sys.path.append("/Users/liupeng/spark/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip")
sys.path.append("/Users/liupeng/spark/spark-2.4.0-bin-hadoop2.7/lib/py4j-0.9-src.zip")
# sys.path.append("/Library/Java/JavaVirtualMachines/jdk1.8.0_144.jdk/Contents/Home")
os.environ['JAVA_HOME'] = "/Library/Java/JavaVirtualMachines/jdk1.8.0_144.jdk/Contents/Home"

from pyspark import SparkContext
from pyspark import SparkConf


sc = SparkContext("local","testing")

print (sc.version)

x = sc.parallelize([1,2,3,4])
y = x.map(lambda x:(x, x**3))

print ( y.collect() )

2、spark相关基础知识

相关spark基础知识如下:

Spark Context:

We start by creating a SparkContext object named sc. In this case we create a spark context that uses 4 executors (one per core)。

Only one sparkContext at a time!

  • Spark is designed for single user
  • Only one sparkContext per program/notebook.
  • Before starting a new sparkContext. Stop the one currently running

# sc.stop() #commented out so that you don't stop your context by mistake

RDDs

RDD (or Resilient Distributed DataSet) is the main novel data structure in Spark. You can think of it as a list whose elements are stored on several computers.

spark杂记:Spark Basics

The elements of each RDD are distributed across the worker nodes which are the nodes that perform the actual computations. This notebook, however, is running on the Driver node. As the RDD is not stored on the driver-node you cannot access it directly. The variable name RDD is really just a pointer to a python object which holds the information regardnig the actual location of the elements.

Some basic RDD commands

Parallelize

  • Simplest way to create an RDD.
  • The method A=sc.parallelize(L), creates an RDD named A from list L.
  • A is an RDD of type PythonRDD.
A=sc.parallelize(range(3))
print (A)

output:PythonRDD[1] at RDD at PythonRDD.scala:48

Collect:

  • RDD content is distributed among all executors.
  • collect() is the inverse of `parallelize()'
  • collects the elements of the RDD
  • Returns a list
L=A.collec t()
print (type(L))
print (L)

output:<class 'list'> [0, 1, 2]

Using .collect() eliminates the benefits of parallelism

It is often tempting to .collect() and RDD, make it into a list, and then process the list using standard python. However, note that this means that you are using only the head node to perform the computation which means that you are not getting any benefit from spark.

Using RDD operations, as described below, will make use of all of the computers at your disposal.

Map

  • applies a given operation to each element of an RDD
  • parameter is the function defining the operation.
  • returns a new RDD.
  • Operation performed in parallel on all executors.
  • Each executor operates on the data local to it.
A.map(lambda x: x*x).collect()

output:[0, 1, 4]

Note: Here we are using lambda functions, later we will see that regular functions can also be used.

Reduce

  • Takes RDD as input, returns a single value.
  • Reduce operator takes two elements as input returns one as output.
  • Repeatedly applies a reduce operator
  • Each executor reduces the data local to it.
  • The results from all executors are combined.

The simplest example of a 2-to-1 operation is the sum:

A.reduce(lambda x,y:x+y)

output:3

Here is an example of a reduce operation that finds the shortest string in an RDD of strings.

words=['this','is','the','best','mac','ever']
wordRDD=sc.parallelize(words)
wordRDD.reduce(lambda w,v: w if len(w)<len(v) else v)

output:'s'

Properties of reduce operations:

  • Reduce operations must not depend on the order
    • Order of operands should not matter
    • Order of application of reduce operator should not matter
  • Multiplication and summation are good:

1 + 3 + 5 + 2 5 + 3 + 1 + 2

  • Division and subtraction are bad:

1 - 3 - 5 - 2 1 - 3 - 5 - 2

Why must reordering not change the result?

You can think about the reduce operation as a binary tree where the leaves are the elements of the list and the root is the final result. Each triplet of the form (parent, child1, child2) corresponds to a single application of the reduce function.

The order in which the reduce operation is applied is determined at run time and depends on how the RDD is partitioned across the cluster. There are many different orders to apply the reduce operation.

If we want the input RDD to uniquely determine the reduced value all evaluation orders must must yield the same final result. In addition, the order of the elements in the list must not change the result. In particular, reversing the order of the operands in a reduce function must not change the outcome.

For example the arithmetic operations multiply * and add + can be used in a reduce, but the operations subtract - and divide / should not.

Doing so will not raise an error, but the result is unpredictable.

B=sc.parallelize([1,3,5,2])
B.reduce(lambda x,y: x-y)

output:-9

Slide Type

Which of these the following orders was executed?

  • ((1−3)−5)−2

or

  • (1−3)−(5−2)

Using regular functions instead of lambda functions

  • lambda function are short and sweet.
  • but sometimes it's hard to use just one line.
  • We can use full-fledged functions instead.
A.reduce(lambda x,y: x+y)

output:3

Suppose we want to find the

  • last word in a lexicographical order 
  • among 
  • the longest words in the list.

We could achieve that as follows:

def largerThan(x,y):
    if len(x)>len(y): return x
    elif len(y)>len(x): return y
    else:  #lengths are equal, compare lexicographically
        if x>y: 
            return x
        else: 
            return y
wordRDD.reduce(largerThan)

output:'this'

Summary:

We saw how to:

  • Start a SparkContext
  • Create an RDD
  • Perform Map and Reduce operations on an RDD
  • Collect the final results back to head node.

以上所述就是小编给大家介绍的《spark杂记:Spark Basics》,希望对大家有所帮助,如果大家有任何疑问请给我留言,小编会及时回复大家的。在此也非常感谢大家对 码农网 的支持!

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

Linux设备驱动程序

Linux设备驱动程序

科波特 / 魏永明、耿岳、钟书毅 / 中国电力出版社 / 2006-1-1 / 69.00元

本书是经典著作《Linux设备驱动程序》的第三版。如果您希望在Linux操作系统上支持计算机外部设备,或者在Linux上运行新的硬件,或者只是希望一般性地了解Linux内核的编程,就一定要阅读本书。本书描述了如何针对各种设备编写驱动程序,而在过去,这些内容仅仅以口头形式交流,或者零星出现在神秘的代码注释中。 本书的作者均是Linux社区的领导者。Jonathan Corbet虽不是专职的内核......一起来看看 《Linux设备驱动程序》 这本书的介绍吧!

RGB转16进制工具
RGB转16进制工具

RGB HEX 互转工具

URL 编码/解码
URL 编码/解码

URL 编码/解码

正则表达式在线测试
正则表达式在线测试

正则表达式在线测试