Ultimate PySpark Cheat Sheet

栏目: IT技术 · 发布时间: 4年前

内容简介:No technology, no programming language is good enough for all use cases. Spark is one of the many technologies used for solving the large scale data analysis and ETL problem. Having worked on Spark for a bit now, I thought of compiling a cheatsheet with re

A short guide to the PySpark DataFrames API

Jun 14 ·5min read

S park is one of the major players in the data engineering, data science space today. With the ever-increasing requirements to crunch more data, businesses have frequently incorporated Spark in the data stack to solve for processing large amounts of data quickly. Maintained by Apache, the main commercial player in the Spark ecosystem is Databricks (owned by the original creators of Spark). Spark has seen extensive acceptance with all kind of companies and setups — on-prem and in the cloud. Some of the most popular cloud offerings that use Spark underneath are AWS Glue , Google Dataproc , Azure Databricks .

No technology, no programming language is good enough for all use cases. Spark is one of the many technologies used for solving the large scale data analysis and ETL problem. Having worked on Spark for a bit now, I thought of compiling a cheatsheet with real examples. Although there are a lot of resources on using Spark with Scala, I couldn’t find a halfway decent cheat sheet except for the one here on Datacamp , but I thought it needs an update and needs to be just a bit more extensive than a one-pager.

First off, a decent introduction on how Spark works —

Configuration & Initialization

Before you get into what lines of code you have to write to get your PySpark notebook/application up and running, you should know a little bit about SparkContext , SparkSession and SQLContext .

  • SparkContext — provides connection to Spark with the ability to create RDDs
  • SQLContext — provides connection to Spark with the ability to run SQL queries on data
  • SparkSession — all-encompassing context which includes coverage for SparkContext , SQLContext and HiveContext .

We’ll be using the MovieLens database in some of the examples. Here’s the link to that database. You can go ahead and download it from Kaggle.

Reading Data

Spark supports reading from various data sources like CSV, Text, Parquet, Avro, JSON. It also supports reading from Hive and any database that has a JDBC channel available. Here’s how you read a CSV in Spark —

Throughout your Spark journey, you’ll find that there are many ways of writing the same line of code to achieve the same result. Many functions have aliases (e.g., dropDuplicates and drop_duplicates ). Here’s an example displaying a couple of ways of reading files in Spark.

Writing Data

Once you’re done transforming your data, you’d want to write it on some kind of persistent storage. Here’s an example showing two different ways to write a Parquet file to disk —

Obviously, based on your consumption patterns and requirements, you can use similar commands writing other file formats to disk too. When writing to a Hive table, you can use bucketBy instead of partitionBy .

The idea behind both, bucketBy and partitionBy is to reject the data that doesn’t need to be queried, i.e., prune the partitions. It’s an old concept which comes from traditional relational database partitioning.

Creating DataFrames

Apart from the direct method df = spark.read.csv(csv_file_path) you saw in the Reading Data section above, there’s one other way to create DataFrames and that is using the Row construct of SparkSQL.

There’s one more option where you can either use the .paralellize or .textFile feature of Spark to represent a file as a RDD. To convert it into a DataFrame, you’d obviously need to specify a schema. That’s where pyspark.sql.types come into picture.

We’ll be using a lot of SQL like functionality in PySpark, please take a couple of minutes to familiarize yourself with the following documentation .

Modifying DataFrames

DataFrames abstract away RDDs. Datasets do the same but Datasets don’t come with a tabular, relational database table like representation of the RDDs. DataFrames do. For that reason, DataFrames support operations similar to what you’d usually perform on a database table, i.e., changing the table structure by adding, removing, modifying columns. Spark provides all the functionality in the DataFrames API. Here’s how it goes —

Aside from just creating new columns, we can also rename existing columns using the following method —

And, if we have to drop a column or multiple columns, here’s how we do it —

Joins

The whole idea behind using a SQL like interface for Spark is that there’s a lot of data that can be represented as in a loose relational model, i.e., a model with tables without ACID, integrity checks , etc. Given that, we can expect a lot of joins to happen. Spark provides full support to join two or more datasets. Here’s how —

Filters

Filters are just WHERE clauses just like in SQL. In fact, you can use filter and where exchangeably in Spark. Here’s an example of filtering movies rated between 7.5 and 8.2 in the MovieLens databases movie metadata file.

Filters support all the SQL-like features such as filtering using comparison operators, regular expressions and bitwise operators.

Filtering out null and not null values is one of the most common use cases in querying. Spark provides a simple isNULL and isNotNull operation on a column object.

Aggregates

Aggregations are at the centre of the massive effort of processing large scale data as it all usually comes down to BI Dashboards and ML, both of which require aggregation of one sort or the other. Using the SparkSQL library, you can achieve mostly everything what you can in a traditional relational database or a data warehouse query engine. Here’s an example showing how aggregation is done in Spark.

Window Functions & Sorting

As with most analysis engines, window functions have become quite the standard with rank , dense_rank , etc., being heavily used. Spark utilizes the traditional SQL based window function syntax of rank() over (partition by something order by something_else desc) .

Please note that sort and orderBy can be used interchangeably in Spark except when it is in Window functions.

These were some examples that I compiled. Obviously there’s much more to Spark than a cheatsheet. If you’re interested or haven’t found anything useful here, head over to the documentation — it’s pretty good.


以上就是本文的全部内容,希望本文的内容对大家的学习或者工作能带来一定的帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

图论导引

图论导引

[美] Douglas B.West / 机械工业出版社 / 2004-10 / 59.00元

图论在计算科学、社会科学和自然科学等各个领域都有广泛应用。本书是本科生或研究生一学期或两学期的图论课程教材。全书力求保持按证明的难度和算法的复杂性循序渐进的风格,使学生能够深入理解书中的内容。书中包括对证明技巧的讨论、1200多道习题、400多幅插图以及许多例题,而且对所有定理都给出了详细完整的证明。虽然本书包括许多算法和应用,但是重点在于理解图论结构和分析图论问题的技巧。一起来看看 《图论导引》 这本书的介绍吧!

图片转BASE64编码
图片转BASE64编码

在线图片转Base64编码工具

URL 编码/解码
URL 编码/解码

URL 编码/解码

SHA 加密
SHA 加密

SHA 加密工具