Ultimate PySpark Cheat Sheet

栏目: IT技术 · 发布时间: 4年前

内容简介：No technology, no programming language is good enough for all use cases. Spark is one of the many technologies used for solving the large scale data analysis and ETL problem. Having worked on Spark for a bit now, I thought of compiling a cheatsheet with re

A short guide to the PySpark DataFrames API

Kovid Rathee

Jun 14 ·5min read

S park is one of the major players in the data engineering, data science space today. With the ever-increasing requirements to crunch more data, businesses have frequently incorporated Spark in the data stack to solve for processing large amounts of data quickly. Maintained by Apache, the main commercial player in the Spark ecosystem is Databricks (owned by the original creators of Spark). Spark has seen extensive acceptance with all kind of companies and setups — on-prem and in the cloud. Some of the most popular cloud offerings that use Spark underneath are AWS Glue , Google Dataproc , Azure Databricks .

No technology, no programming language is good enough for all use cases. Spark is one of the many technologies used for solving the large scale data analysis and ETL problem. Having worked on Spark for a bit now, I thought of compiling a cheatsheet with real examples. Although there are a lot of resources on using Spark with Scala, I couldn’t find a halfway decent cheat sheet except for the one here on Datacamp , but I thought it needs an update and needs to be just a bit more extensive than a one-pager.

First off, a decent introduction on how Spark works —

It's How Spark Runs Your Applications

Recall from the previous Spark 101 blog that your Spark application runs as a set of parallel tasks. In this blog post…

mapr.com

Configuration & Initialization

Before you get into what lines of code you have to write to get your PySpark notebook/application up and running, you should know a little bit about SparkContext , SparkSession and SQLContext .

SparkContext — provides connection to Spark with the ability to create RDDs
SQLContext — provides connection to Spark with the ability to run SQL queries on data
SparkSession — all-encompassing context which includes coverage for SparkContext , SQLContext and HiveContext .

We’ll be using the MovieLens database in some of the examples. Here’s the link to that database. You can go ahead and download it from Kaggle.

The Movies Dataset

Metadata on over 45,000 movies. 26 million ratings from over 270,000 users.

www.kaggle.com

Reading Data

Spark supports reading from various data sources like CSV, Text, Parquet, Avro, JSON. It also supports reading from Hive and any database that has a JDBC channel available. Here’s how you read a CSV in Spark —

Throughout your Spark journey, you’ll find that there are many ways of writing the same line of code to achieve the same result. Many functions have aliases (e.g., dropDuplicates and drop_duplicates ). Here’s an example displaying a couple of ways of reading files in Spark.

Writing Data

Once you’re done transforming your data, you’d want to write it on some kind of persistent storage. Here’s an example showing two different ways to write a Parquet file to disk —

Obviously, based on your consumption patterns and requirements, you can use similar commands writing other file formats to disk too. When writing to a Hive table, you can use bucketBy instead of partitionBy .

The 5-minute guide to using bucketing in Pyspark

There are many different tools in the world, each of which solves a range of problems. Many of them are judged by how…

luminousmen.com

The idea behind both, bucketBy and partitionBy is to reject the data that doesn’t need to be queried, i.e., prune the partitions. It’s an old concept which comes from traditional relational database partitioning.

Creating DataFrames

Apart from the direct method df = spark.read.csv(csv_file_path) you saw in the Reading Data section above, there’s one other way to create DataFrames and that is using the Row construct of SparkSQL.

There’s one more option where you can either use the .paralellize or .textFile feature of Spark to represent a file as a RDD. To convert it into a DataFrame, you’d obviously need to specify a schema. That’s where pyspark.sql.types come into picture.

We’ll be using a lot of SQL like functionality in PySpark, please take a couple of minutes to familiarize yourself with the following documentation .

Modifying DataFrames

DataFrames abstract away RDDs. Datasets do the same but Datasets don’t come with a tabular, relational database table like representation of the RDDs. DataFrames do. For that reason, DataFrames support operations similar to what you’d usually perform on a database table, i.e., changing the table structure by adding, removing, modifying columns. Spark provides all the functionality in the DataFrames API. Here’s how it goes —

Aside from just creating new columns, we can also rename existing columns using the following method —

And, if we have to drop a column or multiple columns, here’s how we do it —

Joins

The whole idea behind using a SQL like interface for Spark is that there’s a lot of data that can be represented as in a loose relational model, i.e., a model with tables without ACID, integrity checks , etc. Given that, we can expect a lot of joins to happen. Spark provides full support to join two or more datasets. Here’s how —

Filters

Filters are just WHERE clauses just like in SQL. In fact, you can use filter and where exchangeably in Spark. Here’s an example of filtering movies rated between 7.5 and 8.2 in the MovieLens databases movie metadata file.

Filters support all the SQL-like features such as filtering using comparison operators, regular expressions and bitwise operators.

Filtering out null and not null values is one of the most common use cases in querying. Spark provides a simple isNULL and isNotNull operation on a column object.

Aggregates

Aggregations are at the centre of the massive effort of processing large scale data as it all usually comes down to BI Dashboards and ML, both of which require aggregation of one sort or the other. Using the SparkSQL library, you can achieve mostly everything what you can in a traditional relational database or a data warehouse query engine. Here’s an example showing how aggregation is done in Spark.

Window Functions & Sorting

As with most analysis engines, window functions have become quite the standard with rank , dense_rank , etc., being heavily used. Spark utilizes the traditional SQL based window function syntax of rank() over (partition by something order by something_else desc) .

Please note that sort and orderBy can be used interchangeably in Spark except when it is in Window functions.

These were some examples that I compiled. Obviously there’s much more to Spark than a cheatsheet. If you’re interested or haven’t found anything useful here, head over to the documentation — it’s pretty good.

以上就是本文的全部内容，希望本文的内容对大家的学习或者工作能带来一定的帮助，也希望大家多多支持码农网

查看所有标签

猜你喜欢:

Ultimate PySpark Cheat Sheet

本站部分资源来源于网络，本站转载出于传递更多信息之目的，版权归原作者或者来源机构所有，如转载稿涉及版权问题，请联系我们。

码农书籍

跨平台桌面应用开发：基于Electron与NW.js

【丹】Paul B. Jensen / Goddy Zhao / 2018-3 / 99

《跨平台桌面应用开发：基于Electron与NW.js》是一本同时介绍 Electron和 NW.js的图书，这两者是目前流行的支持使用 HTML、CSS 和 JavaScript 进行桌面应用开发的框架。书中包含大量的编码示例，而且每个示例都是五脏俱全的实用应用，作者对示例中的关键代码都做了非常详细的解释和说明，可让读者通过实际的编码体会使用这两款框架开发桌面应用的切实感受。除此之外，在内容上，......一起来看看《跨平台桌面应用开发：基于Electron与NW.js》这本书的介绍吧!

码农工具