Ultimate PySpark Cheat Sheet

栏目: IT技术 · 发布时间: 5年前

内容简介:No technology, no programming language is good enough for all use cases. Spark is one of the many technologies used for solving the large scale data analysis and ETL problem. Having worked on Spark for a bit now, I thought of compiling a cheatsheet with re

A short guide to the PySpark DataFrames API

Jun 14 ·5min read

S park is one of the major players in the data engineering, data science space today. With the ever-increasing requirements to crunch more data, businesses have frequently incorporated Spark in the data stack to solve for processing large amounts of data quickly. Maintained by Apache, the main commercial player in the Spark ecosystem is Databricks (owned by the original creators of Spark). Spark has seen extensive acceptance with all kind of companies and setups — on-prem and in the cloud. Some of the most popular cloud offerings that use Spark underneath are AWS Glue , Google Dataproc , Azure Databricks .

No technology, no programming language is good enough for all use cases. Spark is one of the many technologies used for solving the large scale data analysis and ETL problem. Having worked on Spark for a bit now, I thought of compiling a cheatsheet with real examples. Although there are a lot of resources on using Spark with Scala, I couldn’t find a halfway decent cheat sheet except for the one here on Datacamp , but I thought it needs an update and needs to be just a bit more extensive than a one-pager.

First off, a decent introduction on how Spark works —

Configuration & Initialization

Before you get into what lines of code you have to write to get your PySpark notebook/application up and running, you should know a little bit about SparkContext , SparkSession and SQLContext .

  • SparkContext — provides connection to Spark with the ability to create RDDs
  • SQLContext — provides connection to Spark with the ability to run SQL queries on data
  • SparkSession — all-encompassing context which includes coverage for SparkContext , SQLContext and HiveContext .

We’ll be using the MovieLens database in some of the examples. Here’s the link to that database. You can go ahead and download it from Kaggle.

Reading Data

Spark supports reading from various data sources like CSV, Text, Parquet, Avro, JSON. It also supports reading from Hive and any database that has a JDBC channel available. Here’s how you read a CSV in Spark —

Throughout your Spark journey, you’ll find that there are many ways of writing the same line of code to achieve the same result. Many functions have aliases (e.g., dropDuplicates and drop_duplicates ). Here’s an example displaying a couple of ways of reading files in Spark.

Writing Data

Once you’re done transforming your data, you’d want to write it on some kind of persistent storage. Here’s an example showing two different ways to write a Parquet file to disk —

Obviously, based on your consumption patterns and requirements, you can use similar commands writing other file formats to disk too. When writing to a Hive table, you can use bucketBy instead of partitionBy .

The idea behind both, bucketBy and partitionBy is to reject the data that doesn’t need to be queried, i.e., prune the partitions. It’s an old concept which comes from traditional relational database partitioning.

Creating DataFrames

Apart from the direct method df = spark.read.csv(csv_file_path) you saw in the Reading Data section above, there’s one other way to create DataFrames and that is using the Row construct of SparkSQL.

There’s one more option where you can either use the .paralellize or .textFile feature of Spark to represent a file as a RDD. To convert it into a DataFrame, you’d obviously need to specify a schema. That’s where pyspark.sql.types come into picture.

We’ll be using a lot of SQL like functionality in PySpark, please take a couple of minutes to familiarize yourself with the following documentation .

Modifying DataFrames

DataFrames abstract away RDDs. Datasets do the same but Datasets don’t come with a tabular, relational database table like representation of the RDDs. DataFrames do. For that reason, DataFrames support operations similar to what you’d usually perform on a database table, i.e., changing the table structure by adding, removing, modifying columns. Spark provides all the functionality in the DataFrames API. Here’s how it goes —

Aside from just creating new columns, we can also rename existing columns using the following method —

And, if we have to drop a column or multiple columns, here’s how we do it —

Joins

The whole idea behind using a SQL like interface for Spark is that there’s a lot of data that can be represented as in a loose relational model, i.e., a model with tables without ACID, integrity checks , etc. Given that, we can expect a lot of joins to happen. Spark provides full support to join two or more datasets. Here’s how —

Filters

Filters are just WHERE clauses just like in SQL. In fact, you can use filter and where exchangeably in Spark. Here’s an example of filtering movies rated between 7.5 and 8.2 in the MovieLens databases movie metadata file.

Filters support all the SQL-like features such as filtering using comparison operators, regular expressions and bitwise operators.

Filtering out null and not null values is one of the most common use cases in querying. Spark provides a simple isNULL and isNotNull operation on a column object.

Aggregates

Aggregations are at the centre of the massive effort of processing large scale data as it all usually comes down to BI Dashboards and ML, both of which require aggregation of one sort or the other. Using the SparkSQL library, you can achieve mostly everything what you can in a traditional relational database or a data warehouse query engine. Here’s an example showing how aggregation is done in Spark.

Window Functions & Sorting

As with most analysis engines, window functions have become quite the standard with rank , dense_rank , etc., being heavily used. Spark utilizes the traditional SQL based window function syntax of rank() over (partition by something order by something_else desc) .

Please note that sort and orderBy can be used interchangeably in Spark except when it is in Window functions.

These were some examples that I compiled. Obviously there’s much more to Spark than a cheatsheet. If you’re interested or haven’t found anything useful here, head over to the documentation — it’s pretty good.


以上就是本文的全部内容,希望本文的内容对大家的学习或者工作能带来一定的帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

为什么中国没出Facebook

为什么中国没出Facebook

谢文 / 凤凰出版社 / 2011-7-1 / 39.80元

《为什么中国没出Facebook》对互联网的游戏规则、市场、格局、模式及发展趋势等多方面进行了阐述,既勾画出了理想中的互联网生态及其本质,又联系中国实际,探讨了中国互联网行业的未来发展。《为什么中国没出Facebook》提出了在互联网成事应该符合的8条原则,比较了Facebook、MySpace、Twitter三种创新模式,指出了Web2.0平台时代新浪、腾讯、百度、搜狐等互联网巨头的未来方向,也......一起来看看 《为什么中国没出Facebook》 这本书的介绍吧!

HTML 压缩/解压工具
HTML 压缩/解压工具

在线压缩/解压 HTML 代码

JS 压缩/解压工具
JS 压缩/解压工具

在线压缩/解压 JS 代码

图片转BASE64编码
图片转BASE64编码

在线图片转Base64编码工具