内容简介:No technology, no programming language is good enough for all use cases. Spark is one of the many technologies used for solving the large scale data analysis and ETL problem. Having worked on Spark for a bit now, I thought of compiling a cheatsheet with re
A short guide to the PySpark DataFrames API
Jun 14 ·5min read
S park is one of the major players in the data engineering, data science space today. With the ever-increasing requirements to crunch more data, businesses have frequently incorporated Spark in the data stack to solve for processing large amounts of data quickly. Maintained by Apache, the main commercial player in the Spark ecosystem is Databricks (owned by the original creators of Spark). Spark has seen extensive acceptance with all kind of companies and setups — on-prem and in the cloud. Some of the most popular cloud offerings that use Spark underneath are AWS Glue , Google Dataproc , Azure Databricks .
No technology, no programming language is good enough for all use cases. Spark is one of the many technologies used for solving the large scale data analysis and ETL problem. Having worked on Spark for a bit now, I thought of compiling a cheatsheet with real examples. Although there are a lot of resources on using Spark with Scala, I couldn’t find a halfway decent cheat sheet except for the one here on Datacamp , but I thought it needs an update and needs to be just a bit more extensive than a one-pager.
First off, a decent introduction on how Spark works —
Configuration & Initialization
Before you get into what lines of code you have to write to get your PySpark notebook/application up and running, you should know a little bit about SparkContext
, SparkSession
and SQLContext
.
-
SparkContext
— provides connection to Spark with the ability to create RDDs -
SQLContext
— provides connection to Spark with the ability to run SQL queries on data -
SparkSession
— all-encompassing context which includes coverage forSparkContext
,SQLContext
andHiveContext
.
We’ll be using the MovieLens database in some of the examples. Here’s the link to that database. You can go ahead and download it from Kaggle.
Reading Data
Spark supports reading from various data sources like CSV, Text, Parquet, Avro, JSON. It also supports reading from Hive and any database that has a JDBC channel available. Here’s how you read a CSV in Spark —
Throughout your Spark journey, you’ll find that there are many ways of writing the same line of code to achieve the same result. Many functions have aliases (e.g., dropDuplicates
and drop_duplicates
). Here’s an example displaying a couple of ways of reading files in Spark.
Writing Data
Once you’re done transforming your data, you’d want to write it on some kind of persistent storage. Here’s an example showing two different ways to write a Parquet file to disk —
Obviously, based on your consumption patterns and requirements, you can use similar commands writing other file formats to disk too. When writing to a Hive table, you can use bucketBy
instead of partitionBy
.
The idea behind both, bucketBy
and partitionBy
is to reject the data that doesn’t need to be queried, i.e., prune the partitions. It’s an old concept which comes from traditional relational database partitioning.
Creating DataFrames
Apart from the direct method df = spark.read.csv(csv_file_path)
you saw in the Reading Data section above, there’s one other way to create DataFrames and that is using the Row construct of SparkSQL.
There’s one more option where you can either use the .paralellize
or .textFile
feature of Spark to represent a file as a RDD. To convert it into a DataFrame, you’d obviously need to specify a schema. That’s where pyspark.sql.types
come into picture.
We’ll be using a lot of SQL like functionality in PySpark, please take a couple of minutes to familiarize yourself with the following documentation .
Modifying DataFrames
DataFrames abstract away RDDs. Datasets do the same but Datasets don’t come with a tabular, relational database table like representation of the RDDs. DataFrames do. For that reason, DataFrames support operations similar to what you’d usually perform on a database table, i.e., changing the table structure by adding, removing, modifying columns. Spark provides all the functionality in the DataFrames API. Here’s how it goes —
Aside from just creating new columns, we can also rename existing columns using the following method —
And, if we have to drop a column or multiple columns, here’s how we do it —
Joins
The whole idea behind using a SQL like interface for Spark is that there’s a lot of data that can be represented as in a loose relational model, i.e., a model with tables without ACID, integrity checks , etc. Given that, we can expect a lot of joins to happen. Spark provides full support to join two or more datasets. Here’s how —
Filters
Filters are just WHERE
clauses just like in SQL. In fact, you can use filter
and where
exchangeably in Spark. Here’s an example of filtering movies rated between 7.5 and 8.2 in the MovieLens databases movie metadata file.
Filters support all the SQL-like features such as filtering using comparison operators, regular expressions and bitwise operators.
Filtering out null and not null values is one of the most common use cases in querying. Spark provides a simple isNULL
and isNotNull
operation on a column object.
Aggregates
Aggregations are at the centre of the massive effort of processing large scale data as it all usually comes down to BI Dashboards and ML, both of which require aggregation of one sort or the other. Using the SparkSQL library, you can achieve mostly everything what you can in a traditional relational database or a data warehouse query engine. Here’s an example showing how aggregation is done in Spark.
Window Functions & Sorting
As with most analysis engines, window functions have become quite the standard with rank
, dense_rank
, etc., being heavily used. Spark utilizes the traditional SQL based window function syntax of rank() over (partition by something order by something_else desc)
.
Please note that sort
and orderBy
can be used interchangeably in Spark except when it is in Window functions.
These were some examples that I compiled. Obviously there’s much more to Spark than a cheatsheet. If you’re interested or haven’t found anything useful here, head over to the documentation — it’s pretty good.
以上就是本文的全部内容,希望本文的内容对大家的学习或者工作能带来一定的帮助,也希望大家多多支持 码农网
猜你喜欢:本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。