Production Data Processing With Apache Spark

Production Data Processing with
Apache Spark
Using the AWS CLI to submit PySpark applications on
a cluster, a step-by-step guide
Apache Spark has been all the rage for large scale data processing
and analytics — for good reason. With Spark, organizations are
able to extract a ton of value from there ever-growing piles of data.
Because of this, data scientists and engineers who can build Spark
applications are highly valued by businesses. This article will show
you how to run your Spark application on an Amazon EMR cluster
from the command line.
Most of the PySpark tutorials out there use Jupyter notebooks to

demonstrate Spark’s data processing and machine learning
functionality. The reason is simple. When working on a cluster,
notebooks make it much easier to test syntax and debug Spark
applications by giving you quick feedback and presenting error
messages within the UI. Otherwise, you would have to dig through
log files to figure out what went wrong — not ideal for learning.
Once you’re confident your code works, you may want to integrate
your Spark application into your systems. Here, notebooks are
much less useful. To run PySpark on a schedule, we need to move
our code from a notebook to a Python script and submit that script
to a cluster.
Submitting Spark applications to a cluster from the command line

can be intimidating at first. My goal is to make is to demystify the
process. This guide will show you how to use the AWS Command
Line Interface to:
1. Create a cluster that can handle datasets much larger than what
fits on your local machine.
2. Submit a Spark application to the cluster, that reads data,
processes it, and stores the results in an accessible location.
3. Auto-terminate the cluster once the step is complete, so you
only pay for the cluster while you’re using it.
Spark Development Workflow

When developing Spark applications for processing data or
running machine learning models, my preference is to start by
using a Jupyter notebook for the reasons stated above. Here’s a
guide to creating an Amazon EMR cluster and connecting to it
with a Jupyter notebook.
Once I know my code works, I may want to put the process in play
as a scheduled job. I’ll put the code in a script so I can put it on a
schedule with Cron or Apache Airflow.
Production Spark Applications

Create your AWS account if you haven’t
already. Install and configure the AWS Command Line Interface.
To configure the AWS CLI, you’ll need to add your credentials.
You can create credentials by following these instructions. You’ll
also need to specify your default region. For this tutorial, we’re
using us-west-2. You can use whichever region you want. Just be
sure to use the same region for all of your resources.
Defining a Spark Application

For this example, we’ll load Amazon book review data from S3,
perform basic processing, and calculate some aggregates. We’ll
then write our aggregated data frame back to S3.
The example is simple, but this is a common workflow for Spark.
1. Read the data from a source (S3 in this example).

2. Process the data or execute a model workflow with Spark ML.
3. Write the results somewhere accessible to our systems (another
S3 bucket in this example).
If you haven’t already, create an S3 bucket now. Make sure the

region you create the bucket in is the same region you
use for the rest of this tutorial. I’ll be using region “US
West (Oregon)”. Copy the file below. Be sure to edit
the output_path in main() to use your S3 bucket. Then
upload pyspark_job.py to your bucket.
# pyspark_job.pyfrom pyspark.sql import SparkSession
from pyspark.sql import functions as Fdef create_spark_session():
"""Create spark session.Returns:
spark (SparkSession) - spark session connected to AWS EMR
cluster
"""
spark = SparkSession \
.builder \
.config("spark.jars.packages",
"org.apache.hadoop:hadoop-aws:2.7.0") \
.getOrCreate()
return sparkdef process_book_data(spark, input_path,
output_path):
"""Process the book review data and write to S3.Arguments:
spark (SparkSession) - spark session connected to AWS EMR
cluster
input_path (str) - AWS S3 bucket path for source data
output_path (str) - AWS S3 bucket for writing processed
data
"""
df = spark.read.parquet(input_path)
# Apply some basic filters and aggregate by product_title.
book_agg = df.filter(df.verified_purchase == 'Y') \
.groupBy('product_title') \
.agg({'star_rating': 'avg', 'review_id': 'count'}) \
.filter(F.col('count(review_id)') >= 500) \
.sort(F.desc('avg(star_rating)')) \
.select(F.col('product_title').alias('book_title'),
F.col('count(review_id)').alias('review_count'),
F.col('avg(star_rating)').alias('review_avg_stars'))
# Save the data to your S3 bucket as a .parquet file.
book_agg.write.mode('overwrite')\
.save(output_path)def main():
spark = create_spark_session()
input_path = ('s3://amazon-reviews-pds/parquet/' +
'product_category=Books/*.parquet')
output_path = 's3://spark-tutorial-bwl/book-aggregates'
process_book_data(spark, input_path, output_path)if __name__ ==
'__main__':
main()
Using AWS Command Line Interface

It’s time to create our cluster and submit our application. Once our
application finishes, we’ll tell the cluster to terminate. Auto-
terminate allows us to pay for the resources only when we need
them.
Depending on our use case, we may not want to terminate our

cluster upon completion. For instance, if you have a web
application that relies on Spark for a data processing task, you
may want to have a dedicated cluster running at all times.
Run the command below. Make sure you replace the bold
italicized pieces with your own files. Details on --ec2-
attributes and --bootstrap-actions, and all of the other arguments, are
included below.
aws emr create-cluster --name "Spark cluster with step" \
--release-label emr-5.24.1 \
--applications Name=Spark \
--log-uri s3://your-bucket/logs/ \
--ec2-attributes KeyName=your-key-pair \
--instance-type m5.xlarge \
--instance-count 3 \
--bootstrap-actions Path=s3://your-bucket/emr_bootstrap.sh \
--steps Type=Spark,Name="Spark
job",ActionOnFailure=CONTINUE,Args=[--deploy-mode,cluster,--
master,yarn,s3://your-bucket/pyspark_job.py] \
--use-default-roles \
--auto-terminate
Important aws emr create-cluster arguments:
 --steps tells your cluster what to do after the cluster starts. Be

sure to replace s3://your-bucket/pyspark_job.py in the --
steps argument with the S3 path to your Spark application. You
can also put your application code on S3 and pass an S3 path.
 --bootstrap-actions allows you to specify what packages you want
to be installed on all of your cluster’s nodes. This step is only
necessary if your application uses non-builtin Python packages
other than pyspark. To use such packages, create
your emr_bootstrap.sh file using the example below as a template,
and add it to your S3 bucket. Include --bootstrap-actions
Path=s3://your-bucket/emr_bootstrap.sh in the aws emr create-
cluster command.
#!/bin/bash
sudo pip install -U \
matplotlib \
pandas \
spark-nlp
 --ec2-attributes allows you to specify many different EC2

attributes. Set your key pair using this syntax --ec2-attributes
KeyPair=your-key-pair. Note: this is just the name of your
key pair, not the file path. You can learn more about
creating a key pair file here.
 --log-uri requires an S3 bucket to store your log files.
Other aws emr create-cluster arguments explained:
 --name gives the cluster you are creating an identifier.

 specifies which version of EMR to use. I
--release-label
recommend using the latest version.
 tells EMR which type of application you will be
--applications
using on your cluster. To create a Spark cluster, use Name=Spark.
 specifies which type of EC2 instance you want to
--instance-type
use for your cluster.
 --instance-count specifies how many instances you want in your
cluster.
 --use-default-rolestells the cluster to use the default IAM roles
for EMR. If this is your first time using EMR, you’ll need to
run aws emr create-default-roles before you can use this command.
If you’ve created a cluster on EMR in the region you have the
AWS CLI configured for, then you should be good to go.
 --auto-terminatetells the cluster to terminate once the steps
specified in --steps finish. Exclude this command if you would
like to leave your cluster running — beware that you are paying
for your cluster as long as you keep it running.
Check the Spark application’s progress

After you execute the aws emr create-cluster command, you should get
a response:
{
"ClusterId": "j-xxxxxxxxxx"
}
Sign-in to the AWS console and navigate to the EMR dashboard.

Your cluster status should be “Starting”. It should take about ten
minutes for your cluster to start up, bootstrap, and run your
application (if you used my example code). Once the step is
complete, you should see the output data in your S3 bucket.
That’s it!
Final thoughts
You now know how to create an Amazon EMR cluster and submit
Spark applications to it. This workflow is a crucial component of
building production data processing applications with Spark. I
hope you’re now feeling more confident working with all of these
tools.

Production Data Processing With Apache Spark

Uploaded by

Production Data Processing With Apache Spark

Uploaded by

Production Data Processing with

Most of the PySpark tutorials out there use Jupyter notebooks to

Submitting Spark applications to a cluster from the command line

Spark Development Workflow

Production Spark Applications

Defining a Spark Application

The example is simple, but this is a common workflow for Spark.

1. Read the data from a source (S3 in this example).

If you haven’t already, create an S3 bucket now. Make sure the

Using AWS Command Line Interface

Depending on our use case, we may not want to terminate our

Important aws emr create-cluster arguments:

 --steps tells your cluster what to do after the cluster starts. Be

 --ec2-attributes allows you to specify many different EC2

 --name gives the cluster you are creating an identifier.

Check the Spark application’s progress

Sign-in to the AWS console and navigate to the EMR dashboard.

You might also like