Production Data Processing With Apache Spark
Production Data Processing With Apache Spark
Apache Spark
Using the AWS CLI to submit PySpark applications on
a cluster, a step-by-step guide
Apache Spark has been all the rage for large scale data processing
and analytics — for good reason. With Spark, organizations are
able to extract a ton of value from there ever-growing piles of data.
Because of this, data scientists and engineers who can build Spark
applications are highly valued by businesses. This article will show
you how to run your Spark application on an Amazon EMR cluster
from the command line.
Once you’re confident your code works, you may want to integrate
your Spark application into your systems. Here, notebooks are
much less useful. To run PySpark on a schedule, we need to move
our code from a notebook to a Python script and submit that script
to a cluster.
Once I know my code works, I may want to put the process in play
as a scheduled job. I’ll put the code in a script so I can put it on a
schedule with Cron or Apache Airflow.
F.col('avg(star_rating)').alias('review_avg_stars'))
# Save the data to your S3 bucket as a .parquet file.
book_agg.write.mode('overwrite')\
.save(output_path)def main():
spark = create_spark_session()
input_path = ('s3://amazon-reviews-pds/parquet/' +
'product_category=Books/*.parquet')
output_path = 's3://spark-tutorial-bwl/book-aggregates'
process_book_data(spark, input_path, output_path)if __name__ ==
'__main__':
main()
Run the command below. Make sure you replace the bold
italicized pieces with your own files. Details on --ec2-
attributes and --bootstrap-actions, and all of the other arguments, are
included below.
aws emr create-cluster --name "Spark cluster with step" \
--release-label emr-5.24.1 \
--applications Name=Spark \
--log-uri s3://your-bucket/logs/ \
--ec2-attributes KeyName=your-key-pair \
--instance-type m5.xlarge \
--instance-count 3 \
--bootstrap-actions Path=s3://your-bucket/emr_bootstrap.sh \
--steps Type=Spark,Name="Spark
job",ActionOnFailure=CONTINUE,Args=[--deploy-mode,cluster,--
master,yarn,s3://your-bucket/pyspark_job.py] \
--use-default-roles \
--auto-terminate
That’s it!
Final thoughts
You now know how to create an Amazon EMR cluster and submit
Spark applications to it. This workflow is a crucial component of
building production data processing applications with Spark. I
hope you’re now feeling more confident working with all of these
tools.