Modern Data Pipelines With Apache Airflow

Download as pdf or txt
Download as pdf or txt
You are on page 1of 36

Modern Data Pipelines

with Apache Airflow


Andy Cooper & Taylor Edmiston @ Astronomer.io
Momentum Dev Con 2018
About Us
Andy Cooper Taylor Edmiston

● Data Engineer ● Backend software engineer building the


● 6 years of experience developing software Airflow platform at Astronomer.io
and data pipelines ● 9 years with Python, 6 years as a
● Began career developing traditional data professional developer
warehouses with Microsoft stack ● Top 20% all time on Stack Overflow with a
● Using Airflow since 1.7 reach of 750k developers
● Enjoys travel - 9 countries / 4 continents
What is Astronomer?
● Astronomer is a data engineering platform built on Apache Airflow and clickstream analytics
● Building tools that make data engineers lives easier
● Seed-stage startup, founded ~3 years ago, located in Cincinnati (OTR)
● AngelPad #9 batch
● https://www.astronomer.io
● https://www.crunchbase.com/organization/astronomer
What do we do?
Airflow Clickstream

● Astronomer Cloud (Managed Airflow) ● A clickstream analytics pipeline and router


○ Get up and running with Airflow quickly for user events
● Astronomer Enterprise (docs) ● Client-side (web, native mobile) or
○ Keep your data and workflows in your server-side
private cloud
● Not an analytics service! We integrate with
○ Astronomer Spacecamp - Enterprise
support & training available 50+
(https://www.astronomer.io/blog/announcin ● Free tier
g-astronomer-spacecamp/) ● astronomer.io/clickstream
● Astronomer Open (docs) ● 2-min demo video -
○ The core of our platform is open source — https://www.youtube.com/watch?v=ru7VM
try our Docker images on your machine
e5MXZk
(~40 min) Outline
● (5 min) Intro
● (10 min) Part I - Airflow overview & concepts
● (10 min) Part II - Example DAGs
● Midpoint Q&A?
● (10 min) Part III - Getting started with Airflow + Astro CLI demo
● (5 min) Summary / Outro
● Q&A
What We’ll Cover
● Airflow Concepts
● Getting Started with Airflow
● Astro CLI
● Preview and Discussion Of Airflow UI
● Q&A
What is Apache Airflow?
● “Airflow is a platform to programmatically author, schedule and monitor
workflows.”
● Open Source currently in the Apache Incubator phase
○ 7,500 stars
○ 4,000 commits
○ 400 contributors
● Written in Python
● Leverages Flask web framework
Airflow Concepts
What is a DAG?
Directed Acyclic Graph
Define Your Pipelines in
Code
A Centralized Web App for
All Workflows
Web App Features
● A quick look into DAG and task progress
● Error Logging
● Connections & Variables
● Connection Pooling
Hooks and Operators
Hooks
● An interface to an external system
● Often a wrapper for an API client
● Examples
○ DbApiHook
○ S3Hook
○ SlackHook
Operators
● Sensor Operators
○ S3KeySensor
○ S3PrefixSensor
○ HTTPSensor
● Action Operators
○ BashOperator
○ PythonOperator
○ EmailOperator
● Transfer Operators
○ SalesforceToRedshiftSchemaSync
○ SalesforceToS3
DAG Runs & Task
Instances
Dynamic DAGs
Executors & Scaling
Executors
● SequentialExecutor
● LocalExecutor
○ No additional dependencies
○ Multi-threaded out of the box
● CeleryExecutor
● MesosExecutor
● KubernetesExecutor (future)
Plugins
What can a plugin do?
● Extend the Airflow API
● Build new dashboards
● Create custom Hooks and Operators
● Astronomer maintains the most comprehensive collection of Airflow Plugins
○ github.com/airflow-plugins
● Code reuse, composition, good software engineering practices, etc
● Examples
○ Salesforce To Redshift Plugin
○ airflow-api-plugin
○ Airflow DAG Creation Manager Plugin
Example DAGs
DAG Examples
● GitHub stats DAG
● Clickstream Redshift loader DAG
○ ~200 million events per month from customer apps
○ ~2 million Airflow task instances per month
● https://github.com/airflow-plugins/Example-Airflow-DAGs
Github Issue and Commit Tracking Ex.
Clickstream Redshift DAG
Clickstream Redshift DAG
● Your Website → Astronomer Clickstream → S3 → [S3 sensor → Redshift
copy via Apache Spark]
● Dynamic DAGs configured via API → Scheduler (cached) → Variable
Astro CLI
The fastest way to get started with Airflow
How can I get started with Airflow?
● Source Code
○ https://github.com/astronomerio/astro-cli
● Install CLI
○ $ curl -sL https://install.astronomer.io | sudo bash
● Start a Project
○ $ mkdir test-project && cd test-project
○ $ astro airflow init
○ $ astro airflow start
Takeaway
● Part I - Airflow overview & concepts
● Part II - Example DAGs
● Part III - Getting started with Airflow + Astro CLI demo
Resources
● Official
○ https://github.com/apache/incubator-airflow
○ https://airflow.apache.org
○ Airflow Dev Mailing List
○ Apache Airflow meetups
● Community
○ https://github.com/airflow-plugins
○ https://soundcloud.com/the-airflow-podcast
○ https://github.com/jghoman/awesome-apache-airflow
● Related Talks
○ https://blog.tedmiston.com/talks/
Contact Info
● Andy
○ https://twitter.com/andscoop
○ https://www.linkedin.com/in/andscoop/
○ https://andscoop.com/
[email protected]
● Taylor
○ https://twitter.com/kicksopenminds
○ https://www.linkedin.com/in/tedmiston/
○ https://blog.tedmiston.com
[email protected]