The document discusses modern data pipelines using Apache Airflow. It provides an overview of Airflow concepts like DAGs, tasks, the Airflow web interface and scaling options. It also demonstrates example DAGs for GitHub stats and loading clickstream data into Redshift. Finally, it shows how to quickly get started with Airflow using the Astro CLI.
The document discusses modern data pipelines using Apache Airflow. It provides an overview of Airflow concepts like DAGs, tasks, the Airflow web interface and scaling options. It also demonstrates example DAGs for GitHub stats and loading clickstream data into Redshift. Finally, it shows how to quickly get started with Airflow using the Astro CLI.
The document discusses modern data pipelines using Apache Airflow. It provides an overview of Airflow concepts like DAGs, tasks, the Airflow web interface and scaling options. It also demonstrates example DAGs for GitHub stats and loading clickstream data into Redshift. Finally, it shows how to quickly get started with Airflow using the Astro CLI.
The document discusses modern data pipelines using Apache Airflow. It provides an overview of Airflow concepts like DAGs, tasks, the Airflow web interface and scaling options. It also demonstrates example DAGs for GitHub stats and loading clickstream data into Redshift. Finally, it shows how to quickly get started with Airflow using the Astro CLI.
Andy Cooper & Taylor Edmiston @ Astronomer.io Momentum Dev Con 2018 About Us Andy Cooper Taylor Edmiston
● Data Engineer ● Backend software engineer building the
● 6 years of experience developing software Airflow platform at Astronomer.io and data pipelines ● 9 years with Python, 6 years as a ● Began career developing traditional data professional developer warehouses with Microsoft stack ● Top 20% all time on Stack Overflow with a ● Using Airflow since 1.7 reach of 750k developers ● Enjoys travel - 9 countries / 4 continents What is Astronomer? ● Astronomer is a data engineering platform built on Apache Airflow and clickstream analytics ● Building tools that make data engineers lives easier ● Seed-stage startup, founded ~3 years ago, located in Cincinnati (OTR) ● AngelPad #9 batch ● https://www.astronomer.io ● https://www.crunchbase.com/organization/astronomer What do we do? Airflow Clickstream
● Astronomer Cloud (Managed Airflow) ● A clickstream analytics pipeline and router
○ Get up and running with Airflow quickly for user events ● Astronomer Enterprise (docs) ● Client-side (web, native mobile) or ○ Keep your data and workflows in your server-side private cloud ● Not an analytics service! We integrate with ○ Astronomer Spacecamp - Enterprise support & training available 50+ (https://www.astronomer.io/blog/announcin ● Free tier g-astronomer-spacecamp/) ● astronomer.io/clickstream ● Astronomer Open (docs) ● 2-min demo video - ○ The core of our platform is open source — https://www.youtube.com/watch?v=ru7VM try our Docker images on your machine e5MXZk (~40 min) Outline ● (5 min) Intro ● (10 min) Part I - Airflow overview & concepts ● (10 min) Part II - Example DAGs ● Midpoint Q&A? ● (10 min) Part III - Getting started with Airflow + Astro CLI demo ● (5 min) Summary / Outro ● Q&A What We’ll Cover ● Airflow Concepts ● Getting Started with Airflow ● Astro CLI ● Preview and Discussion Of Airflow UI ● Q&A What is Apache Airflow? ● “Airflow is a platform to programmatically author, schedule and monitor workflows.” ● Open Source currently in the Apache Incubator phase ○ 7,500 stars ○ 4,000 commits ○ 400 contributors ● Written in Python ● Leverages Flask web framework Airflow Concepts What is a DAG? Directed Acyclic Graph Define Your Pipelines in Code A Centralized Web App for All Workflows Web App Features ● A quick look into DAG and task progress ● Error Logging ● Connections & Variables ● Connection Pooling Hooks and Operators Hooks ● An interface to an external system ● Often a wrapper for an API client ● Examples ○ DbApiHook ○ S3Hook ○ SlackHook Operators ● Sensor Operators ○ S3KeySensor ○ S3PrefixSensor ○ HTTPSensor ● Action Operators ○ BashOperator ○ PythonOperator ○ EmailOperator ● Transfer Operators ○ SalesforceToRedshiftSchemaSync ○ SalesforceToS3 DAG Runs & Task Instances Dynamic DAGs Executors & Scaling Executors ● SequentialExecutor ● LocalExecutor ○ No additional dependencies ○ Multi-threaded out of the box ● CeleryExecutor ● MesosExecutor ● KubernetesExecutor (future) Plugins What can a plugin do? ● Extend the Airflow API ● Build new dashboards ● Create custom Hooks and Operators ● Astronomer maintains the most comprehensive collection of Airflow Plugins ○ github.com/airflow-plugins ● Code reuse, composition, good software engineering practices, etc ● Examples ○ Salesforce To Redshift Plugin ○ airflow-api-plugin ○ Airflow DAG Creation Manager Plugin Example DAGs DAG Examples ● GitHub stats DAG ● Clickstream Redshift loader DAG ○ ~200 million events per month from customer apps ○ ~2 million Airflow task instances per month ● https://github.com/airflow-plugins/Example-Airflow-DAGs Github Issue and Commit Tracking Ex. Clickstream Redshift DAG Clickstream Redshift DAG ● Your Website → Astronomer Clickstream → S3 → [S3 sensor → Redshift copy via Apache Spark] ● Dynamic DAGs configured via API → Scheduler (cached) → Variable Astro CLI The fastest way to get started with Airflow How can I get started with Airflow? ● Source Code ○ https://github.com/astronomerio/astro-cli ● Install CLI ○ $ curl -sL https://install.astronomer.io | sudo bash ● Start a Project ○ $ mkdir test-project && cd test-project ○ $ astro airflow init ○ $ astro airflow start Takeaway ● Part I - Airflow overview & concepts ● Part II - Example DAGs ● Part III - Getting started with Airflow + Astro CLI demo Resources ● Official ○ https://github.com/apache/incubator-airflow ○ https://airflow.apache.org ○ Airflow Dev Mailing List ○ Apache Airflow meetups ● Community ○ https://github.com/airflow-plugins ○ https://soundcloud.com/the-airflow-podcast ○ https://github.com/jghoman/awesome-apache-airflow ● Related Talks ○ https://blog.tedmiston.com/talks/ Contact Info ● Andy ○ https://twitter.com/andscoop ○ https://www.linkedin.com/in/andscoop/ ○ https://andscoop.com/ ○ [email protected] ● Taylor ○ https://twitter.com/kicksopenminds ○ https://www.linkedin.com/in/tedmiston/ ○ https://blog.tedmiston.com ○ [email protected]