Cloud Data Engineering For Dummies PDF
Cloud Data Engineering For Dummies PDF
Cloud Data Engineering For Dummies PDF
by David Baum
These materials are © 2020 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
Cloud Data Engineering For Dummies®, Snowflake Special Edition
Published by
John Wiley & Sons, Inc.
111 River St.
Hoboken, NJ 07030-5774
www.wiley.com
Copyright © 2020 by John Wiley & Sons, Inc., Hoboken, New Jersey
No part of this publication may be reproduced, stored in a retrieval system or transmitted in any
form or by any means, electronic, mechanical, photocopying, recording, scanning or otherwise,
except as permitted under Sections 107 or 108 of the 1976 United States Copyright Act, without
the prior written permission of the Publisher. Requests to the Publisher for permission should be
addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ
07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permissions.
Trademarks: Wiley, For Dummies, the Dummies Man logo, Dummies.com, and related trade
dress are trademarks or registered trademarks of John Wiley & Sons, Inc. and/or its affiliates in
the United States and other countries, and may not be used without written permission.
Snowflake and the Snowflake logo are trademarks or registered trademarks of Snowflake Inc. All
other trademarks are the property of their respective owners. John Wiley & Sons, Inc., is not
associated with any product or vendor mentioned in this book.
For general information on our other products and services, or how to create a custom For
Dummies book for your business or organization, please contact our Business Development
Department in the U.S. at 877-409-4177, contact [email protected], or visit www.wiley.com/go/
custompub. For information about licensing the For Dummies brand for products or services,
contact BrandedRights&[email protected].
ISBN 978-1-119-75456-5 (pbk); ISBN 978-1-119-75458-9 (ebk)
Manufactured in the United States of America
10 9 8 7 6 5 4 3 2 1
Publisher’s Acknowledgments
We’re proud of this book and of the people who worked on it. Some of the
people who helped bring this book to market include the following:
Development Editor: Production Editor: Siddique Shaik
Colleen Diamond Snowflake Contributors Team:
Project Editor: Martin V. Minner Vincent Morello, Shiyi Gu,
Editorial Manager: Rev Mengle Kent Graziano, Clarke Patterson,
Jeremiah Hansen,
Executive Editor: Steve Hayes Mike Klaczynski,
Business Development Dinesh Kulkarni, Leslie Steere
Representative: Karen Hattan
These materials are © 2020 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
Table of Contents
INTRODUCTION................................................................................................ 1
Who Should Read This Book................................................................ 2
Icons Used in This Book........................................................................ 2
Beyond the Book................................................................................... 2
These materials are © 2020 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
Enforcing Data Governance and Security........................................ 28
Streamlining DataOps by Cloning Data............................................ 30
These materials are © 2020 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
Introduction
I
f data is the heartbeat of the enterprise, then data engineering
is the activity that ensures current, accurate, and high-quality
data is flowing to the solutions that depend on it. As analytics
become progressively more important, data engineering has
become a competitive edge and central to the technology initia-
tives that position companies for success.
In this book, you’ll discover how your business and IT teams can
effectively collaborate as the entire organization adopts mod-
ern data engineering tools and procedures. You’ll see how IT
teams lay the data engineering foundation to build reliable, high-
performance data pipelines that can benefit the entire organization.
You’ll also learn how self-service data preparation and integration
activities can be extended to analysts, data scientists, and line-
of-business users. Lastly, you’ll discover how a cloud data platform
enables your organization to amass all of its data in one central
location, making it easy to consolidate, cleanse, transform, and
deliver that data to a wide range of analytic tools and securely share
it with internal and external data consumers.
Introduction 1
These materials are © 2020 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
Who Should Read This Book
Cloud Data Engineering For Dummies, Snowflake Special Edition,
explains why data engineering is important and introduces the
essential components of a modern data engineering practice. Pro-
fessionals of all levels of technical proficiency can benefit from
this book:
Look for this icon to read about organizations that have success-
fully applied modern data engineering practices and principles.
These materials are © 2020 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
IN THIS CHAPTER
»» Understanding the basics of data
engineering
Chapter 1
Charting the Rise
of Modern Data
Engineering
T
he software industry is driven by innovation, but most new
technologies have historical precedents. This chapter
describes the data engineering principles, practices, and
capabilities that have paved the way for breakthroughs in data
management and analytics.
These materials are © 2020 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
information systems. Data engineers make an organization’s data
“production ready” by putting it into a usable form, and typically
in a centralized repository or cloud data platform. They under-
stand how to manipulate data formats, scale data systems, and
enforce data quality and security.
These materials are © 2020 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
Reviewing the History of Data
Engineering
Extract, transform, and load (ETL), or what we now call data
engineering, used to be much simpler. There was much less data
in the world, of fewer types, and needed at a much slower pace.
Enterprise data was moved from one system to another, and soft-
ware professionals generally transmitted it in batch mode, as a
bulk data load operation. When data needed to be shared, it was
often moved through File Transfer Protocol (FTP), application
programming interfaces (APIs), and web services.
These materials are © 2020 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
so. The cost of storing data has gone down significantly, even as
the computing devices that process that data have become more
powerful. As the cloud computing industry matures, a growing
number of organizations continue to use cloud services to store,
manage, and process their data. The cloud offers virtually unlim-
ited capacity and near-infinite elasticity and scalability, allowing
companies of any size to deploy a large number of concurrent,
high-performance workloads within a centralized platform.
These materials are © 2020 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
The cloud has also given rise to highly efficient methods of appli-
cation development and operations (DevOps). These include server-
less computing, in which the cloud service provider automatically
provisions, scales, and manages the infrastructure required to host
your data and run your business applications. A serverless environ-
ment allows developers to bring products to market faster via con-
tainers (software packages that contain everything needed to run
an application), microservices (applications built as modular com-
ponents or services), and continuous integration/continuous delivery
(CI/CD) processes. Chapter 2 discusses these concepts further.
What is the profile of my data? How can I tell if the quality is good
enough for the types of analysis I want to perform? These are com-
mon questions for which business users seek immediate answers.
These materials are © 2020 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
data governance must be the foundation of these “citizen inte-
gration” efforts.
Previously, Fair’s legacy data warehouse could not keep pace with the
company’s rapidly expanding appetite for data. This led to frequent
contention for scarce server resources, resulting in the creation of
data marts — siloed copies of some of the data stored in a data
warehouse to offload analytic activity and preserve performance.
In addition, dealer inventory data was imported only once per day to
avoid system overload, making real-time analytics impossible. And
Fair’s analytics team spent hours troubleshooting cluster failures and
waiting for ETL jobs to run.
These materials are © 2020 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
IN THIS CHAPTER
»» Understanding how a data pipeline
works
Chapter 2
Describing the Data
Engineering Process
D
ata engineering involves ingesting, transforming, deliver-
ing, and sharing data for analysis. These fundamental tasks
are completed via data pipelines that automate the process
in a repeatable way. This chapter describes the primary procedures
that make this possible.
Most modern pipelines use three basic steps. The first step
is collection, during which raw data is loaded into a reposi-
tory or data platform, often in a raw data zone. In the sec-
ond step, transformation, the data is standardized, cleansed,
mapped, or combined with data from other sources. Trans-
formation also entails modifying the data from one data type
These materials are © 2020 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
to another, so it is ready for different types of data consump-
tion. And finally, data delivery and secure data sharing makes
business-ready data available to other users and departments,
both within the organization and externally (see Figure 2-1).
These materials are © 2020 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
Collecting and Ingesting Data
Many types of data exist, and you can store it in many ways, on
premises and in the cloud. For example, your business may gen-
erate data from transactional applications, such as customer rela-
tionship management (CRM) data from Salesforce or enterprise
resource planning (ERP) data from SAP. Or you may have Internet
of Things (IoT) sensors that gather readings from a production
line or factory floor operation.
These materials are © 2020 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
Transforming Data
Data transformation is the process of preparing data for different
kinds of consumption. It can involve standardizing (converting all
data types to the same format), cleansing (resolving inconsisten-
cies and inaccuracies), mapping (combining data elements from
two or more data models), augmenting (pulling in data from other
sources), and so on.
Designing pipelines
How data pipelines are designed has everything to do with the
underlying database and processing engine as well as the skill
sets of your team. These design decisions invariably reflect the
choice of underlying architecture.
For example, suppose you used Hadoop as the file system for your
data lake, and your data pipelines were based on MapReduce. Only
a few years later, you decide to leverage Spark as a much faster
processing framework, requiring you to modify or re-create all
your pipelines accordingly.
These materials are © 2020 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
Evolving from ETL to ELT
As stated in Chapter 1, ETL refers to the process of extracting,
transforming, and loading data. With ETL, data needs to be trans-
formed outside of the target system and uses a separate process-
ing engine, which involves unnecessary data movement and
changes, and tends to be slow. Modern data integration workloads
are enhanced by leveraging the processing power of target data-
bases. In these instances, the data pipelines are designed to extract
and load the data first, and then transform it later (ELT).
Efficient ELT does not require things such as data schema at the
outset, even for semi-structured data. Data can simply be loaded
in the raw form and transformed later, once it is clear how it will
be used.
These materials are © 2020 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
Modern data sharing involves simply granting access to live, gov-
erned, read-only data by pointing at its original location. With
granular-level access control, data is shared rather than cop-
ied, and no additional cloud storage is required. With this more
advanced architecture, data providers can easily and securely
publish data for instant discovery, query, and enrichment by data
consumers, as shown in Figure 2-2.
Whatever industry or market you operate in, having all your data
on hand and readily accessible opens doors to new opportunities.
This is especially true when that data is stored and managed in a
consistent way, and when you can use the near-infinite capacity
of the cloud to scale your data initiatives.
A cloud data platform allows you to store all your data in one place,
in its raw form, regardless of format, and deliver analytics-ready
data to the people who need it. It provides convenient access to
that data and improves the speed at which you can ingest, trans-
form, and share data across your organization — and beyond.
These materials are © 2020 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
IN THIS CHAPTER
»» Reaping the full benefits from cloud-
based solutions
Chapter 3
Mapping the Data
Engineering Landscape
A
s you create data pipelines, remember the ultimate goal: to
turn your data into useful information such as actionable
analytics for business users and predictive models for data
scientists. To do so, you must think about the journey your data
will take through your data pipelines. Start by answering some
fundamental questions:
These materials are © 2020 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
Working with Data Warehouses
and Data Lakes
Data engineering involves extracting data from various applica-
tions, devices, event streams, and databases. Where will it land?
For many companies, the answer is often a data warehouse or a
data lake:
These materials are © 2020 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
True cloud data platforms are built using a cloud-optimized
architecture that takes advantage of storage as a service, where
data storage expands and contracts automatically.
These materials are © 2020 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
Change data capture (CDC) capabilities simplify data pipelines by
recognizing the changes that have occurred since the last data
load and incrementally processing or ingesting that data. For
example, in the case of the financial services company, a bulk
upload from the banking system refreshes the data warehouse
each night, while CDC adds new transactions every five minutes.
This type of process allows analytic databases to stay current
without reloading the entire data set.
In the case of streaming data, be aware that event time and pro-
cessing time are not always the same. You can’t simply follow the
timestamp in the data, since some transactions may be delayed in
transit, which could cause them to be recorded in the wrong order.
If you need to work with streaming data, you may need to create
a pipeline that can verify precisely when each packet, record, or
transaction occurred, and ensure they are recorded only once, and
in the right order, according to your business requirements. Add-
ing event time to the record ensures that processing delays do
not cause incorrect results due to an earlier change overwriting a
later change.
These materials are © 2020 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
Saving time for developers
Data integration platforms and tools sort out the diversity of data
types and APIs so data engineers can connect to data sources
directly instead of coding APIs, which is complex and time-
consuming. For example, you may need to connect to a com-
plex legacy SAP enterprise resource planning (ERP) system that
requires remote function calls (RFCs), business application pro-
gramming interfaces (BAPIs), and Intermediate Document (IDoc)
messages. Using an integration tool to connect to a data source is
much more efficient than building and managing your own.
Users also can visually create data pipelines and some customiz-
able options. Data integration tools help them craft the necessary
software interfaces. Instead of having to master the nuances of
transformation logic, networking protocols, and data models, the
integration tools encapsulate the nitty-gritty details of merging,
mapping, and integrating your data.
These materials are © 2020 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
and how many software interfaces do you need to establish? How
many engineering resources are you willing to spend on not only
building but also maintaining data pipelines? You may have both
simple transformations and highly complex and customized logic.
Does it make more sense to use your skilled data engineers to
hand-code everything, or to purchase a general-purpose integra-
tion tool your data engineers can use and customize?
Data scientists experiment with many data sets as they create and
train machine learning models. A model designed to predict cus-
tomer churn, for example, may incorporate data about customer
behavior relative to sales, service, and purchasing, both historic
and current. Each time data scientists pull in new data, they must
These materials are © 2020 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
wait for data engineers to load and prepare the data set, which
introduces latency into the flow.
They also must rescale the data into a range or with specific
requirements (often referred to as normalization and standardiza-
tion), as required by each machine learning algorithm. Machine
learning models must be periodically retrained, which requires
fresh data to be reprocessed through the cycle, often via manual
extract, transform, and load (ETL) processes that can potentially
introduce errors.
These materials are © 2020 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
To simplify data pipeline development, look for a data platform
that also has a built-in data ingestion service designed to asyn-
chronously load data into the cloud storage environment. You’ll
also want support for CDC technology, so when data is changed,
detecting the differences is easy. This increases performance by
allowing you to work with the changed data instead of bringing in
the whole data set again.
These materials are © 2020 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
IN THIS CHAPTER
»» Balancing agility and governance
Chapter 4
Establishing Your Data
Engineering Foundation
E
stablishing a healthy and productive data engineering prac-
tice requires balancing agility with governance. You need
comprehensive controls to ensure your data is clean, accu-
rate, and up to date. However, you don’t want to stymie the user
community by imposing data governance procedures that are too
onerous or obtrusive. Ultimately, you want an agile environment
that is broadly accessible and easy to use.
This chapter describes how you can bring together a broad net-
work of stakeholders, from highly skilled engineers and data
scientists to casual users who simply want to explore, enhance,
modify, and cleanse their data with self-service tools. It explains
how you can enforce good governance to provide a safe environ-
ment that allows your user community to be creative, while also
ensuring data is secure, consistent, and compliant with data pri-
vacy regulations.
These materials are © 2020 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
Closing the Gap Between Governance
and Agility
Many people are involved in the cycle of preparing, managing,
and analyzing data. To orchestrate these efforts, you must create
a cohesive environment that accommodates multiple skill sets.
Data and analytics requirements change all the time, and you
need managed self-service procedures, backed by continuous
DataOps delivery methods, to keep accurate and governed data
moving through the pipeline.
WHAT IS DATAOPS?
DataOps, short for data operations, brings together data engineers,
data scientists, business analysts, and other data stakeholders to
apply agile best practices to the data lifecycle, from data preparation
to reporting with data. As shown in the following figure, DataOps
automates critical data engineering activities and orchestrates hand-
offs throughout the data management cycle, from plan, develop,
build, manage, and test to release, deploy, operate, and monitor.
These materials are © 2020 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
DataOps takes its cues from agile programming methods to
ensure the delivery of data via a continuous plan, develop, build,
manage, test, release, deploy, operate, and monitor loop. DataOps
practices include tools, processes, and frameworks that recognize
the interconnected nature of both business and IT personnel in
these data-driven endeavors.
These materials are © 2020 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
framework that serves as a backbone for data quality, in
conjunction with data security and change control
procedures.
»» Data catalog capabilities: Data catalog capabilities help
organize the information within your storage. A data catalog
is a collection of metadata, combined with data manage-
ment and search tools, that helps analysts and other data
users find the data they need, serves as an inventory of
available data, and provides information that helps organiza-
tions evaluate the relevance of that data for intended uses.
»» Data access: Data access rules must be established to
determine who can see, work with, and change the data. Pay
special attention to personally identifiable information (PII),
financial data, and other sensitive information, which may
need to be masked or tokenized to uphold data privacy
regulations. Some cloud data platforms can apply masks and
tokens to the data automatically, as well as unmask the data
for authorized users.
»» Change management: Change management utilities keep
track of who accesses and changes databases and data
pipelines. They track when changes were made, who made
them, and which applications those changes affect. These
tools help you safely manage the environment, audit usage,
and trace data back to its source, reducing the chance of
unauthorized alterations and errors.
Data engineers
Data engineers make an organization’s data “production ready.”
They create and manage data pipelines to serve different business
use cases. They need to understand how to handle various data
formats, how to scale data systems, and how to provide practices
These materials are © 2020 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
to enforce data quality, governance, and security. They monitor
changes and enforce version control for all data objects, as well as
for the data pipelines that act on them.
These materials are © 2020 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
community needs. Data architects are the technology leads
who set up and enforce the overall standards of a data
engineering project.
»» Data stewards help determine what data is needed and
ensure the data is properly defined. They use the data
governance processes to ensure organizational data and
metadata (data about data) are appropriately described and
used. Generally hailing from business rather than IT, these
project stakeholders understand the source data and the
applications it connects to. They often get involved at the
outset of a project as well as during user acceptance testing,
when a data application or pipeline moves from the develop-
ment phase into the testing and QA phase, and then on to
production. Data stewards and product owners also play a
key role in ensuring data quality.
Nominate the business users, who own and manage the data, to be
responsible for data quality because they are in the best p
osition
to detect inaccuracies and inconsistencies. These data stewards
can also determine how often the data should be refreshed to
ensure it remains relevant, as well as when it is being analyzed
out of context. Data stewards should work with data engineers to
establish repeatable and consistent data quality rules, processes,
and procedures.
These materials are © 2020 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
FIVE STEPS TO GOOD
GOVERNANCE
Key steps for establishing good data governance include the
following:
These materials are © 2020 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
has access to it, how it’s used, and whether or not it has been fully
deleted when a consumer requests it be deleted.
The dynamic nature of the cloud allows you to forgo most capacity
planning exercises. If you need to spin up a new database envi-
ronment for development or testing, you can provision it instantly
with a couple of clicks. Similarly, you can clone an existing data-
base of any size with a single command, with no need to pre-
provision storage.
These materials are © 2020 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
IN THIS CHAPTER
»» Explaining integration styles with
streaming, batch data, and data
replication
Chapter 5
Outlining Technology
Requirements
A
s you lay out a data architecture, select data engineering
technologies, and design data pipelines, the goal is to cre-
ate a data environment that not only serves your organiza-
tion’s current needs but also positions it for the future. To ensure
you are following best practices, consider these fundamental
principles:
These materials are © 2020 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
»» Build or buy: Will you build or buy data integration tools,
utilities, and infrastructure? Will you work with commercial
vendors or use open source software? Have you compared
the price/performance and total cost of ownership with
different data systems and the talent resources required?
»» Self-service: Are you democratizing access to data by
insulating users from complex technologies? Are you
encouraging autonomy among the user community?
»» Investments: Can you leverage existing languages, process-
ing engines, and data integration procedures to maximize
investments, minimize training, and make the most of
available talent?
»» Versatility: Can your data pipeline architecture accommo-
date structured, semi-structured, and unstructured data
based on your needs, as well as batch and real-time data?
These materials are © 2020 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
with an object storage service such as Amazon Simple Storage
Service (S3), Microsoft Azure Blob, or Google Cloud Storage.
These materials are © 2020 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
These tools can move the data while maintaining the same struc-
ture and values, producing a one-to-one copy of the source. First,
they establish the structure in the destination location, usually a
table or some other type of database object within a cloud repos-
itory. Then, they incrementally move the data. They manage the
initial bulk data load to populate the destination database fol-
lowed by periodic incremental loads to merge new data and keep
the destination database up to date. These are generally two sep-
arate ETL processes you would have to create manually, so using
a replication tool is a big timesaver.
These materials are © 2020 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
Creating data transformations has two main phases: design and
execute. The following sections describe these phases.
Some cloud data platforms allow you to design these jobs with
SQL, others with Python, Java, or Scala. The talent and resources
available to you are key decision factors in choosing your plat-
form, but also look for extensibility of data pipelines so you can
allow users to bring in their own language of choice when design-
ing the transformation logic. This encourages more collaboration
for data engineering.
These materials are © 2020 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
Cloud services today provide more options for compute and stor-
age resources. Look for a cloud data platform that allows you to
isolate your transformation workloads from each other and auto-
matically allocate appropriate compute capacity to handle the
integration and transformation work without degrading the per-
formance of your analytic workloads.
These materials are © 2020 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
Whenever possible, use ELT (instead of ETL) processes to push
resource-hungry transformation work to your destination plat-
form. This provides better performance, especially if that desti-
nation is a scalable cloud service. Ideally, you should process the
data where that data resides rather than moving it to an indepen-
dent server or storage mechanism.
These materials are © 2020 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
CONTINUOUS IMPROVEMENT
CYCLE
To make sure all your software tools and utilities work with your auto-
mated CI/CD cycle, verify that these requirements have been met:
• Can you build data pipelines with leading data integration tools?
• Can you easily seed preproduction environments with production
data?
• Can you instantly create multiple isolated environments to do your
validations?
• Can you scale the data environment to run validation jobs quickly
and cost-effectively?
• Can you clone data and immediately spin up compute clusters for
development and testing purposes?
• Can you control your schema with change management tools and
keep track of versions for development, testing, and auditing
purposes?
• Can you automate CI/CD pipelines with your preferred software
automation tools?
• Can you restore data easily by rolling back to previous versions
and points in time?
These materials are © 2020 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
collect, store, and analyze their data in one place, so they can eas-
ily obtain all types of insights from their data. And they want to
simplify and democratize the exploration of that data, automate
routine data management activities, and support a broad range of
data and analytics workloads.
By creating a single place for all types of data and all types of
data workloads, a cloud data platform can dramatically simplify
your infrastructure, without incurring the costs inherent in tradi-
tional architectures. For example, by centralizing data, you reduce
the number of stages the data needs to move through before it
becomes actionable, which eliminates the need for complex data
pipeline tools. By reducing the wait time for data, you allow users
to obtain the insights they need, when they need them, so they
can immediately spot business opportunities and address press-
ing issues. One unified platform can handle everything you used
to do in a data warehouse, data lake, and multiple data marts.
These materials are © 2020 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
Modern data pipelines have five unique characteristics:
These materials are © 2020 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
IN THIS CHAPTER
»» Working with existing data sets,
applications, and resources
»» Encouraging self-sufficiency
Chapter 6
Six Steps to Building
a Modern Data
Engineering Practice
T
he previous chapters outline the various technology and
organizational decisions you need to take into account in
your data engineering endeavors. This chapter offers six
guidelines for putting those decisions into practice.
These materials are © 2020 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
Longer-term initiatives focus on opening up new revenue opportu-
nities. In both cases, ask how you can extend your current technol-
ogy assets. What tools have you invested in? Where can you benefit
the most by replacing legacy tools with modern technology? Start
with one project, and move on to the next. Gradually, think about
how you can establish an extensible architecture that leverages the
data, tools, and capabilities you have in place while incorporating
the modern tools, processes, and procedures described in this book.
These materials are © 2020 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
4. Don’t Make Data Governance
an Afterthought
Once you have alignment with business and IT, identify product
owners to oversee data quality. Remember the fundamentals of
data governance, data security, curation, lineage, and other data
management practices outlined in Chapter 4. Does your organiza-
tion already have a DevOps strategy? Find out who spearheads this
effort and if they are familiar with the principles of DataOps as
well. DataOps practices help set good data governance foundations
so you can empower users to self-serve as they prepare, explore,
analyze, and model their data, using fresh data in good quality.
These materials are © 2020 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
The best cloud data platforms include scalable pipeline services
that can ingest streaming and batch data. They enable a wide
variety of concurrent workloads, including data warehouses, data
lakes, data pipelines, and data exchanges, as well as facilitating
business intelligence, data science, and analytics applications.
These materials are © 2020 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
WILEY END USER LICENSE AGREEMENT
Go to www.wiley.com/go/eula to access Wiley’s ebook EULA.