Do We Need The Lakehouse Architecture - by Vu Trinh - Apr, 2024 - Data Engineer Things
Do We Need The Lakehouse Architecture - by Vu Trinh - Apr, 2024 - Data Engineer Things
Do We Need The Lakehouse Architecture - by Vu Trinh - Apr, 2024 - Data Engineer Things
Search
Vu Trinh · Follow
Published in Data Engineer Things
10 min read · Apr 20, 2024
Listen Share
Table of contents
Challenges and Context
The Motivation
Intro
I first heard about the term “Lakehouse“ in 2019 while scrolling through the Dremio
document. With a conservative mind, I assumed this was just another marketing
term. Five years later, it seems like everybody is talking about Lakehouse (after they
finish discussing AI :d); all major cloud data warehouses now support reading Hudi,
Iceberge, or Delta Lake format directly in object storage, and even BigQuery has a
dedicated query engine for this task. The innovation doesn’t stop there; Apache
XTable (formerly OneTable) provides abstractions and tools for translating
Lakehouse table format metadata. Recently, Confluent has announced the release of
TableFlow, which feeds Apache Kafka data directly into the data lake, warehouse, or
analytics engine as Apache Iceberg tables. This makes me re-examine my
assumption from the past: Was Lakehouse just a marketing term?
This week, we will answer that question with my note from the paper Lakehouse: A
New Generation of Open Platforms that Unify Data Warehousing and Advanced
Analytics.
Data warehousing was first introduced to help business users get analytical insights
by consolidating data from operational databases in a centralized warehouse.
Analytic users use this data to support business decisions. Data would be written
with schema-on-write to ensure the data model was optimized for BI consumption.
This is the first-generation data analytics platform.
In the past, organizations typically coupled computing and storage to build data
warehouses on-premise. This forced enterprises to pay for more hardware when the
demand for analytics and data size increased. Moreover, data does not only come in
tabular format anymore; it can be video, audio, or text documents. The unstructured
data caused massive trouble for the warehouse system, which was designed to
handle structured data.
Second-generation data analytics platforms came to the rescue. People started
putting all the raw data into data lakes, low-cost storage systems with file interface
that holds data in open formats, such as Apache Parquet, CSV, or ORC. This
approach started with the rise of Apache Hadoop, which used HDFS for storage.
Unlike data warehousing, data lake was a schema-on-read architecture that allowed
flexibility in storing data. Still, it caused some challenges to data quality and
governance. The approach would move a subset of data in the lake to the warehouse
(ETL). The data-moving process ensures the analytics user can leverage the power of
the data warehouse to mine valuable insights. From 2015 onwards, cloud object
storage, such as S3 or GCS, started replacing HDFS. They have superior durability
and availability, plus extremely low cost. The rest of the architecture is mostly the
same in the cloud era for the second-generation platform, with a data warehouse
such as Redshift, Snowflake, or BigQuery. This two-tier data lake + warehouse
architecture dominated the industry at the time of the paper’s writing (I guess it has
dominated till now). Despite the dominance, the architecture encounters the
following challenges:
Image created by the author.
Reliability: Consolidating the data lake and warehouse is difficult and costly,
requiring much engineering effort to ETL data between the two systems.
Data staleness: The data in the warehouse is stale compared to the lake’s data.
This is a step back from the first-generation systems, where new operational
data was immediately available for analytics demands.
Total cost of ownership: In addition to paying for ETL pipelines, users are billed
twice the storage cost for data duplication in the data lake and data warehouse.
Note from me: The point “Limited support for advanced analytics” does not reflect the
reality at the moment due to the intense support of major cloud data warehouses like
BigQuery, Snowflake, or Redshift for the machine learning workload. Feel free to discuss
this with me in the comments if you don’t think so.
They argue that this paradigm, referred to as a Lakehouse, can solve some of the
challenges of data warehousing. Databricks believes Lakehouse will get more
attention due to recent innovations that address its fundamental problems:
Reliable data management on data lakes: Like data lakes, the Lakehouse must
be able to store raw data and support ETL/ELT processes. Initially, data lakes just
meant “a bunch of files” in various formats, making it hard to offer some key
management features of data warehouses, such as transactions or rollbacks to
old table versions. However, systems such as Delta Lake or Apache Iceberg
provide a transactional layer for data lake and enable these management
features. In this case, there are fewer ETL steps overall, and analysts can also
quickly performantly query the raw data tables if needed, like the first-
generation analytics platforms.
Support for machine learning and data science: ML systems’ support for direct
reads from data lake formats allows them efficient access to the data in the
Lakehouse.
Data quality and reliability are the top challenges reported by enterprise data
users. Implementing efficient data pipelines is hard, and dominant data
architectures that separate the lake and warehouse add extra complexity to this
problem.
Data warehouses and lakes serve machine learning and data science
applications poorly.
Some current industry trends give further evidence that customers are unsatisfied
with the two-tier model:
All the big data warehouses have added support for external tables in Parquet
and ORC format.
There is a broad investment in SQL engines run directly against data lake, such
as Spark SQL or Presto.
However, these improvements can only solve some of the problems of lakes and
warehouses architecture: the lakes still need essential management features such as
ACID transactions and efficient data access methods to match the warehouse
analytics performance.
The Lakehouse Architecture
Databricks defines a Lakehouse as a data management system based on low-cost
storage that enhances traditional analytical DBMS management and performance
features such as ACID transactions, versioning, caching, and query optimization.
Thus, Lakehouses combine the benefits of both worlds. In the following sections,
we will learn about the possible Lakehouse design proposed by Databicks.
Implementation
Metadata Layer
Data lake storage systems such as S3 or HDFS only provide a low-level object store or
filesystem interface. Over the years, the need for data management layers has
emerged, starting with Apache Hive, which keeps track of which data files are part
of a Hive table at a given table.
In 2016, Databricks started developing Delta Lake, which stores the information
about which objects belong to which table in the object storage as a transaction log
in Parquet format. Apache Iceberg, first introduced at Netflix, uses a similar design.
Apache Hudi, which started at Uber, is another system in this area focused on
streaming ingest into data lakes. Databricks observes that these systems provide
similar or better performance to raw Parquet/ORC data lakes while improving data
management, such as transactions, zero-copy, or time travel.
One thing to note here: they are easy to adopt for organizations that already have a
data lake: e.g., Delta Lake can organize an existing directory of Parquet files into a
Delta Lake table without moving data around by adding a transaction log over all the
existing files. In addition, metadata layers can help implement data quality
constraints. For example, Delta Lake constraints APIs let users apply constraints on
the new data (e.g., a list of valid values for a specific column). Delta’s client libraries
will reject all records that violate these constraints. Finally, metadata layers help
implement governance features such as access control, e.g., it can check whether a
client can access a table before granting credentials to read the table’s raw data.
SQL performance
Although a metadata layer adds management capabilities, more is needed to achieve
the warehouse’s capability. SQL performance, in which the engine runs directly on
the raw data, maybe the most significant technical question with the Lakehouse
approach. Databricks proposes several techniques to optimize SQL performance in
the Lakehouse. These techniques are independent of the chosen data format. These
optimizations are:
Caching: When using the metadata layer, the Lakehouse system can cache files
from the cloud object store on faster devices such as SSDs and RAM.
Auxiliary data: The Lakehouse can maintain other auxiliary file data to optimize
queries. In Delta Lake and Delta Engine, Databricks maintains column min-max
information for each data file, storing it in the same Parquet transaction log file.
This enables the engine to skip unnecessary data in the scanning phase. They
are also implementing a Bloom filter for the data-skipping purpose.
Data layout: Lakehouse can optimize many layout decisions. The first one is
record ordering, in which records are clustered together; this makes it easier for
the engine to read them together. Delta Lake supports ordering records using
individual dimensions or space-filling curves such as the Z-order curve to
provide more than one dimension locality.
These optimizations work well together for the typical access patterns in analytical
systems. Typical queries focus on a “hot” subset of the data in the analytics
workload, which can benefit from cache optimization. The critical performance
factor for “cold” data in a cloud object store is the amount scanned per query.
Combining data layout optimizations and auxiliary data structures allows the
Lakehouse system to minimize I/O efficiently.
Outro
If the solution can solve the real problem, it is not just a cliché term. Lakehouse was
initially introduced to relieve the pain point of two-tier architecture: maintaining
two separate systems for storage (the lakes) and analytics (the warehouses). By
bringing the analytics power directly to the lakes, the Lakehouse paradigm has to
deal with the most challenging problem: query performance; doing analytics
directly on the raw data means the engine doesn’t know much about the data
beforehand, which could cause some trouble for the optimization process. Thanks
to innovation in recent years of open table formats like Hudi, Iceberge, or Delta
Lake, the Lakeshouse seems to keep up with the traditional warehouse in the
performance competition. It’s an exciting future to observe the rise of the
Lakehouse side, to co-exist with the lake-warehouse paradigm, or to replace the two-
tier architecture completely; who knows?
References
[1] Databricks, Lakehouse: A New Generation of Open Platforms that Unify Data
Warehousing and Advanced Analytics (2020).
My newsletter is a weekly blog-style email in which I note things I learn from people
smarter than me.
So, if you want to learn and grow with me, subscribe here: https://vutr.substack.com.
Data Analytics Data Engineering Data Warehouse Data Lake Big Data
Follow
Written by Vu Trinh
11.6K Followers · Writer for Data Engineer Things
🚀 My newsletter vutr.substack.com 🚀 Subscribe for weekly writing, mainly about OLAP databases and
other data engineering topics.
1.6K 15
268 1
How I build an ETL pipeline with AWS Glue, Lambda, and Terraform
A Step-by-Step Guide
387 4
Vu Trinh in Data Engineer Things
I spent 5 hours understanding more about the Delta Lake table format
All insights from the paper: Delta Lake: High-Performance ACID Table Storage over Cloud
Object Stores
415 2
268 1
257 8
Lists
ChatGPT prompts
47 stories · 1639 saves
MODERN MARKETING
156 stories · 663 saves
Staff Picks
655 stories · 1017 saves
Blosher Brar
How to organize a data team to get the most value out of data
To state the obvious: a data team is there to bring value to the company. But is it this obvious?
Haven’t companies too often created a…
78
Charles Verleyen in Astrafy
507 4
Gaurav Kumar
194 1