Flash: A Next-gen Vectorized Stream Processing Engine Compatible with Apache Flink

This article is based on a presentation by Mr. Wang Feng (nickname: Mowen), senior director at Alibaba Cloud and head of the open source big data department, in the open source big data session at the Apsara Conference 2024. Key topics include:

Apache Flink Has Become the De Facto Standard for Stream Processing

Core Technologies of Flash

Performance Improvement of Flash

Business Applications of Flash within Alibaba Group

Today I'll be sharing the latest technical advancements in real-time computing from the open source big data team of Alibaba Cloud: Flash, our native, vectorized stream processing engine. Fully compatible with Apache Flink, Flash represents the next generation of stream processing. I'll cover the motivations behind its development, its core technologies, current achievements, and its successful application within Alibaba Group.

1.Apache Flink Has Become the De Facto Standard for Stream Processing

As a pioneering adopter and advocate of Apache Flink, Alibaba has accumulated over a decade of experience in stream processing. Early stream processing engines, such as Apache Storm, emerged during the Hadoop era. Although groundbreaking, Apache Storm faced limitations, particularly in state management. Its stateless nature hindered its effectiveness in accuracy-critical data processing scenarios due to the lack of efficient state management. Subsequently, Apache Spark Streaming, built upon Apache Spark, a batch processing engine, emerged. Apache Spark Streaming provided an alternative approach to stream processing by using a micro-batch model. However, this inherent micro-batching introduced latency, impacting both performance and throughput. Moreover, Apache Spark Streaming could not fully achieve accurate stream processing semantics.

Flink addressed many longstanding challenges in stream processing. Flink provides exceptional real-time stream processing capabilities. With native support for low-latency, high-throughput, and stateful computations, Flink effectively handles complex event-time and out-of-order data. This makes it a powerful solution for real-time data analytics. Since joining the Apache Software Foundation in 2014, Apache Flink has evolved over the past ten years to become a de facto standard for stream processing.

The core architecture of Apache Flink can be illustrated with a simple diagram. Apache Flink is a streaming execution engine that also provides unified batch and stream processing. It can process both streaming and batch data by using a single execution engine. Furthermore, Apache Flink offers unified APIs for both stream and batch processing, simplifying development. Its API ecosystem includes SQL, Java, and Python, further enhancing developer productivity.

The technical architecture of Apache Flink has advanced significantly in recent years. For example, the rise of Flink Change Data Capture (CDC) for real-time data synchronization has driven broader adoption of Apache Flink for data synchronization and integration. Apache Flink is also widely used in classic machine learning tasks, with a robust and growing ecosystem. Furthermore, Apache Flink seamlessly integrates with Kubernetes, fully embracing containerization and cloud-native deployments.

Apache Flink acts as a powerful connector, bringing together the diverse data stores that make up the big data landscape. But before we dive into Flink itself, let's take a step back and consider the fundamental question: where does all this big data actually originate? The storage and management of data is the first step in big data processing. Data is typically stored in databases, message queues, data lakes, or data warehouses. However, data only generates value when it flows. Apache Flink plays an important role in the entire open source big data ecosystem. Its greatest value lies in enabling real-time data streaming between various data sources.

Flink Connector can connect various types of data sources and Flink has become the central bridge of the entire big data ecosystem. The Apache Flink ecosystem is mature and comprehensive. Apache Flink facilitates data export and movement between diverse storage systems. This highlights the current role of Apache Flink in the big data landscape.

As you can see, due to its robust architecture and thriving ecosystem, Apache Flink has become the de facto standard for stream processing, both commercially and within the open source community. Alibaba was an early adopter of Apache Flink. Since 2016, Apache Flink has been deployed within Alibaba Group to support diverse business units and industry verticals, such as e-commerce, logistics, travel, and mapping.

In recent years, the adoption of Apache Flink has grown rapidly across various industries in China, including the Internet, finance, logistics, transportation, and automotive. Internationally, Apache Flink has gained widespread recognition and adoption, becoming the de facto standard for stream processing in regions like North America, Europe, and Southeast Asia, and solidifying its position as a global streaming processing standard.

In 2023, Apache Flink received the prestigious SIGMOD Systems Award, recognizing its outstanding stream processing model design and widespread industry adoption. This recognition reinforces the position of Apache Flink as the de facto standard for stream processing. Alibaba Cloud leverages Apache Flink to power its real-time computing products, offering enhanced services to our users.

So, why develop Flash, a next-gen vectorized stream processing engine, with much better performance and fully compatible with Apache Flink.

First, Apache Flink has become the de facto standard. User needs will inevitably be for compatibility with Flink so that users are not bound to a specific platform and can seamlessly connect with various upstream and downstream systems Apache Flink supports. Adhering to open source standards and fully compatible to Apache Flink are crucial.

Secondly, enterprise users desire a unified data development and O&M platform in the cloud to help facilitate product use and eliminate the need to build a complex set of stream data processing systems from start to finish. Our current Realtime Compute for Apache Flink offering already addresses these needs, and many users are already using our products. In today's economic climate, cost-efficiency is paramount. Users expect cloud services to be more cost-effective than on-premise infrastructure. Their goal is cost-effective data processing and analytics, coupled with open-standards compatibility. Furthermore, a comprehensive service with guaranteed service level agreements (SLAs) is essential. These are the key requirements of enterprise users.

The biggest challenge and the hardest requirement to satisfy is achieving 100% compatibility with open source standards while simultaneously delivering significantly better performance and lower costs. As a cloud provider, we have invested heavily in our Realtime Compute for Apache Flink offering. Over the years, we have launched an enterprise-grade Flink engine that significantly outperforms the open source version. However, we have found this still falls short of meeting the needs of many enterprise customers. Current Apache Flink engine optimizations primarily focus on engineering improvements to the Java codebase of open source Apache Flink, limiting potential performance gains. Therefore, we are developing a new generation of native and vectorized stream processing engine to achieve greater performance improvements and overcome existing bottlenecks, while maintaining Apache Flink compatibility. This is the context for the development of our new engine.

In batch processing, similar requirements have emerged. For example, Databricks has significantly optimized Apache Spark with its Photon engine. This has resulted in substantial performance gains, with Photon outperforming open source Apache Spark by several times. In the open source community, Facebook's Velox, a vectorized operator library, can be integrated with Apache Spark to leverage a C++ backend. Intel's Gluten project further facilitates this integration, effectively enabling a native Apache Spark engine.

These optimizations leverage Single Instruction, Multiple Data (SIMD) vectorization. SIMD allows a single instruction to concurrently process multiple records, significantly improving performance compared to traditional serial processing where one instruction operates on one record at a time. By fully utilizing CPU hardware capabilities, vectorization drastically accelerates computations and is a widely adopted optimization technique.

A truly vectorized, native Apache Flink engine was unprecedented in the stream processing domain before this project. While Apache Spark has implemented similar optimizations, Apache Flink lacks corresponding exploration. Therefore, we initiated this project two years ago to develop a native Apache Flink engine. For years, our team has been a key contributor to Apache Flink's technology development. We recognized the potential of integrating vectorized computing and the performance advantages of C++ into the computational model of Apache Flink, maximizing hardware utilization. This led to the creation of the Flash engine. After two years of development, we have achieved significant improvements, and formally released Flash 1.0.

2.Core Technologies of Flash

Nexmark benchmarks show a 5x to 10x performance improvement of Flash 1.0 compared to open source Apache Flink. This is the context for today's Flash engine announcement. Next, I'll delve into the core technical design of the Flash vectorized stream processing engine and explain the reasons behind its significant performance advantage over the open source version.

This diagram illustrates the core architecture of Flash. The blue components represent the open source Flink framework, including the APIs and distributed runtime, all of which remain fully open source compatible. The orange components comprise newly introduced native runtime kernel. We have maintained full compatibility with Flink tasks and Flink SQL by retaining the SQL API, Table API, and SQL optimizer of Apache Flink. We also retain some Flink Java runtime functionality to provide fallback execution for operators not yet supported natively, ensuring seamless migration. The core design of the engine revolves around three key layers: the Leno integration layer, the Falcon operator layer, and the ForStDB state storage layer. Falcon provides vectorized computation, while ForStDB offers vectorized state storage. Together, these three layers enable Flash to achieve significantly higher performance than the Java-based runtime of Apache Flink.

2.1 Leno layer

Leno, similar to Gluten in Apache Spark, decouples the streaming native runtime from the distributed framework of Apache Flink, enabling independent deployment of native operators. Leno generates a native execution plan based on user-submitted SQL queries. It leverages the Flink planner to determine whether all operators within the SQL query have native counterparts. If so, a complete C++ vectorized execution plan is generated. Otherwise, it falls back to a Java execution plan. This layer primarily focuses on framework integration and acts as a bridge between the Java and Native execution environments.

2.2 Falcon vectorized operator layer

The core design of Flash centers around the Falcon vectorized operator layer and the ForStDB state storage layer. First, let's discuss the vectorized operator layer. This layer utilizes C++ for implementing vectorized operators and memory optimizations, ensuring that all computations are performed in a vectorized fashion. Within Apache Flink, operators are categorized as either stateless or stateful. Stateless operators, such as filters or string processors in stream processing, do not maintain state. Conversely, stateful operators, like those used for aggregations or streaming joins in stream processing, require state maintenance.

Within the Falcon layer, we reimplement numerous built-in data types, time functions, and string processing functions in C++. All operators operate in a vectorized fashion, leading to computational improvements. Based on analysis of internal streaming analytics workloads within Alibaba Group, the Flash engine currently is able to cover over 80% of use cases. This indicates coverage of the majority of computation and arithmetic logic. We are actively developing the remaining operators to address a broader range of stream processing requirements.

Let me illustrate why the Falcon vectorized operator layer outperforms Java-based Flink. We leverage SIMD instructions for data parallelism. While stream processing conceptually handles records individually, the underlying implementation uses batch data in buffers. An upstream node processes a batch of data (such as 1,000 records, approximately 32 KB) and transmits it as a network buffer downstream. Downstream processing also occurs in batches of 1,000 records. The specific batch size for processing these 1,000 records (such as 10, 100, or 1,000) depends on the algorithm characteristics.

Leveraging SIMD instructions allows us to process multiple records concurrently, accelerating even single-record operations like string parsing and comparison. Vectorization enables simultaneous comparison of all data, significantly boosting computational efficiency. This parallels the benefits of batch processing. Applying this approach to Apache Flink, we optimized frequently used built-in functions, particularly those for string and time processing. These optimizations yielded performance improvements of tens or even hundreds of times. This stems from both the advantages of C++ and the efficiency gains of vectorized execution.

One notable highlight is our support for user-defined functions (UDFs). Within Alibaba Group, stream processing is widely used, with over 80% of use cases requiring UDFs. Without UDF support, many of these use cases would be impossible to implement. For example, open source batch processing engines like Velox often resort to the Java runtime environment when encountering UDFs, hindering optimization of user code. Recognizing the critical importance of UDF support, we prioritized it from the outset. Our approach allows us to leverage vectorized computation even with Java UDFs, eliminating the reliance on the Java runtime for UDFs. This capability provides significant advantages in production, enabling rapid prototyping and implementation of diverse business scenarios.

2.3 ForStDB layer

After introducing the Falcon vectorized operator layer, let's move on to another core technology: ForStDB, a state storage engine. Why develop this layer? Because Apache Flink in cloud computing is a stateful compute engine. So, what is the state? During stream processing, such as page view (PV) and unique visitor (UV) statistics or streaming joins, these statistical values and data need to be stored, which is our state data. If the state data is not stored in a small database similar to the in-memory state backend of Apache Flink, these operations are not possible. In addition, If a task fails, we cannot recover it. Thus, an efficient state storage engine like ForStDB is essential.

While Apache Flink offers built-in state backends, performance can be severely impacted when state size increases and spills to disk. Even with NVMe drives, disk access remains hundreds or even thousands of times slower than memory and CPU operations. Without optimization, this bottleneck can negate the benefits of vectorized computation. This performance disparity between state access and computation is a key differentiator between stream and batch processing, and adds significant complexity to vectorized stream processing engines. To address this challenge, we introduce ForStDB, a novel vectorized state storage engine designed specifically for stream processing. ForStDB, short for "For Streaming DB", aims to significantly boost performance and efficiency of stream processing.

When discussing state management in stream processing, it is well-known that state size can vary significantly depending on the use case. For simple aggregations like counting UVs within a single concurrent instance, the state can be relatively small, potentially fitting within 1 GB of memory, with a moderate user base. However, scenarios like streaming joins often require storing substantial amounts of data, resulting in much larger state sizes. Similarly, real-time procurement applications may need to retain rules and historical data for extended periods (such as a month or longer), leading to state sizes that can reach tens or even hundreds of gigabytes, exceeding available memory. Therefore, our design mirrors the approach of in-memory and disk-based databases, providing efficient storage solutions tailored to different state size requirements.

ForStDB is available in two versions, each tailored to different state storage needs. The Mini version is designed for small, dynamic data, leveraging in-memory storage for managing small to medium-sized state. The Pro version is designed for large-scale state management. We'll now detail the implementation of each version.

ForStDB Mini is an in-memory state store designed for statistics such as PVs and UVs. It utilizes a vectorized interface and batched output for all data access to increase throughput. Its performance relies on a modern index structure, similar to a large hash index, which facilitates both single and parallel lookups using SIMD. This gives it a significant performance advantage over traditional key-value stores and even outperforms the in-memory state backend of open source Apache Flink, especially in Java environments. Built entirely in C++ with arena-based memory pool management, ForStDB Mini offers superior memory efficiency compared to Java-based solutions. This makes it a production-ready, high-performance solution for state storage.

The Pro version faces greater challenges due to the need to manage extremely large state datasets, which require both memory and disk storage. When the state data exceeds available memory, performance can be significantly impacted. Therefore, we have introduced asynchronous I/O capabilities, enabling asynchronous operations alongside existing vectorization and batch processing. We have also customized and optimized the Log-Structured Merge-Tree (LSM) architecture to leverage the characteristics of stream processing. This combination of asynchronous I/O and parallel processing accelerates state data access and improves overall efficiency.

Maintaining data order is a new critical challenge in stream processing when asynchronous exeuction is introduced. Unlike traditional key-value stores, where relationships between data points are not strictly enforced, stream processing requires ordering for these k-v accesses. Our framework ensures batched, in-order processing of stream data while also guaranteeing high processing efficiency. By addressing these challenges, we have combined the strengths of the state storage of ForStDB with the Falcon operator layer, achieving significant performance improvements.

3. Performance Improvement of Flash

We used the open source Nexmark benchmark to evaluate performance. Nexmark is a widely recognized stream processing benchmark available on GitHub. We compared Flash 1.0 with the latest open source Apache Flink 1.19, both running on Alibaba Cloud. Open source Apache Flink was deployed on Elastic Compute Service (ECS) instances to simulate a user-managed environment, while Flash 1.0 ran on our fully managed serverless platform, using the same number of compute units (CUs). This ensured a fair comparison with equivalent hardware resources.

We tested with datasets of input number of records 100 million and 200 million to simulate different stream processing scales. The 100 million record dataset represents a small to medium scale, with a smaller state size, suitable for testing with ForStDB Mini. The 200 million record dataset represents a larger scale, where state management may require disk spilling, and thus uses ForStDB Pro. We observed a performance improvement of more than 5x compared to open source Apache Flink across both scales, exceeding 8x for the smaller dataset. These results are reproducible. Our test environment, methodology, and datasets are publicly available. We encourage you to reproduce these results and will provide opportunities for hands-on validation in the future.

As mentioned at the beginning, Apache Flink, a widely recognized and de facto standard stream-batch unified engine, excels in stream processing. Its execution engine also brings significant advantages to batch computing by leveraging performance optimizations from stream processing. We benchmarked batch performance using the TPC-DS benchmark with a 10 TB dataset on ECS instances. Maintaining a consistent testing environment and procedure, we compared open source Apache Flink 1.19 and Apache Spark 3.4 against our product, utilizing the same number of CUs. Results demonstrate that our product outperforms both open source Apache Flink and Apache Spark by over 3x, even in batch processing scenarios. These results are reproducible and verifiable. These stream and batch processing benchmarks demonstrate the significant performance and cost-effectiveness advantages of our C++ vectorized Flash engine compared to purely open source solutions.

4. Business Applications of Flash within Alibaba Group

Beyond theoretical and Benchmark results, we will share the real-world busniness applications on the Flash engine from Alibaba's production environment. Since the beginning of this year, Flash engine has been gradually launched within Alibaba, through continuous online refinement and iteration. Till Sept., Flash has covered over 100,000 CUs worth of experimental services within Alibaba. The production traffic covers main business scenarios of Alibaba, including Tmall, Cainiao, Lazada, Fliggy, AMAP, and Ele.me. Business applications involve scenarios such as user PV and UV statistics, business intelligence (BI), advertisement performance monitoring, personalized real-time recommendations, and order and logistics tracking. The results demonstrate a 50% cost reduction for participating business units.

We are confident that significant resource savings will be realized once fully deployed. This new engine boasts a robust theoretical design, validated by laboratory testing and compelling production results. We are therefore excited to launch this technology on Alibaba Cloud, empowering SMEs and cloud-native enterprises. Specifically, users of open source Apache Flink can seamlessly leverage the new vectorized Flash engine, achieving both cost reduction and performance gains without modifying their existing code.

Given the significant code refactoring involved in developing this entirely new, predominantly C++ engine, we will adopt a phased rollout strategy. We plan to begin with an invitational preview, followed by a public preview, culminating in general availability. Interested customers and developers are invited to contact their account managers to participate in the preview programs and learn more about the upcoming launch. We are eager to release the public beta and subsequently the production-ready version as soon as possible. We hope this new stream processing engine will be a valuable asset to you. Thank you.

Community

Flash: A Next-gen Vectorized Stream Processing Engine Compatible with Apache Flink

1.Apache Flink Has Become the De Facto Standard for Stream Processing

2.Core Technologies of Flash

2.1 Leno layer

2.2 Falcon vectorized operator layer

2.3 ForStDB layer

3. Performance Improvement of Flash

4. Business Applications of Flash within Alibaba Group

Read previous post:

Apache Flink Community

You may also like

Comments

Apache Flink Community

Related Products

Realtime Compute for Apache Flink

MaxCompute

Media Solution

ApsaraVideo Media Processing