15 Big Data Tools and Technologies To Know About in 2021

Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

Tech Accelerator

The ultimate guide to big data for businesses

FEATURE

15 big data tools and technologies to know about in 2021


Lots of tools are available to use in big data applications. Here's a look at the key features of 15 open source options to
see if they fit your needs.

Mary K. Pratt
Published: 11 May 2021

The world of big data is only getting bigger. Organizations of all stripes are producing more data year after year,
and they're finding more ways to use that data to improve operations, better understand customers, and deliver
products faster and at lower costs, among other applications. In addition, business executives looking to get
value from data faster are seeking real-time analytics capabilities.

That's all driving significant investments in big data tools and technologies. A report published in January 2021
by IT research and analysis firm Mind Commerce estimated that the global market for big data in business
intelligence applications will amount to $50.4 billion by 2026.

The list of big data technologies is long, with numerous commercial products available to help organizations
implement a full range of data-driven analytics initiatives -- from real-time reporting to machine learning
applications.

In addition, there are many open source big data tools, some of which are also offered in commercial versions or
as part of big data platforms and managed services. Here's a look at 15 popular open source tools and
technologies for managing and analyzing big data, listed in alphabetical order with a summary of their key
features and capabilities.

THIS ARTICLE IS PART OF

 The ultimate guide to big data for businesses


Which also includes:

6 big data benefits for businesses

How to build an enterprise big data strategy in 4 steps


10 big data challenges and how to address them

1. Delta Lake
Databricks Inc., a software vendor founded by the creators of the Spark processing engine, developed Delta
Lake and then opened sourced the Spark-based technology in 2019 through the Linux Foundation. The
company describes Delta Lake as "an open format storage layer that delivers reliability, security and
performance on your data lake for both streaming and batch operations."

Delta Lake doesn't replace data lakes; rather, it's designed to sit on top of them and create a single home for
structured, semistructured and unstructured data, eliminating data silos that can stymie big data applications.
Furthermore, using Delta Lake can help prevent data corruption, enable faster queries, increase data freshness
and support compliance efforts, according to Databricks. The technology supports ACID transactions, stores
data in an open Apache Parquet format and includes Spark-compatible APIs.

2. Drill
The Apache Software Foundation's Drill website describes it as "a low latency distributed query engine for large-
scale datasets, including structured and semi-structured/nested data." Drill can scale across thousands of
cluster nodes and is capable of querying petabytes of data by using SQL and standard connectivity APIs.

Designed for exploring sets of big data, Drill layers on top of multiple data sources, enabling users to query a
wide range of data in different formats, from Hadoop sequence files and server logs to NoSQL databases and
cloud object storage. It can also access most relational databases through a plugin, and it works with commonly
used BI tools, such as Tableau and Qlik. Although Drill requires Apache's ZooKeeper software to maintain
information about clusters, it can run in any distributed cluster environment.

3. Flink
Another Apache open source technology, Flink is a stream processing framework for distributed, high-
performing and always-available applications. It supports stateful computations over both bounded and
unbounded data streams and can be used for batch, graph and iterative processing.

One of the main benefits touted by Flink's proponents is its speed: It can process millions of events in real time
for low latency and high throughput. Flink, which is designed to run in all common cluster environments,
provides three layers of APIs and a set of libraries for complex event processing, machine learning and other
common big data use cases.

4. Hadoop
A distributed framework for storing data and running applications on clusters of commodity hardware, Hadoop
was developed as a pioneering big data technology to help handle the growing volumes of structured,
unstructured and semistructured data. First released in 2006, it was almost synonymous with big data early on;
it has since been partially eclipsed by other technologies but is still widely used.

Hadoop has four primary components:

the Hadoop Distributed File System (HDFS), which splits data into blocks for storage on the nodes in a
cluster, uses replication methods to prevent data loss and manages access to the data;
YARN, short for Yet Another Resource Negotiator, which schedules jobs to run on cluster nodes and
allocates system resources to them;
MapReduce, a built-in batch processing engine that splits up large computations and runs them on different
nodes for speed and load balancing; and
Hadoop Common, a shared set of utilities and libraries.

Initially, Hadoop was limited to running MapReduce batch applications. The addition of YARN in 2013 opened it
up to other processing engines and use cases, but the framework is still closely associated with MapReduce.
The broader Apache Hadoop ecosystem also includes various big data tools and additional frameworks for
processing, managing and analyzing big data.

5. Hive
Hive is SQL-based data warehouse infrastructure software for reading, writing and managing large data sets in
distributed storage environments. It was created by Facebook but then open sourced to Apache, which
continues to develop and maintain the technology.

Hive runs on top of Hadoop and is used to process structured data; more specifically, it's used for data
summarization and analysis, as well as for querying large amounts of data. Although it can't be used for online
transaction processing, real-time updates, and queries or jobs that require low-latency data retrieval, Hive is
described by its developers as scalable, fast and flexible.

6. HPCC Systems
HPCC Systems is a big data processing platform developed by LexisNexis before being open sourced in 2011.
True to its full name -- High-Performance Computing Cluster -- the technology is, at its core, a cluster of
computers built from commodity hardware to process, manage and deliver big data.

A production-ready data lake platform that enables rapid development and data exploration, HPCC Systems
includes three main components:

Thor, a data refinery engine that's used to cleanse, merge and transform data, and to profile, analyze and
ready it for use in queries;
Roxie, a data delivery engine used to serve up prepared data from the refinery; and
Enterprise Control Language (ECL), a programming language for developing applications.

7. Hudi
Hudi (pronounced hoodie) is short for Hadoop Upserts Deletes and Incrementals. Another open source
technology maintained by Apache, it's used to manage the ingestion and storage of large analytics data sets on
Hadoop-compatible file systems, including HDFS and cloud object storage services.

First developed by Uber, Hudi is designed to provide efficient and low-latency data ingestion and data
preparation capabilities. Moreover, it includes a data management framework that organizations can use to
simplify incremental data processing and data pipeline development, improve data quality and manage the data
lifecycle.

8. Iceberg
Iceberg is an open table format used to manage data in data lakes, which it does partly by tracking individual
data files in tables rather than by tracking directories. Created by Netflix for use with the company's petabyte-
sized tables, Iceberg is now an Apache project. According to the project's website, Iceberg typically "is used in
production where a single table can contain tens of petabytes of data."

Designed to improve on the standard layouts that exist within tools like Hive, Presto, Spark and Trino, the
Iceberg table format has functions similar to SQL tables in relational databases. However, it also accommodates
multiple engines operating on the same data set.

9. Kafka
Kafka is a distributed event streaming platform that, according to Apache, is used by more than 80% of Fortune
100 companies and thousands of other organizations for high-performance data pipelines, streaming analytics,
data integration and mission-critical applications. In simpler terms, Kafka is a framework for storing, reading and
analyzing streaming data.

The technology decouples data streams and systems, holding the data streams so they can then be used
elsewhere. It runs in a distributed environment and uses a high-performance TCP network protocol to
communicate with systems and applications. Kafka was created by LinkedIn before being passed on to Apache
in 2011.

10. Kylin
Kylin is a distributed data warehouse and analytics platform for big data. It provides an online analytical
processing, or OLAP, engine designed to support extremely large data sets. Because Kylin is built on top of
other Apache technologies -- including Hadoop, Hive, Parquet and Spark -- it can easily scale to handle those
large data loads, according to its backers.

It's also fast, delivering query responses measured in milliseconds. In addition, Kylin provides a simple interface
for multidimensional analysis of big data and integrates with Tableau, Microsoft Power BI and other BI tools.
Kylin was developed by eBay, which contributed it as an open source technology in 2015.

11. Presto
Formerly known as PrestoDB, this open source SQL query engine can simultaneously handle both fast queries
and large data volumes in distributed data sets. Presto is optimized for low-latency interactive querying and
scales to support analytics applications across multiple petabytes of data in data warehouses and other
repositories.

Development of Presto began at Facebook in 2012. When its creators left the company in 2018, the technology
split into two branches: PrestoDB, which was still led by Facebook, and PrestoSQL, which the original
developers launched. That continued until December 2020, when PrestoSQL was renamed Trino and PrestoDB
reverted to the Presto name. The Presto open source project is now overseen by the Presto Foundation, which
was set up as part of the Linux Foundation in 2019.

12. Samza
Samza is a distributed stream processing system that was built by LinkedIn and is now an open source project
managed by Apache. According to the project website, Samza enables users to build stateful applications that
can do real-time processing of data from Kafka, HDFS and other sources.
The system can run on top of Hadoop YARN or Kubernetes and also offers a standalone deployment option.
The Samza site says it can handle "several terabytes" of state data, with low latency and high throughput for fast
data analysis. It can also use the same code written for data streaming jobs to run batch applications. LinkedIn
open sourced Samza in 2013.

13. Spark
Spark is an in-memory data processing and analytics engine that can run on clusters managed by Hadoop
YARN, Mesos and Kubernetes or in a standalone mode. It enables large-scale data transformations and
analysis and can be used for both batch and streaming applications, as well as machine learning and graph
processing use cases, all supported by a set of built-in modules and libraries.

Data can be accessed from various sources, including HDFS, relational and NoSQL databases, and flat-file data
sets. Spark also supports various file formats and offers a diverse set of APIs for developers.

But its biggest calling card is speed: Spark's developers claim it can perform up to 100 times faster than
traditional counterpart MapReduce on batch jobs when processing in memory. As a result, Spark has become
the top choice for many batch applications in big data environments, while also functioning as a general-purpose
engine. First developed at the University of California, Berkeley and now maintained by Apache, it can also
process on disk when data sets are too large to fit into the available memory.

14. Storm
9Another Apache open source technology, Storm is a distributed real-time computation system that's designed tog
SearchDataManagement

reliably process unbounded streams of data. According to the project website, it can be used for applications
that include real-time analytics, online machine learning and continuous computation, as well as extract,
transform and load (ETL) jobs.

Storm clusters are akin to Hadoop ones, but applications continue to run on an ongoing basis unless they're
stopped. The system is fault-tolerant and guarantees that data will be processed. In addition, the Storm site says
it can be used with any programming language, message queueing system and database.

15. Trino
As mentioned above, Trino is one of the two branches of the Presto query engine. Known as PrestoSQL until it
was rebranded in December 2020, Trino "runs at ludicrous speed," in the words of the Trino Software
Foundation. That group, which oversees Trino's development, was originally formed in 2019 as the Presto
Software Foundation; its name was also changed as part of the rebranding.

Trino enables users to query data regardless of where it's stored, and is built for both ad hoc interactive
analytics and long-running batch queries. Data from multiple systems can be combined in queries, and the
software works with Tableau, Power BI, R and other BI and analytics tools.

m Next Steps
Hadoop vs. Spark: Comparing the two big data frameworks

12 must-have features for big data analytics tools


6 essential big data best practices for businesses

m Dig Deeper on Big data management


Trino set to advance open source SQL query performance

By: Sean Kerner

Project Trino: a data-developer inside story

By: Adrian Bridgwater

ChaosSearch brings SQL to cloud data lake platform

By: Sean Kerner

Upsolver raises $25M for no-code data lake platform

By: Sean Kerner

BUSINESS ANALYTICS
AWS
CONTENT MANAGEMENT
ORACLE
SAP
SQL SERVER

SearchBusinessAnalytics

ThoughtSpot analytics platform continues to transform


The vendor committed to a cloud-first philosophy in late 2020, and since then added capabilities through acquisitions, targeted ...

Latest ThoughtSpot BI capabilities target new personas


With the addition of Data Workspace, the analytics vendor aims to enable power users within organizations to develop and deploy ...
About Us Editorial Ethics Policy Meet The Editors Contact Us Advertisers Business Partners Media Kit Corporate Site

Contributors Reprints Answers Definitions E-Products Events Features

Guides Opinions Photo Stories Quizzes Tips Tutorials Videos

All Rights Reserved,


Copyright 2005 - 2021, TechTarget

Privacy Policy

Do Not Sell My Personal Info

You might also like