Unit 1 Notes Final Part A
Unit 1 Notes Final Part A
Presented By:-
Dr. Gaurav Agarwal
What is Data & Information?
It has six components and they are hardware, software,
database, telecommunications, people and procedure.
Hardware is the computer equipments that used to perform
input, processing and output activities. Monitor, key-board,
CPU etc are the examples of hardware. Software is consisting
of computer programs that govern the operation of the
computer. It allows processing payroll, sending bills to
customers, providing managers with information to increase
profit, reduce costs and provide better customer service.
Thirdly database is an organized collection of facts and
information, typically consisting of two or more related data
files. Telecommunications is electronic transmission of signals
for communications which enables organizations to carry out
their processes and task through effective computer networks.
The fifth component of IS is people who manage or program
the system. The last components is procedure which means the
strategies, policies, methods and rules for using computer
information system including operation, security and
maintenance of the computer. Good procedures can help
companies take advantage of new opportunities and avoid
potential disasters. Auditing System, Expert System, Customer
Relationship Management System, Payroll System, Applicant
Tracking System, Knowledge Management System are the
example of information system.
Data – The Most Valuable Resource
“In its raw form oil has little value. Once processed and
refined , it helps power the world”
-Ann Winblad
In fact, the computer and Internet duo has imparted the digital form to data.
Digital data can be classified into three forms:
– Unstructured
– Semi-structured
– Structured
Does not
conform to any
data model
Cannot be
stored in form
Has no easily of rows and
identifiable columns as in a
structure database
Unstructured
data
Not in any
Does not particular
follow any rule format or
or semantics sequence
Not easily
usable by a
program
Where does Unstructured Data Come from?
Web pages
Memos
Body of an e-mail
PowerPoint presentations
Chats
Reports
Whitepapers
Surveys
Where does Unstructured Data Come from?
Broadly speaking, anything in a non-database form is unstructured
data.
Refer to figure in the previous slide - Let us take the above example
of the email communication between Dr. Ben and Dr. Stanley. Even
though email messages like the ones exchanged by Dr. Ben and Dr.
Stanley are organized in databases such as Microsoft Exchange or
Lotus Notes, the body of the email is essentially raw data, i.e. free form
text without any structure.
A lot of unstructured data is also noisy text such as chats, emails and
SMS texts.
The language of noisy text differs significantly from the standard
form of language.
A Myth Demystified
Indexing
and Indexing becomes difficult with increase in data.
searching Searching is difficult for non-text data
How to Store Unstructured Data?
Unstructured data may be be converted to formats which are easily
Change managed, stored and searched. For example, IBM is working on
formats providing a solution which converts audio , video, etc. to text
CAS Organize files based on their metadata, Content Addressable Storage (CAS)
How to Extract Information from Unstructured
Data?
Unstructured data is not easily interpreted by conventional
Interpretation search algorithms
• http://www.information-management.com/issues/20030201/6287-1.html
• http://www.enterpriseitplanet.com/storage/features/article.php/11318_34071
61_2
• http://domino.research.ibm.com/comm/research_projects.nsf/pages/uima.ind
ex.html
• http://www.research.ibm.com/UIMA/UIMA%20Architecture%20Highlights.
html
Do it Exercise
Search, think and write about two best practices for managing the growth of
unstructured data
Semi-structured Data
Semi-structured Data
• Semi-structured data does not conform to any data model i.e. it is difficult to
determine the meaning of data neither can data be stored in rows and columns as
in a database but semi-structured data has tags and markers which help to group
data and describe how data is stored, giving some metadata but it is not sufficient
for management and automation of data.
• Similar entities in the data are grouped and organized in a hierarchy. The
attributes or the properties within a group may or may not be the same. For
example two addresses may or may not contain the same number of properties as
in
Address 1
<house number><street name><area name><city>
Address 2
<house number><street name><city>
• For example an e-mail follows a standard format
To: <Name>
From: <Name>
Subject: <Text>
CC: <Name>
Body: <Text, Graphics, Images etc. >
• The tags give us some metadata but the body of the e-mail contains no format
neither is such which conveys meaning of the data it contains.
• There is very fine line between unstructured and semi-structured data.
What is Semi-structured Data?
Does not
conform to a
data model but
contains tags &
elements
(metadata) Cannot be
stored in form
Similar entities
of rows and
are grouped
columns as in a
database
Semi-
structured
data
Not sufficient
Metadata
Where does Semi-structured Data Come from?
XML
TCP/IP packets
Zipped files
Semi-structured
data
Binary
executables
Mark-up languages
Graph-based data
Schemas XML
models
Challenges faced
In many cases the structure is implicit.
Implicit structure Interpreting relationships and
correlations is very difficult
Distinction between Vague distinction between schema and data exists at times
schema and data making it difficult to capture data
How to Store Semi-structured Data?
Possible solutions
Special Databases which are specifically designed to store
purpose semi-structured data
DBMS
OEM (Object Data can be stored and exchanged in the form of graph
Exchange Model) where entities are represented as objects which are the
vertices in a graph
How to Extract Information from Semi-structured Data?
Possible solutions
<message>
<to> XYZ </to>
<from> ABC </from>
<subject> Greetings </subject>
<body> Hello! How are you? </body>
</message>
• http://queue.acm.org/detail.cfm?id=1103832
• http://www.computerworld.com/s/article/93968/Taming_Text
• http://searchstorage.techtarget.com/generic/0,295582,sid5_gci1334684,00.
html
• http://searchdatamanagement.techtarget.com/generic/0,295582,sid91_gci1
264550,00.html
• http://searchdatamanagement.techtarget.com/news/article/0,289142,sid91_
gci1252122,00.html
Answer a Quick Question
Conforms to a
data model
Data is stored in
form of rows and
Similar entities columns
are grouped (e.g., relational
database)
Structured
data
Definition, format
& meaning of data
is explicitly
known
Where does Structured Data Come from?
Spreadsheets
Structured Data
SQL
OLTP systems
Structured Data: Everything in its Place
Semi-structured Structured
• http://www.govtrack.us/articles/20061209data.xpd
• http://www.sapdesignguild.org/editions/edition2/sui_content.asp
Do it Exercise
Think and write about an instance where data was presented to you in
Unstructured, semi-structured and structured data format
What Is Big Data Architecture?
To see how data flows through its systems and ensure that it’s managed properly and meets business
needs for information, we need well-structured Big Data architecture. Data architecture is one of the
domains of enterprise architecture, connecting business strategy and technical implementation. If
it’s well-structured, it allows companies to:
In practical terms, Big Data architecture can be seen as a model for data collection, storage,
processing, and transformation for subsequent analysis or visualization. The choice of an
architectural model depends on the basic purpose of the information system and the context of its
application, including the levels of processes and IT maturity, as well as the technologies currently
available.
The most known paradigms are ETL (Extract, Transform, Load) and ELT (Extract, Load Transform)
in conjunction with data lake, lakehouse, and data warehouse approaches.
1. Data Sources
Data sources, as the name suggests, are the sources of data for systems based on Big Data
architecture. These sources include software and hardware capable of collecting and storing data.
The variety of data collection methods depends directly on the source. All big data solutions start
with one or more data sources. Examples include:
The first step is to extract data from an external system or a data source and ingest it into a Big Data
architecture platform for subsequent processing. It practically means the following:
• Collect data from an external data source using pull or push approach
• The pull approach is when your Big Data platform retrieves bulk data or individual
records from an external data source. Data is usually collected in batches if the external
data source supports it. In this case the system can control better the amount and
throughout of ingested data per unit of time.
• Push approach is when an external data source pushes data into your Big Data platform.
It is usually delivered as real time events and messages. In this case the system should
support high ingestion rate or use intermediate buffer/event log solutions like Apache
Kafka as internal storage.
3. Data storage: Data for batch processing operations is typically stored in a distributed file store
that can hold high volumes of large files in various formats. This kind of store is often called a data
lake. Options for implementing this storage include Azure Data Lake Store or blob containers in
Azure Storage. Persist data in a data lake, lakehouse, or distributed data storage as raw data. Raw
data means that data has its original format and view and it guarantees that subsequent processing
does not lose original information.
• Transfer data to the next processing layer in the form of bulk items (batch processing) or
individual messages for real-time processing (with or without intermediate data
persistence).
4. Data Processing or Transformation
The next step is processing or transformation of previously ingested data. The main objectives of
such activity include:
• Transforming data into structured data format based on predefined schema
• Enriching and cleaning data, converting data into the required format
• Performing data aggregation
• Ingesting data into an analytical database, storage, data lake, lakehouse, or data
warehouse
• Transforming data from raw format into intermediate format for further processing
• Implementing Machine or Deep Learning analysis and predictions
After storing a dataset over a period of time, it moves to the processing stage. Batch processing
presupposes that algorithms process and analyze datasets previously stored in an intermediate
distributed data storage like a data lake, lakehouse, or a distributed file system.
As a rule, batch processing is used when data can be processed in chunks or batches on a daily,
weekly, or monthly basis and end users can wait results for some time. Batch processing allows
processing data more effectively from resource perspective but it increases latency when data is
available after processing for storage, processing, and analysis.
4.2. Stream Processing
Incoming data can also be presented as a continuous stream of events from any external data source,
which is usually pushed to the Big Data ingestion layer. In this case data is ingested and processed
directly by consumers in the form of individual messages.
This approach is used when the end user or external system should see or use the result of
computations almost immediately. The advantage of this approach is high efficiency from resource
point of view per message and low latency to process data in near real-time manner.
4.2.1 Near Real-Time Stream Processing
When a system is required to process data in a real-time manner, then processing is optimized to
achieve millisecond latency. These optimizations include memory processing, caching,
asynchronous persistence of input/output results in addition to classical near real-time stream
processing.
Real-time stream processing allows to process individual events with a maximum throughput per
unit of time but with additional resources like memory and CPU. For example, Apache Flink
processes each event immediately, applying the approaches mentioned above.
5. Analytical data store: Many big data solutions prepare data for analysis and then serve the
processed data in a structured format that can be queried using analytical tools. The analytical data
store used to serve these queries can be a Kimball-style relational data warehouse, as seen in most
traditional business intelligence (BI) solutions. Alternatively, the data could be presented through a
low-latency NoSQL technology such as HBase, or an interactive Hive database that provides a
metadata abstraction over data files in the distributed data store. Azure Synapse Analytics provides
a managed service for large-scale, cloud-based data warehousing. HDInsight supports Interactive
Hive, HBase, and Spark SQL, which can also be used to serve data for analysis.
6. Analytics and Reporting
The majority of Big Data solutions are built in a way that facilitates further analysis and reporting
in order to gain valuable insights. The analysis reports should be presented in a user-friendly format
(tables, diagrams, typewritten text, etc.), meaning that the results should be visualized. Depending
on the type and complexity of visualization, additional programs, services, or add-ons can be added
to the system (table or multidimensional cube models, analytical notebooks, etc.).
To achieve this goal, ingested and transformed data should be persisted in an analytical data store,
solution, or database in the appropriate format or structure optimized for faster ad-hoc queries, quick
access and scalability to support large number of users.
The first popular approach is data warehouses, which is in essence a database optimized for read
operations using column-based storage, optimized reporting schema and SQL Engine. This
approach is usually applied when the data structure is known in advance and sub-second query
latency is critical to support rich reporting functionality and ad-hoc user queries. For example, AWS
Redshift, HP Vertika, Click House, Citrus PostgreSQL.
The next popular approach is data lakes. The original goal of a data lake was to democratize access
to data for different uses cases, including machine learning algorithms, reporting, post-processing
of data on the same ingested raw data. It works but with some limitations. This approach simplifies
the complexity of overall solutions because a data warehouse is not required by default, so less tools
and data transformations are needed. However, the performance of engines used for reporting is
significantly lower even for Parget, Delta, Iceberg optimized formats. A typical example of this
approach is the Classical Apache Spark setup which persists ingested data in Delta or Iceberg format
and Pesto Query Engine.
The last trend is to combine both previous approaches in one and it is known as a lakehouse. In
essence, the idea is to have a data lake with highly optimized data format and storage and SQL
vector-based engine similar to data warehouses but based on Delta format which supports
ACID/versions. For example, Data Bricks Enterprise version achieved performance for typical
reporting queries better than classical data warehouse solutions.
Data warehouse vs data lake vs data lakehouse — Image by author, inspired by the source
7. Orchestration
Mostly, Big Data analytics solutions have similar repetitive business processes which include data
collection, transfer, processing, uploading in analytical data stores, or direct transmission to the
report. That’s why companies leverage orchestration technology to automate and optimize all the
stages of data analysis.
Different Big Data tools can be used in this area depending on goals and skills.
The first level of abstraction is Big Data processing solutions themself described in the data
transformation section. They usually have orchestration mechanisms where a pipeline and its logic
are implemented in code directly based on the functional programing paradigm. For example, Spark,
Apache Flink, Apache Beam all have such functionality. This level of abstraction is very functional
but requires programing skills and deep knowledge of Big Data processing solutions.
The next level is orchestration frameworks, which is still based on writing code to implement the
flow of steps for an automated process but these Big Data tools require basic knowledge of language
to just link steps between each other without special knowledge how specific step or component is
implemented. So, such tools have a list of predefined steps with the ability to be extended by
advanced users. For example, Apache Airflow or Ludgi are popular choices for many people who
work with data but have limited programing knowledge.
The last level is end-user GUI editors that allow to create orchestration flows and business processes
using just a rich editor with graphical components, which should be linked visually and configured
via component properties. BPMN notations are often used for such tools in conjunction with custom
components to process data.
Types of Big Data Architecture
To efficiently handle customer requests and perform tasks well, applications have to interact with
the warehouse. In a nutshell, we’ll look at two most popular Big Data architectures, known as
Lambda and Kappa, that serve as the basis for various corporate applications.
1. Lambda has been the key Big Data architecture. It separates real-time and batch
processing where batch processing is used to ensure consistency. This approach allows
implementing most application scenarios. But for the most part, the batch and stream
levels work with different cases, while their internal processing logic is almost the same.
Thus, data and code duplications may happen, which becomes a source of numerous
errors.
2. For this reason, the Kappa architecture was introduced, which consumes fewer
resources but is great for real-time processing. Kappa is based on Lambda combining
stream and batch processing models but information is stored in the data lake. The
essence of this architecture is to optimize data processing by applying the same set of
code for both processing models. It facilitates management and unifies the problem of
calibration.
Lambda architecture
Let’s take a glimpse at the most common Big Data tools and techniques used nowadays:
Distributed Storage and Processing Tools
Accommodating and analyzing expanding volumes of diverse data requires distributed database
technologies. Distributed databases are infrastructures that can split data across multiple physical
servers allowing multiple computers to be used anywhere. Some of the most widespread processing
and distribution tools include:
Hadoop
Big Data will be difficult to process without Hadoop. It’s not only a storage system, but also a set
of utilities, libraries, frameworks, and development distributions.
Spark is a solution capable of processing real-time, batch, and memory data for quick results. The
tool can run on a local system, which facilitates testing and development. Today, this powerful open-
source Big Data tool is one of the most important in the arsenal of top-performing companies.
Spark is created for a wide range of tasks such as batch applications, iterative algorithms, interactive
queries, and machine learning. This makes it suitable for both amateur use and professional
processing of large amounts of data.
No-SQL Databases
No-SQL databases differ from traditional SQL-based databases in that they support flexible
schemes. This simplifies handling vast amounts of all types of information — especially
unstructured and semi-structured data that are poorly suited for strict SQL systems.
A feature of the Massive parallel processing (MPP) architecture is the physical partitioning of data
memory combined into a cluster. When data is received, only the necessary records are selected and
the rest are eliminated to not take up space in RAM, which speeds up disk reading and processing
of results. Predictive analytics, regular reporting, corporate data warehousing (CDW), and
calculating churn rate are some of the typical applications of MPP.
Cloud Computing Tools
Clouds can be used in the initial phase of working with Big Data, in conducting experiments with
data and testing hypotheses. It’s easier to test new assumptions and technologies, you don’t need
your own infrastructure. Clouds make it faster and cheaper to launch a solution into industrial
operations with certain requirements, such as data storage reliability, infrastructure performance,
and others. In this way, more companies are moving their Big Data to clouds that are scalable and
flexible.
5 Things to Consider When Choosing Big Data Architecture
When choosing a database solution, you have to bear in mind the following factors:
1. Data Requirements
Before launching a Big Data solution, find out which processing type (real-time or batch) will be
more suitable for your business to achieve the highest entry speed and extract the relevant data for
analysis. Don’t overlook such requirements as response time, accuracy and consistency, and fault-
tolerance that play the crucial role in the data analytics process.
2. Stakeholders’ Needs
Identify your key external stakeholders and study their information needs to help them achieve
mission-critical goals. This presupposes that the choice of a data strategy must be based on a
comprehensive needs analysis of the stakeholders to bring about benefits to everyone.
3. Data Retention Periods
Data volumes keep growing exponentially, which makes its storage far more expensive and
complicated. To prevent these losses, you have to determine the period within which each data set
can bring value to your business and, thereby, be retained.
4. Open-Source or Commercial Big Data Tools
Open-source analytics tools will work best for you if you have the people and the skills to work with
it. This software is more tailorable to your business needs as your staff can add features, updates,
and other adjustments and improvements at any moment.
In case you don’t have enough staff to maintain your analytics platform — opting for a commercial
tool can boost more tangible outcomes. Here, you depend on a software vendor but you get regular
updates, tool improvements, and can use their support services to solve arising problems.
5. Continuous Evolution
The Big Data landscape is quickly changing as the technologies keep evolving, introducing new
capabilities and offering advanced performance and scalability. In addition, your data needs are
certainly evolving, too.
Make sure that your Big Data approach accounts for these changes meaning that your Big Data
solution should make it easy to introduce any enhancements like integrate new data sources, add
new custom modules, or implement additional security measures if needed.
Big Data Architecture Challenges
If built correctly, Big Data architecture can save money and help predict important trends, but as a
ground-breaking technology, it has some pitfalls.
Budget Requirement
A Big Data project can often be held back by the cost of adopting Big Data architecture. Your budget
requirements can vary significantly depending on the type of Big Data application architecture, its
components and tools, management and maintenance activities, as well as whether you build your
Big Data application in-house or outsource it to a third-party vendor. To overcome this challenge,
companies need to carefully analyze their needs and plan their budget accordingly.
Data Quality
When information comes from different sources, it’s necessary to ensure consistency of the data
formats and avoid duplication. Companies have to sort out and prepare data for further analysis with
other data types.
Scalability
The value of Big Data lies in its quantity but it can also become an issue. If your Big Data
architecture isn’t ready to expand, problems may soon arise.
• If infrastructure isn’t managed, the cost of its maintenance will increase hurting the
company’s budget.
• If a company doesn’t plan to expand, its productivity may fall significantly.
Cyberthreats are a common problem since hackers are very interested in corporate data. They may
try to add their fake information or view corporate data to obtain confidential information. Thus, a
robust security system should be built to protect sensitive information.
Skills Shortage
The industry is facing a shortage of data analysts due to a lack of experience and necessary skills in
aspirants. Fortunately, this problem is solvable today by outsourcing your Big Data architecture
problems to an expert team that has broad experience and can build a fit-for-purpose solution to
drive business performance.
Big Data Characteristics
The five characteristics that define Big Data are: Volume, Velocity, Variety, Veracity and
Value.
1. VOLUME
Volume refers to the ‘amount of data’, which is growing day by day at a very
fast pace. The size of data generated by humans, machines and their
interactions on social media itself is massive. Researchers have predicted that
40 Zettabytes (40,000 Exabytes) will be generated by 2020, which is an increase
of 300 times from 2005.
2. VELOCITY
Velocity is defined as the pace at which different sources generate the data
every day. This flow of data is massive and continuous. There are 1.03 billion
Daily Active Users (Facebook DAU) on Mobile as of now, which is an increase of
22% year-over-year. This shows how fast the number of users are growing on
social media and how fast the data is getting generated daily. If you are able to
handle the velocity, you will be able to generate insights and take decisions
based on real-time
data.
3. VARIETY
As there are many sources which are contributing to Big Data, the type of data
they are generating is different. It can be structured, semi-structured or
unstructured. Hence, there is a variety of data which is getting generated every
day. Earlier, we used to get the data from excel and databases, now the data are
coming in the form of images, audios, videos, sensor data etc. as shown in below
image. Hence, this variety of unstructured data creates problems in capturing,
storage, mining and analyzing the data.
4. VERACITY
Veracity refers to the data in doubt or uncertainty of data available due to data
inconsistency and incompleteness. In the image below, you can see that few
values are missing in the table. Also, a few values are hard to accept, for example
– 15000 minimum value in the 3rd row, it is not possible. This inconsistency and
incompleteness is Veracity.
Data available can sometimes get messy and maybe difficult to trust. With many
forms of big data, quality and accuracy are difficult to control like Twitter posts
with hashtags, abbreviations, typos and colloquial speech. The volume is often
the reason behind for the lack of quality and accuracy in the data.
5. VALUE
After discussing Volume, Velocity, Variety and Veracity, there is another V that
should be taken into account when looking at Big Data i.e. Value. It is all well
and good to have access to big data but unless we can turn it into value it is
useless. By turning it into value I mean, Is it adding to the benefits of the
organizations who are analyzing big data? Is the organization working on Big
Data achieving high ROI (Return On Investment)? Unless, it adds to their profits
by working on Big Data, it is useless.
6. Variability
Variability in big data's context refers to a few different things. One is the number of
inconsistencies in the data. These need to be found by anomaly and outlier detection
methods in order for any meaningful analytics to occur.
Big data is also variable because of the multitude of data dimensions resulting from
multiple disparate data types and sources. Variability can also refer to the inconsistent
speed at which big data is loaded into your database.
7: Validity
Similar to veracity, validity refers to how accurate and correct the data is for its
intended use. According to Forbes, an estimated 60 percent of a data scientist's time is
spent cleansing their data before being able to do any analysis. The benefit from big data
analytics is only as good as its underlying data, so you need to adopt good data
governance practices to ensure consistent data quality, common definitions, and
metadata.
8: Vulnerability
Big data brings new security concerns. After all, a data breach with big data is a big
breach. Does anyone remember the infamous AshleyMadison hack in 2015?
Unfortunately there have been many big data breaches. Another example, as reported
by CRN: in May 2016 "a hacker called Peace posted data on the dark web to sell, which
allegedly included information on 167 million LinkedIn accounts and ... 360 million
emails and passwords for MySpace users."
9: Volatility
How old does your data need to be before it is considered irrelevant, historic, or not
useful any longer? How long does data need to be kept for?
Before big data, organizations tended to store data indefinitely -- a few terabytes of data
might not create high storage expenses; it could even be kept in the live database
without causing performance issues. In a classical data setting, there not might even be
data archival policies in place.
Due to the velocity and volume of big data, however, its volatility needs to be carefully
considered. You now need to establish rules for data currency and availability as well as
ensure rapid retrieval of information when required. Make sure these are clearly tied to
your business needs and processes -- with big data the costs and complexity of a storage
and retrieval process are magnified.
10: Visualization
Current big data visualization tools face technical challenges due to limitations of in-
memory technology and poor scalability, functionality, and response time. You can't rely
on traditional graphs when trying to plot a billion data points, so you need different
ways of representing data such as data clustering or using tree maps, sunbursts, parallel
coordinates, circular network diagrams, or cone trees.
• Retail: Retail has some of the tightest margins, and is one of the greatest
beneficiaries of big data. The beauty of using big data in retail is to understand
consumer behavior. Amazon’s recommendation engine provides suggestion
based on the browsing history of the consumer.
• Traffic control: Traffic congestion is a major challenge for many cities globally.
Effective use of data and sensors will be key to managing traffic better as cities
become increasingly densely populated.
• Search Quality: Every time we are extracting information from google, we are
simultaneously generating data for it. Google stores this data and uses it to
improve its search quality.
Big data analytics raises several ethical issues, especially as
companies begin monetizing their data externally for purposes
different from those for which the data was initially collected. The
scale and ease with which analytics can be conducted today
completely change the ethical framework. We can now do things that
were impossible a few years ago, and existing ethical and legal
frameworks cannot prescribe what we should do. While there is still
no black or white, experts agree on a few principles:
A big data platform is an integrated computing solution that combines numerous software systems,
tools, and hardware for big data management. It is a one-stop architecture that solves all the data
needs of a business regardless of the volume and size of the data at hand. Due to their efficiency in
data management, enterprises are increasingly adopting big data platforms to gather tons of data
and convert them into structured, actionable business insights.
Currently, the marketplace is flooded with numerous Open source and commercially available big
data platforms. They boast different features and capabilities for use in a big data environment.
Characteristics of a big data platform
Any good big data platform should have the following important features:
• Ability to accommodate new applications and tools depending on the evolving business needs
• Have a wide variety of conversion tools to transform data to different preferred formats
• Provide the tools for scouring the data through massive data sets
That means data coming to Data Lake doesn’t have to be collected with a specific purpose from the
beginning, it can be defined later. Without it, data can be loaded faster since they do not need to
undergo an initial transformation process.
In Data Lakes, data is gathered in its native formats, which provides more opportunities for
exploration, analysis, and further operations, as all data requirements can be tailored on a case-by-
case basis, then – once the schema has been developed – it can be kept for future use or discarded.
Data Warehouse is a scalable storage data repository holding large volumes of raw data, but its
environment is far more structured than in Data Lake. Data collected in Data Warehouse are already
pre-processed, which means it is not in their native formats. Data requirements must be known and
set up front to make sure the models and schemas produce usable data for all users.
1. Data Collection
Big Data platforms collect data from various sources, such as sensors, weblogs, social media, and
other databases.
2. Data Storage
Once the data is collected, it is stored in a repository, such as Hadoop Distributed File System
(HDFS), Amazon S3, or Google Cloud Storage.
3. Data Processing
Data Processing involves tasks such as filtering, transforming, and aggregating the data. This can be
done using distributed processing frameworks, such as Apache Spark, Apache Flink, or Apache
Storm.
4. Data Analytics
After data is processed, it is then analyzed with analytics tools and techniques, such as machine
learning algorithms, predictive analytics, and data visualization.
5. Data Governance
Data Governance (data cataloging, data quality management, and data lineage tracking) ensures the
accuracy, completeness, and security of the data.
6. Data Management
Big data platforms provide management capabilities that enable organizations to make backups,
recover, and archive.
These stages are designed to derive meaningful business insights from raw data from multiple
sources such as website analytic systems, CRM, ERP, loyalty engines, etc. Processed data stored in a
unified environment can be used in preparing static reports and visualizations but also for other
analytics and – for example – building Machine Learning models.
WHAT IS A BIG DATA PLATFORM?
A big data platform acts as an organized storage medium for large amounts
of data. Big data platforms utilize a combination of data management
hardware and software tools to store aggregated data sets, usually onto the
cloud.
Google Cloud offers lots of big data management tools, each with its own
specialty. BigQuery warehouses petabytes of data in an easily queried format.
Dataflow analyzes ongoing data streams and batches of historical data side
by side. With Google Data Studio, clients can turn varied data into custom
graphics.
MICROSOFT AZURE
Users can analyze data stored on Microsoft’s Cloud platform, Azure, with a
broad spectrum of open-source Apache technologies, including Hadoop and
Spark. Azure also features a native analytics tool, HDInsight, that
streamlines data cluster analysis and integrates seamlessly with Azure’s
other data tools.
SNOWFLAKE
CLOUDERA
SUMO LOGIC
The cloud-native Sumo Logic platform offers apps — including Airbnb and
Pokémon GO — three different types of support. It troubleshoots, tracks
business analytics and catches security breaches, drawing on machine
learning for maximum efficiency. It’s also flexible and able to manage sudden
influxes of data.
SISENSE
Sisense’s data analytics platform processes data swiftly thanks to its
signature In-Chip Technology. The interface also lets clients build, use and
embed custom dashboards and analytics apps. And with its AI technology
and built-in machine learning models, Sisense enables clients to identify
future business opportunities.
TABLEAU
COLLIBRA
TALEND
Talend’s data replication product, Stitch, allows clients to quickly load data
from hundreds of sources into a data warehouse, where it’s structured and
ready for analysis. Additionally, Data Fabric, Talend’s unified data
integration solution, combines data integration with data governance and
integrity, as well as offers application and API integration.
QUALTRICS EXPERIENCE MANAGEMENT
TERADATA
ORACLE
Oracle Cloud’s big data platform can automatically migrate diverse data
formats to cloud servers, purportedly with no downtime. The platform can
also operate on-premise and in hybrid settings, enriching and transforming
data whether it’s streaming in real time or stored in a centralized repository,
also known as a data lake. A free tier of the platform is also available.
DOMO
Domo’s big data platform draws on clients’ full data portfolios to offer
industry-specific findings and AI-based predictions. Even when relevant
data sprawls across multiple cloud servers and hard drives, Domo clients can
gather it all in one place with Magic ETL, a drag-and-drop tool that
streamlines the integration process.
MONGODB
CIVIS ANALYTICS
ALTERYX
This platform from Zeta Global uses its database of billions of permission-
based profiles to help users optimize their omnichannel marketing efforts.
The platform’s AI features sift through the diverse data, helping marketers
target key demographics and attract new customers.
VERTICA
TREASURE DATA
ACTIAN AVALANCHE
GREENPLUM
Born out of the open-source Greenplum Database project, this platform uses
PostgreSQL to conquer varied data analysis and operations projects, from
quests for business intelligence to deep learning. Greenplum can parse data
housed in clouds and servers, as well as container orchestration systems.
Additionally, it comes with a built-in toolkit of extensions for location-based
analysis, document extraction and multi-node analysis.
HITACHI VANTARA’S PENTAHO
EXASOL
IBM CLOUD
IBM’s full-stack cloud platform comes with over 170 built-in tools, including
many for customizable big data management. Users can opt for a NoSQL or
SQL database, or store their data as JSON documents, among other database
designs. The platform can also run in-memory analysis and integrate open-
source tools like Apache Spark.
MARKLOGIC
Users can import data into MarkLogic’s platform as is. Items ranging from
images and videos to JSON and RDF files coexist peaceably in the flexible
database, uploaded via a simple drag-and-drop process powered by Apache
Nifi. Organized around MarkLogic’s Universal Index, files and metadata are
easily queried. The database also integrates with a host of more intensive
analytics apps.
DATAMEER
Though it’s possible to code within Datameer’s platform, it’s not necessary.
Users can upload structured and unstructured data directly from many data
sources by following a simple wizard. From there, the point-and-click data
cleansing and built-in library of more than 270 functions — like
chronological organization and custom binning —make it easy to drill into
data even if users don’t have a computer science background.
ALIBABA CLOUD