Unit 1 Notes Final Part A

BIG DATA & HADOOP
Presented By:-
Dr. Gaurav Agarwal
What is Data & Information?
It has six components and they are hardware, software,
database, telecommunications, people and procedure.
Hardware is the computer equipments that used to perform
input, processing and output activities. Monitor, key-board,
CPU etc are the examples of hardware. Software is consisting
of computer programs that govern the operation of the
computer. It allows processing payroll, sending bills to
customers, providing managers with information to increase
profit, reduce costs and provide better customer service.
Thirdly database is an organized collection of facts and
information, typically consisting of two or more related data
files. Telecommunications is electronic transmission of signals
for communications which enables organizations to carry out
their processes and task through effective computer networks.
The fifth component of IS is people who manage or program
the system. The last components is procedure which means the
strategies, policies, methods and rules for using computer
information system including operation, security and
maintenance of the computer. Good procedures can help
companies take advantage of new opportunities and avoid
potential disasters. Auditing System, Expert System, Customer
Relationship Management System, Payroll System, Applicant
Tracking System, Knowledge Management System are the
example of information system.
Data – The Most Valuable Resource
“In its raw form oil has little value. Once processed and
refined , it helps power the world”
-Ann Winblad
“Data is the new oil.” -Clive Humby, CNBC

Digital Data
In fact, the computer and Internet duo has imparted the digital form to data.
Digital data can be classified into three forms:
– Unstructured
– Semi-structured
– Structured
• Usually, data is in the unstructured format which makes extracting

information from it difficult.
• According to Merrill Lynch, 80–90% of business data is either unstructured
or semi-structured.
• Gartner also estimates that unstructured data constitutes 80% of the whole
enterprise data.
Formats of Digital Data
Here is a percent distribution of the three forms of data -

Data Forms Defined-
Unstructured data: This is the data which does not conform to a
data model or is not in a form which can be used easily by a
computer program. About 80—90% data of an organization is in
this format; for example, memos, chat rooms, PowerPoint
presentations, images, videos, letters, researches, white papers, body
of an email, etc.
Semi-structured data: This is the data which does not conform to a
data model but has some structure. However, it is not in a form
which can be used easily by a computer program; for example,
emails, XML, markup languages like HTML, etc. Metadata for this
data is available but is not sufficient.
Structured data: This is the data which is in an organized form
(e.g., in rows and columns) and can be easily used by a computer
program. Relationships exist between entities of data, such as
classes and their objects. Data stored in databases is an example of
structured data.
Unstructured Data
Unstructured Data – Getting to Know
• Dr. Ben, Dr. Stanley, and Dr. Mark work at the medical facility of “GoodLife”. Over
the past few days, Dr. Ben and Dr. Stanley had been exchanging long emails about a
particular case of testinal problem. Dr. Stanley has chanced upon a particular
combination of drugs that has cured gastro-intestinal disorders in his patients. He has
written an email about this combination of drugs to Dr. Ben.
• Dr. Mark has a patient in the “GoodLife” emergency unit with quite a similar case of
gastro-intestinal disorder whose cure Dr. Stanley has chanced upon. Dr. Mark has already
tried regular drugs but with no positive results so far. He quickly searches the
organization's database for answers, but with no luck. The information he wants is tucked
away in the email conversation between two other “GoodLife” doctors, Dr. Ben and Dr.
Stanley. Dr. Mark would have accessed the solution with few mouse clicks had the
storage and analysis of unstructured data been undertaken by “GoodLife”.
• As is the case at “GoodLife”, 80-85% of data in any organization is unstructured and
is an alarming rate. An enormous amount of knowledge is buried in this data. In the
above Stanley's email to Dr. Ben had not been successfully updated into the medical
system in the unstructured format.
• Unstructured data, thus, is the one which cannot be stored in the form of rows and as
in a database and does not conform to any data model, i.e. it is difficult to determine the
meaning of the data. It does not follow any rules or semantics. It can be of any type and
is hence unpredictable.
Characteristics of Unstructured Data
Does not
conform to any
data model
Cannot be
stored in form
Has no easily of rows and
identifiable columns as in a
structure database
Unstructured
data
Not in any
Does not particular
follow any rule format or
or semantics sequence
Not easily
usable by a
program
Where does Unstructured Data Come from?
Web pages
Memos
Videos (MPEG, etc.)
Images (JPEG, GIF, etc.)
Body of an e-mail
Unstructured data Word document
PowerPoint presentations
Chats
Reports
Whitepapers
Surveys
Where does Unstructured Data Come from?
Broadly speaking, anything in a non-database form is unstructured
data.
It can be classified into two broad categories:

• Bitmap objects : For example, image, video, or audio files.
• Textual objects : For example, Microsoft Word documents,
emails, or Microsoft Excel spread-sheets.
Refer to figure in the previous slide - Let us take the above example
of the email communication between Dr. Ben and Dr. Stanley. Even
though email messages like the ones exchanged by Dr. Ben and Dr.
Stanley are organized in databases such as Microsoft Exchange or
Lotus Notes, the body of the email is essentially raw data, i.e. free form
text without any structure.
 A lot of unstructured data is also noisy text such as chats, emails and
SMS texts.
 The language of noisy text differs significantly from the standard
form of language.
A Myth Demystified
• Web pages are said to be unstructured data even though they

are defined by HTML, a markup language which has a rich
structure.
• HTML is solely used for rendering and presentations.
• The tagged elements do not capture the meaning of the data
that the HTML page contains. This makes it difficult to
automatically process the information in the HTML page.
• Another characteristic that makes web pages unstructured data
is that they usually carry links and references to external
unstructured content such as images, XML files, etc.
How to Manage Unstructured Data?
Let us look at a few generic tasks to be performed to enable storage and search of unstructured data:
Indexing: Let us go back to our understanding of the Relational Database Management
System(RDBMS). In this system, data is indexed to enable faster search and retrieval. On the basis
of some value in the data, index is defined which is nothing but an identifier and represents the large
record in the data set. In the absence of an index, the whole data set/ document will be scanned for
retrieving the desired information. In the case of unstructured data too, indexing helps in searching
and retrieval. Based on text or some other attributes, e.g. file name, the unstructured data is indexed.
Indexing in unstructured data is difficult because neither does this data have any predefined attributes
nor does it follow any pattern or naming conventions. Text can be indexed based on a text string but
in case of non-text based files, e.g. audio/video, etc., indexing depends on file names. This becomes a
hindrance when naming conventions are not being followed.
Tags/Metadata:: Using metadata, data in a document, etc. can be tagged. This enables search and
retrieval. But in unstructured data, this is difficult as little or no metadata is available. Structure of
data has to be determined which is very difficult as the data itself has no particular format and is
coming from more than one source.
Classification/Taxonomy: Taxonomy is classifying data on the basis of the relationships that exist
between data. Data can be arranged in groups and placed in hierarchies based on the taxonomy
prevalent in an organization. However, classifying unstructured data is difficult asidentifying
relationships between data is not an easy task. In the absence of any structure ormetadata or schema,
identifying accurate relationships and classifying is not easy. Since the datais unstructured, naming
conventions or standards are not consistent across an organization, thusmaking it difficult to classify
data.CAS (Content Addressable Storage): It stores data based on their metadata. It assigns 2
uniquename to every object stored in it. The object is retrieved based on its content and not its
location.It is used extensively to store emails, etc.
How to Store Unstructured Data?
Sheer volume of unstructured data and its unprecedented

Storage growth makes it difficult to store. Audios, videos, images,
Space etc. acquire huge amount of storage space
Scalability becomes an issue with increase

Scalability in unstructured data
Retrieving and recovering unstructured

Retrieve data are cumbersome
information
Challenges faced
Ensuring security is difficult due to varied
Security sources of data (e.g. e-mail, web pages)
Update and Updating, deleting, etc. are not easy due to

delete the unstructured form
Indexing
and Indexing becomes difficult with increase in data.
searching Searching is difficult for non-text data
How to Store Unstructured Data?
Unstructured data may be be converted to formats which are easily
Change managed, stored and searched. For example, IBM is working on
formats providing a solution which converts audio , video, etc. to text
Create hardware which support unstructured data

New either compliment the existing storage devices or be a
hardware stand alone for unstructured data
Store in relational databases which support

RDBMS/
Possible solutions BLOBs
BLOBs which is Binary Large Objects
XML Store in XML which tries to give some structure to

unstructured data by using tags and elements
CAS Organize files based on their metadata, Content Addressable Storage (CAS)
How to Extract Information from Unstructured
Data?
Unstructured data is not easily interpreted by conventional
Interpretation search algorithms
As the data grows it is not possible to put tags

Tags manually
Designing algorithms to understand the meaning

Indexing of the document and then tag or index them
accordingly is difficult
Challenges faced
Deriving Computer programs cannot automatically derive
meaning meaning/structure from unstructured data
File formats Increasing number of file formats make it difficult to

interpret data
Classification/ Different naming conventions followed across the

Taxonomy organization make it difficult to classify data.
How to Extract Information from Unstructured
Data?
Unstructured data can be stored in a virtual repository and be
Tags automatically tagged. For example, Documentum provides this
type of solution
Text mining tools help in grouping and classifying

Text mining unstructured data and analyze by considering
grammar, context, synonyms ,etc.
Application platforms like XOLAP help

Application extract information from e-mail and XML
Possible solutions platforms based documents
Classification/ Taxonomies within the organization can be

Taxonomy managed automatically to organize data in
hierarchical structures
Naming conventions/ Following naming conventions or standards

standards across an organization can greatly improve
storage and retrieval
UIMA
 UIMA (Unstructured Information Management Architecture) is an open
source platform from IBM which integrates different kinds of analysis
engines to provide a complete solution for edge discovery from
unstructured data.
 In UIMA, the analysis engines integration and analysis of unstructured
information and bridge the gap between structured and unstructured data.
 UIMA stores information in a structured format. The structured resources
can be mined, searched, and put to other uses. The information obtained
from structured sources is also for sub-sequent analysis of unstructured
data.
 Various analysis engines analyze unstructured data in different ways such
as:
– Breaking up of documents into separate words.
– Grouping and classifying according to taxonomy.
– Detecting parts of speech, grammar, and synonyms.
– Detecting events and times.¢ Detecting relationships between various elements.
Further Reading
• http://www.information-management.com/issues/20030201/6287-1.html
• http://www.enterpriseitplanet.com/storage/features/article.php/11318_34071
61_2
• http://domino.research.ibm.com/comm/research_projects.nsf/pages/uima.ind
ex.html
• http://www.research.ibm.com/UIMA/UIMA%20Architecture%20Highlights.
html
Do it Exercise
Search, think and write about two best practices for managing the growth of
unstructured data
Semi-structured Data
Semi-structured Data
• Semi-structured data does not conform to any data model i.e. it is difficult to
determine the meaning of data neither can data be stored in rows and columns as
in a database but semi-structured data has tags and markers which help to group
data and describe how data is stored, giving some metadata but it is not sufficient
for management and automation of data.
• Similar entities in the data are grouped and organized in a hierarchy. The
attributes or the properties within a group may or may not be the same. For
example two addresses may or may not contain the same number of properties as
in
Address 1
<house number><street name><area name><city>
Address 2
<house number><street name><city>
• For example an e-mail follows a standard format
To: <Name>
From: <Name>
Subject: <Text>
CC: <Name>
Body: <Text, Graphics, Images etc. >
• The tags give us some metadata but the body of the e-mail contains no format
neither is such which conveys meaning of the data it contains.
• There is very fine line between unstructured and semi-structured data.
What is Semi-structured Data?
Does not
conform to a
data model but
contains tags &
elements
(metadata) Cannot be
stored in form
Similar entities
of rows and
are grouped
columns as in a
database
Semi-
structured
data
Attributes in a The tags and

group may not elements
be the same describe how
data is stored
Not sufficient
Metadata
Where does Semi-structured Data Come from?
E-mail
XML
TCP/IP packets
Zipped files
Semi-structured
data
Binary
executables
Mark-up languages
Integration of data from

heterogeneous sources
How to Manage Semi-structured Data?
Some ways in which semi-structured data is managed and stored
Graph-based data
Schemas XML
models
• Describe the • Contain data on • Models the data

structure and the leaves of the using tags and
content of data to graph. Also known elements
some extent as ‘schema less’
• Assign meaning to • Used for data • Schemas are not

data hence exchange among tightly coupled to
allowing automatic heterogeneous data
search and sources
indexing
How to Store Semi-structured Data?
Storing data with their schemas increases cost

Storage cost
Semi-structured data cannot be stored in

RDBMS existing RDBMS as data cannot be mapped
into tables directly
Irregular and Some data elements may have extra

partial structure information while others none at all
Challenges faced
In many cases the structure is implicit.
Implicit structure Interpreting relationships and
correlations is very difficult
Schemas keep changing with

Evolving schemas requirements making it difficult to
capture it in a database
Distinction between Vague distinction between schema and data exists at times
schema and data making it difficult to capture data
How to Store Semi-structured Data?
XML allows to define tags and attributes to store data.

Data can be stored in a hierarchical/nested structure
XML
Semi-structured data can be stored in a relational

database by mapping the data to a relational
RDBMS schema which is then mapped to a table
Possible solutions
Special Databases which are specifically designed to store
purpose semi-structured data
DBMS
OEM (Object Data can be stored and exchanged in the form of graph
Exchange Model) where entities are represented as objects which are the
vertices in a graph
How to Extract Information from Semi-structured Data?
Semi-structured is usually stored in flat

files which are difficult to index and
Flat files search
Data comes from varied sources which is

Heterogeneous difficult to tag and search
Challenges faced sources
Incomplete/ Extracting structure when there is none and

irregular interpreting the relations existing in the structure
structure which is present is a difficult task
How to Extract Information from Semi-structured Data?
Indexing data in a graph-based model

Indexing enables quick search
Allows data to be stored in a graph-based data

OEM model which is easier to index and search
Possible solutions
XML Allows data to be arranged in a hierarchical or

tree-like structure which enables indexing and
searching
Mining Various mining tools are available which search

tools data based on graphs, schemas, structure, etc.
XML – A Solution for Semi-structured Data Management
XML Extensible MarkUp Language
Open-source mark up language written in plain text.

What is XML? It is hardware and software independent
Designed to store and transport data over the

Does what? Internet
It allows data to be stored in a hierarchical/nested

How? structure. It allows user to define tags to store the
data
XML – A Solution for Semi-structured Data Management
XML has no predefined tags
<message>
<to> XYZ </to>
<from> ABC </from>
<subject> Greetings </subject>
<body> Hello! How are you? </body>
</message>
The words in the <> (angular brackets) are user-defined tags

XML is known as self-describing as data can exist without a schema and
schema can be added later
Schema can be described in XSLT or XML schema
Further Reading
• http://queue.acm.org/detail.cfm?id=1103832
• http://www.computerworld.com/s/article/93968/Taming_Text
• http://searchstorage.techtarget.com/generic/0,295582,sid5_gci1334684,00.
html
• http://searchdatamanagement.techtarget.com/generic/0,295582,sid91_gci1
264550,00.html
• http://searchdatamanagement.techtarget.com/news/article/0,289142,sid91_
gci1252122,00.html
Answer a Quick Question
What is your take on this….
A Web Page is unstructured. If yes, why?

Structured Data
Structured Data
• Structured data is organized in semantic chunks
(entities)
• Similar entities are grouped together (relations or
classes)
• Entities in the same group have the same
descriptions (attributes)
• Descriptions for all entities in a group (schema)
have the same defined format
have a predefined length
are all present
and follow the same order
What Is Structured Data?
Conforms to a
data model
Data is stored in
form of rows and
Similar entities columns
are grouped (e.g., relational
database)
Structured
data
Attributes in a Data resides in

group are the fixed fields within
same a record or file
Definition, format
& meaning of data
is explicitly
known
Where does Structured Data Come from?
Databases (e.g., Access)
Spreadsheets
Structured Data
SQL
OLTP systems
Structured Data: Everything in its Place
Fully described datasets
Clearly defined categories and sub-categories
Data neatly placed in rows and columns
Data that goes into the records is regulated by a well-defined structure
Indexing can be easily done either by the DBMS itself or manually

Structured Data
Semi-structured Structured
Name E-mail First Name Last Name E-mail Id Alternate E-

mail Id
Patrick Wood ptw@dcs.abc.ac.uk, Patrick Wood ptw@dcs.ab p.wood@ym

p.wood@ymail.uk c.ac.uk ail.uk
First name: Mark MarkT@dcs.ymail.ac.uk Mark Taylor MarkT@dcs.

Last name: Taylor ymail.ac.uk
Alex Bourdoo AlexBourdoo@dcs.ymail.a Alex Bourdoo AlexBourdoo

c.uk @dcs.ymail.a
c.uk
Ease with Structured Data-Storage
Data types – both defined and user defined help

Storage with the storage of structured data
Scalability is not generally an issue with

Scalability increase in data
Ease with structured

data
Security
Update and Updating, deleting, etc. is easy due to

delete structured form
Ease with Structured Data-Retrieval
Retrieve A well-defined structure helps in easy

information retrieval of data
Data can be indexed based not only on a

Indexing and text string but other attributes as well. This
searching enables streamlined search
Ease with structured

data
Structured data can be easily mined and
Mining data knowledge can be extracted from it
BI works extremely well with structured data.

BI operations Hence data mining, warehousing, etc. can be
easily undertaken
Further Readings
• http://www.govtrack.us/articles/20061209data.xpd
• http://www.sapdesignguild.org/editions/edition2/sui_content.asp
Do it Exercise
Think and write about an instance where data was presented to you in
Unstructured, semi-structured and structured data format
What Is Big Data Architecture?
To see how data flows through its systems and ensure that it’s managed properly and meets business
needs for information, we need well-structured Big Data architecture. Data architecture is one of the
domains of enterprise architecture, connecting business strategy and technical implementation. If
it’s well-structured, it allows companies to:
• Transform unstructured data for analysis and compiling reports;

• Record, process and analyze unconnected streams in real-time or with low latency;
• Conduct more accurate analysis, make informed decisions, and reduce costs.
In practical terms, Big Data architecture can be seen as a model for data collection, storage,
processing, and transformation for subsequent analysis or visualization. The choice of an
architectural model depends on the basic purpose of the information system and the context of its
application, including the levels of processes and IT maturity, as well as the technologies currently
available.
The most known paradigms are ETL (Extract, Transform, Load) and ELT (Extract, Load Transform)
in conjunction with data lake, lakehouse, and data warehouse approaches.
Big Data Architecture Components

There are a number of Big Data architecture components or layers. The key layers include data
ingestion, storage, processing, analytics, and application, from the bottom to the top. Let’s take a
closer look at these Big Data components to understand what architectural models consist of.
Components of Big Data architecture
1. Data Sources
Data sources, as the name suggests, are the sources of data for systems based on Big Data
architecture. These sources include software and hardware capable of collecting and storing data.
The variety of data collection methods depends directly on the source. All big data solutions start
with one or more data sources. Examples include:
o Application data stores, such as relational databases.
o Static files produced by applications, such as web server log files.
o Real-time data sources, such as IoT devices.
The most common data sources are:

• Relational databases (Oracle, PostgreSQL., etc.)
• NoSQL solutions (Document, Key/Value, Graph databases)
• Time-series databases (TimescaleDB, InfluxDB)
• File systems such as cloud storages, FTP/NFS/SMB storages
• Distributed files systems (HDFS, AWS EFS, etc.)
• Search engines (Elastic Search)
• Message queues (RabbitMQ, Kafka, Redis)
• Enterprise systems accessed via API
• Legacy enterprise systems like mainframes
Each data source can hold one or more types of data:

• Structured data is data arranged around a predefined schema (various databases, existing
archives, enterprise internal systems, etc.)
• Unstructured data is data not structured according to a predefined data model (GPS,
audio and video files, text files, etc.)
• Semi-structured data refers to data that doesn’t conform to the structure of a data model
but still has definite classifying characteristics (internal system event logs, network
services, XML, etc.)
2. Data Ingestion
As described above, data can be stored initially in any external system as a data source for a Big
Data architecture platform. In addition, data can already exist in any data source or can be generated
in real time.
The first step is to extract data from an external system or a data source and ingest it into a Big Data
architecture platform for subsequent processing. It practically means the following:
• Collect data from an external data source using pull or push approach
• The pull approach is when your Big Data platform retrieves bulk data or individual
records from an external data source. Data is usually collected in batches if the external
data source supports it. In this case the system can control better the amount and
throughout of ingested data per unit of time.
• Push approach is when an external data source pushes data into your Big Data platform.
It is usually delivered as real time events and messages. In this case the system should
support high ingestion rate or use intermediate buffer/event log solutions like Apache
Kafka as internal storage.
3. Data storage: Data for batch processing operations is typically stored in a distributed file store
that can hold high volumes of large files in various formats. This kind of store is often called a data
lake. Options for implementing this storage include Azure Data Lake Store or blob containers in
Azure Storage. Persist data in a data lake, lakehouse, or distributed data storage as raw data. Raw
data means that data has its original format and view and it guarantees that subsequent processing
does not lose original information.
• Transfer data to the next processing layer in the form of bulk items (batch processing) or
individual messages for real-time processing (with or without intermediate data
persistence).
4. Data Processing or Transformation
The next step is processing or transformation of previously ingested data. The main objectives of
such activity include:
• Transforming data into structured data format based on predefined schema
• Enriching and cleaning data, converting data into the required format
• Performing data aggregation
• Ingesting data into an analytical database, storage, data lake, lakehouse, or data
warehouse
• Transforming data from raw format into intermediate format for further processing
• Implementing Machine or Deep Learning analysis and predictions
Depending on project requirements, there are different approaches to data processing:

• Batch processing
• Near real-time stream processing
• Real-time stream processing
Let’s review each type in details below.

4.1. Batch Processing
After storing a dataset over a period of time, it moves to the processing stage. Batch processing
presupposes that algorithms process and analyze datasets previously stored in an intermediate
distributed data storage like a data lake, lakehouse, or a distributed file system.
As a rule, batch processing is used when data can be processed in chunks or batches on a daily,
weekly, or monthly basis and end users can wait results for some time. Batch processing allows
processing data more effectively from resource perspective but it increases latency when data is
available after processing for storage, processing, and analysis.
4.2. Stream Processing
Incoming data can also be presented as a continuous stream of events from any external data source,
which is usually pushed to the Big Data ingestion layer. In this case data is ingested and processed
directly by consumers in the form of individual messages.
This approach is used when the end user or external system should see or use the result of
computations almost immediately. The advantage of this approach is high efficiency from resource
point of view per message and low latency to process data in near real-time manner.
4.2.1 Near Real-Time Stream Processing
If according to non-functional requirements incoming messages can be processed with latency

measured in seconds, near real-time stream processing is the way to go. This type of streaming
allows to process individual events in small micro-batches, combining close items together in a
processing window as micro-batch. For example, Spark Streaming processes streams in this way,
finding balance between latency, resource utilization and overall solution complexity.
4.2.2 Real-Time Stream Processing
When a system is required to process data in a real-time manner, then processing is optimized to
achieve millisecond latency. These optimizations include memory processing, caching,
asynchronous persistence of input/output results in addition to classical near real-time stream
processing.
Real-time stream processing allows to process individual events with a maximum throughput per
unit of time but with additional resources like memory and CPU. For example, Apache Flink
processes each event immediately, applying the approaches mentioned above.
5. Analytical data store: Many big data solutions prepare data for analysis and then serve the
processed data in a structured format that can be queried using analytical tools. The analytical data
store used to serve these queries can be a Kimball-style relational data warehouse, as seen in most
traditional business intelligence (BI) solutions. Alternatively, the data could be presented through a
low-latency NoSQL technology such as HBase, or an interactive Hive database that provides a
metadata abstraction over data files in the distributed data store. Azure Synapse Analytics provides
a managed service for large-scale, cloud-based data warehousing. HDInsight supports Interactive
Hive, HBase, and Spark SQL, which can also be used to serve data for analysis.
6. Analytics and Reporting
The majority of Big Data solutions are built in a way that facilitates further analysis and reporting
in order to gain valuable insights. The analysis reports should be presented in a user-friendly format
(tables, diagrams, typewritten text, etc.), meaning that the results should be visualized. Depending
on the type and complexity of visualization, additional programs, services, or add-ons can be added
to the system (table or multidimensional cube models, analytical notebooks, etc.).
To achieve this goal, ingested and transformed data should be persisted in an analytical data store,
solution, or database in the appropriate format or structure optimized for faster ad-hoc queries, quick
access and scalability to support large number of users.
Let’s see a couple of typical approaches.
The first popular approach is data warehouses, which is in essence a database optimized for read
operations using column-based storage, optimized reporting schema and SQL Engine. This
approach is usually applied when the data structure is known in advance and sub-second query
latency is critical to support rich reporting functionality and ad-hoc user queries. For example, AWS
Redshift, HP Vertika, Click House, Citrus PostgreSQL.
The next popular approach is data lakes. The original goal of a data lake was to democratize access
to data for different uses cases, including machine learning algorithms, reporting, post-processing
of data on the same ingested raw data. It works but with some limitations. This approach simplifies
the complexity of overall solutions because a data warehouse is not required by default, so less tools
and data transformations are needed. However, the performance of engines used for reporting is
significantly lower even for Parget, Delta, Iceberg optimized formats. A typical example of this
approach is the Classical Apache Spark setup which persists ingested data in Delta or Iceberg format
and Pesto Query Engine.
The last trend is to combine both previous approaches in one and it is known as a lakehouse. In
essence, the idea is to have a data lake with highly optimized data format and storage and SQL
vector-based engine similar to data warehouses but based on Delta format which supports
ACID/versions. For example, Data Bricks Enterprise version achieved performance for typical
reporting queries better than classical data warehouse solutions.
Data warehouse vs data lake vs data lakehouse — Image by author, inspired by the source
7. Orchestration
Mostly, Big Data analytics solutions have similar repetitive business processes which include data
collection, transfer, processing, uploading in analytical data stores, or direct transmission to the
report. That’s why companies leverage orchestration technology to automate and optimize all the
stages of data analysis.
Different Big Data tools can be used in this area depending on goals and skills.
The first level of abstraction is Big Data processing solutions themself described in the data
transformation section. They usually have orchestration mechanisms where a pipeline and its logic
are implemented in code directly based on the functional programing paradigm. For example, Spark,
Apache Flink, Apache Beam all have such functionality. This level of abstraction is very functional
but requires programing skills and deep knowledge of Big Data processing solutions.
The next level is orchestration frameworks, which is still based on writing code to implement the
flow of steps for an automated process but these Big Data tools require basic knowledge of language
to just link steps between each other without special knowledge how specific step or component is
implemented. So, such tools have a list of predefined steps with the ability to be extended by
advanced users. For example, Apache Airflow or Ludgi are popular choices for many people who
work with data but have limited programing knowledge.
The last level is end-user GUI editors that allow to create orchestration flows and business processes
using just a rich editor with graphical components, which should be linked visually and configured
via component properties. BPMN notations are often used for such tools in conjunction with custom
components to process data.
Types of Big Data Architecture
To efficiently handle customer requests and perform tasks well, applications have to interact with
the warehouse. In a nutshell, we’ll look at two most popular Big Data architectures, known as
Lambda and Kappa, that serve as the basis for various corporate applications.
1. Lambda has been the key Big Data architecture. It separates real-time and batch
processing where batch processing is used to ensure consistency. This approach allows
implementing most application scenarios. But for the most part, the batch and stream
levels work with different cases, while their internal processing logic is almost the same.
Thus, data and code duplications may happen, which becomes a source of numerous
errors.
2. For this reason, the Kappa architecture was introduced, which consumes fewer
resources but is great for real-time processing. Kappa is based on Lambda combining
stream and batch processing models but information is stored in the data lake. The
essence of this architecture is to optimize data processing by applying the same set of
code for both processing models. It facilitates management and unifies the problem of
calibration.
Lambda architecture
Lambda architecture — Image by author, inspired by the source

Kappa architecture
Kappa architecture — Image by author, inspired by the source
Big Data Tools and Techniques

Analysts use various Big Data tools to monitor current market trends, clients’ needs and preferences,
and other information vital for business growth. When building a solution for clients, we always
take into consideration all these factors, offering Big Data services of supreme quality and providing
you with the most profitable product.
Let’s take a glimpse at the most common Big Data tools and techniques used nowadays:
Distributed Storage and Processing Tools
Accommodating and analyzing expanding volumes of diverse data requires distributed database
technologies. Distributed databases are infrastructures that can split data across multiple physical
servers allowing multiple computers to be used anywhere. Some of the most widespread processing
and distribution tools include:
Hadoop
Big Data will be difficult to process without Hadoop. It’s not only a storage system, but also a set
of utilities, libraries, frameworks, and development distributions.
Hadoop consists of four components:

1. HDFS — a distributed file system designed to run on standard hardware and provide
instant access to data across Hadoop clusters.
2. MapReduce — a distributed computing model used for parallel processing in different
cluster computing environments.
3. YARN — a technology designed to manage clusters and use their resources for
scheduling users’ applications.
4. Libraries for other HDFS modules
Spark
Spark is a solution capable of processing real-time, batch, and memory data for quick results. The
tool can run on a local system, which facilitates testing and development. Today, this powerful open-
source Big Data tool is one of the most important in the arsenal of top-performing companies.
Spark is created for a wide range of tasks such as batch applications, iterative algorithms, interactive
queries, and machine learning. This makes it suitable for both amateur use and professional
processing of large amounts of data.
No-SQL Databases
No-SQL databases differ from traditional SQL-based databases in that they support flexible
schemes. This simplifies handling vast amounts of all types of information — especially
unstructured and semi-structured data that are poorly suited for strict SQL systems.
Here are four main No-SQL categories adopted in businesses:

1. Document-Oriented DB stores data elements in structures like documents.
2. Graph DB connects data into graph-like structures to emphasize the relationships
between information elements.
3. Key-value DB combines unique keys and related Big Data components into a relatively
simple easily-scalable model.
4. Column-based DB stores information in tables that can contain many columns to handle
a huge amount of elements.
MPP
A feature of the Massive parallel processing (MPP) architecture is the physical partitioning of data
memory combined into a cluster. When data is received, only the necessary records are selected and
the rest are eliminated to not take up space in RAM, which speeds up disk reading and processing
of results. Predictive analytics, regular reporting, corporate data warehousing (CDW), and
calculating churn rate are some of the typical applications of MPP.
Cloud Computing Tools
Clouds can be used in the initial phase of working with Big Data, in conducting experiments with
data and testing hypotheses. It’s easier to test new assumptions and technologies, you don’t need
your own infrastructure. Clouds make it faster and cheaper to launch a solution into industrial
operations with certain requirements, such as data storage reliability, infrastructure performance,
and others. In this way, more companies are moving their Big Data to clouds that are scalable and
flexible.
5 Things to Consider When Choosing Big Data Architecture
When choosing a database solution, you have to bear in mind the following factors:
1. Data Requirements
Before launching a Big Data solution, find out which processing type (real-time or batch) will be
more suitable for your business to achieve the highest entry speed and extract the relevant data for
analysis. Don’t overlook such requirements as response time, accuracy and consistency, and fault-
tolerance that play the crucial role in the data analytics process.
2. Stakeholders’ Needs
Identify your key external stakeholders and study their information needs to help them achieve
mission-critical goals. This presupposes that the choice of a data strategy must be based on a
comprehensive needs analysis of the stakeholders to bring about benefits to everyone.
3. Data Retention Periods
Data volumes keep growing exponentially, which makes its storage far more expensive and
complicated. To prevent these losses, you have to determine the period within which each data set
can bring value to your business and, thereby, be retained.
4. Open-Source or Commercial Big Data Tools
Open-source analytics tools will work best for you if you have the people and the skills to work with
it. This software is more tailorable to your business needs as your staff can add features, updates,
and other adjustments and improvements at any moment.
In case you don’t have enough staff to maintain your analytics platform — opting for a commercial
tool can boost more tangible outcomes. Here, you depend on a software vendor but you get regular
updates, tool improvements, and can use their support services to solve arising problems.
5. Continuous Evolution
The Big Data landscape is quickly changing as the technologies keep evolving, introducing new
capabilities and offering advanced performance and scalability. In addition, your data needs are
certainly evolving, too.
Make sure that your Big Data approach accounts for these changes meaning that your Big Data
solution should make it easy to introduce any enhancements like integrate new data sources, add
new custom modules, or implement additional security measures if needed.
Big Data Architecture Challenges
If built correctly, Big Data architecture can save money and help predict important trends, but as a
ground-breaking technology, it has some pitfalls.
Big Data Architecture Challenges — Image by author, inspired by the source
Budget Requirement
A Big Data project can often be held back by the cost of adopting Big Data architecture. Your budget
requirements can vary significantly depending on the type of Big Data application architecture, its
components and tools, management and maintenance activities, as well as whether you build your
Big Data application in-house or outsource it to a third-party vendor. To overcome this challenge,
companies need to carefully analyze their needs and plan their budget accordingly.
Data Quality
When information comes from different sources, it’s necessary to ensure consistency of the data
formats and avoid duplication. Companies have to sort out and prepare data for further analysis with
other data types.
Scalability
The value of Big Data lies in its quantity but it can also become an issue. If your Big Data
architecture isn’t ready to expand, problems may soon arise.
• If infrastructure isn’t managed, the cost of its maintenance will increase hurting the
company’s budget.
• If a company doesn’t plan to expand, its productivity may fall significantly.
Both of these issues need to be addressed at the planning stage.

Security
Cyberthreats are a common problem since hackers are very interested in corporate data. They may
try to add their fake information or view corporate data to obtain confidential information. Thus, a
robust security system should be built to protect sensitive information.
Skills Shortage
The industry is facing a shortage of data analysts due to a lack of experience and necessary skills in
aspirants. Fortunately, this problem is solvable today by outsourcing your Big Data architecture
problems to an expert team that has broad experience and can build a fit-for-purpose solution to
drive business performance.
Big Data Characteristics
The five characteristics that define Big Data are: Volume, Velocity, Variety, Veracity and
Value.
1. VOLUME
Volume refers to the ‘amount of data’, which is growing day by day at a very
fast pace. The size of data generated by humans, machines and their
interactions on social media itself is massive. Researchers have predicted that
40 Zettabytes (40,000 Exabytes) will be generated by 2020, which is an increase
of 300 times from 2005.
2. VELOCITY
Velocity is defined as the pace at which different sources generate the data
every day. This flow of data is massive and continuous. There are 1.03 billion
Daily Active Users (Facebook DAU) on Mobile as of now, which is an increase of
22% year-over-year. This shows how fast the number of users are growing on
social media and how fast the data is getting generated daily. If you are able to
handle the velocity, you will be able to generate insights and take decisions
based on real-time
data.
3. VARIETY
As there are many sources which are contributing to Big Data, the type of data
they are generating is different. It can be structured, semi-structured or
unstructured. Hence, there is a variety of data which is getting generated every
day. Earlier, we used to get the data from excel and databases, now the data are
coming in the form of images, audios, videos, sensor data etc. as shown in below
image. Hence, this variety of unstructured data creates problems in capturing,
storage, mining and analyzing the data.
4. VERACITY
Veracity refers to the data in doubt or uncertainty of data available due to data
inconsistency and incompleteness. In the image below, you can see that few
values are missing in the table. Also, a few values are hard to accept, for example
– 15000 minimum value in the 3rd row, it is not possible. This inconsistency and
incompleteness is Veracity.
Data available can sometimes get messy and maybe difficult to trust. With many
forms of big data, quality and accuracy are difficult to control like Twitter posts
with hashtags, abbreviations, typos and colloquial speech. The volume is often
the reason behind for the lack of quality and accuracy in the data.
• Due to uncertainty of data, 1 in 3 business leaders don’t trust the information

they use to make decisions.
• It was found in a survey that 27% of respondents were unsure of how much of
their data was inaccurate.
• Poor data quality costs the US economy around $3.1 trillion a year.
5. VALUE
After discussing Volume, Velocity, Variety and Veracity, there is another V that
should be taken into account when looking at Big Data i.e. Value. It is all well
and good to have access to big data but unless we can turn it into value it is
useless. By turning it into value I mean, Is it adding to the benefits of the
organizations who are analyzing big data? Is the organization working on Big
Data achieving high ROI (Return On Investment)? Unless, it adds to their profits
by working on Big Data, it is useless.
6. Variability
Variability in big data's context refers to a few different things. One is the number of
inconsistencies in the data. These need to be found by anomaly and outlier detection
methods in order for any meaningful analytics to occur.
Big data is also variable because of the multitude of data dimensions resulting from
multiple disparate data types and sources. Variability can also refer to the inconsistent
speed at which big data is loaded into your database.
7: Validity
Similar to veracity, validity refers to how accurate and correct the data is for its
intended use. According to Forbes, an estimated 60 percent of a data scientist's time is
spent cleansing their data before being able to do any analysis. The benefit from big data
analytics is only as good as its underlying data, so you need to adopt good data
governance practices to ensure consistent data quality, common definitions, and
metadata.
8: Vulnerability
Big data brings new security concerns. After all, a data breach with big data is a big
breach. Does anyone remember the infamous AshleyMadison hack in 2015?
Unfortunately there have been many big data breaches. Another example, as reported
by CRN: in May 2016 "a hacker called Peace posted data on the dark web to sell, which
allegedly included information on 167 million LinkedIn accounts and ... 360 million
emails and passwords for MySpace users."
Information on many others can be found at Information is Beautiful.
9: Volatility
How old does your data need to be before it is considered irrelevant, historic, or not
useful any longer? How long does data need to be kept for?
Before big data, organizations tended to store data indefinitely -- a few terabytes of data
might not create high storage expenses; it could even be kept in the live database
without causing performance issues. In a classical data setting, there not might even be
data archival policies in place.
Due to the velocity and volume of big data, however, its volatility needs to be carefully
considered. You now need to establish rules for data currency and availability as well as
ensure rapid retrieval of information when required. Make sure these are clearly tied to
your business needs and processes -- with big data the costs and complexity of a storage
and retrieval process are magnified.
10: Visualization
Another characteristic of big data is how challenging it is to visualize.
Current big data visualization tools face technical challenges due to limitations of in-
memory technology and poor scalability, functionality, and response time. You can't rely
on traditional graphs when trying to plot a billion data points, so you need different
ways of representing data such as data clustering or using tree maps, sunbursts, parallel
coordinates, circular network diagrams, or cone trees.
Applications of Big Data

We cannot talk about data without talking about the people, people who are getting
benefited by Big Data applications. Almost all the industries today are leveraging Big
Data applications in one or the other way.
• Smarter Healthcare: Making use of the petabytes of patient’s data, the
organization can extract meaningful information and then build applications
that can predict the patient’s deteriorating condition in advance.
• Telecom: Telecom sectors collects information, analyzes it and provide

solutions to different problems. By using Big Data applications, telecom
companies have been able to significantly reduce data packet loss, which occurs
when networks are overloaded, and thus, providing a seamless connection to
their customers.
• Retail: Retail has some of the tightest margins, and is one of the greatest
beneficiaries of big data. The beauty of using big data in retail is to understand
consumer behavior. Amazon’s recommendation engine provides suggestion
based on the browsing history of the consumer.
• Traffic control: Traffic congestion is a major challenge for many cities globally.
Effective use of data and sensors will be key to managing traffic better as cities
become increasingly densely populated.
• Manufacturing: Analyzing big data in the manufacturing industry can reduce

component defects, improve product quality, increase efficiency, and save time
and money.
• Search Quality: Every time we are extracting information from google, we are
simultaneously generating data for it. Google stores this data and uses it to
improve its search quality.
Big data analytics raises several ethical issues, especially as
companies begin monetizing their data externally for purposes
different from those for which the data was initially collected. The
scale and ease with which analytics can be conducted today
completely change the ethical framework. We can now do things that
were impossible a few years ago, and existing ethical and legal
frameworks cannot prescribe what we should do. While there is still
no black or white, experts agree on a few principles:
1. Private customer data and identity should remain

private: Privacy does not mean secrecy, as personal data
might need to be audited based on legal requirements, but
that private data obtained from a person with their consent
should not be exposed for use by other businesses or
individuals with any traces to their identity.
2. Shared private information should be treated

confidentially: Third-party companies share sensitive
data — medical, financial or locational — and need
restrictions on whether and how that information can be
shared further.
3. Customers should have a transparent view of how

our data is being used or sold and the ability to manage the
flow of their private information across massive, third-party
analytical systems.
4. Big Data should not interfere with human will: Big

data analytics can moderate and even determine who we are
before we make up our minds. Companies need to consider
the kind of predictions and inferences that should be allowed
and those that should not.
5. Big data should not institutionalize unfair
biases like racism or sexism. Machine learning algorithms
can absorb unconscious biases in a population and amplify
them via training samples.
What is a big data platform?

The constant stream of information from various sources is becoming more intense, especially with
the advance in technology. And this is where big data platforms come in to store and analyze the
ever-increasing mass of information.
A big data platform is an integrated computing solution that combines numerous software systems,
tools, and hardware for big data management. It is a one-stop architecture that solves all the data
needs of a business regardless of the volume and size of the data at hand. Due to their efficiency in
data management, enterprises are increasingly adopting big data platforms to gather tons of data
and convert them into structured, actionable business insights.
Currently, the marketplace is flooded with numerous Open source and commercially available big
data platforms. They boast different features and capabilities for use in a big data environment.
Characteristics of a big data platform
Any good big data platform should have the following important features:
• Ability to accommodate new applications and tools depending on the evolving business needs
• Support several data formats
• Ability to accommodate large volumes of streaming or at-rest data
• Have a wide variety of conversion tools to transform data to different preferred formats
• Capacity to accommodate data at any speed
• Provide the tools for scouring the data through massive data sets
• Support linear scaling
• The ability for quick deployment
• Have the tools for data analysis and reporting requirements

Big Data Platforms vs. Data Lake vs.
Data Warehouse
Big Data, at its core, refers to technologies that handle large volumes of data too complex to be
processed by traditional databases. However, it is a very broad term, functioning as an umbrella
term for more specific solutions such as Data Lake and Data Warehouse.
What is a Data Lake?

Data Lake is a scalable storage repository that not only holds large volumes of raw data in its native
format but also enables organizations to prepare them for further usage.
That means data coming to Data Lake doesn’t have to be collected with a specific purpose from the
beginning, it can be defined later. Without it, data can be loaded faster since they do not need to
undergo an initial transformation process.
In Data Lakes, data is gathered in its native formats, which provides more opportunities for
exploration, analysis, and further operations, as all data requirements can be tailored on a case-by-
case basis, then – once the schema has been developed – it can be kept for future use or discarded.
Read more about Data Lake architecture
What is a Data Warehouse?

Compared to Data Lakes, it can be said that Data Warehouses represent a more traditional and
restrictive approach.
Data Warehouse is a scalable storage data repository holding large volumes of raw data, but its
environment is far more structured than in Data Lake. Data collected in Data Warehouse are already
pre-processed, which means it is not in their native formats. Data requirements must be known and
set up front to make sure the models and schemas produce usable data for all users.
Key differences between Data Lake and Data

Warehouse
How Big Data Platform works
Big Data platform workflow can be divided into the following stages:
1. Data Collection
Big Data platforms collect data from various sources, such as sensors, weblogs, social media, and
other databases.
2. Data Storage
Once the data is collected, it is stored in a repository, such as Hadoop Distributed File System
(HDFS), Amazon S3, or Google Cloud Storage.
3. Data Processing
Data Processing involves tasks such as filtering, transforming, and aggregating the data. This can be
done using distributed processing frameworks, such as Apache Spark, Apache Flink, or Apache
Storm.
4. Data Analytics
After data is processed, it is then analyzed with analytics tools and techniques, such as machine
learning algorithms, predictive analytics, and data visualization.
5. Data Governance
Data Governance (data cataloging, data quality management, and data lineage tracking) ensures the
accuracy, completeness, and security of the data.
6. Data Management
Big data platforms provide management capabilities that enable organizations to make backups,
recover, and archive.
These stages are designed to derive meaningful business insights from raw data from multiple
sources such as website analytic systems, CRM, ERP, loyalty engines, etc. Processed data stored in a
unified environment can be used in preparing static reports and visualizations but also for other
analytics and – for example – building Machine Learning models.
WHAT IS A BIG DATA PLATFORM?
A big data platform acts as an organized storage medium for large amounts
of data. Big data platforms utilize a combination of data management
hardware and software tools to store aggregated data sets, usually onto the
cloud.
Big Data Platforms to Know

GOOGLE CLOUD
Google Cloud offers lots of big data management tools, each with its own
specialty. BigQuery warehouses petabytes of data in an easily queried format.
Dataflow analyzes ongoing data streams and batches of historical data side
by side. With Google Data Studio, clients can turn varied data into custom
graphics.
MICROSOFT AZURE
Users can analyze data stored on Microsoft’s Cloud platform, Azure, with a
broad spectrum of open-source Apache technologies, including Hadoop and
Spark. Azure also features a native analytics tool, HDInsight, that
streamlines data cluster analysis and integrates seamlessly with Azure’s
other data tools.
AMAZON WEB SERVICES
Best known as AWS, Amazon’s cloud-based platform comes with analytics

tools that are designed for everything from data prep and warehousing to
SQL queries and data lake design. All the resources scale with your data as it
grows in a secure cloud-based environment. Features include customizable
encryption and the option of a virtual private cloud.
SNOWFLAKE
Snowflake is a data warehouse used for storage, processing and analysis. It

runs completely atop the public cloud infrastructures — Amazon Web
Services, Google Cloud Platform and Microsoft Azure — and combines with
a new SQL query engine. Built like a SaaS product, everything about its
architecture is deployed and managed on the cloud.
CLOUDERA
Rooted in Apache’s Hadoop, Cloudera can handle massive amounts of data.

Clients routinely store more than 50 petabytes in Cloudera’s Data
Warehouse, which can manage data including machine logs, text, and more.
Meanwhile, Cloudera’s DataFlow — previously Hortonworks’ DataFlow —
analyzes and prioritizes data in real time.
SUMO LOGIC
The cloud-native Sumo Logic platform offers apps — including Airbnb and
Pokémon GO — three different types of support. It troubleshoots, tracks
business analytics and catches security breaches, drawing on machine
learning for maximum efficiency. It’s also flexible and able to manage sudden
influxes of data.
SISENSE
Sisense’s data analytics platform processes data swiftly thanks to its
signature In-Chip Technology. The interface also lets clients build, use and
embed custom dashboards and analytics apps. And with its AI technology
and built-in machine learning models, Sisense enables clients to identify
future business opportunities.
TABLEAU
The Tableau platform — available on-premises or in the cloud — allows users

to find correlations, trends and unexpected interdependences between data
sets. The Data Management add-on further enhances the platform, allowing
for more granular data cataloging and the tracking of data lineage.
COLLIBRA
Designed to accommodate the needs of banking, healthcare and other data-

heavy fields, Collibra lets employees company wide find quality, relevant
data. The versatile platform features semantic search, which can find more
relevant results by unraveling contextual meanings and pronoun referents in
search phrases.
TALEND
Talend’s data replication product, Stitch, allows clients to quickly load data
from hundreds of sources into a data warehouse, where it’s structured and
ready for analysis. Additionally, Data Fabric, Talend’s unified data
integration solution, combines data integration with data governance and
integrity, as well as offers application and API integration.
QUALTRICS EXPERIENCE MANAGEMENT
Qualtrics’ experience management platform lets companies assess the key

experiences that define their brand: customer experience; employee
experience; product experience; design experience; and the brand
experience, defined by marketing and brand awareness. Its analytics tools
turn data on employee satisfaction, marketing campaign impact and more
into actionable predictions rooted in machine learning and AI.
TERADATA
Teradata’s Vantage analytics software works with various public cloud

services, but users can also combine it with Teradata Cloud storage. This all-
Teradata experience maximizes synergy between cloud hardware and
Vantage’s machine learning and NewSQL engine capabilities. Teradata
Cloud users also enjoy special perks, like flexible pricing.
ORACLE
Oracle Cloud’s big data platform can automatically migrate diverse data
formats to cloud servers, purportedly with no downtime. The platform can
also operate on-premise and in hybrid settings, enriching and transforming
data whether it’s streaming in real time or stored in a centralized repository,
also known as a data lake. A free tier of the platform is also available.
DOMO
Domo’s big data platform draws on clients’ full data portfolios to offer
industry-specific findings and AI-based predictions. Even when relevant
data sprawls across multiple cloud servers and hard drives, Domo clients can
gather it all in one place with Magic ETL, a drag-and-drop tool that
streamlines the integration process.
MONGODB
MongoDB doesn’t force data into spreadsheets. Instead, its cloud-based

platforms store data as flexible JSON documents — in other words, as digital
objects that can be arranged in a variety of ways, even nested inside each
other. Designed for app developers, the platforms offer of-the-moment
search functionality. For example, users can search their data for geotags and
graphs as well as text phrases.
CIVIS ANALYTICS
Civis Analytics’ cloud-based platform offers end-to-end data services, from

data ingestion to modeling and reports. Designed with data scientists in
mind, the platform integrates with GitHub to ease user collaboration and is
purportedly ultra-secure — both HIPAA-compliant and SOC 2 Type II-
certified.
ALTERYX
Alteryx’s designers built the company’s eponymous platform with simplicity

and interdepartmental collaboration in mind. Its interlocking tools allow
users to create repeatable data workflows — stripping busywork from the
data prep and analysis process — and deploy R and Python code within the
platform for quicker predictive analytics.
ZETA GLOBAL’S MARKETING PLATFORM
This platform from Zeta Global uses its database of billions of permission-
based profiles to help users optimize their omnichannel marketing efforts.
The platform’s AI features sift through the diverse data, helping marketers
target key demographics and attract new customers.
VERTICA
Vertica’s software-only SQL data warehouse is storage system-agnostic. That

means it can analyze data from cloud services, on-premise servers and any
other data storage space. Vertica works quickly thanks to columnar storage,
which facilitates the scanning of only relevant data. It offers predictive
analytics rooted in machine learning for industries that include finance and
marketing.
TREASURE DATA
Treasure Data’s customer data platform sorts morasses of web, mobile

and IoT data into rich, individualized customer profiles so marketers can
communicate with their desired demographics in a more tailored and
personalized way.
ACTIAN AVALANCHE
Actian’s cloud-native data warehouse is built for near-instantaneous results

— even if users run multiple queries at once. Backed by support from
Microsoft and Amazon’s public clouds, it can analyze data in public and
private Clouds. For easy app use, the platform comes with ready-made
connections to Salesforce, Workday and others.
GREENPLUM
Born out of the open-source Greenplum Database project, this platform uses
PostgreSQL to conquer varied data analysis and operations projects, from
quests for business intelligence to deep learning. Greenplum can parse data
housed in clouds and servers, as well as container orchestration systems.
Additionally, it comes with a built-in toolkit of extensions for location-based
analysis, document extraction and multi-node analysis.
HITACHI VANTARA’S PENTAHO
Hitachi Vantara’s data integration and analytics platform streamlines the

data ingestion process by foregoing hand coding and offering time-saving
functions like drag-and-drop integration, pre-made data transformation
templates and metadata injection. Once users add data, the platform can
mine business intelligence from any data format thanks to its data-agnostic
design.
EXASOL
The Exasol intelligent, in-memory analytics database was designed for

speed, especially on clustered systems. It can analyze all types of data —
including sensor, online transaction, location and more — via massive
parallel processing. The cloud-first platform also analyzes data stored in
appliances and can function purely as software.
IBM CLOUD
IBM’s full-stack cloud platform comes with over 170 built-in tools, including
many for customizable big data management. Users can opt for a NoSQL or
SQL database, or store their data as JSON documents, among other database
designs. The platform can also run in-memory analysis and integrate open-
source tools like Apache Spark.
MARKLOGIC
Users can import data into MarkLogic’s platform as is. Items ranging from
images and videos to JSON and RDF files coexist peaceably in the flexible
database, uploaded via a simple drag-and-drop process powered by Apache
Nifi. Organized around MarkLogic’s Universal Index, files and metadata are
easily queried. The database also integrates with a host of more intensive
analytics apps.
DATAMEER
Though it’s possible to code within Datameer’s platform, it’s not necessary.
Users can upload structured and unstructured data directly from many data
sources by following a simple wizard. From there, the point-and-click data
cleansing and built-in library of more than 270 functions — like
chronological organization and custom binning —make it easy to drill into
data even if users don’t have a computer science background.
ALIBABA CLOUD
The largest public cloud provider in China, Alibaba operates in 24 regions

worldwide, including the United States. Its popular cloud platform offers a
variety of database formats and big data tools, including data warehousing,
analytics for streaming data and speedy Elasticsearch, which can scan
petabytes of data scattered across hundreds of servers in real time.
Big Data Technology Components.

Unit 1 Notes Final Part A

Uploaded by

Unit 1 Notes Final Part A

Uploaded by

BIG DATA & HADOOP

“Data is the new oil.” -Clive Humby, CNBC

• Usually, data is in the unstructured format which makes extracting

Here is a percent distribution of the three forms of data -

Videos (MPEG, etc.)

Images (JPEG, GIF, etc.)

Unstructured data Word document

It can be classified into two broad categories:

• Web pages are said to be unstructured data even though they

Sheer volume of unstructured data and its unprecedented

Scalability becomes an issue with increase

Retrieving and recovering unstructured

Update and Updating, deleting, etc. are not easy due to

Create hardware which support unstructured data

Store in relational databases which support

XML Store in XML which tries to give some structure to

As the data grows it is not possible to put tags

Designing algorithms to understand the meaning

File formats Increasing number of file formats make it difficult to

Classification/ Different naming conventions followed across the

Text mining tools help in grouping and classifying

Application platforms like XOLAP help

Classification/ Taxonomies within the organization can be

Naming conventions/ Following naming conventions or standards

Attributes in a The tags and

Integration of data from

Some ways in which semi-structured data is managed and stored

• Describe the • Contain data on • Models the data

• Assign meaning to • Used for data • Schemas are not

Storing data with their schemas increases cost

Semi-structured data cannot be stored in

Irregular and Some data elements may have extra

Schemas keep changing with

XML allows to define tags and attributes to store data.

Semi-structured data can be stored in a relational

Semi-structured is usually stored in flat

Data comes from varied sources which is

Incomplete/ Extracting structure when there is none and

Indexing data in a graph-based model

Allows data to be stored in a graph-based data

XML Allows data to be arranged in a hierarchical or

Mining Various mining tools are available which search

XML Extensible MarkUp Language

Open-source mark up language written in plain text.

Designed to store and transport data over the

It allows data to be stored in a hierarchical/nested

XML has no predefined tags

The words in the <> (angular brackets) are user-defined tags

What is your take on this….

A Web Page is unstructured. If yes, why?

Attributes in a Data resides in

Databases (e.g., Access)

Fully described datasets

Clearly defined categories and sub-categories

Data neatly placed in rows and columns

Data that goes into the records is regulated by a well-defined structure

Indexing can be easily done either by the DBMS itself or manually

Name E-mail First Name Last Name E-mail Id Alternate E-

Patrick Wood ptw@dcs.abc.ac.uk, Patrick Wood ptw@dcs.ab p.wood@ym

First name: Mark MarkT@dcs.ymail.ac.uk Mark Taylor MarkT@dcs.

Alex Bourdoo AlexBourdoo@dcs.ymail.a Alex Bourdoo AlexBourdoo

Data types – both defined and user defined help

Scalability is not generally an issue with

Ease with structured

Update and Updating, deleting, etc. is easy due to

Retrieve A well-defined structure helps in easy

Data can be indexed based not only on a

Ease with structured

BI works extremely well with structured data.

• Transform unstructured data for analysis and compiling reports;

Big Data Architecture Components

o Application data stores, such as relational databases.

o Static files produced by applications, such as web server log files.

o Real-time data sources, such as IoT devices.

The most common data sources are:

Each data source can hold one or more types of data: