Big Data Analytics Compiled Notes

Welcome to the VTU Padhai Family!
We are thrilled to present you with a complete set of notes for Big Data Analytics,
meticulously covering all the essential topics across five comprehensive modules. Whether
you're diving into the vast world of Big Data or preparing for your exams, these notes are
designed to serve as your ultimate resource for mastering this dynamic and evolving subject.
Here’s what you can look forward to in each module:
Module 1: Introduction to Big Data Analytics
Embark on a journey to understand the fundamentals of Big Data. This module introduces the
core concepts, scalability challenges, and parallel processing. It provides insights into
designing data architecture, understanding data sources and quality, and the critical steps in
pre-processing and storing data. Learn about Big Data storage, analysis techniques, and
explore real-world applications and case studies.
Module 2: Introduction to Hadoop
Delve into the Hadoop ecosystem and Distributed File System (HDFS). Understand its design
features, user commands, and the MapReduce framework, along with the Yarn architecture.
This module also introduces essential Hadoop tools like Apache Pig, Hive, Sqoop, Flume,
Oozie, and HBase, equipping you with practical knowledge of handling Big Data efficiently.
Module 3: NoSQL Big Data Management
Explore the world of NoSQL databases tailored for Big Data management. Learn about
NoSQL architecture patterns, shared-nothing architecture for handling tasks, and leveraging
MongoDB and Cassandra for managing vast amounts of data.
Module 4: MapReduce and HiveQL
Uncover the power of MapReduce tasks for Big Data computations and algorithms. This
module covers the basics of MapReduce execution, composing complex calculations, and the
scripting capabilities of HiveQL and Pig for managing and analysing data.
Module 5: Machine Learning and Analytics
Discover the intersection of Big Data and machine learning. This module dives into
algorithms for regression analysis, finding similarities, frequent itemsets, and association
rule mining. You’ll also explore advanced topics like text mining, web content analytics,
PageRank, and social network analytics, offering a complete perspective on data-driven
decision-making.
These notes are carefully curated to not only help you excel in your exams but also provide
valuable insights that will serve as a solid foundation for your career in Big Data Analytics.
From theoretical knowledge to practical applications, you’ll gain a holistic understanding of
the subject.We hope this learning journey empowers you to explore the vast possibilities in
Big Data. Let’s unlock the potential of analytics together!
Thank you for choosing VTU Padhai. Happy learning!

Table of Contents
Sl. No Topic Page No:
1. Module 1: Introduction to Big Data Analytics 1-19
2. Module 2: Introduction to Hadoop 20-31
3. Module 3: NoSQL Big Data Management 32-55
4. Module 4: MapReduce and HiveQL 56-90
5. Module 5: Machine Learning and Analytics 91-121
6. Question Bank 122-124

Big Data Analytics 21CS71
MODULE 1
INTRODUCTION TO BIG DATA ANALYTICS
Data
Data has multiple definitions and can be used in both singular and plural forms:
1. "Data is information, usually in the form of facts or statistics that one can analyze or
use for further calculations."
2. "Data is information that can be stored and used by a computer program."
3. "Data is information presented in numbers, letters, or other forms."
4. "Data is information from a series of observations, measurements, or facts."
5. "Data is information from a series of behavioral observations, measurements, or
facts."
Web Data
Web data refers to the information available on web servers, including text, images, videos,
audio, and other multimedia content accessible to web users. A user (client software) interacts
with this data in various ways:
 Pull: Clients retrieve data by sending requests to the server.
 Push/Post: Servers can also publish or push data, or users can post data after
subscribing to services.
Examples of Internet Applications:
 Websites, web services, and web portals
 Online business applications
 Emails, chats, tweets
 Social networks
Classification of Data:
Structured, Semi-Structured, Multi-Structured and Unstructured
Data can be broadly classified into the following categories:
1. Structured Data
Structured data conforms to predefined data schemas and models, such as relational tables
with rows and columns. Around 15-20% of data is either structured or semi-structured.
Characteristics of Structured Data:
 Supports CRUD operations: Enables data insertion, deletion, updating, and
appending.
 Indexing: Facilitates faster data retrieval through indexing.
1
 Scalability: Allows for increasing or decreasing storage capacity and processing

power for data operations like storage, analytics, and processing.
 Data Security: Supports encryption and decryption for secure data management.
2. Semi-Structured Data
Semi-structured data does not strictly adhere to a formal data model like relational databases
but contains organizational markers (tags) that distinguish different elements within the data.
Examples:
 XML (eXtensible Markup Language)
 JSON (JavaScript Object Notation)
Characteristics of Semi-Structured Data:
3. Multi-Structured Data
Multi-structured data encompasses data in multiple formats, which can include structured,
semi-structured, and unstructured data.
Characteristics of Multi-Structured Data:
 Found in non-transactional systems and across various platforms.
 Examples include streaming data from customer interactions, sensor data, or data
from web servers.
 Combines features of structured and unstructured data, sometimes with semantic
meanings attached.
4. Unstructured Data
Unstructured data does not conform to any predefined data models or schemas and lacks an
organized structure, such as tables or databases.
Characteristics of Unstructured Data:
 Can be found in file types like TXT, CSV, etc.
 Unstructured data may still have internal structures, as seen in emails, where headers
and bodies are distinguished.
 Establishing relationships, schemas, and structures for this data type often requires
additional processing.
Big Data
Big Data refers to large, complex data sets that traditional data processing systems cannot
efficiently handle. It is characterized by high volume, high velocity, and high variety,
requiring new processing methods to extract valuable insights and optimize processes.
2
Definitions of Big Data

1. Large and Complex Data Sets: A collection of data so large or complex that
traditional data processing applications are inadequate.
2. Logistical Challenges: Data that is too large for typical database software to capture,
store, manage, or analyze.
3. Large Size: Data of a very large size, making its manipulation and management
logistically challenging.
Characteristics of Big Data (4Vs)
Big Data has four key characteristics known as the 4Vs:
1. Volume
o Refers to the size of the data, often measured in terabytes or petabytes. The
sheer quantity of data generated from applications contributes to this
characteristic.
2. Velocity
o Refers to the speed at which data is generated and processed. Big Data often
needs to be processed in real-time or near real-time to meet decision-making
demands.
3. Variety
o Refers to the different types and formats of data generated from multiple
sources. Data can include structured, semi-structured, and unstructured
formats, making it more complex to manage.
4. Veracity
o Refers to the quality and reliability of the data. Big Data can come with
inconsistencies or inaccuracies, and ensuring data accuracy is critical for
meaningful analysis.
Big Data Types
Big Data comes from a wide variety of sources, both human and machine-generated. These
types of data contribute to its complexity and the need for specialized handling and
processing techniques.
1. Social Networks and Web Data
 Examples: Facebook, Twitter, YouTube, emails, blogs
 Description: Data from social media platforms, user interactions, and content
creation on the web. This includes textual data, multimedia content, and user
engagement metrics.
2. Transactional and Business Process Data
3
 Examples: Credit card transactions, flight bookings, public agency records (e.g.,
medical records, insurance data)
 Description: Data generated through business transactions and operational processes.
This includes financial transactions, service bookings, and records from public
agencies.
3. Customer Master Data
 Examples: Facial recognition data, personal information (e.g., name, date of birth,
marriage anniversary, gender, location, income category)
 Description: Data related to customer identity and demographics, often used for
personalized services, marketing, and authentication (e.g., facial recognition).
4. Machine-Generated Data
 Examples: Internet of Things (IoT) data, sensors, trackers, web logs, computer
system logs
 Description: Data generated from machines, devices, and sensors in an automated
manner. This type includes data from sensors in IoT devices, system logs from
servers, and data from machine-to-machine communication.
5. Human-Generated Data
 Examples: Biometrics, human-machine interaction data, email records, student
grades (stored in databases like MySQL)
 Description: Data that is generated by human interaction with machines. This
includes biometric data (e.g., fingerprints, facial scans), emails, and data stored in
databases for academic or business purposes.
Big Data Classification
Big Data can be classified based on various criteria such as data sources, formats, storage
structures, processing rates, and analysis methods. This classification helps understand how
Big Data is sourced, stored, processed, and analyzed.
1. Data Sources (Traditional)
Traditional data sources include:
 Records, Relational Database Management Systems (RDBMS): Structured data
storage in tables.
 Distributed Databases: Data spread across multiple systems for redundancy and
performance.
 In-memory Data Tables: Data stored directly in memory for fast processing.
 Data Warehouse: Centralized repositories for structured data.
 Servers: Data generated from machine interactions and operations.
4
 Business Process (BP) Data and Business Intelligence (BI) Data: Business
operation records and data for decision-making purposes.
 Human-Sourced Data: Data generated by human activities, such as emails, social
media, and transactions.
2. Data Formats (Traditional)
 Structured and Semi-Structured Data: Data stored in predefined formats, like
tables, XML, or JSON, making it easier to retrieve and analyze.
3. Big Data Sources
Big Data is sourced from a variety of places:
 Data Storage Systems: Distributed file systems, Operational Data Stores (ODS), data
marts, data warehouses, and NoSQL databases (e.g., MongoDB, Cassandra).
 Sensor Data: Data from IoT devices, monitoring systems, and sensors.
 External Data Sources: Web data, social media activity, weather data, and health
records.
 Audit Trails: Logs from financial transactions and other system operations.
4. Big Data Formats
Big Data comes in various formats:
 Unstructured, Semi-Structured, and Multi-Structured Data: Data without a
predefined schema, such as images, videos, and text, along with semi-structured
formats like XML and JSON.
 Data Stores Structure: Includes row-oriented data (used for OLTP systems),
column-oriented data (used for OLAP systems), graph databases, and hashed
key/value pairs.
5. Processing Data Rates
Big Data processing can happen at different speeds:
 Batch Processing: Large volumes of data are processed in chunks (e.g., using
MapReduce).
 Near-Time Processing: Data is processed almost immediately after it's received.
 Real-Time and Streaming Processing: Data is processed as it arrives (e.g., using
Spark Streaming).
6. Big Data Processing Methods
 Batch Processing: Includes tools like MapReduce, Hive, and Pig for processing large
data sets over time.
 Real-Time Processing: Uses tools like SparkStreaming, Apache Drill, and SparkSQL
for immediate data analysis and decision-making.
5
7. Data Analysis Methods

Big Data analysis can be performed using a variety of methods:
 Statistical and Predictive Analysis: Uses historical data to predict future outcomes.
 Regression and Machine Learning Algorithms: Includes clustering, classification,
and text analysis.
 Social Network and Location-Based Analysis: Analyzes social media networks or
geographical data.
 Cognitive and Diagnostic Analysis: Focuses on understanding patterns in data and
diagnosing problems.
8. Data Usages
Big Data is used in various fields:
 Human Interaction Data: Includes social media, customer feedback, etc.
 Business Processes and Knowledge Discovery: Helps businesses discover patterns
and optimize operations.
 Enterprise Applications: Big Data is utilized in decision-making, predictive
analytics, and improving business intelligence systems.
Scalability and Parallel Processing in Big Data
Big Data requires processing large volumes of data, often necessitating complex
computations that span across numerous computing nodes. Scalability and parallel processing
are fundamental to handling the massive data workloads that Big Data generates. Here's a
breakdown of how scalability and parallel processing are achieved:
Convergence of Data Environments and Analytics
 Scaling Up and Scaling Out: Big Data processing and analytics require both vertical
(scaling up) and horizontal (scaling out) scalability to handle massive data sets.
o Scaling Up (Vertical Scalability): Refers to improving a single system's
capacity by upgrading resources like CPUs, RAM, and storage. This enhances
analytics, reporting, and visualization capabilities.
o Scaling Out (Horizontal Scalability): Involves adding more systems (nodes)
to handle workloads. Multiple systems work together, distributing tasks and
increasing overall system capacity.
Scalability
 Capacity Increase: Scalability allows a system to grow or shrink in capacity as data
and processing demands change.
o Vertical Scalability: Increases a single system's resources to improve its
processing power and efficiency.
6
o Horizontal Scalability: Adds more systems to distribute tasks, enhancing the

ability to handle large datasets in parallel.
Analytics Scalability to Big Data
 Vertical Scalability: Focuses on maximizing the performance of a single machine.
By designing algorithms that efficiently use system resources (CPU, memory,
storage), Big Data analytics can handle larger datasets and more complex queries.
 Horizontal Scalability: Distributes workload across multiple machines or nodes,
enabling systems to process large datasets simultaneously and in parallel, improving
efficiency for large-scale data analysis.
Massively Parallel Processing (MPP) Platforms
 Parallel Processing: Involves dividing a computational problem into smaller sub-
tasks that are executed simultaneously across multiple CPUs or computers. This can
be achieved by:
o Task Distribution within the Same CPU: Using multiple threads to perform
different tasks on the same processor.
o Task Distribution Across CPUs in a Single Machine: Allocating separate
tasks to different processors within the same machine.
o Task Distribution Across Multiple Machines: Leveraging several computers
to process different parts of the dataset in parallel.
Advantages of Parallel Processing:
o Tasks are completed faster by distributing them across multiple compute
resources.
o Massive Parallel Processing (MPP) platforms enable scaling by distributing
computations across many systems or CPUs.
Distributed Computing Model
 Distributed Computing: Utilizes cloud, grid, or cluster computing environments,
where datasets are processed across several computing nodes connected by high-
speed networks.
o Key Characteristics:
 Efficiently processes large datasets by leveraging the collective
computing power of multiple systems.
 Scalable and parallel processing models like MapReduce are used to
break down large data into smaller, manageable chunks, processed in
parallel across nodes.
o No-Sharing Program Model: Ensures each node works independently,
reducing the complexity and improving performance.
Scalability and Efficient Design
7
 Scaling software to run on larger machines with more resources can enhance
performance, but the efficiency of the algorithm plays a significant role.
o Simply adding more CPUs or memory without optimizing the software's
ability to leverage these resources won't provide substantial performance
gains.
o Algorithm Design: Properly designed algorithms exploit additional resources
like extra CPUs and memory, enabling efficient use of parallel computing
environments.
Cloud Computing
Cloud computing is an internet-based service that allows on-demand access to shared
resources and data. It provides flexible and scalable computing power, data storage, and
services without requiring users to invest in their own physical infrastructure.
Features of Cloud Computing:
1. On-Demand Service: Users can access computing resources (such as storage,
processing power, or software) whenever needed without human interaction with
service providers.
2. Resource Pooling: Cloud providers use multi-tenant models to pool resources,
dynamically allocating them to meet the demands of multiple customers.
3. Scalability: Cloud resources can be scaled up or down based on demand, ensuring
flexibility.
4. Broad Network Access: Cloud services are accessible via the internet, meaning users
can access their resources from any location, using various devices.
5. Accountability: Cloud providers ensure transparent usage metrics and billing, giving
users clear insight into their resource consumption.
Types of Cloud Computing Services:
1. Infrastructure as a Service (IaaS):
o Provides access to computing resources such as virtual machines, storage, and
network infrastructure.
o Users can rent infrastructure on a pay-as-you-go basis.
o Examples:
 Amazon EC2: Virtual server space for scalable computing power.
 Tata CloudStack: Open-source software for managing virtual
machines, offering public cloud services.
2. Platform as a Service (PaaS):
o Provides a platform allowing developers to build, deploy, and manage
applications without worrying about the underlying infrastructure.
8
o Examples:
 Microsoft Azure HD Insights: Offers cloud-based Hadoop services.
 IBM BigInsight and Oracle Big Data Cloud Services: Provide big
data platforms for analytics and application development.
3. Software as a Service (SaaS):
o Delivers software applications over the internet. Users access software without
installing or maintaining it on their own computers.
o Examples:
 GoogleSQL, IBM BigSQL, HPE Vertica: Cloud-based SQL services.
 Microsoft Polybase and Oracle Big Data SQL: Cloud solutions for
data analytics and querying large datasets.
Cloud Computing in Big Data Processing
Cloud computing is a powerful environment for handling Big Data as it allows for both
parallel and distributed computing across multiple nodes. Big data solutions leverage the
cloud for:
 Data Storage: Cloud platforms such as Amazon S3 provide scalable storage for large
datasets.
 Data Processing: Cloud-based services like Microsoft Azure, Apache CloudStack,
and AWS EC2 facilitate the parallel processing of large-scale datasets.
Grid Computing:
Grid computing is a form of distributed computing where computers located in different
locations are interconnected to work together on a common task. It allows the sharing of
resources across various organizations or individuals for achieving large-scale tasks,
particularly data-intensive ones.
Features of Grid Computing:
1. Distributed Network: Grid computing involves a network of computers from
multiple locations, each contributing resources for a common goal.
2. Large-Scale Resource Sharing: It enables the flexible, coordinated, and secure
sharing of resources among users, such as individuals and organizations.
3. Data-Intensive Tasks: Grid computing is particularly well-suited for handling large
datasets that can be distributed across grid nodes.
4. Scalability: Grid computing can scale efficiently by adding more nodes to
accommodate growing data or processing needs.
5. Single-Task Focus: At any given time, a grid typically dedicates its resources to a
single application or task.
Drawbacks of Grid Computing:
9
1. Single Point of Failure: If one node underperforms or fails, it can disrupt the entire
grid, affecting overall performance.
2. Variable Performance: The performance and storage capacity of the grid can
fluctuate depending on the number of users, instances, and data transferred.
3. Resource Management Complexity: As resources are shared among many users,
managing and coordinating them can be challenging, especially with large volumes of
data.
Cluster Computing:
Cluster computing refers to a group of computers connected by a local network that work
together to accomplish the same task. Unlike grid computing, clusters are typically located in
close proximity and used primarily for load balancing and high availability.
Key Features of Cluster Computing:
1. Local Network: The computers in a cluster are interconnected locally and operate as
a single system.
2. Load Balancing: Clusters distribute processes among nodes to ensure that no single
computer is overloaded. This allows for better resource utilization and higher
availability.
3. Fault Tolerance: Clusters often provide redundancy, where if one node fails, others
can take over, minimizing the risk of downtime.
4. Application: Cluster computing is commonly used in high-performance computing
(HPC), scientific simulations, and business analytics.
5. Hadoop Integration: The Hadoop architecture follows cluster computing principles
by distributing tasks across many nodes for large-scale data processing.
Volunteer Computing:
Volunteer computing is a type of distributed computing that uses the resources of volunteers
(organizations or individuals) to contribute to projects requiring computational power.
Key Features of Volunteer Computing:
1. Volunteer Resources: Volunteers donate the computing power of their personal
devices (computers, smartphones, etc.) to help process data or run simulations for
large-scale projects.
2. Distributed Network: Similar to grid computing, volunteer computing relies on a
network of geographically distributed devices.
3. Popular in Academia: Volunteer computing is often used for science-related projects
by universities or research institutions.
Examples of Volunteer Computing Projects:
 SETI@home: A project that uses idle resources from volunteers to analyze radio
signals for extraterrestrial life.
10
 Folding@home: A project aimed at understanding protein folding to help find cures

for diseases like Alzheimer's and cancer.
Designing Data Architecture:
Data Architecture Design involves organizing how Big Data is stored, accessed, and
managed in a Big Data or IT environment. It creates a structure that allows the flow of
information, security management, and utilization of core components in an efficient manner.
Big Data architecture follows a systematic approach, especially when broken down into
logical layers, each serving a specific function. These layers make it easier to design, process,
and implement data architecture.
Big Data Architecture Layers:
The architecture is broken down into five main layers, each representing a set of core
functions essential for handling Big Data:
1. Identification of Data Sources (L1):
o Purpose: Identify the sources of data, which could be both internal
(organization databases, ERP systems) and external (social media, IoT
devices, APIs).
o Key Task: Determine the relevant data sources to be ingested into the system.
2. Acquisition, Ingestion, and Pre-Processing of Data (L2):
11
o Purpose: Data ingestion is the process of importing and absorbing data into
the system for further use. This data may be ingested in batches or real-time.
o Key Task: Perform initial data transformation, cleaning, and standardization
to ensure data readiness for storage and processing.
3. Data Storage (L3):
o Purpose: Store data in a variety of storage environments, such as files,
databases, clusters, or cloud systems. This layer holds structured, semi-
structured, and unstructured data for future processing.
o Key Task: Choose appropriate storage systems based on scalability and
reliability (e.g., Hadoop Distributed File System (HDFS), cloud storage like
AWS S3, or distributed storage).
4. Data Processing (L4):
o Purpose: This layer focuses on processing the data using frameworks and
tools like MapReduce, Apache Hive, Apache Pig, and Apache Spark.
o Key Task: Implement large-scale distributed data processing to analyze and
extract meaningful insights from vast datasets.
5. Data Consumption (L5):
o Purpose: After data is processed, this layer delivers the insights and results to
end users through analytics, visualization, and reporting tools.
o Key Task: Use analytics for various applications such as business
intelligence, AI/ML models, predictive analytics, pattern recognition, and
data visualization tools.
Designing Data Architecture
Data Architecture Design involves organizing how Big Data is stored, accessed, and
managed in a Big Data or IT environment. It creates a structure that allows the flow of
information, security management, and utilization of core components in an efficient manner.
Big Data architecture follows a systematic approach, especially when broken down into
logical layers, each serving a specific function. These layers make it easier to design, process,
and implement data architecture.
Big Data Architecture Layers:
The architecture is broken down into five main layers, each representing a set of core
functions essential for handling Big Data:
1. Identification of Data Sources (L1):
o Purpose: Identify the sources of data, which could be both internal
(organization databases, ERP systems) and external (social media, IoT
devices, APIs).
o Key Task: Determine the relevant data sources to be ingested into the system.
12
2. Acquisition, Ingestion, and Pre-Processing of Data (L2):

o Purpose: Data ingestion is the process of importing and absorbing data into
the system for further use. This data may be ingested in batches or real-time.
o Key Task: Perform initial data transformation, cleaning, and standardization
to ensure data readiness for storage and processing.
3. Data Storage (L3):
o Purpose: Store data in a variety of storage environments, such as files,
databases, clusters, or cloud systems. This layer holds structured, semi-
structured, and unstructured data for future processing.
o Key Task: Choose appropriate storage systems based on scalability and
reliability (e.g., Hadoop Distributed File System (HDFS), cloud storage like
AWS S3, or distributed storage).
4. Data Processing (L4):
o Purpose: This layer focuses on processing the data using frameworks and
tools like MapReduce, Apache Hive, Apache Pig, and Apache Spark.
o Key Task: Implement large-scale distributed data processing to analyze and
extract meaningful insights from vast datasets.
5. Data Consumption (L5):
o Purpose: After data is processed, this layer delivers the insights and results to
end users through analytics, visualization, and reporting tools.
o Key Task: Use analytics for various applications such as business
intelligence, AI/ML models, predictive analytics, pattern recognition, and
data visualization tools.
Functions of Data Architecture Layers
 Data Ingestion and Pre-processing (L2): The ingestion layer handles transferring
data into the system, similar to the way the body ingests food. It cleans and transforms
the data to make it suitable for analysis. This could include batch or real-time
ingestion processes.
 Storage (L3): After data is ingested, it is stored in a suitable format, either on
traditional servers, in clusters, or cloud-based platforms depending on the size, type,
and access requirements.
 Processing (L4): The processing layer handles computation and data processing
using tools such as Spark or MapReduce. It is responsible for transforming raw data
into meaningful information.
 Consumption (L5): Finally, data is made available for use through business
intelligence tools, reporting applications, or visualization platforms.
Applications of Big Data Architecture:
13
 Business Intelligence (BI)

 Data Mining and Machine Learning
 Artificial Intelligence (AI)
 Predictive and Descriptive Analytics
 Data Visualization
Data Pre-Processing
Data pre-processing is an essential step before conducting data mining, analytics, or running
Machine Learning (ML) algorithms. This stage ensures that the data is of high quality and
suitable for further analysis, particularly when data is being exported to cloud services or data
stores.
Pre-Processing Needs:
1. Handling Outliers and Inconsistencies:
o Dropping out-of-range values: Identify and remove data points that fall
outside acceptable limits.
o Addressing inconsistencies: Ensure that data follows the same format and
structure throughout.
2. Filtering Information:
o Eliminating unreliable data: Remove any data that is deemed irrelevant or
redundant to improve the quality of the dataset.
3. Data Cleaning and Editing:
o Correcting inaccuracies: Identify and rectify incomplete or incorrect data
entries.
o Data editing controls: Review and adjust datasets to enhance quality, using
methods such as:
 Interactive editing
 Selective editing
 Automatic editing
 Aggregating data
 Distribution of data
4. Data Reduction:
o Simplifying datasets: Transform acquired information into a more
manageable form while retaining meaningful insights. This involves reducing
the volume of data while focusing on the most relevant parts.
5. Data Wrangling:
14
o Transforming and mapping data: This process involves restructuring data to

make it more suitable for analytics and visualization. For instance, mapping
data to a new format can enhance its value.
6. Data Validation and Transformation:
o Ensuring accuracy: Validate data to confirm that it meets the required
standards.
o Transformation or transcoding: Convert data from one format to another to
facilitate easier analysis.
7. ELT Processing (Extract, Load, Transform):
o Data pipeline: Involves extracting data from various sources, loading it into a
storage system, and then transforming it as needed for analysis.
Data Cleaning
Data cleaning is a critical component of data pre-processing, focusing on the identification
and correction of issues in the data, such as:
 Removing inaccuracies: Eliminate incorrect or irrelevant data points that could lead
to misleading insights or decisions.
 Correcting incomplete data: Fill in gaps where information is missing or
incomplete.
Data Cleaning Tools:
Various tools are available for data cleaning, which play a vital role in preparing data for
analysis. These tools help in structuring and refining data to make it usable for decision-
making.
Data Enrichment
Data enrichment refers to processes aimed at enhancing the quality and depth of raw data.
This can involve adding context or supplementary information to improve analysis outcomes.
Big Data Platform Overview
A Big Data platform is designed to handle the challenges associated with large datasets, high
velocity, diverse varieties, and varying veracity of data. These platforms leverage advanced
technologies and tools to store, process, and analyze vast amounts of information effectively.
Here are the key components and requirements for a robust Big Data platform:
Core Functions of a Big Data Platform:
1. Storage, Processing, and Analytics:
o Supports efficient data storage and retrieval mechanisms.
o Enables real-time and batch processing capabilities.
o Provides analytical tools for insights and decision-making.
15
2. Development and Management:

o Tools for developing, deploying, and managing Big Data environments.
o Streamlined processes for integrating various data sources into a cohesive
system.
3. Complexity Reduction:
o Simplifies the integration of multiple data sources and applications.
o Ensures a unified approach to data management.
4. Custom Development and Integration:
o Facilitates custom solutions for specific business needs.
o Enables querying and integration with existing systems.
5. Support for Traditional and Big Data Techniques:
o Combines conventional data management techniques with modern Big Data
practices.
Requirements for Managing Big Data:
1. Innovative Storage Methods: Adoption of non-traditional methods for storing and
processing data.
2. Distributed Data Stores: Utilization of distributed databases to manage large
datasets effectively.
3. Scalable and Elastic Platforms: Implementation of cloud computing to create
virtualized environments that can scale as needed.
4. High-Volume Data Stores: Infrastructure to handle massive amounts of data
efficiently.
5. Massive Parallelism: Use of high-speed networks and parallel processing to enhance
performance.
6. High-Performance Processing: Optimization and tuning for maximum efficiency in
data processing tasks.
7. NoSQL Data Management Models: Adoption of NoSQL databases for flexible data
storage solutions.
8. In-Memory Data Processing: Implementation of in-memory data processing for
faster transactions and analytics.
9. Advanced Analytics Capabilities: Tools for data retrieval, mining, reporting,
visualization, and analysis.
10. Graph Databases: Leveraging graph databases for analyzing complex relationships,
such as social networks.
16
11. Machine Learning Integration: Utilizing machine learning techniques for predictive
analytics and data insights.
Data Sources in a Big Data Environment:
 Data Storage Solutions:
Traditional data warehouses and modern NoSQL databases (e.g., Oracle Big
Data, MongoDB, Cassandra).
 Sensor Data: Data generated from various sensors, which can include IoT devices.
 Audit Trails: Records of financial transactions and other business processes.
 External Data Sources: Information from web platforms, social media, weather data,
and health records.
Big Data Analytics Applications and Case Studies:
1. Big Data in Marketing and Sales:
o Marketing revolves around delivering value to customers. Big Data plays a
vital role in customer value analytics (CVA), allowing companies like Amazon
to enhance customer experiences. It helps businesses understand customer
needs and perceptions, leading to effective strategies for improving customer
relationships and lifetime value (CLTV).
o Big Data in marketing also aids in lowering customer acquisition cost (CAC)
and enhancing contextual marketing by targeting potential customers based on
browsing patterns.
2. Big Data Analytics in Fraud Detection:
o Fraud detection is critical to avoiding financial losses. Examples of fraud
include sharing customer data with third parties or falsifying company
information. Big Data analytics help detect and prevent fraud by integrating
data from multiple sources such as social media, emails, and websites,
allowing faster detection of threats and preventing potential frauds.
3. Big Data Risks:
o While Big Data offers insights, it also introduces risks. Erroneous or
inaccurate data can lead to faulty analytics, requiring companies to implement
strong risk management strategies to ensure accurate predictions and reliable
data usage.
4. Big Data in Credit Risk Management:
o Financial institutions use Big Data to manage credit risks by analyzing loan
defaults, timely return of interests, and the creditworthiness of borrowers. Big
Data provides insights into industries with higher risks, individuals with poor
credit ratings, and liquidity issues, helping financial institutions make
informed lending decisions.
17
5. Big Data in Healthcare:

o Big Data in healthcare leverages clinical records, electronic medical records,
and other sources to enhance healthcare services. It facilitates value-based,
customer-centric healthcare, fraud prevention, real-time patient monitoring,
and reduces healthcare costs. The Internet of Things (IoT) is also integrated
with Big Data to improve patient care.
6. Big Data in Medicine:
o In medicine, Big Data uses large datasets to create predictive models that help
in disease research and understanding the biology of diseases. Wearable
devices provide continuous health data, allowing better disease risk profiling
and personalized treatments.
7. Big Data in Advertising:
o Big Data has transformed the digital advertising industry by enabling real-time
analytics, identifying emerging trends, and building customer relationships. It
captures data from various sources, enriching structured data and providing
insights for targeted digital advertisements via SMS, emails, social media, and
other platforms. The data helps advertisers discover less competitive markets
and personalize their advertising strategies.
-------------------------------------------END OF MODULE 1-------------------------------------------
Important Diagrams:
18
------
Hadoop Structure
19
MODULE 2
Introduction to Hadoop (T1)
Introduction to Hadoop
Hadoop is an Apache open-source framework written in Java that enables the distributed processing
of large datasets across clusters of computers using simple programming models. It allows
applications to work in an environment that supports distributed storage and computation. Hadoop is
scalable, meaning it can grow from a single server to thousands of machines, each providing local
computation and storage. It is designed to handle Big Data and enable efficient processing of massive
datasets.
Big Data Store Model
The Big Data store model in Hadoop is based on a distributed file system. Data is stored in blocks,
which are physical divisions of data spread across multiple nodes. The architecture is organized in
clusters and racks:
 Data Nodes: Store data in blocks.
 Racks: A collection of data nodes, scalable across clusters.
 Clusters: Racks are grouped into clusters to form the overall storage and processing system.
Hadoop ensures reliability by replicating data blocks across nodes. If a data link or node fails, the
system can still access the replicated data from other nodes.
Big Data Programming Model
In Hadoop's Big Data programming model, jobs and tasks are scheduled to run on the same servers
where the data is stored, minimizing data transfer time. This programming model is enabled by
MapReduce, a powerful tool that divides processing tasks into smaller subtasks that can be executed
in parallel across the cluster.
Example of Jobs in Hadoop
 Query Processing: A job that processes queries on datasets and returns results to an
application.
 Sorting Data: Sorting performance data from an examination or another large dataset.
Hadoop and Its Ecosystem
The Hadoop framework was developed as part of an Apache project for Big Data storage and
processing, initiated by Doug Cutting and Mike Cafarella. The name Hadoop came from Cutting’s
son, who named his stuffed toy elephant "Hadoop."
Hadoop has two main components:
1. Hadoop Distributed File System (HDFS): A system for storing data in blocks across
clusters.
2. MapReduce: A computational framework that processes data in parallel across the clusters.
Hadoop is written primarily in Java, with some native code in C, and the utilities are managed using
shell scripts. The framework operates on cloud-based infrastructure, making it a cost-effective
solution for managing and processing terabytes of data in minutes.
20
Characteristics of Hadoop
Hadoop offers several key advantages for managing Big Data:
 Scalable: Easily scales from a few machines to thousands.
 Self-manageable: Requires minimal manual intervention for management.
 Self-healing: Automatically manages node failures by replicating data.
 Distributed File System: Ensures reliable storage and quick access to large datasets.
Hadoop Core Components
The Apache Hadoop framework is made up of several core components, which work together to store
and process large datasets in a distributed computing environment. The core components of Hadoop
are as follows:
1. Hadoop Common:
o Description: This is the foundational module that contains the libraries and utilities
required by other Hadoop components. It provides various common services like file
system and input/output operations, serialization, and Remote Procedure Calls
(RPCs).
o Features:
 Common utilities shared across the Hadoop modules.
 File-based data structures.
 Essential interfaces for interacting with the distributed file system.
2. Hadoop Distributed File System (HDFS):
o Description: HDFS is a Java-based distributed file system designed to run on
commodity hardware. It allows Hadoop to store large datasets by distributing data
blocks across multiple machines (nodes) in the cluster.
o Features:
 Data is stored in blocks and replicated for fault tolerance.
 Highly scalable and reliable.
 Optimized for batch processing and provides high throughput for data access.
21
3. MapReduce v1:
o Description: MapReduce v1 is a programming model that allows for the processing
of large datasets in parallel across multiple nodes. The model divides a job into
smaller sub-tasks, which are then executed across the cluster.
o Features:
 Jobs are divided into Map tasks and Reduce tasks.
 Suitable for batch processing large sets of data.
4. YARN (Yet Another Resource Negotiator):

o Description: YARN is responsible for managing computing resources in Hadoop. It
schedules and manages jobs and sub-tasks by allocating resources to applications and
ensuring they run efficiently in a distributed environment.
o Features:
 Resource management for Hadoop clusters.
 Parallel execution of tasks across clusters.
 Supports dynamic allocation of resources to applications.
5. MapReduce v2:
o Description: An updated version of MapReduce that operates under the YARN
architecture. It improves resource management and scalability compared to
MapReduce v1.
o Features:
 YARN-based system for distributed parallel processing.
 Allows better resource allocation for running large applications.
Features of Hadoop
Hadoop has several features that make it an essential tool for handling Big Data:
1. Scalability and Modularity:
o Hadoop is highly scalable, meaning you can add more nodes to the cluster as data
grows.
o Its modular design allows components to be easily added or replaced.
2. Robust HDFS:
o The Hadoop Distributed File System (HDFS) is designed to handle large-scale data
reliably.
o Data is replicated (default: three copies), ensuring backup and recovery in case of
node failures.
3. Big Data Processing:
22
o Hadoop processes Big Data characterized by the 3Vs: Volume, Variety, and
Velocity.
4. Distributed Cluster Computing with Data Locality:
o Hadoop optimizes processing by running tasks on the same nodes where the data is
stored, enhancing efficiency.
o High-speed processing is achieved by distributing tasks across multiple nodes in a
cluster.
5. Fault Tolerance:
o Hadoop automatically handles hardware failures. If a node fails, the system recovers
by using data replicated across other nodes.
6. Open-Source Framework:
o Hadoop is open-source, making it cost-effective for handling large data workloads. It
can run on inexpensive hardware and cloud infrastructure.
7. Java and Linux Based:
o Hadoop is built in Java and runs primarily on Linux. It also includes its own set of
shell commands for easy management.
Hadoop Ecosystem Components
Hadoop's ecosystem consists of multiple layers, each responsible for different aspects of storage,
resource management, processing, and application support. The key components are:
1. Distributed Storage Layer:
o HDFS: Manages the distributed storage of large datasets.
23
2. Resource Manager Layer:

o YARN: Manages and schedules the distribution of resources for jobs and sub-tasks in
the cluster.
3. Processing Framework Layer:
o MapReduce: Processes data in parallel by dividing jobs into Mapper and Reducer
tasks.
4. APIs at the Application Support Layer:
o Provides application interfaces for interacting with the Hadoop ecosystem.
This layered architecture enables Hadoop to efficiently store, manage, and process vast amounts of
data, making it an essential tool for organizations working with Big Data.
HDFS Data Storage
1. Data Distribution in Clusters:
o Hadoop's storage concept involves distributing data across a cluster. A cluster
consists of multiple racks, and each rack contains several DataNodes.
o DataNodes are responsible for storing the actual data blocks, while the NameNode
manages the file system metadata and keeps track of where the data is stored.
2. Data Blocks:
o HDFS breaks down large files into data blocks. Each block is stored independently
across various DataNodes.
o By default, HDFS stores replicas of each data block on multiple DataNodes to
ensure data availability even if some nodes fail.
o Default block size: 64 MB (this can be configured to be larger, such as 128 MB or
256 MB).
3. Rack Awareness:
o HDFS is aware of the physical distribution of nodes across racks.
o When replicating blocks, Hadoop attempts to place replicas on different racks to
improve fault tolerance and reduce network bandwidth between nodes on the same
rack.
4. Fault Tolerance:
o The replication of blocks ensures that data is not lost if a node goes down. The
default replication factor is 3, meaning that each block is replicated across three
different nodes.
o In the event of a DataNode failure, the NameNode automatically re-replicates the
missing blocks on other DataNodes.
5. Processing and Storage:
o DataNodes not only store data but also have the capability to process the data stored
in them. This enables distributed processing and allows Hadoop to process large
datasets efficiently across clusters.
24
6. Data Block Management:

o When a file is uploaded to HDFS, it is split into blocks. Each block is distributed
across different nodes to optimize read and write performance.
o Blocks are immutable, meaning once written, they cannot be modified. Data can
only be appended to a file, but not altered in between.
Hadoop Physical Organization
In a Hadoop cluster, nodes are divided into MasterNodes and SlaveNodes.

MasterNodes:
 MasterNodes (or simply Masters) are responsible for coordinating the operations within the
cluster. These nodes handle the overall management of the Hadoop environment, including
storage and task distribution.
 Key MasterNodes:
1. NameNode: The central node that manages the file system's metadata, such as file
block locations, permissions, and access times. It plays a crucial role in managing
data within HDFS.
2. Secondary NameNode: Maintains a backup of the metadata and acts as a failover
mechanism for the NameNode. It helps in managing metadata snapshots but is not a
complete replacement for the NameNode.
3. JobTracker: Oversees the allocation of MapReduce tasks to various nodes and
ensures job completion by managing job execution across the cluster.
25
SlaveNodes:
 SlaveNodes (or DataNodes and Task Trackers) store actual data blocks and execute
computational tasks. Each node has a significant amount of disk space and is responsible for
both data storage and processing.
o DataNodes handle the storage and management of data blocks.
o TaskTrackers execute the processing tasks sent by the MasterNode and return the
results.
Physical Distribution of Nodes:
 A typical Hadoop cluster consists of many DataNodes that store data, while MasterNodes
handle administrative tasks. In a large cluster, multiple MasterNodes are used to balance the
load and ensure redundancy.
Client-Server Interaction:
 Clients interact with the Hadoop system by submitting queries or applications through various
Hadoop ecosystem projects, such as Hive, Pig, or Mahout.
 The MasterNode coordinates with the DataNodes to store data and process tasks. For
example, it organizes how files are distributed across the cluster, assigns jobs to the nodes,
and monitors the health of the system.
Hadoop MapReduce Framework and Programming Model

MapReduce is the primary programming model used for processing large datasets in Hadoop. The
framework is divided into two main functions:
1. Map Function:
o The Map function organizes the data into key/value pairs.
o Each mapper works on a subset of the data blocks and produces intermediate results
that are used by the Reduce function.
o Mapping distributes the task across different nodes in the cluster, where each node
processes its portion of the data.
2. Reduce Function:
o The Reduce function takes the intermediate key/value pairs generated by the Map
function and processes them to produce a final aggregated result.
o It applies aggregation, queries, or other functions to the mapped data, reducing it
into a smaller, cohesive set of results.
Hadoop MapReduce Execution Process

The MapReduce job execution involves several steps:
1. Job Submission:
26
o A client submits a request to the JobTracker, which estimates the required resources
and prepares the cluster for execution.
2. Task Assignment:
o The JobTracker assigns Map tasks to nodes that store the relevant data. This is
called data locality, which reduces network overhead.
3. Monitoring:
o The progress of each task is monitored, and if any task fails, it is restarted on a
different node with available resources.
4. Final Output:
o After the Map and Reduce jobs are completed, the results are serialized and
transferred back to the client, typically using formats like AVRO.
MapReduce Programming Model

MapReduce programs can be written in various languages, including Java, C++, and Python. The
basic structure of a MapReduce program includes:
1. Input Data:
o Data is typically stored in HDFS in files or directories, either structured or
unstructured.
2. Map Phase:
o The map function processes the input data by breaking it down into key/value pairs.
Each key/value pair is passed to the reduce phase after mapping.
3. Reduce Phase:
o The reduce function collects the output of the map phase and reduces the data by
aggregating, sorting, or applying user-defined functions.
Hadoop YARN: Resource Management and Execution Model
YARN (Yet Another Resource Negotiator) is a resource management framework used in Hadoop for
managing and scheduling computer resources in a distributed environment. YARN separates the job
processing function from resource management, improving scalability and efficiency.
Components in YARN:
1. Resource Manager (RM):
o The Resource Manager is the master node in the YARN architecture. There is one
RM per cluster, and it is responsible for:
 Managing the overall resources of the cluster.
 Handling job submissions from clients.
 Monitoring the availability of node resources (Node Managers).
 Allocating resources to the applications.
27
2. Node Manager (NM):

o The Node Manager is a slave component running on each cluster node. It manages
the individual node's resources and keeps the RM informed of its status.
Responsibilities include:
 Monitoring the resource usage (CPU, memory) of containers running on the
node.
 Starting and stopping containers (which run the actual tasks).
 Sending periodic heartbeat signals to the RM to indicate its availability.
3. Application Master (AM):
o The Application Master is created for each job submitted to YARN. It handles the
life cycle of an individual application. Its tasks include:
 Requesting resources (containers) from the RM.
 Coordinating the execution of tasks across containers.
 Monitoring task completion and handling failures.
4. Containers:
o Containers are the basic unit of resource allocation in YARN. Each container is a
collection of resources (memory, CPU) on a single node, assigned by the RM to the

Application Master for executing tasks.
28
o Containers run the actual tasks of the application in parallel, distributed across
multiple nodes.
YARN-Based Execution Model

The YARN execution model consists of several steps involving the interaction between different
components. Below is a breakdown of the actions in the YARN resource allocation and scheduling
process:
1. Client Submission:
o The Client Node submits a request for an application or job to the Resource
Manager (RM). The RM then takes responsibility for managing and executing the
job.
2. Job History Server:
o The Job History Server keeps track of all the jobs that have been completed in the
cluster. This helps in maintaining job execution history for future analysis or
debugging.
3. Node Manager Startup:

o In a YARN cluster, multiple Node Managers (NM) exist. Each NM starts an instance
of the Application Master (AM). The AM is responsible for managing the lifecycle
of the application and requesting resources.
4. Application Master Initialization:
o Once the AM instance (AMI) is created, it registers itself with the RM. The AM
evaluates the resource requirements for the submitted job and requests the necessary
containers.
5. Resource Allocation:
o The RM analyzes the resource availability in the cluster by tracking heartbeat signals
from active NMs. The RM allocates the required containers across different nodes
based on the resource requests from the Application Master.
6. Container Assignment:
o Each NM assigns a container to the AMI. The containers can be assigned either on
the same NM or across different NMs, depending on resource availability. Each
Application Master uses the assigned containers to execute the sub-tasks of the
application.
7. Execution of Application Sub-Tasks:
o Once the containers are assigned, the Application Master coordinates the execution
of sub-tasks across the allocated containers. The job's tasks run in parallel on different
containers, utilizing the distributed nature of the Hadoop cluster.
8. Resource Monitoring:
29
o During job execution, the NM monitors resource utilization and ensures the tasks are
completed successfully. If there are any failures, the RM may reassign tasks to
available containers.
Hadoop Ecosystem Tools
1. Zookeeper:
Zookeeper is a centralized coordination service for distributed applications. It provides a reliable,
efficient way to manage configuration, synchronization, and name services across distributed systems.
Zookeeper maintains data in nodes called JournalNodes, ensuring that distributed systems function
cohesively. Its main coordination services include:
 Name Service: Similar to DNS, it maps names to information, tracking servers or services
and checking their statuses.
 Concurrency Control: Manages concurrent access to shared resources, preventing
inconsistencies and ensuring that distributed processes run smoothly.
 Configuration Management: A centralized configuration manager that updates nodes with
the current system configuration when they join the system.
 Failure Management: Automatically recovers from node failures by selecting alternative
nodes to take over processing tasks.
2. Oozie:
Apache Oozie is a workflow scheduler for Hadoop that manages and coordinates complex jobs and
tasks in big data processing. Oozie allows you to create, schedule, and manage multiple workflows. It
organizes jobs into Directed Acyclic Graphs (DAGs) and supports:
 Integration of Multiple Jobs: Oozie integrates MapReduce, Hive, Pig, and Sqoop jobs in a
sequential workflow.
 Time and Data Triggers: Automatically runs workflows based on time or specific data
availability.
 Batch Management: Manages the timely execution of thousands of jobs in a Hadoop cluster.
Oozie is efficient for automating and scheduling repetitive jobs, simplifying the management of
multiple workflows.
3. Sqoop:
Apache Sqoop is a tool used for efficiently importing and exporting large amounts of data between
Hadoop and relational databases. It uses the MapReduce framework to parallelize data transfer
tasks. The workflow of Sqoop includes:
 Command-Line Parsing: Sqoop processes the arguments passed through the command line
and prepares map tasks.
 Data Import and Export: Data from external databases is distributed across multiple
mappers. Each mapper connects to the database using JDBC to fetch and import the data into
Hadoop, HDFS, Hive, or HBase.
 Parallel Processing: Sqoop leverages Hadoop's parallel processing to transfer data quickly
and efficiently. It also provides fault tolerance and schema definition for data import.
30
Sqoop's ability to handle structured data makes it an essential tool for integrating relational databases
with the Hadoop ecosystem.
4. Flume:
Apache Flume is a service designed for efficiently collecting, aggregating, and transferring large
volumes of streaming data into Hadoop, particularly into HDFS. It's highly useful for applications
involving continuous data streams, such as logs, social media feeds, or sensor data. Key components
of Flume include:
 Sources: These collect data from servers or applications.
 Sinks: These store the collected data into HDFS or another destination.
 Channels: These act as a buffer, holding event data (typically 4 KB in size) between sources
and sinks.
 Agents: Agents run sources and sinks. Interceptors filter or modify the data before it's
written to the target.
Flume is reliable and fault-tolerant, providing a robust solution for handling massive, continuous data
streams.
----------------------------------------END OF MODULE 2-------------------------------------------------
31
MODULE3
Introduction to Distributed Systems in Big Data
 Definition: Distributed systems consist of multiple data nodes organized into clusters,
enabling tasks to execute in parallel.
 Communication: Nodes communicate with applications over a network, optimizing
resource utilization.
Features of Distributed-Computing Architecture
1. Increased Reliability and Fault Tolerance
o Failure of some cluster machines does not impact the overall system.
o Data replication across nodes enhances fault tolerance.
2. Flexibility
o Simplifies installation, implementation, and debugging of new services.
3. Sharding
o Definition: Dividing data into smaller, manageable parts called shards.
o Example: A university student database is sharded into datasets per course
and year.
4. Speed
o Parallel processing on individual nodes in clusters boosts computing
efficiency.
5. Scalability
o Horizontal Scalability: Expanding by adding more machines and shards.
o Vertical Scalability: Enhancing machine capabilities to run multiple
algorithms.
6. Resource Sharing
o Shared memory, machines, and networks reduce operational costs.
7. Open System
o Accessibility of services across all nodes in the system.
8. Performance
o Improved performance through collaborative processor operations with lower
communication costs compared to centralized systems.
Demerits of Distributed Computing
32
1. Troubleshooting Complexity
o Diagnosing issues becomes challenging in large network infrastructures.
2. Software Overhead
o Additional software is often required for distributed system management.
3. Security Risks
o Vulnerabilities in data and resource sharing due to distributed architecture.
NoSQL Concepts
 NoSQL Data Store: Non-relational databases designed to handle semi-structured and
unstructured data.
 NoSQL Data Architecture Patterns: Models such as key-value, document, column-
family, and graph for efficient data organization.
 Shared-Nothing Architecture: Ensures no shared resources among nodes, enabling
independent operation and scalability.
MongoDB
 Type: Document-oriented NoSQL database.
 Features: Schema-less design, JSON-like storage, scalability, and high availability.
 Usage: Suitable for real-time applications and Big Data analytics.
Cassandra
 Type: Column-family NoSQL database.
 Features: High availability, decentralized architecture, linear scalability, and eventual
consistency.
 Usage: Ideal for applications requiring fast writes and large-scale data handling.
SQL Databases: ACID Properties
SQL databases are relational and exhibit ACID properties to ensure reliability and
consistency of transactions:
1. Atomicity
o All operations in a transaction must complete entirely, or none at all.
o Example: In a banking transaction, if updating both withdrawal and balance
fails midway, the entire transaction rolls back.
2. Consistency
33
o Transactions must maintain the integrity of the database by adhering to

predefined rules.
o Example: The sum of deposits minus withdrawals should always match the
account balance.
3. Isolation
o Transactions are executed independently, ensuring no interference among
concurrent transactions.
4. Durability
o Once a transaction is completed, its results are permanent, even in the event of
a system failure.
SQL Features
1. Triggers
o Automated actions executed upon events like INSERT, UPDATE, or
DELETE.
2. Views
o Logical subsets of data from complex queries, simplifying data access.
3. Schedules
o Define the chronological execution order of transactions to maintain
consistency.
4. Joins
o Combine data from multiple tables based on conditions, enabling complex
queries.
CAP Theorem Overview
The CAP Theorem, formulated by Eric Brewer, states that in a distributed system, it is
impossible to simultaneously guarantee all three properties: Consistency (C), Availability
(A), and Partition Tolerance (P). Distributed databases must trade off between these
properties based on specific application needs.
CAP Properties
1. Consistency (C):
o All nodes in the distributed system see the same data at the same time.
o Changes to data are immediately reflected across all nodes.
34
o Example: If a sales figure in one node is updated, it should instantly reflect in

all other nodes.
2. Availability (A):
o The system remains operational, ensuring every request gets a response,
regardless of success or failure.
o Achieved through replication, where copies of data are maintained on
multiple nodes.
o Example: A query for sales data will return a result even if some nodes are
down.
3. Partition Tolerance (P):
o The system continues to function even when network partitions
(communication breakdowns between nodes) occur.
o Ensures fault tolerance and resilience to node or network failures.
o Example: Operations on one partition of a database do not fail even if another
partition is unreachable.
CAP Combinations
Since achieving all three properties is not possible, distributed systems choose two of the
three based on requirements:
1. Consistency + Availability (CA):
o Ensures all nodes see the same data (Consistency).
o Ensures all requests receive responses (Availability).
o Cannot tolerate network partitions.
o Example: Relational databases in centralized systems.
2. Availability + Partition Tolerance (AP):
o Ensures the system responds to requests even during network failures
(Partition Tolerance).
o May sacrifice consistency, meaning some nodes may have stale or outdated
data.
o Example: DynamoDB, where availability is prioritized over consistency.
Consistency + Partition Tolerance (CP):
o Ensures all nodes maintain consistent data (Consistency).
o Tolerates network partitions but sacrifices availability during failures (some
requests may be denied).
35
o Example: MongoDB, where consistency is more critical than availability.
Network Partition and Trade-offs

When a network partition occurs:
1. AP (Availability + Partition Tolerance):
o The system provides responses but may return outdated or incorrect data.
o Prioritizes availability for user experience.
o Suitable for applications like social media or e-commerce with high fault
tolerance needs.
2. CP (Consistency + Partition Tolerance):
o The system waits for the latest data to be replicated, potentially delaying
responses.
o Prioritizes data accuracy.
o Suitable for applications like banking systems or financial transactions
requiring strict consistency.
36
Schema-less Models in NoSQL Databases

Schema in traditional databases defines a pre-designed structure for datasets, dictating how
data is organized and stored (e.g., tables, columns, data types). NoSQL databases, however,
often adopt a schema-less model, which increases flexibility and allows for unstructured or
semi-structured data.
Characteristics of Schema-less Models

1. No Fixed Table Schema:
o NoSQL databases do not require a predefined schema for data storage.
o New fields can be added to records without affecting existing data.
2. Non-Mathematical Relations:
o Unlike relational databases, NoSQL systems store relationships as aggregates
or metadata rather than using mathematical joins.
3. Flexibility for Data Manipulation:
o Ideal for applications where data evolves over time.
o For example, a student database can start with basic information and add fields
(e.g., extracurricular activities) dynamically as needed.
4. Cluster-based Management:
o Large datasets are stored and managed across distributed clusters or nodes.
o This setup ensures scalability and fault tolerance.
5. Metadata-driven Relationships:
o Relationships between datasets are stored as metadata, which describes and
specifies inter-linkages without rigid relational constraints.
BASE Model in NoSQL Databases

NoSQL databases follow the BASE model (as opposed to the ACID model in relational
databases). The BASE model prioritizes flexibility and scalability:
1. Basic Availability (BA):
o Ensures availability through data replication and distribution across nodes.
o Even if some segments fail, the system remains partially functional.
2. Soft State:
o Allows intermediate states that may be inconsistent temporarily.
37
o The system doesn't require immediate consistency after every transaction.

3. Eventual Consistency:
o Guarantees that all updates will eventually propagate through the system to
achieve a consistent state.
o Suitable for systems where immediate consistency is not critical (e.g., social
media platforms).
Advantages of Schema-less Models

 Increased Agility:
o Easier to adapt to changing data requirements without restructuring the entire
database.
 Scalability:
o Facilitates horizontal scaling by adding nodes to the system.
 Data Variety:
o Handles structured, semi-structured, and unstructured data seamlessly.
 Fault Tolerance:
o Replication across nodes ensures resilience against failures.
38
Applications of Schema-less Models

 E-commerce Platforms: Flexible product catalogs with dynamic attributes.
 Social Media: Storing varied user-generated content.
 IoT Systems: Managing semi-structured sensor data.
 Content Management Systems: Organizing diverse content types like text, images,
and videos.
NoSQL Data Architecture Patterns

1. Key-Value Pair Data Stores
 Definition: A schema-less store where data is represented as key-value pairs.
 Characteristics:
o High performance, scalability, and flexibility.
o Keys are simple strings, and values can be any large object or BLOB (e.g.,
text, images).
o Uses primary keys for fast data retrieval.
 Functions:
o Get(key) → Retrieves the value associated with the key.
o Put(key, value) → Associates or updates the value with the key.
o Multi-get(key1, key2, ...) → Retrieves multiple values.
o Delete(key) → Removes a key-value pair.
 Advantages:
1. Supports any data type in the value field.
2. Queries return values as a single item.
3. Eventually consistent.
4. Supports hierarchical and ordered structures.
5. High scalability, reliability, and portability with low operational cost.
6. Auto-generated or synthetic keys simplify usage.
 Limitations:
o No indexing or searching within values.
o Lack of traditional database capabilities (e.g., SQL queries).
o Managing unique keys becomes challenging with increasing data volume.
39
 Uses:
o Image/document storage.
o Lookup tables and query caches.
2. Document Stores
 Definition: Stores unstructured or semi-structured data in a hierarchical format.
 Features:
1. Stores data as documents (e.g., JSON, XML).
2. Hierarchical tree structures with paths for navigation.
3. Transactions exhibit ACID properties.
4. Flexible schema-less design.
 Advantages:
o Easy querying and navigation using languages like XPath or XQuery.
o Supports dynamic schema changes (e.g., adding new fields).
 Limitations:
o Incompatible with traditional SQL.
o Complex implementation compared to other stores.
 Examples: MongoDB, CouchDB.
 Use Cases:
o Office documents, inventory data, forms, and document searches.
40
3. CSV and JSON File Formats

 CSV: Stores flat, tabular data without hierarchical structure.
 JSON:
o Supports object-oriented and hierarchical structures.
o Easier parsing in JavaScript compared to XML.
o Preferred for serialization due to concise syntax.
 Comparison:
o JSON includes arrays; XML is more verbose but widely used.
o JSON is easier to handle for developers due to its key-value structure.
4. Columnar Data Stores

 Definition: Stores data in columns rather than rows for high-performance analytical
processing.
 Features:
o Groups columns into families, forming a tree-like structure.
o Keys include row, column family, and column identifiers.
 Advantages:
1. High scalability and partitionability.
2. Efficient querying and replication.
3. Supports dynamic column additions.
 Examples: HBase, BigTable, Cassandra.
41
 Use Cases:
o Web crawling, large sparsely populated tables, and high-variance systems.
5. BigTable Data Stores

 Features:
o Massively scalable (up to petabytes).
o Integrates with Hadoop and MapReduce.
o Handles millions of operations per second.
o Includes features like timestamps for versioning and consistent low latency.
 Use Cases: Ideal for high-throughput applications like analytics and global-scale
services.
6. Object Data Stores

 Definition: Stores data as objects (files, images, metadata) with associated system and
custom metadata.
 Features:
o APIs for scalability, indexing, querying, transactions, replication, and lifecycle
management.
o Persistent object storage and lifecycle control.
 Example: Amazon S3 (Simple Storage Service).
o S3 uses REST, SOAP, and BitTorrent interfaces for accessing trillions of
objects.
o Two storage classes: Standard and infrequent access.
 Uses: Web hosting, image storage, and backup systems.
7. Document and Hierarchical Patterns

 Document stores allow hierarchical structures resembling file directories.
 Query languages like XPath and XQuery facilitate efficient searching and navigation.
42
Graph Database Overview

Characteristics:
1. High Flexibility:
o Graph databases can easily expand by adding new nodes and edges.
o Best suited when relationships and relationship types are critical to the data
model.
2. Data Representation:
o Data is stored as interconnected nodes (entities or objects) and edges
(relationships between nodes).
o Makes relationship-based queries simple and efficient.
3. Specialized Query Languages:
o Examples include SPARQL for RDF-based graph databases.
4. Hyper-Edges Support:
o Hyper-edges allow edges to connect multiple vertices, providing a more
complex relationship structure (e.g., hypergraphs).
5. Small Data Size Records:
o Consist of small, interconnected records for efficient traversal and queries.
43
Typical Use Cases:

 Link Analysis: Finding connections in data, such as in social networks,
communication records, etc.
 Friend-of-Friend Queries: Querying indirect relationships, such as second-degree
connections.
 Rules and Inference: Leveraging taxonomies and class hierarchies for advanced rule-
based queries.
 Pattern Matching: Identifying specific patterns in interconnected data.
Examples of Graph Databases:
 Neo4J
 AllegroGraph
 HyperGraph
 InfiniteGraph
 Titan
 FlockDB
Shared-Nothing Architecture for Big Data Tasks

The Shared-Nothing (SN) architecture is a distributed computing model where each node
operates independently and does not share data with any other node. It is typically used in big
data systems for parallel processing and distributed storage. In this model, each node is self-
sufficient in computation, which ensures the system’s scalability and fault tolerance.
44
Features of Shared-Nothing Architecture:

1. Independence: Each node has its own memory and resources; there is no shared
memory or storage between nodes. Each node operates independently.
2. Self-Healing: If a node or link fails, the system can reconfigure itself by creating
alternative links to maintain operation.
3. Sharding: Data is partitioned across multiple nodes, where each node handles a shard
(a portion of the database). Each shard is processed independently at its respective
node.
4. No Network Contention: Since there is no shared memory, network contention is
minimized, allowing for efficient parallel processing.
Examples of Shared-Nothing Systems:
 Hadoop: A distributed data processing framework where tasks are distributed across
many nodes in a cluster.
 Apache Flink: A stream-processing framework that allows for distributed data
processing in real-time.
 Apache Spark: A unified analytics engine for big data processing, which also uses
shared-nothing architecture for parallel execution across a cluster.
Choosing Distribution Models for Big Data

Big data solutions often require data to be distributed across multiple nodes in a cluster. This
enables horizontal scalability, which allows the system to handle large volumes of data while
providing the ability to process many read and write operations simultaneously.
Distribution Models:
1. Single Server Model (SSD):
45
o This is the simplest distribution model where all data is stored and processed
on a single server. While this model is easy to implement, it may not scale
well for large datasets or high traffic applications.
o Best for: Small-scale applications or use cases like graph databases where
relationships are processed sequentially on a single server.
o Example: A simple graph database that processes node relationships on a
single server.
2. Sharding Very Large Databases:
o Sharding refers to the process of splitting a large database into smaller, more
manageable parts called "shards". Each shard is distributed across multiple
servers in a cluster.
o Sharding provides horizontal scalability, allowing the system to process data
in parallel across multiple nodes.
o Advantages:
 Enhanced performance by distributing data across multiple nodes.
 If a node fails, the shard can migrate to another node for continued
processing.
o Example: A dataset of customer records is split across four servers, where
each server handles a shard (DBi, DBk, DBL, DBMS).
46
3. Master-Slave Distribution Model (MSD):
o In this model, there is one master node that handles write operations, and
multiple slave nodes that replicate the master’s data for read operations.
o The master node directs the slaves to replicate data, ensuring consistency
across nodes.
o Advantages:
 Read performance is optimized as multiple slave nodes handle read
requests.
 Writing is centralized, ensuring data consistency.
o Challenges:
 The replication process can introduce some latency and complexity.
 A failure of the master node may impact the write operations until a
failover mechanism is implemented.
o Example: MongoDB uses this model where data is replicated from the master
node to slave nodes.
4. Peer-to-Peer Distribution Model (PPD):
o In this model, all nodes are equal peers that both read and write data. Each
node has a copy of the data and can handle both read and write operations
independently.
o Advantages:
 High Availability: Since all nodes can read and write, the system can
tolerate node failures without affecting the ability to perform writes.
47
 Consistency: Each node contains the updated data, ensuring

consistency across the system.
o Challenges:
 More complex to manage compared to the master-slave model, as
every node can serve read and write requests.
o Example: Cassandra uses the Peer-to-Peer model, where data is distributed
across all nodes in a cluster and each node can independently process read and
write requests.
Ways of Handling Big Data Problems:

1. Evenly Distribute Data Using Hash Rings:
o Consistent Hashing: A technique where data in a collection is distributed
across nodes in a cluster using a hashing algorithm. The hash ring serves as a
map, where each client node uses the hash of a collection ID to determine
where the data is located in the cluster. This helps in evenly assigning data to
processors, improving scalability and fault tolerance.
2. Replication for Horizontal Scaling:
o Replication: Involves creating backup copies of data in real-time. Many Big
Data clusters use replication to ensure fault-tolerant data retrieval in a
distributed environment. This technique enables horizontal scaling of client
requests by distributing the load across multiple nodes, improving
performance.
3. Moving Queries to the Data (Not Data to Queries):
o Efficient Query Processing: Instead of moving data to where queries are
executed, moving the queries to the data itself is a more efficient approach.
This is especially true for NoSQL databases, where data is often spread across
a distributed system (e.g., cloud services or enterprise servers). This method
reduces the overhead of data transfer and enhances performance.
48
4. Distribute Queries to Multiple Nodes:

o Query Distribution: Client queries are distributed across multiple nodes or
replica nodes. High-performance query processing can be achieved by
parallelizing the query execution on different nodes. This strategy involves
analyzing and distributing the query load across the system, improving overall
system throughput and reducing response times.
MongoDB Database:
MongoDB is a widely-used open-source NoSQL database designed to handle large amounts
of data in a flexible, distributed manner. Initially developed by 10gen (now MongoDB Inc.),
MongoDB was introduced as a platform-as-a-service (PaaS) and later released as an open-
source database. It’s known for its document-oriented model, making it suitable for handling
unstructured and semi-structured data.
Key Characteristics of MongoDB:
 Non-relational: Does not rely on traditional SQL-based relational models.
 NoSQL: Flexible and can handle large volumes of data across multiple nodes.
 Distributed: Data can be stored across multiple machines, supporting horizontal
scalability.
 Open Source: Freely available for use and modification.
 Document-based: Uses a document-oriented storage model, storing data in flexible
formats such as JSON.
 Cross-Platform: Can be used across different operating systems.
 Scalable: Can scale horizontally by adding more servers to handle growing data
needs.
 Fault Tolerant: Provides high availability through replication and data redundancy.
Features of MongoDB:
1. Database Structure:
o Each database is a physical container for collections. Multiple databases can
run on a single MongoDB server. The default database is called db, and the
server's main process is called mongod, while the client is mongo.
2. Collections:
o Collections are analogous to tables in relational databases, and they store
multiple MongoDB documents. Collections are schema-less, meaning that
documents within a collection can have different fields and structures.
3. Document Model:
49
o Data in MongoDB is stored in documents, which are structured in BSON

(Binary JSON) format. These documents are similar to rows in a relational
database but are more flexible, allowing fields to vary from document to
document.
4. JSON-Like Storage:
o Documents are stored in a JSON-like format (BSON), which allows flexibility
in storing different types of data structures.
5. Flexible Data Storage:
o MongoDB’s document format allows for storing complex data structures, and
the schema can evolve over time without requiring a predefined structure.
6. Querying and Indexing:
o MongoDB supports dynamic querying and real-time aggregation. Its query
language is similar to SQL but optimized for document-based storage. It also
allows indexing to speed up query execution.
7. No Complex Joins:
o MongoDB does not rely on complex joins, making it more efficient for certain
types of queries, especially when dealing with large datasets.
8. Distributed Architecture:
o MongoDB is designed to support high availability and horizontal scalability.
Data is distributed across multiple servers, which allows it to handle larger
datasets efficiently.
9. Real-Time Aggregation:
o MongoDB includes powerful aggregation capabilities for real-time data
analysis. It supports grouping, filtering, and transforming data to provide
insights on the fly
50
MongoDB Replication
Replication in MongoDB is essential for high availability and fault tolerance in Big Data
environments. Replication involves maintaining multiple copies of data across different
database servers. In MongoDB, this is achieved using replica sets, which ensure data
redundancy and allow for continuous data availability even in the event of server failures.
How Replica Sets Work:
 A replica set is a group of MongoDB server processes (mongod) that store the same
data. Each replica set has at least three nodes:
1. Primary Node: Receives all write operations.
2. Secondary Nodes: Replicate data from the primary node.
The primary node handles all write operations, and these are automatically propagated to the
secondary nodes. If the primary node fails, one of the secondary nodes is promoted to
primary in an automatic failover process, ensuring continuous availability.
o Commands for Replica Set Management:
 rs.initiate(): Initializes a new replica set.
 rs.config(): Checks the replica set configuration.
 rs.status(): Displays the status of the replica set.
 rs.add(): Adds new members to the replica set.
MongoDB Sharding
Sharding is MongoDB’s method of distributing data across multiple machines, particularly in
scenarios involving large amounts of data. It is useful for scaling out horizontally when a
single machine can no longer store or process the data efficiently.
How Sharding Works:
 Shards: A shard is a single MongoDB server or replica set that holds part of the data.
 Sharded Cluster: MongoDB uses a sharded cluster to distribute data. Each shard
contains a portion of the data, and queries are routed to the appropriate shard based on
a shard key.
 Shard Key: A field in the documents used to determine how data is distributed across
the shards.
Sharding allows MongoDB to handle larger datasets and more operations by spreading the
load across multiple machines.
51
MongoDB Data Types

MongoDB supports various data types for flexible and efficient data storage. Some of the key
data types include:
 Double: Represents floating-point values.
 String: UTF-8 encoded text.
 Object: Represents embedded documents (similar to a record in RDBMS).
 Array: A list of values.
 Binary Data: Arbitrary bytes, used for storing images or files.
 ObjectId: Unique identifier for documents, often used as the primary key.
 Boolean: Represents true or false.
 Date: Stores a date in BSON format (milliseconds since Unix epoch).
 Null: Represents a missing or unknown value.
 Regular Expression: JavaScript-based regular expression.
 32-bit Integer: Stores numbers without decimals.
 Timestamp: Special timestamp type for internal MongoDB use.
MongoDB Querying Commands

MongoDB provides various commands for interacting with databases, collections, and
documents.
Basic Commands:
 mongo: Starts the MongoDB client.
 db.help(): Displays help for available commands.
 db.stats(): Shows statistics for the database server.
 use <database name>: Switches to or creates a database.
 show dbs: Lists all databases.
 db.dropDatabase(): Drops the current database.
 db.createCollection(): Creates a collection within a database.
 db.<collection>.insert(): Inserts a document into a collection.
 db.<collection>.find(): Retrieves documents from a collection.
 db.<collection>.update(): Updates a document in a collection.
 db.<collection>.remove(): Removes a document from a collection.
52
Cassandra Database
Cassandra, developed by Facebook and later released by Apache, is a highly scalable NoSQL
database designed to handle large amounts of structured, semi-structured, and unstructured
data. The database is named after the Trojan mythological prophet Cassandra, who was
cursed to always speak the truth but never to be believed. It was initially designed by
Facebook to handle their massive data needs, and it has since been adopted by several large
companies like IBM, Twitter, and Netflix.
Characteristics:
 Open Source: Cassandra is freely available and open to modifications.
 Scalable: It is designed to scale horizontally by adding more nodes to the system.
 NoSQL: It is a non-relational database, making it suitable for big data applications.
 Distributed: Cassandra's architecture allows it to run on multiple servers, ensuring
high availability and fault tolerance.
 Column-based: Data is stored in columns rather than rows, making it more efficient
for write-heavy workloads.
 Decentralized: All nodes in a Cassandra cluster are peers, which ensures that there is
no single point of failure.
 Fault-tolerant: Due to data replication across multiple nodes, Cassandra can
withstand node failures without data loss.
 Tuneable consistency: It provides flexibility to choose the level of consistency for
different operations.
Features of Cassandra:
 Maximizes write throughput: It is optimized for handling massive amounts of write
operations.
 No support for joins, group by, OR clauses, or complex aggregations: Its
architecture focuses on performance rather than relational operations.
 Fast and easily scalable: The database performs well as more nodes are added, and it
can handle high write volumes.
 Distributed architecture: Data is distributed across the nodes in the cluster, ensuring
high availability.
 Peer-to-peer: Nodes in Cassandra communicate with each other in a peer-to-peer
fashion, unlike master-slave architectures.
Data Replication in Cassandra: Cassandra provides data replication across multiple
nodes, ensuring no single point of failure. The replication factor defines the number of
replicas placed on different nodes. In case of stale data or node failure, Cassandra uses read
repair to ensure that all replicas are consistent. It adheres to the CAP theorem, prioritizing
availability and partition tolerance.
53
Scalability: Cassandra supports linear scalability. As new nodes are added to the cluster,
both throughput increases and response time decreases. It uses a decentralized approach
where each node in the cluster is equally important.
Transaction Support: Cassandra supports the ACID properties (Atomicity, Consistency,
Isolation, Durability), although it is not strictly a transactional system like traditional
RDBMS. Instead, it offers eventual consistency to ensure high availability and fault
tolerance.
Replication Strategies:
 Simple Strategy: A straightforward replication factor for the entire cluster.
 Network Topology Strategy: Allows replication factor configuration per data center,
useful for multi-data center deployments.
Cassandra Data Model:
 Cluster: A collection of nodes and keyspaces.
 Keyspace: The outermost container in Cassandra that holds column families (tables).
Each keyspace defines the replication strategy and factors.
 Column: A single data point consisting of a name, value, and timestamp.
 Column Family: A collection of columns, which is equivalent to a table in relational
databases.
Cassandra CQL (Cassandra Query Language):
 CREATE KEYSPACE: Creates a keyspace to store tables. It includes replication
strategy options.
 ALTER KEYSPACE: Modifies an existing keyspace.
 DROP KEYSPACE: Deletes a keyspace.
 USE KEYSPACE: Connects to a specific keyspace.
 CREATE TABLE: Defines a new table with columns, including primary key
constraints.
 ALTER TABLE: Modifies the structure of an existing table (e.g., adding or dropping
columns).
 DESCRIBE: Provides detailed information about keyspaces, tables, indexes, etc.
CRUD Operations in Cassandra:
1. INSERT: Adds new data into a table.
o Example: INSERT INTO <tablename> (<columns>) VALUES (<values>);
2. UPDATE: Modifies existing data.
o Example: UPDATE <tablename> SET <column> = <value> WHERE
<condition>;
54
3. SELECT: Retrieves data from a table.

o Example: SELECT <columns> FROM <tablename> WHERE <condition>;
4. DELETE: Removes data from a table.
o Example: DELETE FROM <tablename> WHERE <condition>;
Cassandra Clients and Drivers: Cassandra supports various programming languages, and
clients interact with Cassandra through drivers. It has a peer-to-peer distribution system,
and each node can accept client connections, providing high availability.
Cassandra Hadoop Support: Cassandra integrates with Hadoop for big data processing. In
Cassandra 2.1, it offers Hadoop 2 support, allowing Hadoop's distributed storage and
processing capabilities to overlay Cassandra's data storage.
Column-Family Data Store Column-Family Data Store

----------------------END OF MODULE 3--------------------
55
MODULE 4
MapReduce, Hive and Pig
Map Reduce Programming Model
The MapReduce programming model is a powerful framework used for processing and
analysing large-scale datasets in a distributed computing environment. It divides tasks into
two core operations: Map and Reduce.
In the Map phase, the input data is split into smaller chunks and distributed across multiple
nodes for parallel processing, where each node produces key-value pairs as intermediate
outputs.
The Reduce phase then aggregates these outputs, combining them into a smaller, more
concise result. This parallelized approach allows for efficient handling of vast amounts of
data. Hadoop, one of the most widely used implementations of MapReduce, utilizes the
Hadoop Distributed File System (HDFS) for storing and retrieving data. In such systems,
nodes serve both as computational units and storage devices, optimizing resource use and
scalability.
The MapReduce model is highly applicable in big data scenarios, enabling tasks like log
analysis, data transformation, and large-scale data mining. Additionally, database techniques
such as indexing and inner joins further enhance the efficiency of data retrieval and
processing, making MapReduce a foundational concept for modern big data solutions.
MapReduce employs a master-slave architecture, consisting of a JobTracker as the master
and a Task Tracker on each cluster node as the slave.

The JobTracker is responsible for coordinating the execution of a job by scheduling tasks
across the cluster, monitoring their progress, and re-executing any failed tasks to ensure
reliability. Meanwhile, the Task Trackers handle the execution of tasks as directed by the
JobTracker, operating on the cluster's individual nodes.
The input data for a MapReduce task is stored in files, typically within the Hadoop
Distributed File System (HDFS). These files can vary in format, including line-based logs,
binary files, or multi-line input records. The MapReduce framework processes this data
entirely as key-value pairs, where both the input and output of tasks are structured in this
form. While the input and output key-value pairs may differ in type, this flexible model
56
allows for a wide range of data processing tasks, making MapReduce a robust solution for
handling diverse big data workloads.
Map-Tasks
A Map Task in the MapReduce programming model is responsible for processing input data
in the form of key-value pairs, denoted as (k1,v1)(k_1, v_1). Here, k1k_1 represents a set of
keys, and v1v_1 is a value (often a large string) read from the input file(s). The map()
function implemented within the task executes the user application logic on these pairs. The
output of a map task consists of zero or more intermediate key-value pairs (k2,v2)(k_2, v_2),
which are used as input for the Reduce task for further processing.
The Mapper operates independently on each dataset, without intercommunication between
Mappers. The output of the Mapper, v2v_2, serves as input for transformation operations at
the Reduce stage, typically involving aggregation or other reducing functions. A Reduce
Task takes these intermediate outputs, processes them using a combiner, and generates a
smaller, summarized dataset. Reduce tasks are always executed after the completion of all
Map tasks.
The Hadoop Java API provides a Mapper class, which includes an abstract map() function.
Any specific Mapper implementation must extend this class and override the map () function
to define its behaviour.
For instance:
public class SampleMapper extends Mapper<k1, v1, k2, v2> {
void map(k1 key, v1 value, Context context) throws IOException, InterruptedException {
// User-defined logic
}
}
The number of Map tasks, NmapN_{map}, is determined by the size of the input files and the
block size of the Hadoop Distributed File System (HDFS).
For example, a 1 TB input file with a block size of 128 MB results in 8192 Map tasks. The
number of Map tasks can also be explicitly set using setNumMapTasks(int) and typically
ranges between 10–100 per node, though higher values can be configured for more granular
parallelism.
57
Key-Value Pair
Each phase (Map phase and Reduce phase) of MapReduce has key-value pairs as input
and output. Data should be first converted into key-value pairs before it is passed to
the Mapper, as the Mapper only understands key-value pairs of data.
Key-value pairs in Hadoop MapReduce are generated as follows:
InputSplit - Defines a logical representation of data and presents a Split data for
processing at individual map().
RecordReader - Communicates with the Input Split and converts the Split into records
which are in the form of key-value pairs in a format suitable for reading by the Mapper.
RecordReader uses TextlnputFormat by default for converting data into key-value
pairs. RecordReader communicates with the InputSplit until the file is read.
In MapReduce, the Grouping by Key operation involves collecting and grouping all the
output key-value pairs from the mapper by their keys. This process aggregates values
associated with the same key into a list, which is crucial for further processing during the
58
Shuffle and Sorting Phase. During this phase, all pairs with the same key are grouped
together, creating a list for each unique key, and the results are sorted. The output format of
the shuffle phase is <k2, List(v2)>. Once the shuffle process completes, the data is divided
into partitions.
A Partitioner plays a key role in this step, distributing the intermediate data into different
partitions, ensuring efficient data handling across multiple reducers.
A Combiner is an optional, local reducer that aggregates map output records on each node
before the shuffle phase, optimizing data transfer between the mapper and reducer by
reducing the volume of data that needs to be shuffled across the network.
The Reduce Tasks then process the grouped key-value pairs, applying the reduce() function
to aggregate the data and produce the final output. Each reduce task receives a list of values
for each key and iterates over them to generate aggregated results, which are then outputted
in the form of key-value pairs (k3, v3). This setup, which includes the shuffle, partitioning,
combiner, and reduce phases, optimizes performance and reduces the network load in
distributed computing environments like Hadoop.
public class ExampleReducer extends Reducer<K2, V2, K3, V3> {
@Override
public void reduce(K2 key, Iterable<V2> values, Context context) throws IOException,
InterruptedException {
// Processing logic for each key-value pair in the reduce function
// Example: Sum of values for each key
int sum = 0;
for (V2 value : values) {
sum += value; // assuming the values are integers
}
// Emit the final output key-value pair
context.write(key, sum);
}
}
Coping with Node Failure
Hadoop achieves fault tolerance by restarting tasks that fail during the execution of a
MapReduce job.
1. Map TaskTracker Failure:

o If a Map TaskTracker fails, the tasks it was running are reset to idle.
59
o A new TaskTracker will be assigned to re-execute the failed map tasks.

2. Reduce TaskTracker Failure:
o If a Reduce TaskTracker fails, only the in-progress reduce tasks are reset to
idle.
o A new TaskTracker will execute the in-progress reduce tasks.
3. Master JobTracker Failure:
o If the JobTracker fails, the entire job is aborted and the client is notified.
o If there is only one master node, a failure in the JobTracker results in the job
needing to restart.
Through regular communication between TaskTrackers and the JobTracker, Hadoop can
detect failures, reassign tasks, and ensure that the job completes even in the event of node
failures. This fault tolerance mechanism helps MapReduce jobs run reliably on a large
distributed cluster.
Composing MapReduce for Calculations and Algorithms
In MapReduce, calculations and algorithms can be composed to efficiently handle a variety
of big data processing tasks. Below are several examples of common MapReduce
compositions for various operations:
1. Counting and Summing
Counting and summing operations are fundamental to MapReduce jobs. For example,
counting the number of alerts or messages generated during a vehicle's maintenance activity
for a specific period (e.g., a month) can be done by emitting a count for each message.
Example: Word count or counting messages in a log file:
 Mapper: For each message, emit a key-value pair with key as a generic identifier
(like null or a timestamp) and the value as 1.
 Reducer: The reducer will sum the values, providing the total count of messages or
words.
2. Sorting
Sorting in MapReduce typically occurs by emitting keys in a sorted order during the map
phase and having the framework sort them before they are passed to the reducer. The reducer
will then aggregate the sorted results.
 Mapper: Emits items associated with sorting keys.
 Reducer: Combines all emitted parts into a final sorted list.
3. Finding Distinct Values (Counting Unique Values)
60
Finding distinct values is a common task in applications like web log analysis or counting
unique users. Here are two possible solutions for counting unique values:
1. First Solution: Mapper emits dummy counters for each field and group ID, and the
reducer calculates the total number of occurrences for each pair.
2. Second Solution: The Mapper emits values and group IDs, and the reducer excludes
duplicates and counts unique values for each group.
Example: Counting unique users by their ID in web logs.
 Mapper: Emits the user ID with a dummy count.
 Reducer: Filters out duplicate user IDs and counts the total number of unique users.
4. Collating
Collating involves collecting all items with the same key into a list. This is useful for
operations like producing inverted indexes or performing extract, transform, and load (ETL)
tasks.
 Mapper: Computes a given function for each item and emits the result as a key, with
the item itself as a value.
 Reducer: Groups items by key and processes them.
Example: Creating an inverted index.
 Mapper: Emits each word from the document as a key and the document ID as the
value.
 Reducer: Collects all document IDs for each word, producing a list of documents
where each word appears.
5. Filtering or Parsing
Filtering or parsing is used when processing datasets to collect only the items that satisfy
certain conditions or transform items into other formats.
 Mapper: Accepts only items that satisfy specific conditions and emits them.
 Reducer: Collects all the emitted items and outputs the results.
Example: Extracting valid records from a log file.
 Mapper: Filters records based on a condition (e.g., logs with errors) and emits the
valid records.
 Reducer: Collects the valid records and saves them.
6. Distributed Tasks Execution
Large-scale computations are divided into multiple partitions and executed in parallel. The
results from each partition are then combined to produce the final result.
 Mapper: Processes a specific partition of the data and emits the computed results.
61
 Reducer: Combines the results from all the mappers.

Example: Numerical analysis or performance testing tasks that require distributed execution.
7. Graph Processing using Iterative Message Passing
In graph processing, nodes and edges represent entities and relationships, and iterative
message passing is used for tasks like path traversal.
 Mapper: Each node sends messages to its neighbouring nodes.
 Reducer: Updates the state of nodes based on received messages.
Example: PageRank computation or social network analysis.
 Mapper: Sends messages with node IDs to their neighbouring nodes.
 Reducer: Updates each node’s state based on the messages received from neighbours.
Cross-Correlation using MapReduce
Cross-correlation is a technique that computes how much two sequences (or datasets) are
similar to one another. In the context of big data, particularly in text analytics or market
analysis, cross-correlation is used to find co-occurring items, like words in sentences or
products bought together by customers.
Use Cases:
1. Text Analytics: Finding words that co-occur in the same sentence or document.
2. Market Analysis: Identifying which products are often bought together (e.g.,
"customers who bought item X also tend to buy item Y").
Basic Approach:
1. N x N Matrix: If there are N items, the total number of pairwise correlations will be
N×NN \times NN×N. For example, in text analytics, these pairs would represent co-
occurring words, and in market analysis, they would represent items bought together.
2. Memory Constraints: If N×NN \times NN×N is small enough to fit into memory, the
correlation matrix can be processed straightforwardly on a single machine. For larger
datasets, the problem must be distributed across multiple nodes.
MapReduce Approaches for Cross-Correlation:
There are two main solutions for calculating cross-correlation using MapReduce:
1. First Approach: Emitting All Pairs and Dummy Counters
In this approach, the Mapper emits all possible pairs of items and dummy counters for each
pair. The Reducer then sums these counters.
Steps:
 Mapper: For each tuple (sentence or transaction), emit pairs of items with a counter
(1).
62
o Example: For sentence ["apple", "banana", "cherry"], emit the following

pairs:
 (apple, banana, 1)
 (apple, cherry, 1)
 (banana, cherry, 1)
 Reducer: The reducer will sum all the dummy counters for each item pair to compute
the total co-occurrence count.
o Example: If the pair (apple, banana) appears in three sentences, the reducer
will sum the counters for this pair to give the final count.
2. Second Approach: Using Stripes for Efficient Computation
When the dataset is large, emitting all pairs directly may not be efficient. Instead, a stripe
technique is used, which groups the data by the first item in each pair. This method
accumulates the counts for all adjacent items in an associative array (or "stripe").
Steps:
 Mapper: For each tuple, the mapper groups all adjacent items into an associative
array (stripe). The stripe keeps track of the co-occurrence counts for each item in the
tuple.
o Example: For sentence ["apple", "banana", "cherry"], the mapper will emit:
 (apple, {banana: 1, cherry: 1})
 (banana, {apple: 1, cherry: 1})
 (cherry, {apple: 1, banana: 1})
 Reducer: The reducer will then merge all stripes for the same leading item (i.e., for
each unique item), aggregate the counts for each co-occurring item, and emit the final
result.
Relational Algebra Operations
Relational algebra is a procedural query language used for querying relational databases. It
consists of a set of operations that take one or two relations (tables) as input and produce a
new relation as output. These operations form the foundation of SQL and are used to
manipulate and retrieve data from relational databases.
Here are the basic relational algebra operations:
1. Selection (σ)
The Selection operation is used to select a subset of rows from a relation that satisfy a given
condition. The result is a new relation that contains only those rows from the original relation
where the condition holds true.
Syntax:
σ condition(R)
63
Where:
 condition is a predicate (a logical condition) that the rows must satisfy.
 R is the relation (table) from which rows are selected.
Example:
Consider a relation Employees with attributes (EmpID, Name, Age, Department).
EmpID Name Age Department
101 Alice 30 HR
102 Bob 25 IT
103 Carol 35 HR
If we want to select employees from the HR department, we would write: σDepartment =

'HR'(Employees)
This would result in the following relation:
EmpID Name Age Department
101 Alice 30 HR
103 Carol 35 HR
2. Projection (π)
The Projection operation is used to select specific columns from a relation, effectively
reducing the number of attributes in the resulting relation. It eliminates duplicate rows in the
result.
Syntax:
πattribute1, attribute2, ..., attributeN(R)
Where:
 attribute1, attribute2, ..., attributeN are the columns to be selected from the
relation.
 R is the relation from which attributes are selected.
Example:
Consider the Employees relation again. If we only want to select the Name and Department
columns, we would write: πName, Department(Employees)
This would produce the following result:
Name Department
Alice HR
64
Name Department
Bob IT
Carol HR
3. Union (∪)
The Union operation combines the rows of two relations, removing duplicates. The two
relations involved must have the same set of attributes (columns).
Syntax:
R∪S
Where:
 R and S are two relations with the same schema (same attributes).
Example:
Let’s assume two relations:
Employees (EmpID, Name) and Contractors (EmpID, Name).
Employees:
EmpID Name
101 Alice
102 Bob
Contractors:
EmpID Name
103 Carol
102 Bob
Performing the union: Employees ∪ Contractors

This would result in:
EmpID Name
101 Alice
102 Bob
103 Carol
The duplicate entry (Bob) is eliminated.

4. Set Difference (−)
65
The Set Difference operation returns the rows that are present in one relation but not in the
other. Like the Union operation, the two relations must have the same schema.
Syntax:
R−S
Where:
 R is the first relation.
 S is the second relation.
Example:
If we subtract Contractors from Employees: Employees − Contractors
This would result in:
EmpID Name
101 Alice
Bob is excluded because he appears in both Employees and Contractors.

5. Cartesian Product (×)
The Cartesian Product operation combines every row of the first relation with every row of
the second relation. The result is a relation where each row is a combination of one row from
the first relation and one from the second.
Syntax:
R×S
Where:
 R and S are two relations.
Example:
Consider the following relations:
Employees:
EmpID Name
101 Alice
102 Bob
Departments:
DeptID Department
D01 HR
D02 IT
66
The Cartesian Product of Employees and Departments (Employees × Departments) would

be:
EmpID Name DeptID Department
101 Alice D01 HR
101 Alice D02 IT
102 Bob D01 HR
102 Bob D02 IT
6. Rename (ρ)
The Rename operation is used to rename the attributes (columns) of a relation or to change
the name of the relation itself. This operation is particularly useful when combining relations
in operations like join.
Syntax:
ρNewName(OldName)(R)
Where:
 NewName is the new name of the relation.
 OldName is the current name of the relation.
 R is the relation.
Example:
If we have a relation Employees and want to rename the attribute EmpID to EmployeeID,
we would write: ρEmployees(EmpID → EmployeeID)(Employees)
This would result in the following relation:
EmployeeID Name Age Department
101 Alice 30 HR
102 Bob 25 IT
103 Carol 35 HR
7. Join (⨝)
The Join operation combines two relations based on a common attribute. It is one of the most
important operations in relational algebra, as it allows combining data from different tables.
Types of Join:
 Inner Join: Combines rows from both relations where the join condition is true.
 Outer Join: Returns all rows from one or both relations, with null values for
unmatched rows.
67
Syntax:
R ⨝condition S
Where:
 R and S are relations.
 condition specifies the common attribute used for the join.
Example:
Consider the following relations:
Employees:
EmpID Name
101 Alice
102 Bob
Departments:
EmpID Department
101 HR
102 IT
Performing an Inner Join on EmpID: Employees ⨝EmpID Departments

The result would be:
EmpID Name Department
101 Alice HR
102 Bob IT
Hive
Hive is a data warehousing and SQL-like query system built on top of Hadoop. It was
originally developed by Facebook to manage large amounts of data in Hadoop's distributed
file system (HDFS). Hive simplifies the process of querying and managing large-scale
datasets by providing an abstraction layer that allows users to run SQL-like queries (HiveQL)
on top of the Hadoop ecosystem.
Characteristics of Hive
1. MapReduce Integration:
Hive translates queries written in Hive Query Language (HiveQL) into MapReduce jobs.
This makes Hive scalable and suitable for managing and analyzing vast datasets,
particularly static data. Since Hive uses MapReduce, it inherits the scalability and parallel
processing capabilities of Hadoop.
68
2. Web and API Support:

Hive provides web interfaces and APIs that allow clients to interact with the Hive
database server. Users can query Hive either through a web browser or via programmatic
access, making it convenient for both end-users and developers.
3. SQL-like Query Language (HiveQL):
Hive provides a query language called HiveQL (Hive Query Language), which is similar
to SQL. It allows users to perform typical database operations like SELECT, INSERT,
JOIN, and GROUP BY. However, HiveQL is specifically designed to work with the
underlying Hadoop infrastructure.
4. Data Storage on HDFS:
Data loaded into Hive tables is stored in Hadoop's HDFS. Hive abstracts away the
complexity of managing HDFS directly, and users can interact with their data through
HiveQL queries. This allows for easy integration with Hadoop's distributed storage
system.
Limitations of Hive
1. Not a Full Database:
While Hive provides querying capabilities and table management features, it is not a full-
fledged database. Some critical operations typically available in traditional databases
(such as UPDATE, ALTER, and DELETE) are not directly supported by Hive. The
design of Hive prioritizes read-heavy, analytical workloads rather than transactional
operations.
2. Unstructured Data Handling:
Hive is primarily designed for structured and semi-structured data. It is not optimized for
managing unstructured data (e.g., audio, video, or images). Therefore, it may not be the
best tool for use cases requiring real-time analysis or unstructured data processing.
3. Real-time Query Limitations:

Hive is not designed for real-time queries. It performs best in batch processing scenarios
where fast response times are not a strict requirement. This makes it less suitable for
interactive or real-time analytics where low-latency queries are needed.
4. Partitioning from the Last Column:
One limitation of Hive is its partitioning strategy. When creating partitions in Hive, the
partitioning is always done based on the last column in the table. This can be restrictive in
certain situations where partitions need to be based on a different column or dynamic
partitioning strategies.
Use Cases of Hive
 Data Warehousing:
69
Hive is commonly used in large-scale data warehousing applications. It allows

organizations to store vast amounts of structured data in HDFS and provides an easy way
to run analytical queries using HiveQL.
 Batch Processing:
Due to its MapReduce-based execution model, Hive is well-suited for batch processing of
large datasets. It can be used for ETL (Extract, Transform, Load) operations, aggregation,
and other data transformation tasks.
 Data Analytics:
Hive is widely used for analyzing large datasets in industries like e-commerce, finance,
and telecom, where data is typically stored in HDFS, and querying needs are focused on
extracting insights from large volumes of static data.
Features of Hive:
Hive Architecture
The architecture of Hive is designed to provide an abstraction layer on top of Hadoop,
allowing users to run SQL-like queries (HiveQL) for managing and analyzing large datasets
stored in HDFS. Hive architecture consists of several key components that work together to
enable querying, execution, and management of data within the Hadoop ecosystem.
70
Components of Hive Architecture

1. Hive Server (Thrift):
o Function: The Hive Server is an optional service that allows remote clients to
submit requests to Hive and retrieve results.
o Client API: The Hive Server exposes a simple client API (through Thrift),
enabling the execution of HiveQL statements. It supports various
programming languages for interacting with Hive, such as Java, Python, and
others.
o Role: This server acts as an interface for external applications to communicate
with the Hive system. The Thrift service allows clients to send HiveQL
queries and retrieve results without directly interacting with the underlying
infrastructure.
2. Hive CLI (Command Line Interface):
o Function: The Hive CLI is a popular interface that allows users to interact
directly with Hive through a command line.
o Local Mode: Hive can run in local mode when used with the CLI. In this
mode, Hive uses the local file system for storing data, rather than HDFS. This
is useful for small-scale testing or development.
o Usage: The Hive CLI allows users to submit queries, manage databases, create
tables, and perform other administrative tasks in Hive.
3. Web Interface (HWI):
o Function: Hive can also be accessed through a web interface, which is
provided by the Hive Web Interface (HWI).
o HWI Server: A designated HWI server must be running to provide web-based
access to Hive. Users can access Hive via a web browser by navigating to a
URL like http://hadoop:<port>/hwi.
71
o Usage: The web interface provides a graphical interface for executing queries,
managing tables, and performing administrative tasks without needing to use
the CLI.
4. Metastore:
o Function: The Metastore is a crucial component of Hive that stores all the
metadata (schema information) related to the tables, databases, and columns.
o Metadata: It stores information such as the database schema, column data
types, and HDFS locations of the data files.
o Interaction: All other components of Hive interact with the Metastore to fetch
or update metadata. For example, when a user queries a table, the Metastore
helps locate the corresponding data in HDFS.
o Storage: The Metastore typically uses a relational database (like MySQL or
PostgreSQL) to store this metadata.
5. Hive Driver:
o Function: The Hive Driver manages the lifecycle of a HiveQL query.
o Lifecycle Management: It is responsible for compiling the HiveQL query,
optimizing it, and finally executing the query on the Hadoop cluster.
o Execution Flow:
 Compilation: The Hive Driver compiles the HiveQL statement into a
series of MapReduce jobs (or other execution plans depending on the
environment).
 Optimization: The query is then optimized for execution. This may
include tasks such as predicate pushdown, column pruning, and join
optimization.
 Execution: The final optimized query is submitted for execution on the
Hadoop cluster, where it is processed by the MapReduce framework.
6. Query Compiler:
o Function: The Query Compiler is responsible for parsing the HiveQL
statements and converting them into execution plans that are understandable
by the Hadoop system.
o Stages: The process involves the compilation of the HiveQL statement into an
Abstract Syntax Tree (AST), followed by the generation of a logical query
plan and its optimization before the physical plan is produced.
7. Execution Engine:
o Function: The Execution Engine is responsible for the actual execution of the
query.
72
o Processing: It submits tasks to the underlying Hadoop infrastructure

(MapReduce, Tez, or Spark, depending on the configuration). The Execution
Engine also handles data movement between various stages of the
computation.
Hive Data Types and File Formats

Hive defines various primitive, complex, string, date/time, collection data types and file
formats for handling and storing different data formats. The following Table gives primitive,
string, date/time and complex Hive data types and their descriptions.
Hive data types can be classified into two parts.
Primitive Data Types:

Primitive Data Types also divide into 3 types which are as follows.
73
74
Hive Data Model

The Hive data model organizes and structures data in a way that allows efficient querying,
analysis, and storage in a Hadoop ecosystem. The components of the Hive data model include
Databases, Tables, Partitions, and Buckets.
1. Database
 Description: A database in Hive acts as a namespace for organizing and storing
tables. Each database can contain multiple tables, and you can use the USE statement
to switch between databases.
 Example: A database can represent different applications or systems, like a database
for customer data or product data.
2. Tables
 Description: Tables in Hive are similar to tables in traditional RDBMS. They are
used to store structured data in a tabular format. Each table is backed by a directory in
HDFS where the actual data files reside.
 Operations Supported: Hive tables support various operations such as:
o Filter: Filtering rows based on certain conditions.
o Projection: Selecting specific columns to be returned in the result.
o Join: Joining multiple tables to retrieve related data.
o Union: Combining results from multiple queries.
 Data Storage: The data in a Hive table is stored in HDFS, and the structure (schema)
is defined when creating the table.
3. Partitions
 Description: Partitions are used to divide the data in a table into subsets based on the
values of one or more columns. This helps in organizing large datasets and enables
efficient querying by reducing the amount of data scanned for specific queries.
 How It Works: A table can be partitioned by a column, such as date or region, and
each partition will store data corresponding to that column's value (e.g., data from
January will be stored in one partition, data from February in another).
 Example: A sales table can be partitioned by the year and month columns to store
data for each year and month separately.
4. Buckets
 Description: Buckets further divide data within each partition based on a hash of a
column in the table. This technique allows data to be split into smaller, more
manageable files within each partition.
75
 How It Works: Data is divided into a specific number of buckets (files) by hashing a
particular column's value. Each bucket corresponds to one file stored in the partition's
directory.
 Example: A customer table might be bucketed by the customer_id column, ensuring
that the data for each customer is stored in a separate bucket.
Hive Integration and Workflow Steps
Hive’s integration with Hadoop involves several key components that handle the query
execution, metadata retrieval, and job management.
1. Execute Query:
o The query is sent from the Hive interface (CLI, Web Interface, etc.) to the
Database Driver, which is responsible for initiating the execution process.
2. Get Plan:
o The Driver forwards the query to the Query Compiler. The compiler parses
the query and creates an execution plan, verifying the syntax and determining
the operations required.
3. Get Metadata:
o The Compiler requests metadata information (like table schema, column
types, etc.) from the Metastore (which can be backed by databases like
MySQL or PostgreSQL).
76
4. Send Metadata:
o The Metastore responds with the metadata, and the Compiler uses this
information to refine the query plan.
5. Send Plan:
o After parsing the query and receiving metadata, the Compiler sends the
finalized query execution plan back to the Driver.
6. Execute Plan:
o The Driver sends the execution plan to the Execution Engine, which is
responsible for actually running the query on the Hadoop cluster.
7. Execute Job:
o The execution engine triggers the execution of the query, which is typically
translated into a MapReduce job. This job is sent to the JobTracker (running
on the NameNode), which assigns tasks to TaskTrackers on DataNodes for
parallel processing.
8. Metadata Operations:
o During the execution, the Execution Engine may also perform metadata
operations with the Metastore, such as querying schema details or updating
the metastore.
9. Fetch Result:
o After completing the MapReduce job, the Execution Engine collects the
results from the DataNodes where the job was processed.
10. Send Results:
o The results are sent back to the Driver, which in turn forwards them to the
Hive interface for display to the user.
Hive Built-in Functions
Hive provides a wide range of built-in functions to operate on different data types, enabling
various data transformations and calculations. Here’s a breakdown of some common built-in
functions in Hive:
1. BIGINT Functions
 round(double a)
o Description: Returns the rounded BIGINT (8-byte integer) value of the 8-byte
double-precision floating point number a.
o Return Type: BIGINT
o Example: round(123.456) returns 123.
 floor(double a)
77
o Description: Returns the maximum BIGINT value that is equal to or less than
the double value.
o Example: floor(123.789) returns 123.
 ceil(double a)
o Description: Returns the minimum BIGINT value that is equal to or greater
than the double value.
o Example: ceil(123.456) returns 124.
2. Random Number Generation
 rand(), rand(int seed)
o Description: Returns a random number (double) that is uniformly distributed
between 0 and 1. The sequence changes with each row, and specifying a seed
ensures the random number sequence is deterministic.
o Return Type: double
o Example: rand() returns a random number like 0.456789, and rand(5) will
generate a sequence based on the seed 5.
3. String Functions
 concatenate(string str1, string str2, ...)
o Description: Concatenates two or more strings into one.
o Return Type: string
o Example: concatenate('Hello ', 'World') returns 'Hello World'.
 substr(string str, int start)
o Description: Returns a substring of str starting from the position start till the
end of the string.
o Example: substr('Hello World', 7) returns 'World'.
 substr(string str, int start, int length)
o Description: Returns a substring of str starting from position start with the
given length.
o Example: substr('Hello World', 1, 5) returns 'Hello'.
 upper(string str), ucase(string str)
78
o Description: Converts all characters of str to uppercase.

o Example: upper('hello') returns 'HELLO'.
 lower(string str), lcase(string str)
o Description: Converts all characters of str to lowercase.
o Example: lower('HELLO') returns 'hello'.
 trim(string str)
o Description: Trims spaces from both ends of the string.
o Example: trim(' Hello World ') returns 'Hello World'.
 ltrim(string str)
o Description: Trims spaces from the left side of the string.
o Example: ltrim(' Hello') returns 'Hello'.
 rtrim(string str)
o Description: Trims spaces from the right side of the string.
o Example: rtrim('Hello ') returns 'Hello'.
4. Date and Time Functions
 year(string date)
o Description: Extracts the year part of a date or timestamp string.
o Return Type: int
o Example: year('2024-12-25') returns 2024.
 month(string date)
o Description: Extracts the month part of a date or timestamp string.
o Return Type: int
o Example: month('2024-12-25') returns 12.
 day(string date)
o Description: Extracts the day part of a date or timestamp string.
79
o Return Type: int

o Example: day('2024-12-25') returns 25.
HiveQL Features
 Data Definition: Allows users to define and manage the schema of tables, databases,
etc.
 Data Manipulation: Enables the manipulation of data, such as inserting, updating, or
deleting records (although with some limitations).
 Query Processing: Supports querying large datasets using operations like filtering,
joining, and aggregating data.
HiveQL Process Engine
 The HiveQL Process Engine translates HiveQL queries into execution plans and
communicates with the Execution Engine to run the query. It is a replacement for the
traditional approach of writing Java-based MapReduce programs.
Hive Execution Engine
 The Execution Engine is the component that bridges HiveQL and MapReduce. It
processes the query and generates results in the same way that MapReduce jobs
would do. It uses a variant of MapReduce to execute HiveQL queries across a
distributed Hadoop cluster.
HiveQL Data Definition Language (DDL)
HiveQL provides several commands for defining databases and tables. These commands are
used to manage the structure of the data in Hive.
Creating a Database
To create a new database in Hive, the following command is used:
CREATE DATABASE [IF NOT EXISTS] <database_name>;
 IF NOT EXISTS: Ensures that Hive does not throw an error if the database already
exists.
Example:
CREATE DATABASE IF NOT EXISTS my_database;
Show Databases
To list all the databases in Hive, use the command:
SHOW DATABASES;
Dropping a Database
80
To delete an existing database, use the following command:

DROP DATABASE [IF EXISTS] [RESTRICT | CASCADE] <database_name>;
 IF EXISTS: Prevents an error if the database does not exist.
 RESTRICT: Deletes the database only if it is empty.
 CASCADE: Deletes the database along with any tables it contains.
Example:
DROP DATABASE IF EXISTS my_database CASCADE;
Creating a Table
The syntax for creating a table in Hive is:
CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS]
[<database_name>.]<table_name>
[(<column_name> <data_type> [COMMENT <column_comment>], ...)]
[COMMENT <table_comment>]
[ROW FORMAT <row_format>]
[STORED AS <file_format>];
 TEMPORARY: Creates a temporary table that is only available during the session.
 EXTERNAL: Specifies that the table is external, meaning Hive won't manage its data
(i.e., data is stored outside the Hive warehouse).
 IF NOT EXISTS: Avoids an error if the table already exists.
 COMMENT: Adds a description to the table or column.
 ROW FORMAT: Specifies the format of the rows in the table (e.g., DELIMITED).
 STORED AS: Specifies the file format for storing data (e.g., TEXTFILE, ORC,
PARQUET).
Example:
CREATE TABLE IF NOT EXISTS employee (
emp_id INT,
name STRING,
salary DOUBLE
)
COMMENT 'Employee table'
ROW FORMAT DELIMITED
81
FIELDS TERMINATED BY ','

STORED AS TEXTFILE;
HiveQL Data Manipulation Language (DML)

DML commands in Hive are used for managing and modifying the data within Hive tables.
Using a Database
To set the current database in Hive, use the USE command:
USE <database_name>;
Loading Data into a Table
To load data into a Hive table from a local or HDFS path, the LOAD DATA command is
used:
LOAD DATA [LOCAL] INPATH '<file_path>' [OVERWRITE] INTO TABLE <table_name>
[PARTITION (partcol1=val1, partcol2=val2, ...)];
 LOCAL: Specifies that the file is on the local filesystem.
 OVERWRITE: Overwrites the existing data in the table.
 PARTITION: Specifies partitioning columns if applicable.
Example:
LOAD DATA LOCAL INPATH '/path/to/data.csv' INTO TABLE employee;
Dropping a Table
To delete an existing table, the command is:
DROP TABLE [IF EXISTS] <table_name>;
 IF EXISTS: Avoids an error if the table does not exist.
Example:
DROP TABLE IF EXISTS employee;
Altering a Table
You can modify the structure of a table using the ALTER TABLE command:
ALTER TABLE <table_name> ADD COLUMNS (<column_name> <data_type>
[COMMENT <column_comment>]);
Example:
ALTER TABLE employee ADD COLUMNS (department STRING);
82
HiveQL Querying the Data

Hive supports querying data using SQL-like syntax, with additional features for partitioning,
sorting, and aggregating data.
Basic Query
A basic query to select data from a table is:
SELECT [ALL | DISTINCT] <select_expression>, ...
FROM <table_name>
[WHERE <condition>]
[GROUP BY <column_list>]
[HAVING <condition>]
[CLUSTER BY <column_list>]
[DISTRIBUTE BY <column_list>]
[SORT BY <column_list>]
[LIMIT <number>];
 ALL: Returns all rows (default).
 DISTINCT: Returns only unique rows.
 WHERE: Filters rows based on a condition.
 GROUP BY: Groups rows based on column values.
 HAVING: Filters groups based on a condition.
 CLUSTER BY: Groups rows and sorts them in a specific way.
 DISTRIBUTE BY: Distributes rows to different reducers.
 SORT BY: Sorts the data within each reducer.
 LIMIT: Limits the number of rows returned.
Example:
SELECT DISTINCT name, salary
FROM employee
WHERE salary > 50000
ORDER BY salary DESC;
PIG
83
Pig is a high-level platform built on top of Hadoop to facilitate the processing of large
datasets. It abstracts the complexities of writing MapReduce programs and provides a more
user-friendly interface for data manipulation.
Features of Apache Pig
 Dataflow Language: Pig uses a dataflow language, where operations on data are
linked in a chain, and the output of one operation is the input to the next.
 Simplifies MapReduce: Pig reduces the complexity of writing raw MapReduce
programs by providing a higher-level abstraction.
 Parallel Processing: Pig allows the execution of tasks in parallel, which makes it
suitable for handling large datasets.
 Flexible: It can process structured, semi-structured, and unstructured data.
 High-level Operations: Supports complex data manipulation tasks like filtering,
joining, and aggregating large datasets.
Applications of Apache Pig
 Large Dataset Analysis: Ideal for analyzing vast amounts of data in HDFS.
 Ad-hoc Data Processing: Useful for quick, one-time data processing tasks.
 Processing Streaming Data: It can process web logs, sensor data, or other real-time
data.
 Search Platform Data Processing: Pig can be used for processing and analyzing data
related to search platforms.
 Time-sensitive Data Processing: Processes and analyzes data quickly, which is
essential for applications that require fast insights.
Pig scripts are often used in combination with Hadoop for data processing at scale, making it
a powerful tool for big data analytics.
Pig Architecture
84
The Pig architecture is built to support flexible and scalable data processing in a Hadoop
ecosystem. It executes Pig Latin scripts via three main methods:
1. Grunt Shell: An interactive shell that executes Pig scripts in real time.
2. Script File: A file containing Pig commands that are executed on a Pig server.
3. Embedded Script: Pig Latin functions that can be written as User-Defined Functions
(UDFs) in different programming languages and embedded within Pig scripts.
Pig Execution Process

The Pig execution flow involves multiple stages to transform raw data into processed output:
1. Parser: After a script passes through Grunt or Pig Server, the parser handles syntax
checking and type validation. The output of this step is a Directed Acyclic Graph
(DAG) that represents the flow of operations.
o DAG: Nodes represent operations, and edges indicate data flows between
them. This structure ensures that each node only handles one set of inputs at a
time, making the process acyclic.
2. Optimizer: After generating the DAG, the optimizer reduces data at various stages to
optimize performance. Some of the optimizations include:
o PushUpFilter: Splits and pushes filters up in the execution plan to reduce the
dataset early in the pipeline.
o PushDownForEachFlatten: Delays the flatten operation to minimize the data
set in the pipeline.
o ColumnPruner: Removes unused columns as early as possible.
o MapKeyPruner: Discards unused map keys.
o Limit Optimizer: Pushes the LIMIT operation as early as possible to avoid
unnecessary computation.
3. Compiler: After optimization, the Pig scripts are converted into a series of
MapReduce jobs, which are compiled into code that will be executed on the Hadoop
cluster.
4. Execution Engine: The execution engine takes the MapReduce jobs and executes
them on the Hadoop cluster, generating the final output.
Pig Grunt Shell
The Grunt shell is primarily used for writing and executing Pig Latin scripts. You can also
invoke shell commands such as sh and ls. For instance:
 To execute shell commands: grunt> sh shell_command_parameters
85
 To list files in the Grunt shell: grunt> sh ls

Pig Latin Data Model
Pig Latin supports both primitive (atomic) and complex data types, making it versatile for
handling various data structures.
Primitive Data Types:
 int: 32-bit signed integer (e.g., 10)
 long: 64-bit signed integer (e.g., 101)
 float: 32-bit floating point (e.g., 22.7F)
 double: 64-bit floating point (e.g., 3.4)
 chararray: Character array (e.g., 'hello')
 bytearray: Binary data (e.g., ffoo)
Complex Data Types:
 bag: Collection of tuples (e.g., {{(1,1), (2,4)}})
 tuple: Ordered set of fields (e.g., (1, 1))
 map: Set of key-value pairs (e.g., ['key1#1'])
Pig Latin Constructs
Pig Latin scripts are built using a variety of operations that handle data input, output, and
transformations. A typical Pig Latin script includes the following:
1. Schemas and Expressions: Defines how data is structured and what operations will
be performed on it.
2. Commands:
o LOAD: Reads data from the file system.
o DUMP: Displays the result.
o STORE: Stores the processed result into the file system.
3. Comments:
o Single-line comments start with --.
o Multiline comments are enclosed in /* */.
4. Case Sensitivity:
o Keywords (like LOAD, STORE, DUMP) are not case-sensitive.
o Function names, relations, and paths are case-sensitive.
Pig Latin Script Execution Modes
86
1. Interactive Mode: This mode uses the Grunt shell. It allows you to write and
execute Pig Latin scripts interactively, making it ideal for quick testing and
debugging.
2. Batch Mode: In this mode, you write the Pig Latin script in a single file with a .pig
extension. The script is then executed as a batch process.
3. Embedded Mode: This mode involves defining User-Defined Functions (UDFs) in
programming languages such as Java, and using them in Pig scripts. It allows for
more advanced functionality beyond the built-in operations of Pig.
Pig Commands
1. To get a list of Pig commands:
2. pig-help;
3. To check the version of Pig:
4. pig -version;
5. To start the Grunt shell:
6. pig
Load Command
The LOAD command in Pig is used to load data into the system from various data sources.
Here's how it works:
 Loading data from HBase:
 book = LOAD 'MyBook' USING HBaseStorage();
 Loading data from a CSV file using PigStorage, with a comma as a separator:
 book = LOAD 'PigDemo/Data/Input/myBook.csv' USING PigStorage(',');
 Specifying a schema while loading data: You can define a schema for the loaded
data, which helps in interpreting each field of the record.
 book = LOAD 'MyBook' AS (name:chararray, author:chararray, edition:int,
publisher:chararray);
Store Command
The STORE command writes the processed data to a storage location, typically HDFS. It can
store data in various formats.
 Default storage in HDFS (tab-delimited format):
 STORE processed INTO '/PigDemo/Data/Output/Processed';
 Storing data in HBase:
87
 STORE processed INTO 'processed' USING HBaseStorage();

 Storing data as comma-separated text:
 STORE processed INTO 'processed' USING PigStorage(',');
Dump Command
The DUMP command is useful for displaying the processed data directly on the screen. It’s
often used during debugging or prototyping to quickly inspect the results.
 Displaying processed data:
 DUMP processed;
Relational Operations in Pig Latin
Pig Latin provides several relational operations that allow you to transform and manipulate
data. These operations are used to sort, group, join, project, and filter data. Some of the basic
relational operators include:
1. FOREACH: This operation applies transformations to the data based on columns and
is often used to project data (i.e., select specific columns). It is the projection operator
in Pig Latin.
Example:
result = FOREACH data GENERATE field1, field2;
The FOREACH operation is extremely powerful and can also be used for applying functions
or expressions to each field in the dataset.
Summary of Key Commands
 LOAD: Used for loading data from an external source into Pig.
 STORE: Writes processed data to an external location.
 DUMP: Displays the processed data on the screen for inspection.
 FOREACH: Allows applying transformations and projections to data.
These commands form the foundation of writing and executing Pig scripts, enabling you to
process and analyze large datasets in a Hadoop environment.
88
89
-------------------------------------------END OF MODULE 4-----------------------------------------
90
Module 5
Machine Learning Algorithms for Big Data Analytics
Artificial Intelligence (AI) is the field of computer science focused on creating machines
capable of performing tasks that traditionally require human intelligence. These tasks include
predicting future outcomes, recognizing visual patterns, understanding and processing
speech, making decisions, and engaging in natural language processing. AI systems aim to
mimic human cognitive abilities, allowing them to handle complex processes that are
typically done by humans, such as problem-solving and learning from experience.
Machine Learning (ML), a key subset of AI, revolves around the ability of systems to learn
from data without being explicitly programmed for specific tasks. It involves three main
stages: collecting data, analysing it to identify patterns, and predicting future outcomes based
on those patterns. Over time, as the system processes more data, its performance improves,
enabling it to make more accurate and efficient decisions. ML is used across various
industries and research fields to support decision-making and automation.
Deep Learning (DL) is an advanced approach within machine learning that uses complex
models, such as artificial neural networks (ANN), to simulate the human brain's learning
process. These models are designed to analyse large datasets with multiple layers of
information, making them highly effective for tasks like computer vision, speech recognition,
natural language processing, and bioinformatics. Deep learning techniques can produce
results that rival or exceed human-level performance, enabling breakthroughs in fields like
AI-assisted medical research, automated translation, and more.
Estimating the relationships, outliers, variances, probability distributions and
correlations
The given text discusses different types of variables used in statistical analysis, focusing on
their relationships, outliers, variances, probability distributions, and correlations.
Independent Variables: These are directly measurable characteristics that are not affected by
other variables. Examples include the year of sales or the semester of study. The value of an
independent variable is not dependent on any other variable.
1. Dependent Variables: These represent characteristics that are influenced by
independent variables. For example, profit over successive years or grades awarded in
successive semesters are dependent on other factors. The value of a dependent
variable depends on the value of the independent variable.
2. Predictor Variable: This is an independent variable that helps predict the value of a
dependent variable using an equation, function, or graph. For example, it can predict
sales growth of a car model after five years from past sales data.
3. Outcome Variable: This represents the effect of manipulations using a function,
equation, or experiment. For instance, CGPA of a student is an outcome variable that
depends on the grades awarded during semesters.
4. Explanatory Variable: An independent variable that explains the behavior of the
dependent variable, such as factors influencing the growth of profit, including the
amount of investment.
91
5. Response Variable: A dependent variable that is the focus of a study, experiment, or

computation. It is used to measure the effect of changes or interventions, such as
improvement in profits due to investments or improvement in class performance due
to extra teaching efforts.
6. Feature Variable: These represent characteristics of an object or subject and are often
expressed in categorical or numerical forms. For instance, the color of an apple can be
a feature variable, represented by categories such as red, pink, or yellow.
7. Categorical Variable: This represents categories or groups. For example, the types of
vehicles like car, tractor, and truck are categorized as four-wheel automobiles.
Categorical variables are typically represented by text characters or labels.
The provided text discusses methods for depicting relationships between variables using
graphs, scatter plots, and charts, as well as the distinction between linear and non-linear
relationships.
Relationships Using Graphs, Scatter Plots, and Charts:
 Graphs, scatter plots, and charts are commonly used to visualize the relationship
between quantitative variables. Typically, independent variables are plotted on the x-
axis, and dependent variables are plotted on the y-axis.
 A line graph uses a line on the x-y axis to represent a continuous function, showing
how the dependent variable changes as the independent variable changes.
 A scatter plot uses dots or distinct shapes to represent values of the dependent
variable at various values of the independent variable. Scatter plots are especially
useful for determining whether there is a relationship between two variables.
Linear and Non-linear Relationships:
 Linear Relationship: A linear relationship exists between two variables when the
data points can be approximated by a straight line, which is mathematically expressed
as y=a0+a1.xy = a_0 + a_1.x. The value of a1a_1 represents the linearity coefficient
and indicates the strength and direction of the relationship. If the scatter plot shows a
pattern that approximates a straight line, this indicates a linear relationship.
o A positive linear relationship means that as one variable increases, the other
also increases. For example, the number of students opting for computer
courses may increase over the years.
o A negative linear relationship means that as one variable increases, the other
decreases.
o The strength of the relationship can be categorized as perfect, strong, or
weak, depending on how closely the data points align with the line.
92
 Non-linear Relationship: A non-linear relationship exists when the data points

cannot be fitted with a straight line but can be fitted with a curve. The mathematical
expression for this type of relationship is typically in the form y=a0+a1.x+a2.x2+…y
= a_0 + a_1.x + a_2.x^2 + \dots, where the higher powers of xx better represent the
relationship between the variables.
o An example of a non-linear relationship is the area of a square in relation to
the length of its side. When the side length doubles, the area increases by a
factor of four, which is a quadratic relationship. The relationship is not linear,
and thus, the data points cannot be fit by a straight line but require a curve for
accurate representation.
Outliers
Outliers are data points that deviate significantly from the other data points in a dataset. They
are numerically far distant from the rest of the points and can indicate anomalous situations
or errors in data collection. Outliers can occur due to various reasons, such as:
 Anomalous situations: Unexpected or rare events that deviate from the norm.
 Presence of previously unknown facts: New, unrecognized factors that may cause
unusual data points.
 Human error: Mistakes made during data entry or collection.
93
 Participants intentionally reporting incorrect data: This can occur in self-reported

measures, especially when sensitive information is involved.
 Sampling error: This occurs when a sample does not accurately represent the
population.
Understanding and identifying outliers is crucial to improve data quality, detect anomalies,
and ensure the accuracy of analysis.
Variance
Variance measures the degree to which data points in a dataset vary from the expected or
average value. It is calculated as the sum of squared differences between each data point and
the mean. A high variance indicates that the data points are spread out over a wide range,
while low variance indicates that the data points are close to the mean.
Variance is a key measure of dispersion in a dataset, and understanding it helps in assessing
how much variation exists in the data.
Standard Deviation and Standard Error Estimates
 Standard Deviation: Standard deviation is the square root of variance. It measures
the average distance that data points fall from the mean. While variance is a measure
of spread, the standard deviation is less sensitive to extreme values (outliers) because
it is square-rooted. The formula for standard deviation is:
Where:
 S is the standard deviation.
 Xi is each data point.
 Μ is the mean of the data.
 N is the number of data points.
Standard Error: The standard error estimate measures the accuracy of predictions made by a
model or relationship. It is related to the sum of squared deviations (also known as the sum of
squares error). The formula for the standard error of the estimate is:
Where:
 Yi is the observed value.
 yi^ predicted value.
94
 Nis the number of observations.

Probabilistic Distribution of Variables, Items, or Entities
Probability refers to the chance of an event or outcome occurring, given certain conditions. In
a dataset, the probability distribution is the distribution of the probability values (denoted as
P(x) for all possible outcomes, given some independent variables.
 Probability Distribution: This is the function that describes the probability of each
possible outcome in a system. For a given variable x, the probability distribution P(x)
gives the likelihood of each possible value of x.
o Discrete Probability Distribution: The probability values for discrete
variables (e.g., the outcome of rolling a die).
o Continuous Probability Distribution: The probability values for continuous
variables (e.g., height, weight, etc.).
The normal distribution, a commonly used continuous distribution, is often expressed with
a bell-shaped curve cantered around the expected value. The standard normal distribution
has a mean of 0 and a standard deviation of 1, and it follows the formula:
Analysis of Variance (ANOVA)

ANOVA (Analysis of Variance) is a statistical method used to determine whether there are
any statistically significant differences between the means of three or more independent
groups. It is used to test the null hypothesis, which assumes that no significant difference
exists between the groups being compared. In contrast, the alternative hypothesis suggests
that there is a significant difference between the groups.
Null Hypothesis in ANOVA:
95
 The null hypothesis in ANOVA assumes that there is no significant difference

between the means of the groups being tested.
 For example, in the case of yearly sales data for Tata Zest and Jaguar Land Rover
models, the null hypothesis would assume that there is no significant difference in the
sales data for both models.
 ANOVA helps to accept or reject this hypothesis based on statistical evidence.
F-test:
The F-test is used in ANOVA to compare the variances between the groups. It involves two
estimates of the population variance:
1. Variance between the groups (El(V)) — This estimate how much the group means
deviate from the overall mean.
2. Variance within the groups (E2(V)) — This estimate how much the data points
within each group deviate from their respective group means.
The F-test statistic is calculated as the ratio of these two variances:
Where:
 E1(V) is the estimate of the variance between the groups.
 E2(V) is the estimate of the variance within the groups
F-distribution and Critical Value:
To determine whether the F-test statistic is significant, we compare it against a critical value
from the F-distribution table, which depends on the degrees of freedom for both the
numerator (between-group variance) and denominator (within-group variance). If the
calculated F-value is greater than the critical value from the F-table, the null hypothesis is
rejected.
Correlation
Correlation measures the strength and direction of the relationship between two variables. It
quantifies how one variable changes with respect to another and is used to assess whether and
how strongly pairs of variables are related.
R-Squared (R2):
 R-Squared is a statistical measure used to evaluate the goodness of fit in a regression
model.
 It is also called the coefficient of determination and represents the proportion of the
variance in the dependent variable that can be explained by the independent
variable(s) in the model.
96
 R2 is the square of the Pearson correlation coefficient (R) and ranges from 0 to 1. A
higher R2 value indicates a better fit of the model to the data.
 Interpretation:
o R2=1: Perfect fit, where the predicted values are identical to the observed
values.
o R2=0: No correlation between the model and the observed data.
o Larger R2 values indicate a better model fit, implying stronger correlation
between the variables.
Regression Analysis
Regression analysis is a statistical method used to estimate the relationships among
variables. It helps understand how the dependent variable (also known as the response
variable) changes when one or more independent variables (predictor or explanatory
variables) are modified. The main goal of regression analysis is to model these relationships
to make predictions about future values of the dependent variable.
Multivariate Distribution and Regression
In regression analysis, we often deal with multivariate distributions, where multiple
variables are involved. For example, if a company wants to predict future sales of Jaguar cars
based on past sales data, it would analyze the relationship between the sales in previous years
and sales in the current year using regression models.
Regression analysis estimates how one or more independent variables influence the
dependent variable. It involves identifying the strength and nature of the relationship (e.g.,
linear or non-linear) between these variables.
Non-linear and Linear Regression
Non-linear Regression:
 Non-linear regression is used when the relationship between the independent and
dependent variables is not linear.
 The general equation for non-linear regression can have multiple terms (3 or more) on
the right-hand side of the equation, representing more complex relationships between
variables.
Here, y is the dependent variable, and x1,x2,x3,…x_1, x_2, x_3, are the independent
variables with their corresponding coefficients a1,a2,a3,…a_1, a_2, a_3,
Linear Regression:
 Linear regression assumes that the relationship between the dependent and
independent variables can be modelled using a straight line.
97
 It is a simpler form of regression where only the first two terms are considered in the
equation.
Simple Linear Regression
Simple Linear Regression is one of the most widely used techniques in regression analysis.
It is a supervised machine learning algorithm that aims to predict the value of a dependent
variable using one independent variable. It is the simplest form of regression, where the
relationship between the independent variable xxx and the dependent variable yyy is assumed
to be linear.
Key Features of Simple Linear Regression:
 The objective is to fit a line (called the regression line) that minimizes the deviation
from all the data points.
 The deviation from the line is called the error or residual.
 The equation of the regression line is typically written as:
 Where:
o Y is the dependent variable (what we want to predict).
o x is the independent variable (the predictor).
o mmm is the slope of the line, which represents how much y changes for a one-
unit change in x.
o c is the intercept, the value of y when x=0.
Objective of Simple Linear Regression:
 The goal is to find the best-fitting line that minimizes the total error (the deviation of
observed data points from the predicted values). This is often done using the least
squares method, which minimizes the sum of squared errors (residuals).
Steps in Performing Simple Linear Regression:
1. Collect Data: Gather the data points for both independent and dependent variables.
2. Fit a Line: Use statistical methods (e.g., least squares) to fit the line that minimizes
the error.
3. Predict: Once the regression line is obtained, it can be used to predict the value of y
for new values of xxx.
4. Evaluate the Model: The accuracy of the model can be measured using statistical
metrics like R-squared ( R2) and mean squared error (MSE).
98
Multiple Regression
Multiple regression is an extension of simple linear regression that allows for the prediction
of a criterion variable (dependent variable) using two or more predictor variables
(independent variables). While simple linear regression predicts a dependent variable from
one independent variable, multiple regression considers multiple independent variables
simultaneously, making it ideal for more complex scenarios where several factors affect the
outcome.
Why Use Multiple Regression?
 Real-world scenarios: Many real-world phenomena are influenced by multiple
factors. For example, a company's sales may depend on various factors like
advertising budget, season, customer sentiment, and economic conditions. Multiple
regression helps in modelling such complex relationships.
 Forecasting and Prediction: Multiple regression is often used for forecasting future
values by considering multiple influencing factors. It is also useful for assessing the
strength of these predictors.
Example of Multiple Regression Model:
The general form of the multiple regression equation is:
Where:
 y is the dependent variable (the outcome we are predicting).
 b0is the intercept (the predicted value when all predictors are zero).
 b1,b2,…,bn are the regression coefficients for the independent variables x1,x2,…,xn
indicating how much change in y is expected with a one-unit change in each predictor.
 ϵ\epsilonϵ is the error term, representing unexplained variation in y(residuals).
Applications of Multiple Regression
1. Sales Forecasting:
o A company can use multiple regression to predict future sales based on several
variables, such as advertising spend, promotions, and seasonality. The
regression model would allow the company to forecast future sales more
accurately by considering these factors together.
2. Marketing Investment Analysis:
o A company may analyze whether investments in marketing campaigns (e.g.,
TV and radio ads) yield substantial returns. Using multiple regression, the
company can evaluate the individual impact of TV and radio ads as well as
their combined effect on sales.
99
3. Risk Management in Finance:

o In a financial company, multiple regression can help minimize risk by
understanding the key factors that contribute to loan defaults. The model
would identify which factors (e.g., income level, credit score, past repayment
behaviour) are most predictive of default risk.
4. Employment Discrimination:
o In cases where an employment discrimination claim is raised (e.g., unequal
salaries between men and women), multiple regression can analyse various
factors (e.g., job role, education, experience, etc.) to determine if gender is a
significant predictor of salary discrepancies.
5. Real Estate Price Prediction:
o For predicting house prices, multiple regression can be used to account for
various factors such as location, the reputation of the builder, number of
bedrooms, etc., which are likely to affect the price.
6. Biological Activity Prediction:
o In pharmaceuticals, multiple regression can help understand the relationship
between the structure of compounds and their biological activity. It can also
be used to predict which compounds within a group might have higher
bioactivity, aiding in drug discovery.
7. Prediction Based on Parental Characteristics:
o In educational or social studies, multiple regression might be used to predict a
child's academic performance or social behaviors based on parental
characteristics, such as their education level, income, and lifestyle.
Advantages and Considerations of Multiple Regression
Advantages:
 Multiple predictors: Unlike simple linear regression, multiple regression can handle
more than one independent variable, allowing for a more comprehensive analysis of
the factors influencing a dependent variable.
 Better accuracy: By considering multiple factors, the predictions made by multiple
regression models tend to be more accurate than those from simple linear regression,
especially in complex environments.
 Interaction effects: Multiple regression can also help explore interactions between
different predictor variables. This is important when the combined effect of two or
more predictors is more significant than their individual effects.
100
Text Mining
Text Mining is the process of extracting valuable knowledge, insights, and patterns from
large collections of textual data. This involves analysing text data in a structured or
unstructured form and is used to uncover patterns, relationships, and insights that may not be
immediately apparent.
Text mining is particularly important due to the large amount of text-based data generated in
the world today. With the rise of social media, user-generated content such as text, images,
and videos has increased exponentially. Text mining plays a crucial role in analyzing and
understanding these vast amounts of data for actionable insights across various domains.
1.1 Text Mining Applications

Text mining has a broad range of applications across various sectors, and it is a key tool for
decision support, sentiment analysis, fraud detection, survey analysis, and more. Some of the
most notable applications of text mining include:
1. Marketing:
 Customer Voice Analysis: Text mining can help capture and analyse raw customer
feedback from social media, surveys, and reviews to understand customer preferences
and complaints. By analyzing this data, companies can gain insights into customer
sentiment and behaviours.
 Social Personas and Segmentation: Text mining helps create customer segments by
analyzing social media content, reviews, blogs, and tweets. These inputs provide
indicators that can help anticipate consumer behaviour and preferences.
 Listening Platforms: A 'listening platform' is a tool that analyses real-time data from
social media, blogs, and other textual feedback. It filters out irrelevant content to
extract meaningful insights, such as customer sentiment, which can help improve
product marketing and customer service strategies.
 Customer Support: By analyzing conversations and records from customer call
centres, text mining can help identify common patterns in customer complaints.
Techniques like decision trees can help organize this data, allowing businesses to
proactively address potential issues in product management and customer service.
2. Business Operations:
 Employee Sentiment Analysis: Text mining, combined with social network analysis,
can be applied to emails, blogs, and other social media content to gauge the emotional
state of employees. Sentiment analysis can help detect early signs of dissatisfaction or
morale issues among employees, enabling organizations to take proactive measures.
 Investments and Financial Insights: Text mining of social media and the internet
can help understand public sentiment and mass psychology. By analyzing this data,
investors may gain insights that can lead to better decision-making and superior
investment returns.
101
3. Legal:
 Legal Case Search: Text mining tools can assist lawyers and paralegals in searching
vast databases of legal documents, case histories, and laws. This can help them find
relevant documents quickly, improving the efficiency and effectiveness of their legal
research.
 E-Discovery: Text mining is embedded in e-discovery platforms, helping
organizations minimize the risk associated with sharing legally mandated documents.
These platforms assist in ensuring that relevant legal documents are properly
reviewed, managed, and stored.
 Predictive Legal Insights: Case histories, testimonies, and client meeting notes can
be analysed to uncover additional insights that may help predict high-cost injuries or
legal issues. This analysis can contribute to better legal strategies and cost-saving
measures.
4. Governance and Politics:
 Social media and Public Sentiment: Text mining can be used to analyse public
sentiment on social media platforms. This can help governments and political parties
gauge the mood of constituents, track public opinions, and adjust their strategies
accordingly.
 Micro-Targeting in Elections: Social network analysis enables political campaigns
to create targeted messages based on data gathered from social media. This approach
helps political campaigns more efficiently use resources and reach voters with
messages tailored to their specific concerns.
 Geopolitical Security: Text mining can be applied to real-time internet chatter to
detect emerging threats or crises. By analyzing large-scale social media data,
governments and organizations can gain valuable intelligence to improve security
measures.
 Research Trend Analysis: In academic and research fields, text mining can help
analyse large amounts of research papers and publications to identify emerging trends.
Meta-analysis of academic research using text mining can uncover important insights
and direct future research initiatives.
Text Mining Process
Text mining is a rapidly growing field, especially with the increasing volume of social media
and other text data. To manage this data, there is a need for efficient techniques to extract
meaningful information. The process of text mining can be divided into five phases:
102
Phase 1: Text Pre-processing

Pre-processing is essential to enable effective syntactic and semantic analysis of the text. This
phase includes several key steps:
1. Text Cleanup:
This process involves removing unwanted or irrelevant data from the raw text. It
could involve tasks like:
o Filling in missing values.
o Removing outliers or inconsistencies.
o Correcting typographical errors, e.g., changing "teh" to "the" and removing
characters like "%20" in URLs.
2. Tokenization:
Tokenization splits the cleaned text into individual tokens (words or terms), often
using whitespace and punctuation marks as delimiters.
3. Part of Speech (POS) Tagging:
Each word is tagged with its part of speech, such as noun, verb, adjective, etc. This
helps in recognizing names, places, and other significant terms. For example, "dog"
might be tagged as a noun, and "run" as a verb.
4. Word Sense Disambiguation:
This method identifies the correct meaning of a word based on its context. For
example, the word "bank" could refer to a financial institution or the side of a river,
and word sense disambiguation helps determine the meaning in the sentence.
5. Parsing:
Parsing involves generating a parse tree for each sentence to analyze the grammatical
structure and relationships between words. It helps in understanding the syntactic
structure of a sentence.
103
Phase 2: Features Generation

In this phase, features (or variables) are created for further analysis. The goal is to represent
text in a way that machine learning algorithms can process. Some common methods of
feature generation include:
1. Bag of Words (BoW):
The BoW model represents a document as a collection of words, ignoring the order.
The frequency of each word in the document becomes a feature for classification or
clustering algorithms.
2. Stemming:
Stemming reduces words to their root form. For example, "running," "runner," and
"runs" would all be reduced to the root word "run." This helps in normalizing
variations of words to the same root term.
3. Removing Stop Words:
Stop words (common words such as "a," "at," "for," "the") are removed because they
typically do not contribute meaningful information for text mining.
4. Vector Space Model (VSM):
The VSM represents text documents as vectors of word frequencies. The Term
Frequency-Inverse Document Frequency (TF-IDF) method is commonly used to
evaluate the importance of a word in a document relative to a collection of
documents.
Phase 3: Feature Selection
Feature selection is about choosing a subset of features that are most relevant and discarding
irrelevant or redundant ones. This phase helps in improving the performance of machine
learning models. Techniques include:
1. Dimensionality Reduction:
Methods like Principal Component Analysis (PCA) and Linear Discriminant Analysis
(LDA) help reduce the number of features by eliminating redundant or irrelevant
ones, thus making the model more efficient.
2. N-gram Evaluation:
N-grams are sequences of consecutive words (e.g., a 2-gram could be "tasty food").
This technique extracts meaningful word combinations from the text.
3. Noise Detection:
This step identifies unusual or irrelevant data (outliers) that could distort the analysis
and ensures the dataset is cleaner for subsequent phases.
Phase 4: Data Mining Techniques
Once features are selected, data mining techniques are applied to discover patterns and
insights:
1. Unsupervised Learning (Clustering):
In unsupervised learning, the class labels are not known, and the goal is to discover
104
clusters or groups within the data. Examples include grouping similar documents
together based on content.
2. Supervised Learning (Classification):
In supervised learning, the data is labelled, and the model is trained to classify new
data based on these labels. Examples include spam email classification or sentiment
analysis.
3. Evolutionary Pattern Identification:
This technique identifies patterns over time, such as analyzing news articles to
summarize events or identifying trends in research literature.
Phase 5: Analyzing Results

After applying data mining techniques, it's important to evaluate the results and understand
the outcomes:
1. Evaluate the Outcome:
The results are assessed to determine if they meet the objectives. If the results are
acceptable, they are used for further analysis. If not, adjustments are made to improve
the process.
2. Interpretation of Results:
The results are interpreted, and if they are acceptable, they can be used to influence
future actions. If the results are not satisfactory, the process is revisited to understand
why it failed.
3. Visualization:
Data is visualized using charts, graphs, or other forms of representation to help
stakeholders understand the findings.
4. Use for Further Improvement:
The insights derived from text mining can be applied to improve business processes,
strategies, or research.
Text Mining Challenges
Text mining, while powerful, faces several challenges due to the nature of text data and the
complexities involved in processing it. These challenges can be categorized based on the
characteristics of the documents being analyzed:
1. NLP (Natural Language Processing) Issues

Text mining often relies heavily on NLP techniques, which face various challenges:
1. POS Tagging (Part of Speech Tagging):
Identifying the correct part of speech (noun, verb, adjective, etc.) for each word in a
sentence can be difficult, especially when words have multiple meanings or functions.
Errors in POS tagging can lead to incorrect interpretations of the text.
105
2. Ambiguity:
Words and phrases can have multiple meanings depending on context, which creates
ambiguity. For example, the word "bat" can refer to a flying mammal or a piece of
sports equipment. Resolving such ambiguity requires sophisticated context
understanding.
3. Tokenization:
Tokenization is the process of splitting text into smaller units (tokens), such as words
or phrases. However, tokenizing text correctly can be challenging due to punctuation,
contractions, and compound words that may not be straightforward to split.
4. Parsing:
Parsing aims to analyze the grammatical structure of sentences. This task can be
complicated by sentence complexity, variations in sentence structure, and non-
standard language usage, leading to difficulty in constructing accurate parse trees.
5. Stemming:
Stemming reduces words to their root form, but it is not always perfect. For example,
stemming might strip the suffixes in words like "running" or "better," which could
result in ambiguous roots like "run" or "good."
6. Synonymy and Polysemy:
o Synonymy refers to the challenge of identifying words with similar meanings
(e.g., "car" and "automobile").
o Polysemy involves words with multiple meanings, which can confuse systems
(e.g., "bank" can mean a financial institution or the side of a river). Addressing
these issues requires deep contextual understanding.
2. Mining Techniques
Various mining techniques face challenges, including:
1. Identification of Suitable Algorithm(s):
There is no one-size-fits-all algorithm for text mining. Choosing the right algorithm
depends on the task (e.g., classification, clustering) and the nature of the text. The
diversity of text data requires selecting appropriate algorithms to handle different
types of text and tasks effectively.
2. Massive Amount of Data and Annotated Corpora:
Text mining often deals with vast amounts of unstructured data, and for supervised
learning tasks, large annotated corpora (labelled data) are needed. Annotating such
massive datasets is time-consuming and expensive, which is a significant barrier.
3. Concepts and Semantic Relations Extraction:
Extracting meaningful concepts and understanding the relationships between them in
text is complex. This requires understanding deeper semantics and context, which can
be difficult to model accurately.
106
4. When No Training Data Is Available:

Some tasks, such as unsupervised learning or anomaly detection, may not have
labelled training data available. Developing models without labelled data can
significantly reduce the performance of the text mining system.
3. Variety of Data
The variety of data types and sources adds another layer of complexity to text mining:
1. Different Data Sources Require Different Approaches and Areas of Expertise:
Text data can come from various sources, such as social media, scientific articles,
news, and books. Each type of text may require different pre-processing, feature
extraction, and analysis techniques.
2. Unstructured and Language Independence:
Much of the text data is unstructured, meaning it doesn't have a predefined format,
making it harder to process. Additionally, text mining systems may need to work
across different languages, each with its unique structure and nuances, requiring
language-independent approaches.
4. Information Visualization
Once insights are extracted from text, presenting them in a meaningful way is a challenge.
Text mining results can be complex and multidimensional, so effective visualization tools are
needed to make the insights understandable and actionable for users.
5. Efficiency When Processing Real-time Text Streams

Processing real-time text data, such as social media posts or news feeds, requires systems to
be fast and efficient. Traditional batch processing methods are not suitable for real-time
scenarios, and the ability to process and analyse data as it arrives is a significant challenge.
6. Scalability
Text mining systems must be scalable to handle large volumes of text. As the amount of text
data continues to grow, the system needs to scale efficiently, ensuring that computational
resources are not exhausted and that processing time remains manageable even with large
datasets.
107
Text Mining Best Practices

Text mining (TDM) is a powerful tool for extracting insights from textual data, but to achieve
optimal results, it is essential to follow best practices. Many of these practices overlap with
general data mining principles but are specifically tailored to the challenges and nuances of
working with text.
1. Ask the Right Question

 Purpose: The first and most critical practice in text mining is to define the problem
clearly by asking the right question. This helps shape the scope and objectives of the
project.
 Key Considerations:
o The question should be well-defined, actionable, and capable of yielding
valuable insights for the organization.
o The purpose and key question will determine the approach and granularity of
the text data mining (TDM). For example, a TDM designed for simple
keyword searches will differ significantly from one aimed at complex
semantic analysis or network analysis.
o A clear question helps in designing the TDM methodology, focusing on the
right set of documents and data sources, and ensuring the solution is aligned
with organizational goals.
2. Be Creative and Open in Proposing Hypotheses

 Purpose: Creativity is vital in text mining, as it can uncover valuable insights that
traditional methods might miss.
 Key Considerations:
o Encourage outside-the-box thinking when formulating hypotheses. Consider
unconventional angles or ways of connecting different data points to enhance
the depth and quality of the analysis.
o A comprehensive TDM strategy often combines different types of data. For
example, combining consumer sentiment data with customer order data can
provide a richer, more detailed view of customer behavior.
o Assemble a team with a mix of technical and business expertise. Technical
skills are necessary for executing the mining process, while business
knowledge ensures that the results are actionable and aligned with
organizational goals.
108
3. Pursue the Problem Iteratively

 Purpose: Text mining often involves large, complex datasets. It is crucial to approach
the problem in smaller, manageable steps to avoid overwhelming the system or the
team.
Considerations:
o Start with a simpler, more focused TDM, using fewer terms, documents, and
data sources. This makes it easier to understand and refine the model before
expanding it.
o Iterative development allows you to improve the model progressively by
adding complexity and new terms as needed. This approach makes it easier to
identify what works and what doesn't.
o By adding new terms gradually, you can enhance the model's predictive
accuracy over time while minimizing the risk of errors or irrelevant data.
4. Use a Variety of Data Mining Tools

 Purpose: Relying on a single tool or technique can limit the scope and effectiveness
of your text mining process. Using multiple tools increases the likelihood of
discovering meaningful insights.
Considerations:
o Experiment with different algorithms and techniques, such as decision trees,
cluster analysis, and others, to test various hypotheses and relationships in the
TDM.
o Applying multiple methods allows for triangulation of the results, helping to
validate findings and increase confidence in the solution.
o Conducting several "what-if" scenarios can help you explore different
potential outcomes, ensuring a more robust and reliable model.
o Thoroughly test the solution using diverse techniques before committing to its
deployment, as this helps identify weaknesses or areas for improvement.
109
110
Web Mining
Web mining is the process of discovering patterns and insights from data on the World Wide
Web to enhance the web and improve the user experience. As the web grows exponentially,
with more data being uploaded every day than the entire web had just two decades ago, web
mining has become a crucial tool for understanding and optimizing how the internet is used.
The web serves various functions, including electronic commerce, business communication,
and social interactions, which makes it essential to extract valuable insights from web data.
Web mining collects data through web crawlers, web logs, and other means, helping to
uncover trends that can optimize content, improve user experiences, and provide business
insights.
Characteristics of Optimized Websites
For a website to be considered optimized, it needs to have several key characteristics across
three main aspects: appearance, content, and functionality.
1. Appearance
 Aesthetic Design: The design of a website plays a significant role in user
engagement. A visually appealing website captures attention and encourages
interaction.
 Well-formatted Content: The content should be easy to read, scannable, and
logically structured to enhance user experience.
 Easy Navigation: A website with clear navigation pathways ensures that users can
easily find the information they are looking for.
 Good Color Contrasts: Proper contrast improves readability and enhances the visual
appeal of the website.
2. Content
 Well-planned Information Architecture: The content should be structured logically
and organized in a way that is intuitive for users.
 Fresh Content: Regularly updated content ensures that visitors have access to the
latest information and keep them returning.
 Search Engine Optimization (SEO): Optimizing the content for search engines
makes it more discoverable, driving organic traffic to the site.
 Links to Other Good Sites: Having quality external links helps build authority and
offers additional value to users.
111
3. Functionality
 Accessibility for Authorized Users: Ensuring the website is accessible to users with
disabilities and meets web accessibility standards is important for inclusivity.
 Fast Loading Times: A website should load quickly to reduce user frustration and
abandonment.
 Usable Forms: User-friendly forms, such as easy-to-fill contact or sign-up forms,
improve interaction rates.
 Mobile-enabled: With the increasing use of mobile devices, a responsive website is
essential for reaching a wider audience.
Web Content Mining

Web content mining focuses on extracting useful data from the content of web pages,
including text, images, videos, and other media. Websites are organized into pages, each with
a distinct URL, and these pages are managed using Content Management Systems (CMS).
Web content mining involves analyzing the content of these pages to gain insights that can
enhance the user experience and optimize web content. Some of the important techniques in
web content mining include:
 Text Analysis: Examining the textual content of web pages to gauge its popularity,
user engagement, and relevance. This can include the frequency of specific terms,
content sentiment, and user-generated content.
 Application Content: Analyzing interactive elements like forms, user interactions,
and applications on the web pages, focusing on how they contribute to user
engagement and conversion rates.
 Popularity Analysis: By analyzing page visits and usage patterns, content that is
popular among users can be identified, helping to optimize and update the most
visited pages to keep them fresh and engaging.
112
 Content Quality Evaluation: Assessing the quality of content on web pages to

identify pages that may be underperforming, which could either be removed or
redesigned to attract more users.
Web Structure Mining
Web structure mining involves analyzing the structure of web pages, focusing primarily on
the hyperlinks between them. The interconnected nature of the web, where pages link to one
another, offers a rich source of information. Two key strategic models in web structure
mining are:
1. Hubs: These are web pages with numerous outgoing links to other pages. They serve
as central points that direct users to a wide variety of information. Examples include
Yahoo.com or government websites. These pages act as hubs, gathering information
from various sources and directing users to different areas of interest.
2. Authorities: These are web pages that are seen as authoritative on a specific topic and
have numerous inbound links from other web pages. Examples include
MayoClinic.com for medical information or NYTimes.com for daily news. These
pages are trusted for their expertise and attract links from various sources.
Analyzing the structure of web pages and the network of links between them helps in
identifying hubs and authorities, optimizing navigation, and understanding the importance of
specific web pages within the broader web ecosystem.
Web Usage Mining
Web usage mining is the process of extracting useful patterns and insights from data
generated by users interacting with a website. This includes the tracking of clicks, page
views, and other actions, often recorded in web server logs. Web usage mining helps to
understand how users navigate a website, what content they engage with, and how to improve
user experience. Key aspects of web usage mining include:
 Clickstream Analysis: This involves analyzing the sequence of clicks made by users
as they navigate through the website. By examining patterns in users’ clicks, insights
can be gained about popular pages, user behavior, and bottlenecks in navigation.
Clickstream analysis can be used for a variety of purposes, such as market research,
software testing, and employee productivity analysis.
 User Profiles and Behavior Analysis: Web usage data, including server access logs,
referrer logs, and client-side cookies, helps to build user profiles based on their
browsing history. This data can then be used to predict user behavior, target specific
user segments, and personalize the website experience for individual users.
 Textual Information Mining: Textual data on the web pages users visit can be mined
using techniques like the bag-of-words method to build a Term-document matrix.
This matrix can be used for cluster analysis, association rule mining, and sentiment
analysis to identify popular topics, user interests, and sentiment trends across web
pages.
113
Applications of Web Usage Mining:

 Predicting User Behaviour: By analyzing users’ clickstream data and profiles, web
usage mining can predict future actions, such as the likelihood of a user making a
purchase or completing a form.
 Customer Segmentation and Targeting: Web usage mining can help segment users
into groups based on their browsing patterns, which can then be targeted with
personalized content, ads, or recommendations.
 Evaluating Promotional Campaigns: Web usage data can help measure the success
of online marketing campaigns by analyzing user engagement with the pages relevant
to the campaign.
 Dynamic Content Delivery: Based on user profiles and behaviour, web usage mining
can present dynamic, personalized content to users, such as targeted advertisements,
coupons, or special offers based on previous visits and interactions.
Naïve Bayes:
Naïve Bayes is a supervised machine learning technique based on probability theory. It
predicts the probability of an instance belonging to a specific target class, given prior
probabilities and predictors. Despite its simplicity, Naïve Bayes is powerful for many real-
world applications.
1. Posterior Probability (P(c|x)):

The probability of class cc (target) given the predictor xx (attributes).
2. Prior Probability (P(c)):
The initial probability of a class before observing any predictors.
3. Likelihood (P(x|c)):
The probability of predictor xx given class cc.
4. Predictor Probability (P(x)):
The probability of the predictor itself.
114
Advantages:
1. Efficiency:
o Performs well when the assumption of independent predictors holds.
o Requires minimal training data for estimating test data, leading to short
training periods.
2. Simplicity:
o Easy to implement and computationally efficient.
Disadvantages:
1. Assumption of Independence:
o Naïve Bayes assumes that all predictors are independent. In reality, this is
often not true, which may limit the model’s accuracy.
2. Zero Frequency Issue:
o If a category in the test dataset does not appear in the training dataset, the
model assigns a probability of zero, making it unable to make predictions.
Solution: Apply smoothing techniques like Laplace Estimation to handle
such cases.
Practical Use:
 Applications:
o Spam detection, sentiment analysis, text classification, and more.
 Despite its simplicity and assumptions, Naïve Bayes often delivers robust results,
particularly for text-based applications.
Support Vector Machine (SVM)
 Support Vector Machine (SVM) is a supervised machine learning algorithm used for
both classification and regression problems, but it is predominantly applied in
classification tasks.
 The algorithm represents data points in an n-dimensional space (where nn is the
number of features) and identifies the optimal hyperplane to distinguish between
different classes.
How SVM Works:
1. Data Representation:
o Each data point is plotted as a point in n-dimensional space with its
coordinates corresponding to the feature values.
2. Hyperplane Identification:
o Classification is performed by finding the hyperplane that best separates the
two classes.
115
3. Margin Maximization:
o The "margin" is the distance between the hyperplane and the nearest data point
from each class.
o The optimal hyperplane maximizes this margin, ensuring better
generalization.
Advantages of SVM:
1. High Dimensional Feature Space:
o SVM performs well even when the number of features exceeds the number of
instances (e.g., spam filtering with numerous features).
2. Nonlinear Decision Boundaries:
o SVM can handle nonlinear decision boundaries by transforming the input data
into higher dimensions (using kernel tricks) where the classifier can be
represented as a linear function.
3. Ease of Understanding:
o Conceptually simple, offering an intuitive linear classification model.
4. Efficiency:
o Focuses only on a subset of relevant data points (support vectors), making it
computationally efficient.
5. Wide Availability:
o Supported by most modern data analytics and machine learning toolsets.
Disadvantages of SVM:
1. Numeric Input Requirement:
o SVM requires all data points in all dimensions to be numeric, limiting its
application to non-numeric datasets without preprocessing.
2. Binary Classification Limitation:
o Primarily designed for binary classification. Multi-class problems require
techniques like cascading multiple SVMs.
3. Computational Complexity:
o Training SVMs can be inefficient and time-consuming, especially for large
datasets.
4. Noise Sensitivity:
o SVM struggles with noisy data, requiring the computation of soft margins to
accommodate misclassifications.
5. Lack of Probability Estimates:
116
o SVM does not provide confidence levels or probability estimates for

classifications.
Practical Applications:
 Spam Detection: Identifying spam emails using high-dimensional feature spaces.
 Image Recognition: Classifying images into predefined categories.
 Text Categorization: Sorting documents into topics or themes.
 Bioinformatics: Predicting protein structure or gene expression profiles.
PageRank
PageRank is a link analysis algorithm developed by Larry Page and Sergey Brin, used to rank
web pages based on their importance in a web graph. It operates under the principle that more
important pages are likely to receive more links from other pages.
How PageRank Works:
1. Each page is assigned an initial rank (usually 1/N, where N is the total number of
pages).
2. The rank of a page is calculated iteratively based on the ranks of other pages linking
to it.
PageRank Formula:
The PageRank PR(p)of a page p is given by:
117
Where:
 d: Damping factor (typically set to 0.85). It accounts for the probability of randomly
following links versus jumping to a random page.
 N; Total number of pages.
 L(p): Set of pages linking to p.
 C(q): Number of outbound links on page q.
Structure of the Web:
The web can be represented as a directed graph, where:
 Nodes: Represent web pages.
 Edges: Represent hyperlinks between pages.
Characteristics:
1. Bow-Tie Structure:
o The web graph often has a bow-tie structure with:
 A strongly connected core of pages that are mutually reachable.
 In-links: Pages that link to the core but are not reachable from it.
 Out-links: Pages reachable from the core but do not link back.
 Disconnected components: Isolated pages or groups of pages.
2. Small-World Phenomenon:
o Most pages are reachable from any other page within a small number of clicks
(high clustering coefficient).
3. Power-Law Distribution:
o The number of links per page follows a power-law distribution, meaning a
small number of pages have a high number of links.
3. Analyzing a Web Graph:

The analysis of a web graph involves understanding its structure and properties to identify
influential pages and optimize search engines.
118
Steps:
1. Representation:
o Use an adjacency matrix or adjacency list to represent the graph.
2. Rank Calculation:
o Apply the PageRank algorithm iteratively.
3. Properties:
o Indegree: Number of links pointing to a page.
o Outdegree: Number of links going out from a page.
o Connectivity: Identify strongly connected components.
4. Traversal:
o Use BFS/DFS to explore the graph for reachability and link structure.
Algorithm for PageRank Calculation

Input:
 Web graph represented as adjacency matrix G
 Damping factor d.
 Total number of pages N.
 Convergence threshold ϵ.
Output:
 PageRank values for all pages.
119
Social Networks as Graphs
A social network can be represented as a graph, where nodes (also called vertices)
signify entities such as individuals, groups, or organizations, and edges represent
the relationships or interactions between these entities. For instance, in a social
media platform, users can be represented as nodes, and friendships or follows can
be represented as edges. These edges may be undirected (e.g., mutual friendships)
or directed (e.g., one-way follows). Analyzing social networks using graph theory
enables us to understand connectivity, influence, and the overall structure of the
network.
Social Network Analytics
Social network analytics refers to the study and interpretation of social networks
using computational and mathematical methods. This includes identifying key
individuals, understanding community structures, and uncovering patterns of
interaction. Tools like clustering, similarity measures, and community detection
are commonly used to reveal insights such as the most influential nodes, densely
connected communities, and trends within the network.
Clustering in Social Networks
Clustering involves grouping nodes in a network such that nodes within the same
cluster are more interconnected than those in different clusters. Clustering helps
identify communities or subgroups within the network. Popular clustering methods
include modularity-based clustering, spectral clustering, and hierarchical
clustering. Modularity-based clustering measures the quality of clustering by
evaluating the density of links inside clusters compared to those between clusters.
This technique is widely used for discovering tightly-knit groups in large social
networks.
SimRank
SimRank is a similarity measure that evaluates the structural similarity between

two nodes in a graph. The core idea is that two nodes are similar if they are
connected to similar nodes. For example, in a citation network, two research papers
might be considered similar if they are both cited by similar sets of papers.
SimRank is computed iteratively, starting with an initial similarity score (usually 1
for self-similarity) and updating it based on the similarities of neighboring nodes. It
is widely used in applications like link prediction and recommendation systems.
120
Counting Triangles and Graph Matches
Triangles in a social network graph are formed when three nodes are mutually
connected. Counting triangles is essential for analyzing the local clustering
coefficient, which indicates the likelihood of two neighbors of a node being
connected. This measure reflects the level of interconnectedness and transitivity in
the network. Graph matching, on the other hand, involves finding specific
subgraph patterns within a larger graph. This is useful in identifying motifs or
recurring structures in the network, such as organizational hierarchies or friend
groups.
Direct Discovery of Communities
Community detection aims to identify groups of nodes in a social network that are densely
connected internally but have sparse connections with nodes in other groups. Techniques
such as the Girvan-Newman algorithm, which removes edges with the highest betweenness
centrality, and the Louvain method, which maximizes modularity, are popular for discovering
communities. These methods help in understanding the underlying social dynamics and
targeting specific groups for marketing or information dissemination.
Applications of Social Network Analytics
The insights gained from analyzing social networks are applied in various domains.
Clustering helps in detecting communities for marketing campaigns, SimRank is used for
friend and content recommendations, triangle counting measures social cohesion, and
community detection identifies influential subgroups for spreading awareness or
advertisements. Overall, social network analytics enables businesses, researchers, and
organizations to harness the power of networked data effectively.
Clique
It is a subset of vertices within a graph such that every two distinct vertices in the clique are
adjacent; in other words, it forms a complete subgraph. This means that every member of the
clique is directly connected to every other member. Cliques are significant in various fields,
including social network analysis, where they can represent groups of individuals who all
know each other.
------------------------------------END OF MODULE 5--------------------------------------------------
121
Dear VTU Padhai Family,

The wait is over! We are delighted to announce that we have compiled a comprehensive file
containing all past year questions, important model papers and IA question papers for all 5 modules of
Big Data Analytics. This carefully curated resource is designed to provide everything you need for
your exam preparation in one place. We wish you great success in your studies and for your upcoming
exam!
Warm regards,
VTU Padhai Team.
Subject: Big Data Analytics Subject Code: 21CS71

Question Bank
Module 1: Introduction to Big Data Analytics
1. What is Big Data? Give the different versions of Big Data Definitions
2. Explain the characteristics of Big Data. Explain the characteristics of
Bigdata with respect to images taken by Satellite
3. Explain CAP theorem in detail.
4. Explain the classification of Data with examples.
5. Discuss the Big Data classification methods and types with examples.
6. Explain scalability and Data Processing in Big Data.
7. Explain Big Data architecture of Big Data.
8. Explain the following:
1. Data Sources
2. Data Quality
3. Data Preprocessing
9. Discuss Data store
10. List the characteristics of Big Data platform
11. How does Toy company can optimize the benefits using Big Data Analytics
12. Explain the usage of Big data analytics:
i) to detect Marketing Frauds
ii) in medicine
iii) advertising
13. How are Big Data used in
i) Chocolate company
ii) Automobile industry
14. What is grid computing? List and explain the features, drawbacks of grid computing
Module 2:
Introduction to Hadoop, Hadoop Distributed File System Basics, Essential Hadoop
Tools
1. What is Hadoop? Explain the core components of Hadoop.
2. Explain Hadoop Ecosystem with a neat Diagram
3. What are the features of Hadoop?
4. Explain Hadoop Physical Organisation
5. Explain Hadoop MapReduce Framework and Programming Model
6. Brief about YARN-Based Execution Model
122
Module 3: NoSQL Big Data Management, MongoDB and Cassandra:
1. Give the comparison between

i) NOSQL and SQL
ii) MongoDB and RDBMS
2. List and compare the features of Big Table, RC, ORC and Parquet data stores
3. With example explain key-value store
4. List the pros and cons of distribution using sharding
5. Discuss the Characteristics of
1) NOSQL
2) MongoDB
3) Cassandra
6. Describe the features of MongoDB, and its industrial applications
7. What are the different ways of handling bigdata problems?
8. Define NOSQL, Explain Bigdata NOSQL with its features, transactions and solutions
9. Describe graph database characteristics, typical uses and examples
10. With a neat diagram explain the shared nothing architectures of Big Data task.
11. Explain NoSQL Data Architecture Patterns
12. Give the characteristics of schema less models
13. What are BASE properties
14. Write a code using MongoDB to
1) To create a collection
2) Add an array in to a collection
15. Give the examples of CQL commands.
16. What are the draw backs of Bigdata and how to overcome the big data problems.
Module 4
1. Explain Map Reduce Map tasks with the Map reduce programming model
2. Discuss, how to compose Map-reduce for calculations
3. Illustrate different Relational algebraic operations in Map reduce
4. Discuss HIVE
i) Features
ii) Architecture
iii) Installation Process
5. Compare HIVE and RDBMS
6. Explain HIVE Datatypes and file format
7. Discuss Hive Data Model with data flow sequences
8. Explain Hive Built in functions
9. Define HiveQL. Write a program to create, show, drop and query operations taking a
database for toy company
10. Explain Table partitioning, bucketing, views, join and aggregation in Hive QL
11. Explain PIG architecture with applications and features.
12. Give the differences between
i) Pig and Map reduce
ii) Pig and SQL
13. Explain Pig Latin Data Model with pig installation steps
14. Explain Pig Relational operations
123
15. Illustrate User defined functions in PIG with a programming example.
Module 5
Machine Learning Algorithms for Big Data Analytics, Text, Web Content, Link and
Social Network Analytics
1. Explain the following
i) Text mining with text analytics process pipe line
ii)Text mining process and phases
iii)Text mining challenges
2. Discuss the following
i) Naïve base analysis
ii)Support vector machines
iii)Binary classification
3. Discuss
i) Web Mining
ii) Web content
iii) Web usage Analytics
4. Explain
i) Page rank
ii) Structure of Web and Analysing a Web graph authorities
5. What are Hubs and Authorities?
6. Explain Social Network as Graph and Social network analytics
7. Discuss
i) Clustering in social networks
ii) Sim rank
iii) Counting triangles and graph matches
iv) Direct discovery of communities
8. Discuss Analysis of Variances (ANOVA) and correlation indicators of linear relationship
9. Describe the regression analysis predict the value of the dependent variable in case of
linear regression
10. In Machine Learning, Explain Linear and Non-Linear Relationships with Graphs
11. Explain Multiple Regression. Explain their examples in forecasting and optimisation
12. Explain with neat diagram K-means clustering.
13. Explain Naïve Bayes Theorem with example.
14. Explain Apriori Algorithm to evaluate candidate key
124

Big Data Analytics Compiled Notes

Uploaded by

Big Data Analytics Compiled Notes

Uploaded by

Welcome to the VTU Padhai Family!

Thank you for choosing VTU Padhai. Happy learning!

Sl. No Topic Page No:

1. Module 1: Introduction to Big Data Analytics 1-19

2. Module 2: Introduction to Hadoop 20-31

3. Module 3: NoSQL Big Data Management 32-55

4. Module 4: MapReduce and HiveQL 56-90

5. Module 5: Machine Learning and Analytics 91-121

6. Question Bank 122-124

 Scalability: Allows for increasing or decreasing storage capacity and processing

Definitions of Big Data

7. Data Analysis Methods

o Horizontal Scalability: Adds more systems to distribute tasks, enhancing the

 Folding@home: A project aimed at understanding protein folding to help find cures

2. Acquisition, Ingestion, and Pre-Processing of Data (L2):

 Business Intelligence (BI)

o Transforming and mapping data: This process involves restructuring data to

2. Development and Management:

5. Big Data in Healthcare:

4. YARN (Yet Another Resource Negotiator):

1. Distributed Storage Layer:

o HDFS: Manages the distributed storage of large datasets.

2. Resource Manager Layer:

6. Data Block Management:

In a Hadoop cluster, nodes are divided into MasterNodes and SlaveNodes.

Hadoop MapReduce Framework and Programming Model

Hadoop MapReduce Execution Process

MapReduce Programming Model

2. Node Manager (NM):

collection of resources (memory, CPU) on a single node, assigned by the RM to the

YARN-Based Execution Model

3. Node Manager Startup:

Demerits of Distributed Computing

o Transactions must maintain the integrity of the database by adhering to

o Example: If a sales figure in one node is updated, it should instantly reflect in

o Example: MongoDB, where consistency is more critical than availability.

Network Partition and Trade-offs

Schema-less Models in NoSQL Databases

Characteristics of Schema-less Models

BASE Model in NoSQL Databases

o The system doesn't require immediate consistency after every transaction.

Advantages of Schema-less Models

o Replication across nodes ensures resilience against failures.

Applications of Schema-less Models

NoSQL Data Architecture Patterns

3. CSV and JSON File Formats

o Preferred for serialization due to concise syntax.

4. Columnar Data Stores

5. BigTable Data Stores

6. Object Data Stores

7. Document and Hierarchical Patterns

Graph Database Overview

Typical Use Cases:

Shared-Nothing Architecture for Big Data Tasks

Features of Shared-Nothing Architecture:

Choosing Distribution Models for Big Data

3. Master-Slave Distribution Model (MSD):

 Consistency: Each node contains the updated data, ensuring

Ways of Handling Big Data Problems:

4. Distribute Queries to Multiple Nodes:

o Data in MongoDB is stored in documents, which are structured in BSON

MongoDB Data Types

MongoDB Querying Commands

3. SELECT: Retrieves data from a table.

Column-Family Data Store Column-Family Data Store

and a Task Tracker on each cluster node as the slave.

1. Map TaskTracker Failure:

o A new TaskTracker will be assigned to re-execute the failed map tasks.

3. Finding Distinct Values (Counting Unique Values)

 Reducer: Combines the results from all the mappers.

o Example: For sentence ["apple", "banana", "cherry"], emit the following

EmpID Name Age Department

If we want to select employees from the HR department, we would write: σDepartment =

EmpID Name Age Department

Performing the union: Employees ∪ Contractors

The duplicate entry (Bob) is eliminated.