Big Data Analytics Compiled Notes
Big Data Analytics Compiled Notes
We are thrilled to present you with a complete set of notes for Big Data Analytics,
meticulously covering all the essential topics across five comprehensive modules. Whether
you're diving into the vast world of Big Data or preparing for your exams, these notes are
designed to serve as your ultimate resource for mastering this dynamic and evolving subject.
Here’s what you can look forward to in each module:
Module 1: Introduction to Big Data Analytics
Embark on a journey to understand the fundamentals of Big Data. This module introduces the
core concepts, scalability challenges, and parallel processing. It provides insights into
designing data architecture, understanding data sources and quality, and the critical steps in
pre-processing and storing data. Learn about Big Data storage, analysis techniques, and
explore real-world applications and case studies.
Module 2: Introduction to Hadoop
Delve into the Hadoop ecosystem and Distributed File System (HDFS). Understand its design
features, user commands, and the MapReduce framework, along with the Yarn architecture.
This module also introduces essential Hadoop tools like Apache Pig, Hive, Sqoop, Flume,
Oozie, and HBase, equipping you with practical knowledge of handling Big Data efficiently.
Module 3: NoSQL Big Data Management
Explore the world of NoSQL databases tailored for Big Data management. Learn about
NoSQL architecture patterns, shared-nothing architecture for handling tasks, and leveraging
MongoDB and Cassandra for managing vast amounts of data.
Module 4: MapReduce and HiveQL
Uncover the power of MapReduce tasks for Big Data computations and algorithms. This
module covers the basics of MapReduce execution, composing complex calculations, and the
scripting capabilities of HiveQL and Pig for managing and analysing data.
Module 5: Machine Learning and Analytics
Discover the intersection of Big Data and machine learning. This module dives into
algorithms for regression analysis, finding similarities, frequent itemsets, and association
rule mining. You’ll also explore advanced topics like text mining, web content analytics,
PageRank, and social network analytics, offering a complete perspective on data-driven
decision-making.
These notes are carefully curated to not only help you excel in your exams but also provide
valuable insights that will serve as a solid foundation for your career in Big Data Analytics.
From theoretical knowledge to practical applications, you’ll gain a holistic understanding of
the subject.We hope this learning journey empowers you to explore the vast possibilities in
Big Data. Let’s unlock the potential of analytics together!
MODULE 1
INTRODUCTION TO BIG DATA ANALYTICS
Data
Data has multiple definitions and can be used in both singular and plural forms:
1. "Data is information, usually in the form of facts or statistics that one can analyze or
use for further calculations."
2. "Data is information that can be stored and used by a computer program."
3. "Data is information presented in numbers, letters, or other forms."
4. "Data is information from a series of observations, measurements, or facts."
5. "Data is information from a series of behavioral observations, measurements, or
facts."
Web Data
Web data refers to the information available on web servers, including text, images, videos,
audio, and other multimedia content accessible to web users. A user (client software) interacts
with this data in various ways:
Pull: Clients retrieve data by sending requests to the server.
Push/Post: Servers can also publish or push data, or users can post data after
subscribing to services.
Examples of Internet Applications:
Websites, web services, and web portals
Online business applications
Emails, chats, tweets
Social networks
Classification of Data:
Structured, Semi-Structured, Multi-Structured and Unstructured
Data can be broadly classified into the following categories:
1. Structured Data
Structured data conforms to predefined data schemas and models, such as relational tables
with rows and columns. Around 15-20% of data is either structured or semi-structured.
Characteristics of Structured Data:
Supports CRUD operations: Enables data insertion, deletion, updating, and
appending.
Indexing: Facilitates faster data retrieval through indexing.
1
Big Data Analytics 21CS71
2
Big Data Analytics 21CS71
3
Big Data Analytics 21CS71
Examples: Credit card transactions, flight bookings, public agency records (e.g.,
medical records, insurance data)
Description: Data generated through business transactions and operational processes.
This includes financial transactions, service bookings, and records from public
agencies.
3. Customer Master Data
Examples: Facial recognition data, personal information (e.g., name, date of birth,
marriage anniversary, gender, location, income category)
Description: Data related to customer identity and demographics, often used for
personalized services, marketing, and authentication (e.g., facial recognition).
4. Machine-Generated Data
Examples: Internet of Things (IoT) data, sensors, trackers, web logs, computer
system logs
Description: Data generated from machines, devices, and sensors in an automated
manner. This type includes data from sensors in IoT devices, system logs from
servers, and data from machine-to-machine communication.
5. Human-Generated Data
Examples: Biometrics, human-machine interaction data, email records, student
grades (stored in databases like MySQL)
Description: Data that is generated by human interaction with machines. This
includes biometric data (e.g., fingerprints, facial scans), emails, and data stored in
databases for academic or business purposes.
Big Data Classification
Big Data can be classified based on various criteria such as data sources, formats, storage
structures, processing rates, and analysis methods. This classification helps understand how
Big Data is sourced, stored, processed, and analyzed.
1. Data Sources (Traditional)
Traditional data sources include:
Records, Relational Database Management Systems (RDBMS): Structured data
storage in tables.
Distributed Databases: Data spread across multiple systems for redundancy and
performance.
In-memory Data Tables: Data stored directly in memory for fast processing.
Data Warehouse: Centralized repositories for structured data.
Servers: Data generated from machine interactions and operations.
4
Big Data Analytics 21CS71
Business Process (BP) Data and Business Intelligence (BI) Data: Business
operation records and data for decision-making purposes.
Human-Sourced Data: Data generated by human activities, such as emails, social
media, and transactions.
2. Data Formats (Traditional)
Structured and Semi-Structured Data: Data stored in predefined formats, like
tables, XML, or JSON, making it easier to retrieve and analyze.
3. Big Data Sources
Big Data is sourced from a variety of places:
Data Storage Systems: Distributed file systems, Operational Data Stores (ODS), data
marts, data warehouses, and NoSQL databases (e.g., MongoDB, Cassandra).
Sensor Data: Data from IoT devices, monitoring systems, and sensors.
External Data Sources: Web data, social media activity, weather data, and health
records.
Audit Trails: Logs from financial transactions and other system operations.
4. Big Data Formats
Big Data comes in various formats:
Unstructured, Semi-Structured, and Multi-Structured Data: Data without a
predefined schema, such as images, videos, and text, along with semi-structured
formats like XML and JSON.
Data Stores Structure: Includes row-oriented data (used for OLTP systems),
column-oriented data (used for OLAP systems), graph databases, and hashed
key/value pairs.
5. Processing Data Rates
Big Data processing can happen at different speeds:
Batch Processing: Large volumes of data are processed in chunks (e.g., using
MapReduce).
Near-Time Processing: Data is processed almost immediately after it's received.
Real-Time and Streaming Processing: Data is processed as it arrives (e.g., using
Spark Streaming).
6. Big Data Processing Methods
Batch Processing: Includes tools like MapReduce, Hive, and Pig for processing large
data sets over time.
Real-Time Processing: Uses tools like SparkStreaming, Apache Drill, and SparkSQL
for immediate data analysis and decision-making.
5
Big Data Analytics 21CS71
Scalability
Capacity Increase: Scalability allows a system to grow or shrink in capacity as data
and processing demands change.
o Vertical Scalability: Increases a single system's resources to improve its
processing power and efficiency.
6
Big Data Analytics 21CS71
7
Big Data Analytics 21CS71
Scaling software to run on larger machines with more resources can enhance
performance, but the efficiency of the algorithm plays a significant role.
o Simply adding more CPUs or memory without optimizing the software's
ability to leverage these resources won't provide substantial performance
gains.
o Algorithm Design: Properly designed algorithms exploit additional resources
like extra CPUs and memory, enabling efficient use of parallel computing
environments.
Cloud Computing
Cloud computing is an internet-based service that allows on-demand access to shared
resources and data. It provides flexible and scalable computing power, data storage, and
services without requiring users to invest in their own physical infrastructure.
Features of Cloud Computing:
1. On-Demand Service: Users can access computing resources (such as storage,
processing power, or software) whenever needed without human interaction with
service providers.
2. Resource Pooling: Cloud providers use multi-tenant models to pool resources,
dynamically allocating them to meet the demands of multiple customers.
3. Scalability: Cloud resources can be scaled up or down based on demand, ensuring
flexibility.
4. Broad Network Access: Cloud services are accessible via the internet, meaning users
can access their resources from any location, using various devices.
5. Accountability: Cloud providers ensure transparent usage metrics and billing, giving
users clear insight into their resource consumption.
Types of Cloud Computing Services:
1. Infrastructure as a Service (IaaS):
o Provides access to computing resources such as virtual machines, storage, and
network infrastructure.
o Users can rent infrastructure on a pay-as-you-go basis.
o Examples:
Amazon EC2: Virtual server space for scalable computing power.
Tata CloudStack: Open-source software for managing virtual
machines, offering public cloud services.
2. Platform as a Service (PaaS):
o Provides a platform allowing developers to build, deploy, and manage
applications without worrying about the underlying infrastructure.
8
Big Data Analytics 21CS71
o Examples:
Microsoft Azure HD Insights: Offers cloud-based Hadoop services.
IBM BigInsight and Oracle Big Data Cloud Services: Provide big
data platforms for analytics and application development.
3. Software as a Service (SaaS):
o Delivers software applications over the internet. Users access software without
installing or maintaining it on their own computers.
o Examples:
GoogleSQL, IBM BigSQL, HPE Vertica: Cloud-based SQL services.
Microsoft Polybase and Oracle Big Data SQL: Cloud solutions for
data analytics and querying large datasets.
Cloud Computing in Big Data Processing
Cloud computing is a powerful environment for handling Big Data as it allows for both
parallel and distributed computing across multiple nodes. Big data solutions leverage the
cloud for:
Data Storage: Cloud platforms such as Amazon S3 provide scalable storage for large
datasets.
Data Processing: Cloud-based services like Microsoft Azure, Apache CloudStack,
and AWS EC2 facilitate the parallel processing of large-scale datasets.
Grid Computing:
Grid computing is a form of distributed computing where computers located in different
locations are interconnected to work together on a common task. It allows the sharing of
resources across various organizations or individuals for achieving large-scale tasks,
particularly data-intensive ones.
Features of Grid Computing:
1. Distributed Network: Grid computing involves a network of computers from
multiple locations, each contributing resources for a common goal.
2. Large-Scale Resource Sharing: It enables the flexible, coordinated, and secure
sharing of resources among users, such as individuals and organizations.
3. Data-Intensive Tasks: Grid computing is particularly well-suited for handling large
datasets that can be distributed across grid nodes.
4. Scalability: Grid computing can scale efficiently by adding more nodes to
accommodate growing data or processing needs.
5. Single-Task Focus: At any given time, a grid typically dedicates its resources to a
single application or task.
Drawbacks of Grid Computing:
9
Big Data Analytics 21CS71
1. Single Point of Failure: If one node underperforms or fails, it can disrupt the entire
grid, affecting overall performance.
2. Variable Performance: The performance and storage capacity of the grid can
fluctuate depending on the number of users, instances, and data transferred.
3. Resource Management Complexity: As resources are shared among many users,
managing and coordinating them can be challenging, especially with large volumes of
data.
Cluster Computing:
Cluster computing refers to a group of computers connected by a local network that work
together to accomplish the same task. Unlike grid computing, clusters are typically located in
close proximity and used primarily for load balancing and high availability.
Key Features of Cluster Computing:
1. Local Network: The computers in a cluster are interconnected locally and operate as
a single system.
2. Load Balancing: Clusters distribute processes among nodes to ensure that no single
computer is overloaded. This allows for better resource utilization and higher
availability.
3. Fault Tolerance: Clusters often provide redundancy, where if one node fails, others
can take over, minimizing the risk of downtime.
4. Application: Cluster computing is commonly used in high-performance computing
(HPC), scientific simulations, and business analytics.
5. Hadoop Integration: The Hadoop architecture follows cluster computing principles
by distributing tasks across many nodes for large-scale data processing.
Volunteer Computing:
Volunteer computing is a type of distributed computing that uses the resources of volunteers
(organizations or individuals) to contribute to projects requiring computational power.
Key Features of Volunteer Computing:
1. Volunteer Resources: Volunteers donate the computing power of their personal
devices (computers, smartphones, etc.) to help process data or run simulations for
large-scale projects.
2. Distributed Network: Similar to grid computing, volunteer computing relies on a
network of geographically distributed devices.
3. Popular in Academia: Volunteer computing is often used for science-related projects
by universities or research institutions.
Examples of Volunteer Computing Projects:
SETI@home: A project that uses idle resources from volunteers to analyze radio
signals for extraterrestrial life.
10
Big Data Analytics 21CS71
Data Architecture Design involves organizing how Big Data is stored, accessed, and
managed in a Big Data or IT environment. It creates a structure that allows the flow of
information, security management, and utilization of core components in an efficient manner.
Big Data architecture follows a systematic approach, especially when broken down into
logical layers, each serving a specific function. These layers make it easier to design, process,
and implement data architecture.
Big Data Architecture Layers:
The architecture is broken down into five main layers, each representing a set of core
functions essential for handling Big Data:
1. Identification of Data Sources (L1):
o Purpose: Identify the sources of data, which could be both internal
(organization databases, ERP systems) and external (social media, IoT
devices, APIs).
o Key Task: Determine the relevant data sources to be ingested into the system.
2. Acquisition, Ingestion, and Pre-Processing of Data (L2):
11
Big Data Analytics 21CS71
o Purpose: Data ingestion is the process of importing and absorbing data into
the system for further use. This data may be ingested in batches or real-time.
o Key Task: Perform initial data transformation, cleaning, and standardization
to ensure data readiness for storage and processing.
3. Data Storage (L3):
o Purpose: Store data in a variety of storage environments, such as files,
databases, clusters, or cloud systems. This layer holds structured, semi-
structured, and unstructured data for future processing.
o Key Task: Choose appropriate storage systems based on scalability and
reliability (e.g., Hadoop Distributed File System (HDFS), cloud storage like
AWS S3, or distributed storage).
4. Data Processing (L4):
o Purpose: This layer focuses on processing the data using frameworks and
tools like MapReduce, Apache Hive, Apache Pig, and Apache Spark.
o Key Task: Implement large-scale distributed data processing to analyze and
extract meaningful insights from vast datasets.
5. Data Consumption (L5):
o Purpose: After data is processed, this layer delivers the insights and results to
end users through analytics, visualization, and reporting tools.
o Key Task: Use analytics for various applications such as business
intelligence, AI/ML models, predictive analytics, pattern recognition, and
data visualization tools.
Designing Data Architecture
Data Architecture Design involves organizing how Big Data is stored, accessed, and
managed in a Big Data or IT environment. It creates a structure that allows the flow of
information, security management, and utilization of core components in an efficient manner.
Big Data architecture follows a systematic approach, especially when broken down into
logical layers, each serving a specific function. These layers make it easier to design, process,
and implement data architecture.
Big Data Architecture Layers:
The architecture is broken down into five main layers, each representing a set of core
functions essential for handling Big Data:
1. Identification of Data Sources (L1):
o Purpose: Identify the sources of data, which could be both internal
(organization databases, ERP systems) and external (social media, IoT
devices, APIs).
o Key Task: Determine the relevant data sources to be ingested into the system.
12
Big Data Analytics 21CS71
13
Big Data Analytics 21CS71
14
Big Data Analytics 21CS71
15
Big Data Analytics 21CS71
16
Big Data Analytics 21CS71
11. Machine Learning Integration: Utilizing machine learning techniques for predictive
analytics and data insights.
Data Sources in a Big Data Environment:
Data Storage Solutions:
Traditional data warehouses and modern NoSQL databases (e.g., Oracle Big
Data, MongoDB, Cassandra).
Sensor Data: Data generated from various sensors, which can include IoT devices.
Audit Trails: Records of financial transactions and other business processes.
External Data Sources: Information from web platforms, social media, weather data,
and health records.
Big Data Analytics Applications and Case Studies:
1. Big Data in Marketing and Sales:
o Marketing revolves around delivering value to customers. Big Data plays a
vital role in customer value analytics (CVA), allowing companies like Amazon
to enhance customer experiences. It helps businesses understand customer
needs and perceptions, leading to effective strategies for improving customer
relationships and lifetime value (CLTV).
o Big Data in marketing also aids in lowering customer acquisition cost (CAC)
and enhancing contextual marketing by targeting potential customers based on
browsing patterns.
2. Big Data Analytics in Fraud Detection:
o Fraud detection is critical to avoiding financial losses. Examples of fraud
include sharing customer data with third parties or falsifying company
information. Big Data analytics help detect and prevent fraud by integrating
data from multiple sources such as social media, emails, and websites,
allowing faster detection of threats and preventing potential frauds.
3. Big Data Risks:
o While Big Data offers insights, it also introduces risks. Erroneous or
inaccurate data can lead to faulty analytics, requiring companies to implement
strong risk management strategies to ensure accurate predictions and reliable
data usage.
4. Big Data in Credit Risk Management:
o Financial institutions use Big Data to manage credit risks by analyzing loan
defaults, timely return of interests, and the creditworthiness of borrowers. Big
Data provides insights into industries with higher risks, individuals with poor
credit ratings, and liquidity issues, helping financial institutions make
informed lending decisions.
17
Big Data Analytics 21CS71
18
Big Data Analytics 21CS71
------
Hadoop Structure
19
Big Data Analytics 21CS71
MODULE 2
Introduction to Hadoop (T1)
Introduction to Hadoop
Hadoop is an Apache open-source framework written in Java that enables the distributed processing
of large datasets across clusters of computers using simple programming models. It allows
applications to work in an environment that supports distributed storage and computation. Hadoop is
scalable, meaning it can grow from a single server to thousands of machines, each providing local
computation and storage. It is designed to handle Big Data and enable efficient processing of massive
datasets.
Big Data Store Model
The Big Data store model in Hadoop is based on a distributed file system. Data is stored in blocks,
which are physical divisions of data spread across multiple nodes. The architecture is organized in
clusters and racks:
Data Nodes: Store data in blocks.
Racks: A collection of data nodes, scalable across clusters.
Clusters: Racks are grouped into clusters to form the overall storage and processing system.
Hadoop ensures reliability by replicating data blocks across nodes. If a data link or node fails, the
system can still access the replicated data from other nodes.
Big Data Programming Model
In Hadoop's Big Data programming model, jobs and tasks are scheduled to run on the same servers
where the data is stored, minimizing data transfer time. This programming model is enabled by
MapReduce, a powerful tool that divides processing tasks into smaller subtasks that can be executed
in parallel across the cluster.
Example of Jobs in Hadoop
Query Processing: A job that processes queries on datasets and returns results to an
application.
Sorting Data: Sorting performance data from an examination or another large dataset.
Hadoop and Its Ecosystem
The Hadoop framework was developed as part of an Apache project for Big Data storage and
processing, initiated by Doug Cutting and Mike Cafarella. The name Hadoop came from Cutting’s
son, who named his stuffed toy elephant "Hadoop."
Hadoop has two main components:
1. Hadoop Distributed File System (HDFS): A system for storing data in blocks across
clusters.
2. MapReduce: A computational framework that processes data in parallel across the clusters.
Hadoop is written primarily in Java, with some native code in C, and the utilities are managed using
shell scripts. The framework operates on cloud-based infrastructure, making it a cost-effective
solution for managing and processing terabytes of data in minutes.
20
Big Data Analytics 21CS71
Characteristics of Hadoop
Hadoop offers several key advantages for managing Big Data:
Scalable: Easily scales from a few machines to thousands.
Self-manageable: Requires minimal manual intervention for management.
Self-healing: Automatically manages node failures by replicating data.
Distributed File System: Ensures reliable storage and quick access to large datasets.
Hadoop Core Components
The Apache Hadoop framework is made up of several core components, which work together to store
and process large datasets in a distributed computing environment. The core components of Hadoop
are as follows:
1. Hadoop Common:
o Description: This is the foundational module that contains the libraries and utilities
required by other Hadoop components. It provides various common services like file
system and input/output operations, serialization, and Remote Procedure Calls
(RPCs).
o Features:
Common utilities shared across the Hadoop modules.
File-based data structures.
Essential interfaces for interacting with the distributed file system.
2. Hadoop Distributed File System (HDFS):
o Description: HDFS is a Java-based distributed file system designed to run on
commodity hardware. It allows Hadoop to store large datasets by distributing data
blocks across multiple machines (nodes) in the cluster.
o Features:
Data is stored in blocks and replicated for fault tolerance.
Highly scalable and reliable.
Optimized for batch processing and provides high throughput for data access.
21
Big Data Analytics 21CS71
3. MapReduce v1:
o Description: MapReduce v1 is a programming model that allows for the processing
of large datasets in parallel across multiple nodes. The model divides a job into
smaller sub-tasks, which are then executed across the cluster.
o Features:
Jobs are divided into Map tasks and Reduce tasks.
Suitable for batch processing large sets of data.
22
Big Data Analytics 21CS71
o Hadoop processes Big Data characterized by the 3Vs: Volume, Variety, and
Velocity.
4. Distributed Cluster Computing with Data Locality:
o Hadoop optimizes processing by running tasks on the same nodes where the data is
stored, enhancing efficiency.
o High-speed processing is achieved by distributing tasks across multiple nodes in a
cluster.
5. Fault Tolerance:
o Hadoop automatically handles hardware failures. If a node fails, the system recovers
by using data replicated across other nodes.
6. Open-Source Framework:
o Hadoop is open-source, making it cost-effective for handling large data workloads. It
can run on inexpensive hardware and cloud infrastructure.
7. Java and Linux Based:
o Hadoop is built in Java and runs primarily on Linux. It also includes its own set of
shell commands for easy management.
Hadoop Ecosystem Components
Hadoop's ecosystem consists of multiple layers, each responsible for different aspects of storage,
resource management, processing, and application support. The key components are:
23
Big Data Analytics 21CS71
24
Big Data Analytics 21CS71
25
Big Data Analytics 21CS71
SlaveNodes:
SlaveNodes (or DataNodes and Task Trackers) store actual data blocks and execute
computational tasks. Each node has a significant amount of disk space and is responsible for
both data storage and processing.
o DataNodes handle the storage and management of data blocks.
o TaskTrackers execute the processing tasks sent by the MasterNode and return the
results.
Physical Distribution of Nodes:
A typical Hadoop cluster consists of many DataNodes that store data, while MasterNodes
handle administrative tasks. In a large cluster, multiple MasterNodes are used to balance the
load and ensure redundancy.
Client-Server Interaction:
Clients interact with the Hadoop system by submitting queries or applications through various
Hadoop ecosystem projects, such as Hive, Pig, or Mahout.
The MasterNode coordinates with the DataNodes to store data and process tasks. For
example, it organizes how files are distributed across the cluster, assigns jobs to the nodes,
and monitors the health of the system.
26
Big Data Analytics 21CS71
o A client submits a request to the JobTracker, which estimates the required resources
and prepares the cluster for execution.
2. Task Assignment:
o The JobTracker assigns Map tasks to nodes that store the relevant data. This is
called data locality, which reduces network overhead.
3. Monitoring:
o The progress of each task is monitored, and if any task fails, it is restarted on a
different node with available resources.
4. Final Output:
o After the Map and Reduce jobs are completed, the results are serialized and
transferred back to the client, typically using formats like AVRO.
27
Big Data Analytics 21CS71
28
Big Data Analytics 21CS71
o Containers run the actual tasks of the application in parallel, distributed across
multiple nodes.
29
Big Data Analytics 21CS71
o During job execution, the NM monitors resource utilization and ensures the tasks are
completed successfully. If there are any failures, the RM may reassign tasks to
available containers.
Hadoop Ecosystem Tools
1. Zookeeper:
Zookeeper is a centralized coordination service for distributed applications. It provides a reliable,
efficient way to manage configuration, synchronization, and name services across distributed systems.
Zookeeper maintains data in nodes called JournalNodes, ensuring that distributed systems function
cohesively. Its main coordination services include:
Name Service: Similar to DNS, it maps names to information, tracking servers or services
and checking their statuses.
Concurrency Control: Manages concurrent access to shared resources, preventing
inconsistencies and ensuring that distributed processes run smoothly.
Configuration Management: A centralized configuration manager that updates nodes with
the current system configuration when they join the system.
Failure Management: Automatically recovers from node failures by selecting alternative
nodes to take over processing tasks.
2. Oozie:
Apache Oozie is a workflow scheduler for Hadoop that manages and coordinates complex jobs and
tasks in big data processing. Oozie allows you to create, schedule, and manage multiple workflows. It
organizes jobs into Directed Acyclic Graphs (DAGs) and supports:
Integration of Multiple Jobs: Oozie integrates MapReduce, Hive, Pig, and Sqoop jobs in a
sequential workflow.
Time and Data Triggers: Automatically runs workflows based on time or specific data
availability.
Batch Management: Manages the timely execution of thousands of jobs in a Hadoop cluster.
Oozie is efficient for automating and scheduling repetitive jobs, simplifying the management of
multiple workflows.
3. Sqoop:
Apache Sqoop is a tool used for efficiently importing and exporting large amounts of data between
Hadoop and relational databases. It uses the MapReduce framework to parallelize data transfer
tasks. The workflow of Sqoop includes:
Command-Line Parsing: Sqoop processes the arguments passed through the command line
and prepares map tasks.
Data Import and Export: Data from external databases is distributed across multiple
mappers. Each mapper connects to the database using JDBC to fetch and import the data into
Hadoop, HDFS, Hive, or HBase.
Parallel Processing: Sqoop leverages Hadoop's parallel processing to transfer data quickly
and efficiently. It also provides fault tolerance and schema definition for data import.
30
Big Data Analytics 21CS71
Sqoop's ability to handle structured data makes it an essential tool for integrating relational databases
with the Hadoop ecosystem.
4. Flume:
Apache Flume is a service designed for efficiently collecting, aggregating, and transferring large
volumes of streaming data into Hadoop, particularly into HDFS. It's highly useful for applications
involving continuous data streams, such as logs, social media feeds, or sensor data. Key components
of Flume include:
Sources: These collect data from servers or applications.
Sinks: These store the collected data into HDFS or another destination.
Channels: These act as a buffer, holding event data (typically 4 KB in size) between sources
and sinks.
Agents: Agents run sources and sinks. Interceptors filter or modify the data before it's
written to the target.
Flume is reliable and fault-tolerant, providing a robust solution for handling massive, continuous data
streams.
----------------------------------------END OF MODULE 2-------------------------------------------------
31
Big Data Analytics 21CS71
MODULE3
Introduction to Distributed Systems in Big Data
Definition: Distributed systems consist of multiple data nodes organized into clusters,
enabling tasks to execute in parallel.
Communication: Nodes communicate with applications over a network, optimizing
resource utilization.
Features of Distributed-Computing Architecture
1. Increased Reliability and Fault Tolerance
o Failure of some cluster machines does not impact the overall system.
o Data replication across nodes enhances fault tolerance.
2. Flexibility
o Simplifies installation, implementation, and debugging of new services.
3. Sharding
o Definition: Dividing data into smaller, manageable parts called shards.
o Example: A university student database is sharded into datasets per course
and year.
4. Speed
o Parallel processing on individual nodes in clusters boosts computing
efficiency.
5. Scalability
o Horizontal Scalability: Expanding by adding more machines and shards.
o Vertical Scalability: Enhancing machine capabilities to run multiple
algorithms.
6. Resource Sharing
o Shared memory, machines, and networks reduce operational costs.
7. Open System
o Accessibility of services across all nodes in the system.
8. Performance
o Improved performance through collaborative processor operations with lower
communication costs compared to centralized systems.
32
Big Data Analytics 21CS71
1. Troubleshooting Complexity
o Diagnosing issues becomes challenging in large network infrastructures.
2. Software Overhead
o Additional software is often required for distributed system management.
3. Security Risks
o Vulnerabilities in data and resource sharing due to distributed architecture.
NoSQL Concepts
NoSQL Data Store: Non-relational databases designed to handle semi-structured and
unstructured data.
NoSQL Data Architecture Patterns: Models such as key-value, document, column-
family, and graph for efficient data organization.
Shared-Nothing Architecture: Ensures no shared resources among nodes, enabling
independent operation and scalability.
MongoDB
Type: Document-oriented NoSQL database.
Features: Schema-less design, JSON-like storage, scalability, and high availability.
Usage: Suitable for real-time applications and Big Data analytics.
Cassandra
Type: Column-family NoSQL database.
Features: High availability, decentralized architecture, linear scalability, and eventual
consistency.
Usage: Ideal for applications requiring fast writes and large-scale data handling.
SQL Databases: ACID Properties
SQL databases are relational and exhibit ACID properties to ensure reliability and
consistency of transactions:
1. Atomicity
o All operations in a transaction must complete entirely, or none at all.
o Example: In a banking transaction, if updating both withdrawal and balance
fails midway, the entire transaction rolls back.
2. Consistency
33
Big Data Analytics 21CS71
SQL Features
1. Triggers
o Automated actions executed upon events like INSERT, UPDATE, or
DELETE.
2. Views
o Logical subsets of data from complex queries, simplifying data access.
3. Schedules
o Define the chronological execution order of transactions to maintain
consistency.
4. Joins
o Combine data from multiple tables based on conditions, enabling complex
queries.
CAP Theorem Overview
The CAP Theorem, formulated by Eric Brewer, states that in a distributed system, it is
impossible to simultaneously guarantee all three properties: Consistency (C), Availability
(A), and Partition Tolerance (P). Distributed databases must trade off between these
properties based on specific application needs.
CAP Properties
1. Consistency (C):
o All nodes in the distributed system see the same data at the same time.
o Changes to data are immediately reflected across all nodes.
34
Big Data Analytics 21CS71
CAP Combinations
Since achieving all three properties is not possible, distributed systems choose two of the
three based on requirements:
1. Consistency + Availability (CA):
o Ensures all nodes see the same data (Consistency).
o Ensures all requests receive responses (Availability).
o Cannot tolerate network partitions.
o Example: Relational databases in centralized systems.
2. Availability + Partition Tolerance (AP):
o Ensures the system responds to requests even during network failures
(Partition Tolerance).
o May sacrifice consistency, meaning some nodes may have stale or outdated
data.
o Example: DynamoDB, where availability is prioritized over consistency.
Consistency + Partition Tolerance (CP):
o Ensures all nodes maintain consistent data (Consistency).
o Tolerates network partitions but sacrifices availability during failures (some
requests may be denied).
35
Big Data Analytics 21CS71
36
Big Data Analytics 21CS71
37
Big Data Analytics 21CS71
38
Big Data Analytics 21CS71
39
Big Data Analytics 21CS71
Uses:
o Image/document storage.
o Lookup tables and query caches.
2. Document Stores
Definition: Stores unstructured or semi-structured data in a hierarchical format.
Features:
1. Stores data as documents (e.g., JSON, XML).
2. Hierarchical tree structures with paths for navigation.
3. Transactions exhibit ACID properties.
4. Flexible schema-less design.
Advantages:
o Easy querying and navigation using languages like XPath or XQuery.
o Supports dynamic schema changes (e.g., adding new fields).
Limitations:
o Incompatible with traditional SQL.
o Complex implementation compared to other stores.
Examples: MongoDB, CouchDB.
Use Cases:
o Office documents, inventory data, forms, and document searches.
40
Big Data Analytics 21CS71
Comparison:
o JSON includes arrays; XML is more verbose but widely used.
o JSON is easier to handle for developers due to its key-value structure.
41
Big Data Analytics 21CS71
Use Cases:
o Web crawling, large sparsely populated tables, and high-variance systems.
42
Big Data Analytics 21CS71
43
Big Data Analytics 21CS71
44
Big Data Analytics 21CS71
45
Big Data Analytics 21CS71
o This is the simplest distribution model where all data is stored and processed
on a single server. While this model is easy to implement, it may not scale
well for large datasets or high traffic applications.
o Best for: Small-scale applications or use cases like graph databases where
relationships are processed sequentially on a single server.
o Example: A simple graph database that processes node relationships on a
single server.
2. Sharding Very Large Databases:
o Sharding refers to the process of splitting a large database into smaller, more
manageable parts called "shards". Each shard is distributed across multiple
servers in a cluster.
o Sharding provides horizontal scalability, allowing the system to process data
in parallel across multiple nodes.
o Advantages:
Enhanced performance by distributing data across multiple nodes.
If a node fails, the shard can migrate to another node for continued
processing.
o Example: A dataset of customer records is split across four servers, where
each server handles a shard (DBi, DBk, DBL, DBMS).
46
Big Data Analytics 21CS71
o In this model, there is one master node that handles write operations, and
multiple slave nodes that replicate the master’s data for read operations.
o The master node directs the slaves to replicate data, ensuring consistency
across nodes.
o Advantages:
Read performance is optimized as multiple slave nodes handle read
requests.
Writing is centralized, ensuring data consistency.
o Challenges:
The replication process can introduce some latency and complexity.
A failure of the master node may impact the write operations until a
failover mechanism is implemented.
o Example: MongoDB uses this model where data is replicated from the master
node to slave nodes.
4. Peer-to-Peer Distribution Model (PPD):
o In this model, all nodes are equal peers that both read and write data. Each
node has a copy of the data and can handle both read and write operations
independently.
o Advantages:
High Availability: Since all nodes can read and write, the system can
tolerate node failures without affecting the ability to perform writes.
47
Big Data Analytics 21CS71
48
Big Data Analytics 21CS71
MongoDB Database:
MongoDB is a widely-used open-source NoSQL database designed to handle large amounts
of data in a flexible, distributed manner. Initially developed by 10gen (now MongoDB Inc.),
MongoDB was introduced as a platform-as-a-service (PaaS) and later released as an open-
source database. It’s known for its document-oriented model, making it suitable for handling
unstructured and semi-structured data.
Key Characteristics of MongoDB:
Non-relational: Does not rely on traditional SQL-based relational models.
NoSQL: Flexible and can handle large volumes of data across multiple nodes.
Distributed: Data can be stored across multiple machines, supporting horizontal
scalability.
Open Source: Freely available for use and modification.
Document-based: Uses a document-oriented storage model, storing data in flexible
formats such as JSON.
Cross-Platform: Can be used across different operating systems.
Scalable: Can scale horizontally by adding more servers to handle growing data
needs.
Fault Tolerant: Provides high availability through replication and data redundancy.
Features of MongoDB:
1. Database Structure:
o Each database is a physical container for collections. Multiple databases can
run on a single MongoDB server. The default database is called db, and the
server's main process is called mongod, while the client is mongo.
2. Collections:
o Collections are analogous to tables in relational databases, and they store
multiple MongoDB documents. Collections are schema-less, meaning that
documents within a collection can have different fields and structures.
3. Document Model:
49
Big Data Analytics 21CS71
50
Big Data Analytics 21CS71
MongoDB Replication
Replication in MongoDB is essential for high availability and fault tolerance in Big Data
environments. Replication involves maintaining multiple copies of data across different
database servers. In MongoDB, this is achieved using replica sets, which ensure data
redundancy and allow for continuous data availability even in the event of server failures.
How Replica Sets Work:
A replica set is a group of MongoDB server processes (mongod) that store the same
data. Each replica set has at least three nodes:
1. Primary Node: Receives all write operations.
2. Secondary Nodes: Replicate data from the primary node.
The primary node handles all write operations, and these are automatically propagated to the
secondary nodes. If the primary node fails, one of the secondary nodes is promoted to
primary in an automatic failover process, ensuring continuous availability.
o Commands for Replica Set Management:
rs.initiate(): Initializes a new replica set.
rs.config(): Checks the replica set configuration.
rs.status(): Displays the status of the replica set.
rs.add(): Adds new members to the replica set.
MongoDB Sharding
Sharding is MongoDB’s method of distributing data across multiple machines, particularly in
scenarios involving large amounts of data. It is useful for scaling out horizontally when a
single machine can no longer store or process the data efficiently.
How Sharding Works:
Shards: A shard is a single MongoDB server or replica set that holds part of the data.
Sharded Cluster: MongoDB uses a sharded cluster to distribute data. Each shard
contains a portion of the data, and queries are routed to the appropriate shard based on
a shard key.
Shard Key: A field in the documents used to determine how data is distributed across
the shards.
Sharding allows MongoDB to handle larger datasets and more operations by spreading the
load across multiple machines.
51
Big Data Analytics 21CS71
52
Big Data Analytics 21CS71
Cassandra Database
Cassandra, developed by Facebook and later released by Apache, is a highly scalable NoSQL
database designed to handle large amounts of structured, semi-structured, and unstructured
data. The database is named after the Trojan mythological prophet Cassandra, who was
cursed to always speak the truth but never to be believed. It was initially designed by
Facebook to handle their massive data needs, and it has since been adopted by several large
companies like IBM, Twitter, and Netflix.
Characteristics:
Open Source: Cassandra is freely available and open to modifications.
Scalable: It is designed to scale horizontally by adding more nodes to the system.
NoSQL: It is a non-relational database, making it suitable for big data applications.
Distributed: Cassandra's architecture allows it to run on multiple servers, ensuring
high availability and fault tolerance.
Column-based: Data is stored in columns rather than rows, making it more efficient
for write-heavy workloads.
Decentralized: All nodes in a Cassandra cluster are peers, which ensures that there is
no single point of failure.
Fault-tolerant: Due to data replication across multiple nodes, Cassandra can
withstand node failures without data loss.
Tuneable consistency: It provides flexibility to choose the level of consistency for
different operations.
Features of Cassandra:
Maximizes write throughput: It is optimized for handling massive amounts of write
operations.
No support for joins, group by, OR clauses, or complex aggregations: Its
architecture focuses on performance rather than relational operations.
Fast and easily scalable: The database performs well as more nodes are added, and it
can handle high write volumes.
Distributed architecture: Data is distributed across the nodes in the cluster, ensuring
high availability.
Peer-to-peer: Nodes in Cassandra communicate with each other in a peer-to-peer
fashion, unlike master-slave architectures.
Data Replication in Cassandra: Cassandra provides data replication across multiple
nodes, ensuring no single point of failure. The replication factor defines the number of
replicas placed on different nodes. In case of stale data or node failure, Cassandra uses read
repair to ensure that all replicas are consistent. It adheres to the CAP theorem, prioritizing
availability and partition tolerance.
53
Big Data Analytics 21CS71
Scalability: Cassandra supports linear scalability. As new nodes are added to the cluster,
both throughput increases and response time decreases. It uses a decentralized approach
where each node in the cluster is equally important.
Transaction Support: Cassandra supports the ACID properties (Atomicity, Consistency,
Isolation, Durability), although it is not strictly a transactional system like traditional
RDBMS. Instead, it offers eventual consistency to ensure high availability and fault
tolerance.
Replication Strategies:
Simple Strategy: A straightforward replication factor for the entire cluster.
Network Topology Strategy: Allows replication factor configuration per data center,
useful for multi-data center deployments.
Cassandra Data Model:
Cluster: A collection of nodes and keyspaces.
Keyspace: The outermost container in Cassandra that holds column families (tables).
Each keyspace defines the replication strategy and factors.
Column: A single data point consisting of a name, value, and timestamp.
Column Family: A collection of columns, which is equivalent to a table in relational
databases.
Cassandra CQL (Cassandra Query Language):
CREATE KEYSPACE: Creates a keyspace to store tables. It includes replication
strategy options.
ALTER KEYSPACE: Modifies an existing keyspace.
DROP KEYSPACE: Deletes a keyspace.
USE KEYSPACE: Connects to a specific keyspace.
CREATE TABLE: Defines a new table with columns, including primary key
constraints.
ALTER TABLE: Modifies the structure of an existing table (e.g., adding or dropping
columns).
DESCRIBE: Provides detailed information about keyspaces, tables, indexes, etc.
CRUD Operations in Cassandra:
1. INSERT: Adds new data into a table.
o Example: INSERT INTO <tablename> (<columns>) VALUES (<values>);
2. UPDATE: Modifies existing data.
o Example: UPDATE <tablename> SET <column> = <value> WHERE
<condition>;
54
Big Data Analytics 21CS71
55
Big Data Analytics 21CS71
MODULE 4
MapReduce, Hive and Pig
Map Reduce Programming Model
The MapReduce programming model is a powerful framework used for processing and
analysing large-scale datasets in a distributed computing environment. It divides tasks into
two core operations: Map and Reduce.
In the Map phase, the input data is split into smaller chunks and distributed across multiple
nodes for parallel processing, where each node produces key-value pairs as intermediate
outputs.
The Reduce phase then aggregates these outputs, combining them into a smaller, more
concise result. This parallelized approach allows for efficient handling of vast amounts of
data. Hadoop, one of the most widely used implementations of MapReduce, utilizes the
Hadoop Distributed File System (HDFS) for storing and retrieving data. In such systems,
nodes serve both as computational units and storage devices, optimizing resource use and
scalability.
The MapReduce model is highly applicable in big data scenarios, enabling tasks like log
analysis, data transformation, and large-scale data mining. Additionally, database techniques
such as indexing and inner joins further enhance the efficiency of data retrieval and
processing, making MapReduce a foundational concept for modern big data solutions.
MapReduce employs a master-slave architecture, consisting of a JobTracker as the master
56
Big Data Analytics 21CS71
allows for a wide range of data processing tasks, making MapReduce a robust solution for
handling diverse big data workloads.
Map-Tasks
A Map Task in the MapReduce programming model is responsible for processing input data
in the form of key-value pairs, denoted as (k1,v1)(k_1, v_1). Here, k1k_1 represents a set of
keys, and v1v_1 is a value (often a large string) read from the input file(s). The map()
function implemented within the task executes the user application logic on these pairs. The
output of a map task consists of zero or more intermediate key-value pairs (k2,v2)(k_2, v_2),
which are used as input for the Reduce task for further processing.
The Mapper operates independently on each dataset, without intercommunication between
Mappers. The output of the Mapper, v2v_2, serves as input for transformation operations at
the Reduce stage, typically involving aggregation or other reducing functions. A Reduce
Task takes these intermediate outputs, processes them using a combiner, and generates a
smaller, summarized dataset. Reduce tasks are always executed after the completion of all
Map tasks.
The Hadoop Java API provides a Mapper class, which includes an abstract map() function.
Any specific Mapper implementation must extend this class and override the map () function
to define its behaviour.
For instance:
public class SampleMapper extends Mapper<k1, v1, k2, v2> {
void map(k1 key, v1 value, Context context) throws IOException, InterruptedException {
// User-defined logic
}
}
The number of Map tasks, NmapN_{map}, is determined by the size of the input files and the
block size of the Hadoop Distributed File System (HDFS).
For example, a 1 TB input file with a block size of 128 MB results in 8192 Map tasks. The
number of Map tasks can also be explicitly set using setNumMapTasks(int) and typically
ranges between 10–100 per node, though higher values can be configured for more granular
parallelism.
57
Big Data Analytics 21CS71
Key-Value Pair
Each phase (Map phase and Reduce phase) of MapReduce has key-value pairs as input
and output. Data should be first converted into key-value pairs before it is passed to
the Mapper, as the Mapper only understands key-value pairs of data.
Key-value pairs in Hadoop MapReduce are generated as follows:
InputSplit - Defines a logical representation of data and presents a Split data for
processing at individual map().
RecordReader - Communicates with the Input Split and converts the Split into records
which are in the form of key-value pairs in a format suitable for reading by the Mapper.
RecordReader uses TextlnputFormat by default for converting data into key-value
pairs. RecordReader communicates with the InputSplit until the file is read.
In MapReduce, the Grouping by Key operation involves collecting and grouping all the
output key-value pairs from the mapper by their keys. This process aggregates values
associated with the same key into a list, which is crucial for further processing during the
58
Big Data Analytics 21CS71
Shuffle and Sorting Phase. During this phase, all pairs with the same key are grouped
together, creating a list for each unique key, and the results are sorted. The output format of
the shuffle phase is <k2, List(v2)>. Once the shuffle process completes, the data is divided
into partitions.
A Partitioner plays a key role in this step, distributing the intermediate data into different
partitions, ensuring efficient data handling across multiple reducers.
A Combiner is an optional, local reducer that aggregates map output records on each node
before the shuffle phase, optimizing data transfer between the mapper and reducer by
reducing the volume of data that needs to be shuffled across the network.
The Reduce Tasks then process the grouped key-value pairs, applying the reduce() function
to aggregate the data and produce the final output. Each reduce task receives a list of values
for each key and iterates over them to generate aggregated results, which are then outputted
in the form of key-value pairs (k3, v3). This setup, which includes the shuffle, partitioning,
combiner, and reduce phases, optimizes performance and reduces the network load in
distributed computing environments like Hadoop.
public class ExampleReducer extends Reducer<K2, V2, K3, V3> {
@Override
public void reduce(K2 key, Iterable<V2> values, Context context) throws IOException,
InterruptedException {
// Processing logic for each key-value pair in the reduce function
// Example: Sum of values for each key
int sum = 0;
for (V2 value : values) {
sum += value; // assuming the values are integers
}
// Emit the final output key-value pair
context.write(key, sum);
}
}
Coping with Node Failure
Hadoop achieves fault tolerance by restarting tasks that fail during the execution of a
MapReduce job.
59
Big Data Analytics 21CS71
60
Big Data Analytics 21CS71
Finding distinct values is a common task in applications like web log analysis or counting
unique users. Here are two possible solutions for counting unique values:
1. First Solution: Mapper emits dummy counters for each field and group ID, and the
reducer calculates the total number of occurrences for each pair.
2. Second Solution: The Mapper emits values and group IDs, and the reducer excludes
duplicates and counts unique values for each group.
Example: Counting unique users by their ID in web logs.
Mapper: Emits the user ID with a dummy count.
Reducer: Filters out duplicate user IDs and counts the total number of unique users.
4. Collating
Collating involves collecting all items with the same key into a list. This is useful for
operations like producing inverted indexes or performing extract, transform, and load (ETL)
tasks.
Mapper: Computes a given function for each item and emits the result as a key, with
the item itself as a value.
Reducer: Groups items by key and processes them.
Example: Creating an inverted index.
Mapper: Emits each word from the document as a key and the document ID as the
value.
Reducer: Collects all document IDs for each word, producing a list of documents
where each word appears.
5. Filtering or Parsing
Filtering or parsing is used when processing datasets to collect only the items that satisfy
certain conditions or transform items into other formats.
Mapper: Accepts only items that satisfy specific conditions and emits them.
Reducer: Collects all the emitted items and outputs the results.
Example: Extracting valid records from a log file.
Mapper: Filters records based on a condition (e.g., logs with errors) and emits the
valid records.
Reducer: Collects the valid records and saves them.
6. Distributed Tasks Execution
Large-scale computations are divided into multiple partitions and executed in parallel. The
results from each partition are then combined to produce the final result.
Mapper: Processes a specific partition of the data and emits the computed results.
61
Big Data Analytics 21CS71
62
Big Data Analytics 21CS71
63
Big Data Analytics 21CS71
Where:
condition is a predicate (a logical condition) that the rows must satisfy.
R is the relation (table) from which rows are selected.
Example:
Consider a relation Employees with attributes (EmpID, Name, Age, Department).
101 Alice 30 HR
102 Bob 25 IT
103 Carol 35 HR
101 Alice 30 HR
103 Carol 35 HR
2. Projection (π)
The Projection operation is used to select specific columns from a relation, effectively
reducing the number of attributes in the resulting relation. It eliminates duplicate rows in the
result.
Syntax:
πattribute1, attribute2, ..., attributeN(R)
Where:
attribute1, attribute2, ..., attributeN are the columns to be selected from the
relation.
R is the relation from which attributes are selected.
Example:
Consider the Employees relation again. If we only want to select the Name and Department
columns, we would write: πName, Department(Employees)
This would produce the following result:
Name Department
Alice HR
64
Big Data Analytics 21CS71
Name Department
Bob IT
Carol HR
3. Union (∪)
The Union operation combines the rows of two relations, removing duplicates. The two
relations involved must have the same set of attributes (columns).
Syntax:
R∪S
Where:
R and S are two relations with the same schema (same attributes).
Example:
Let’s assume two relations:
Employees (EmpID, Name) and Contractors (EmpID, Name).
Employees:
EmpID Name
101 Alice
102 Bob
Contractors:
EmpID Name
103 Carol
102 Bob
EmpID Name
101 Alice
102 Bob
103 Carol
65
Big Data Analytics 21CS71
The Set Difference operation returns the rows that are present in one relation but not in the
other. Like the Union operation, the two relations must have the same schema.
Syntax:
R−S
Where:
R is the first relation.
S is the second relation.
Example:
If we subtract Contractors from Employees: Employees − Contractors
This would result in:
EmpID Name
101 Alice
EmpID Name
101 Alice
102 Bob
Departments:
DeptID Department
D01 HR
D02 IT
66
Big Data Analytics 21CS71
6. Rename (ρ)
The Rename operation is used to rename the attributes (columns) of a relation or to change
the name of the relation itself. This operation is particularly useful when combining relations
in operations like join.
Syntax:
ρNewName(OldName)(R)
Where:
NewName is the new name of the relation.
OldName is the current name of the relation.
R is the relation.
Example:
If we have a relation Employees and want to rename the attribute EmpID to EmployeeID,
we would write: ρEmployees(EmpID → EmployeeID)(Employees)
This would result in the following relation:
101 Alice 30 HR
102 Bob 25 IT
103 Carol 35 HR
7. Join (⨝)
The Join operation combines two relations based on a common attribute. It is one of the most
important operations in relational algebra, as it allows combining data from different tables.
Types of Join:
Inner Join: Combines rows from both relations where the join condition is true.
Outer Join: Returns all rows from one or both relations, with null values for
unmatched rows.
67
Big Data Analytics 21CS71
Syntax:
R ⨝condition S
Where:
R and S are relations.
condition specifies the common attribute used for the join.
Example:
Consider the following relations:
Employees:
EmpID Name
101 Alice
102 Bob
Departments:
EmpID Department
101 HR
102 IT
101 Alice HR
102 Bob IT
Hive
Hive is a data warehousing and SQL-like query system built on top of Hadoop. It was
originally developed by Facebook to manage large amounts of data in Hadoop's distributed
file system (HDFS). Hive simplifies the process of querying and managing large-scale
datasets by providing an abstraction layer that allows users to run SQL-like queries (HiveQL)
on top of the Hadoop ecosystem.
Characteristics of Hive
1. MapReduce Integration:
Hive translates queries written in Hive Query Language (HiveQL) into MapReduce jobs.
This makes Hive scalable and suitable for managing and analyzing vast datasets,
particularly static data. Since Hive uses MapReduce, it inherits the scalability and parallel
processing capabilities of Hadoop.
68
Big Data Analytics 21CS71
69
Big Data Analytics 21CS71
Hive Architecture
The architecture of Hive is designed to provide an abstraction layer on top of Hadoop,
allowing users to run SQL-like queries (HiveQL) for managing and analyzing large datasets
stored in HDFS. Hive architecture consists of several key components that work together to
enable querying, execution, and management of data within the Hadoop ecosystem.
70
Big Data Analytics 21CS71
71
Big Data Analytics 21CS71
o Usage: The web interface provides a graphical interface for executing queries,
managing tables, and performing administrative tasks without needing to use
the CLI.
4. Metastore:
o Function: The Metastore is a crucial component of Hive that stores all the
metadata (schema information) related to the tables, databases, and columns.
o Metadata: It stores information such as the database schema, column data
types, and HDFS locations of the data files.
o Interaction: All other components of Hive interact with the Metastore to fetch
or update metadata. For example, when a user queries a table, the Metastore
helps locate the corresponding data in HDFS.
o Storage: The Metastore typically uses a relational database (like MySQL or
PostgreSQL) to store this metadata.
5. Hive Driver:
o Function: The Hive Driver manages the lifecycle of a HiveQL query.
o Lifecycle Management: It is responsible for compiling the HiveQL query,
optimizing it, and finally executing the query on the Hadoop cluster.
o Execution Flow:
Compilation: The Hive Driver compiles the HiveQL statement into a
series of MapReduce jobs (or other execution plans depending on the
environment).
Optimization: The query is then optimized for execution. This may
include tasks such as predicate pushdown, column pruning, and join
optimization.
Execution: The final optimized query is submitted for execution on the
Hadoop cluster, where it is processed by the MapReduce framework.
6. Query Compiler:
o Function: The Query Compiler is responsible for parsing the HiveQL
statements and converting them into execution plans that are understandable
by the Hadoop system.
o Stages: The process involves the compilation of the HiveQL statement into an
Abstract Syntax Tree (AST), followed by the generation of a logical query
plan and its optimization before the physical plan is produced.
7. Execution Engine:
o Function: The Execution Engine is responsible for the actual execution of the
query.
72
Big Data Analytics 21CS71
73
Big Data Analytics 21CS71
74
Big Data Analytics 21CS71
75
Big Data Analytics 21CS71
How It Works: Data is divided into a specific number of buckets (files) by hashing a
particular column's value. Each bucket corresponds to one file stored in the partition's
directory.
Example: A customer table might be bucketed by the customer_id column, ensuring
that the data for each customer is stored in a separate bucket.
Hive Integration and Workflow Steps
Hive’s integration with Hadoop involves several key components that handle the query
execution, metadata retrieval, and job management.
1. Execute Query:
o The query is sent from the Hive interface (CLI, Web Interface, etc.) to the
Database Driver, which is responsible for initiating the execution process.
2. Get Plan:
o The Driver forwards the query to the Query Compiler. The compiler parses
the query and creates an execution plan, verifying the syntax and determining
the operations required.
3. Get Metadata:
o The Compiler requests metadata information (like table schema, column
types, etc.) from the Metastore (which can be backed by databases like
MySQL or PostgreSQL).
76
Big Data Analytics 21CS71
4. Send Metadata:
o The Metastore responds with the metadata, and the Compiler uses this
information to refine the query plan.
5. Send Plan:
o After parsing the query and receiving metadata, the Compiler sends the
finalized query execution plan back to the Driver.
6. Execute Plan:
o The Driver sends the execution plan to the Execution Engine, which is
responsible for actually running the query on the Hadoop cluster.
7. Execute Job:
o The execution engine triggers the execution of the query, which is typically
translated into a MapReduce job. This job is sent to the JobTracker (running
on the NameNode), which assigns tasks to TaskTrackers on DataNodes for
parallel processing.
8. Metadata Operations:
o During the execution, the Execution Engine may also perform metadata
operations with the Metastore, such as querying schema details or updating
the metastore.
9. Fetch Result:
o After completing the MapReduce job, the Execution Engine collects the
results from the DataNodes where the job was processed.
10. Send Results:
o The results are sent back to the Driver, which in turn forwards them to the
Hive interface for display to the user.
Hive Built-in Functions
Hive provides a wide range of built-in functions to operate on different data types, enabling
various data transformations and calculations. Here’s a breakdown of some common built-in
functions in Hive:
1. BIGINT Functions
round(double a)
o Description: Returns the rounded BIGINT (8-byte integer) value of the 8-byte
double-precision floating point number a.
o Return Type: BIGINT
o Example: round(123.456) returns 123.
floor(double a)
77
Big Data Analytics 21CS71
o Description: Returns the maximum BIGINT value that is equal to or less than
the double value.
o Return Type: BIGINT
o Example: floor(123.789) returns 123.
ceil(double a)
o Description: Returns the minimum BIGINT value that is equal to or greater
than the double value.
o Return Type: BIGINT
o Example: ceil(123.456) returns 124.
2. Random Number Generation
rand(), rand(int seed)
o Description: Returns a random number (double) that is uniformly distributed
between 0 and 1. The sequence changes with each row, and specifying a seed
ensures the random number sequence is deterministic.
o Return Type: double
o Example: rand() returns a random number like 0.456789, and rand(5) will
generate a sequence based on the seed 5.
3. String Functions
concatenate(string str1, string str2, ...)
o Description: Concatenates two or more strings into one.
o Return Type: string
o Example: concatenate('Hello ', 'World') returns 'Hello World'.
substr(string str, int start)
o Description: Returns a substring of str starting from the position start till the
end of the string.
o Return Type: string
o Example: substr('Hello World', 7) returns 'World'.
substr(string str, int start, int length)
o Description: Returns a substring of str starting from position start with the
given length.
o Return Type: string
o Example: substr('Hello World', 1, 5) returns 'Hello'.
upper(string str), ucase(string str)
78
Big Data Analytics 21CS71
79
Big Data Analytics 21CS71
HiveQL Features
Data Definition: Allows users to define and manage the schema of tables, databases,
etc.
Data Manipulation: Enables the manipulation of data, such as inserting, updating, or
deleting records (although with some limitations).
Query Processing: Supports querying large datasets using operations like filtering,
joining, and aggregating data.
HiveQL Process Engine
The HiveQL Process Engine translates HiveQL queries into execution plans and
communicates with the Execution Engine to run the query. It is a replacement for the
traditional approach of writing Java-based MapReduce programs.
Hive Execution Engine
The Execution Engine is the component that bridges HiveQL and MapReduce. It
processes the query and generates results in the same way that MapReduce jobs
would do. It uses a variant of MapReduce to execute HiveQL queries across a
distributed Hadoop cluster.
HiveQL Data Definition Language (DDL)
HiveQL provides several commands for defining databases and tables. These commands are
used to manage the structure of the data in Hive.
Creating a Database
To create a new database in Hive, the following command is used:
CREATE DATABASE [IF NOT EXISTS] <database_name>;
IF NOT EXISTS: Ensures that Hive does not throw an error if the database already
exists.
Example:
CREATE DATABASE IF NOT EXISTS my_database;
Show Databases
To list all the databases in Hive, use the command:
SHOW DATABASES;
Dropping a Database
80
Big Data Analytics 21CS71
81
Big Data Analytics 21CS71
82
Big Data Analytics 21CS71
83
Big Data Analytics 21CS71
Pig is a high-level platform built on top of Hadoop to facilitate the processing of large
datasets. It abstracts the complexities of writing MapReduce programs and provides a more
user-friendly interface for data manipulation.
Features of Apache Pig
Dataflow Language: Pig uses a dataflow language, where operations on data are
linked in a chain, and the output of one operation is the input to the next.
Simplifies MapReduce: Pig reduces the complexity of writing raw MapReduce
programs by providing a higher-level abstraction.
Parallel Processing: Pig allows the execution of tasks in parallel, which makes it
suitable for handling large datasets.
Flexible: It can process structured, semi-structured, and unstructured data.
High-level Operations: Supports complex data manipulation tasks like filtering,
joining, and aggregating large datasets.
Applications of Apache Pig
Large Dataset Analysis: Ideal for analyzing vast amounts of data in HDFS.
Ad-hoc Data Processing: Useful for quick, one-time data processing tasks.
Processing Streaming Data: It can process web logs, sensor data, or other real-time
data.
Search Platform Data Processing: Pig can be used for processing and analyzing data
related to search platforms.
Time-sensitive Data Processing: Processes and analyzes data quickly, which is
essential for applications that require fast insights.
Pig scripts are often used in combination with Hadoop for data processing at scale, making it
a powerful tool for big data analytics.
Pig Architecture
84
Big Data Analytics 21CS71
The Pig architecture is built to support flexible and scalable data processing in a Hadoop
ecosystem. It executes Pig Latin scripts via three main methods:
1. Grunt Shell: An interactive shell that executes Pig scripts in real time.
2. Script File: A file containing Pig commands that are executed on a Pig server.
3. Embedded Script: Pig Latin functions that can be written as User-Defined Functions
(UDFs) in different programming languages and embedded within Pig scripts.
85
Big Data Analytics 21CS71
86
Big Data Analytics 21CS71
1. Interactive Mode: This mode uses the Grunt shell. It allows you to write and
execute Pig Latin scripts interactively, making it ideal for quick testing and
debugging.
2. Batch Mode: In this mode, you write the Pig Latin script in a single file with a .pig
extension. The script is then executed as a batch process.
3. Embedded Mode: This mode involves defining User-Defined Functions (UDFs) in
programming languages such as Java, and using them in Pig scripts. It allows for
more advanced functionality beyond the built-in operations of Pig.
Pig Commands
1. To get a list of Pig commands:
2. pig-help;
3. To check the version of Pig:
4. pig -version;
5. To start the Grunt shell:
6. pig
Load Command
The LOAD command in Pig is used to load data into the system from various data sources.
Here's how it works:
Loading data from HBase:
book = LOAD 'MyBook' USING HBaseStorage();
Loading data from a CSV file using PigStorage, with a comma as a separator:
book = LOAD 'PigDemo/Data/Input/myBook.csv' USING PigStorage(',');
Specifying a schema while loading data: You can define a schema for the loaded
data, which helps in interpreting each field of the record.
book = LOAD 'MyBook' AS (name:chararray, author:chararray, edition:int,
publisher:chararray);
Store Command
The STORE command writes the processed data to a storage location, typically HDFS. It can
store data in various formats.
Default storage in HDFS (tab-delimited format):
STORE processed INTO '/PigDemo/Data/Output/Processed';
Storing data in HBase:
87
Big Data Analytics 21CS71
88
Big Data Analytics 21CS71
89
Big Data Analytics 21CS71
90
Big Data Analytics 21CS71
Module 5
Machine Learning Algorithms for Big Data Analytics
Artificial Intelligence (AI) is the field of computer science focused on creating machines
capable of performing tasks that traditionally require human intelligence. These tasks include
predicting future outcomes, recognizing visual patterns, understanding and processing
speech, making decisions, and engaging in natural language processing. AI systems aim to
mimic human cognitive abilities, allowing them to handle complex processes that are
typically done by humans, such as problem-solving and learning from experience.
Machine Learning (ML), a key subset of AI, revolves around the ability of systems to learn
from data without being explicitly programmed for specific tasks. It involves three main
stages: collecting data, analysing it to identify patterns, and predicting future outcomes based
on those patterns. Over time, as the system processes more data, its performance improves,
enabling it to make more accurate and efficient decisions. ML is used across various
industries and research fields to support decision-making and automation.
Deep Learning (DL) is an advanced approach within machine learning that uses complex
models, such as artificial neural networks (ANN), to simulate the human brain's learning
process. These models are designed to analyse large datasets with multiple layers of
information, making them highly effective for tasks like computer vision, speech recognition,
natural language processing, and bioinformatics. Deep learning techniques can produce
results that rival or exceed human-level performance, enabling breakthroughs in fields like
AI-assisted medical research, automated translation, and more.
Estimating the relationships, outliers, variances, probability distributions and
correlations
The given text discusses different types of variables used in statistical analysis, focusing on
their relationships, outliers, variances, probability distributions, and correlations.
Independent Variables: These are directly measurable characteristics that are not affected by
other variables. Examples include the year of sales or the semester of study. The value of an
independent variable is not dependent on any other variable.
1. Dependent Variables: These represent characteristics that are influenced by
independent variables. For example, profit over successive years or grades awarded in
successive semesters are dependent on other factors. The value of a dependent
variable depends on the value of the independent variable.
2. Predictor Variable: This is an independent variable that helps predict the value of a
dependent variable using an equation, function, or graph. For example, it can predict
sales growth of a car model after five years from past sales data.
3. Outcome Variable: This represents the effect of manipulations using a function,
equation, or experiment. For instance, CGPA of a student is an outcome variable that
depends on the grades awarded during semesters.
4. Explanatory Variable: An independent variable that explains the behavior of the
dependent variable, such as factors influencing the growth of profit, including the
amount of investment.
91
Big Data Analytics 21CS71
92
Big Data Analytics 21CS71
Outliers
Outliers are data points that deviate significantly from the other data points in a dataset. They
are numerically far distant from the rest of the points and can indicate anomalous situations
or errors in data collection. Outliers can occur due to various reasons, such as:
Anomalous situations: Unexpected or rare events that deviate from the norm.
Presence of previously unknown facts: New, unrecognized factors that may cause
unusual data points.
Human error: Mistakes made during data entry or collection.
93
Big Data Analytics 21CS71
Where:
S is the standard deviation.
Xi is each data point.
Μ is the mean of the data.
N is the number of data points.
Standard Error: The standard error estimate measures the accuracy of predictions made by a
model or relationship. It is related to the sum of squared deviations (also known as the sum of
squares error). The formula for the standard error of the estimate is:
Where:
Yi is the observed value.
yi^ predicted value.
94
Big Data Analytics 21CS71
95
Big Data Analytics 21CS71
Where:
E1(V) is the estimate of the variance between the groups.
E2(V) is the estimate of the variance within the groups
F-distribution and Critical Value:
To determine whether the F-test statistic is significant, we compare it against a critical value
from the F-distribution table, which depends on the degrees of freedom for both the
numerator (between-group variance) and denominator (within-group variance). If the
calculated F-value is greater than the critical value from the F-table, the null hypothesis is
rejected.
Correlation
Correlation measures the strength and direction of the relationship between two variables. It
quantifies how one variable changes with respect to another and is used to assess whether and
how strongly pairs of variables are related.
R-Squared (R2):
R-Squared is a statistical measure used to evaluate the goodness of fit in a regression
model.
It is also called the coefficient of determination and represents the proportion of the
variance in the dependent variable that can be explained by the independent
variable(s) in the model.
96
Big Data Analytics 21CS71
R2 is the square of the Pearson correlation coefficient (R) and ranges from 0 to 1. A
higher R2 value indicates a better fit of the model to the data.
Interpretation:
o R2=1: Perfect fit, where the predicted values are identical to the observed
values.
o R2=0: No correlation between the model and the observed data.
o Larger R2 values indicate a better model fit, implying stronger correlation
between the variables.
Regression Analysis
Regression analysis is a statistical method used to estimate the relationships among
variables. It helps understand how the dependent variable (also known as the response
variable) changes when one or more independent variables (predictor or explanatory
variables) are modified. The main goal of regression analysis is to model these relationships
to make predictions about future values of the dependent variable.
Multivariate Distribution and Regression
In regression analysis, we often deal with multivariate distributions, where multiple
variables are involved. For example, if a company wants to predict future sales of Jaguar cars
based on past sales data, it would analyze the relationship between the sales in previous years
and sales in the current year using regression models.
Regression analysis estimates how one or more independent variables influence the
dependent variable. It involves identifying the strength and nature of the relationship (e.g.,
linear or non-linear) between these variables.
Non-linear and Linear Regression
Non-linear Regression:
Non-linear regression is used when the relationship between the independent and
dependent variables is not linear.
The general equation for non-linear regression can have multiple terms (3 or more) on
the right-hand side of the equation, representing more complex relationships between
variables.
Here, y is the dependent variable, and x1,x2,x3,…x_1, x_2, x_3, are the independent
variables with their corresponding coefficients a1,a2,a3,…a_1, a_2, a_3,
Linear Regression:
Linear regression assumes that the relationship between the dependent and
independent variables can be modelled using a straight line.
97
Big Data Analytics 21CS71
It is a simpler form of regression where only the first two terms are considered in the
equation.
Simple Linear Regression
Simple Linear Regression is one of the most widely used techniques in regression analysis.
It is a supervised machine learning algorithm that aims to predict the value of a dependent
variable using one independent variable. It is the simplest form of regression, where the
relationship between the independent variable xxx and the dependent variable yyy is assumed
to be linear.
Key Features of Simple Linear Regression:
The objective is to fit a line (called the regression line) that minimizes the deviation
from all the data points.
The deviation from the line is called the error or residual.
The equation of the regression line is typically written as:
Where:
o Y is the dependent variable (what we want to predict).
o x is the independent variable (the predictor).
o mmm is the slope of the line, which represents how much y changes for a one-
unit change in x.
o c is the intercept, the value of y when x=0.
Objective of Simple Linear Regression:
The goal is to find the best-fitting line that minimizes the total error (the deviation of
observed data points from the predicted values). This is often done using the least
squares method, which minimizes the sum of squared errors (residuals).
Steps in Performing Simple Linear Regression:
1. Collect Data: Gather the data points for both independent and dependent variables.
2. Fit a Line: Use statistical methods (e.g., least squares) to fit the line that minimizes
the error.
3. Predict: Once the regression line is obtained, it can be used to predict the value of y
for new values of xxx.
4. Evaluate the Model: The accuracy of the model can be measured using statistical
metrics like R-squared ( R2) and mean squared error (MSE).
98
Big Data Analytics 21CS71
Multiple Regression
Multiple regression is an extension of simple linear regression that allows for the prediction
of a criterion variable (dependent variable) using two or more predictor variables
(independent variables). While simple linear regression predicts a dependent variable from
one independent variable, multiple regression considers multiple independent variables
simultaneously, making it ideal for more complex scenarios where several factors affect the
outcome.
Why Use Multiple Regression?
Real-world scenarios: Many real-world phenomena are influenced by multiple
factors. For example, a company's sales may depend on various factors like
advertising budget, season, customer sentiment, and economic conditions. Multiple
regression helps in modelling such complex relationships.
Forecasting and Prediction: Multiple regression is often used for forecasting future
values by considering multiple influencing factors. It is also useful for assessing the
strength of these predictors.
Example of Multiple Regression Model:
The general form of the multiple regression equation is:
Where:
y is the dependent variable (the outcome we are predicting).
b0is the intercept (the predicted value when all predictors are zero).
b1,b2,…,bn are the regression coefficients for the independent variables x1,x2,…,xn
indicating how much change in y is expected with a one-unit change in each predictor.
ϵ\epsilonϵ is the error term, representing unexplained variation in y(residuals).
Applications of Multiple Regression
1. Sales Forecasting:
o A company can use multiple regression to predict future sales based on several
variables, such as advertising spend, promotions, and seasonality. The
regression model would allow the company to forecast future sales more
accurately by considering these factors together.
2. Marketing Investment Analysis:
o A company may analyze whether investments in marketing campaigns (e.g.,
TV and radio ads) yield substantial returns. Using multiple regression, the
company can evaluate the individual impact of TV and radio ads as well as
their combined effect on sales.
99
Big Data Analytics 21CS71
100
Big Data Analytics 21CS71
Text Mining
Text Mining is the process of extracting valuable knowledge, insights, and patterns from
large collections of textual data. This involves analysing text data in a structured or
unstructured form and is used to uncover patterns, relationships, and insights that may not be
immediately apparent.
Text mining is particularly important due to the large amount of text-based data generated in
the world today. With the rise of social media, user-generated content such as text, images,
and videos has increased exponentially. Text mining plays a crucial role in analyzing and
understanding these vast amounts of data for actionable insights across various domains.
101
Big Data Analytics 21CS71
3. Legal:
Legal Case Search: Text mining tools can assist lawyers and paralegals in searching
vast databases of legal documents, case histories, and laws. This can help them find
relevant documents quickly, improving the efficiency and effectiveness of their legal
research.
E-Discovery: Text mining is embedded in e-discovery platforms, helping
organizations minimize the risk associated with sharing legally mandated documents.
These platforms assist in ensuring that relevant legal documents are properly
reviewed, managed, and stored.
Predictive Legal Insights: Case histories, testimonies, and client meeting notes can
be analysed to uncover additional insights that may help predict high-cost injuries or
legal issues. This analysis can contribute to better legal strategies and cost-saving
measures.
4. Governance and Politics:
Social media and Public Sentiment: Text mining can be used to analyse public
sentiment on social media platforms. This can help governments and political parties
gauge the mood of constituents, track public opinions, and adjust their strategies
accordingly.
Micro-Targeting in Elections: Social network analysis enables political campaigns
to create targeted messages based on data gathered from social media. This approach
helps political campaigns more efficiently use resources and reach voters with
messages tailored to their specific concerns.
Geopolitical Security: Text mining can be applied to real-time internet chatter to
detect emerging threats or crises. By analyzing large-scale social media data,
governments and organizations can gain valuable intelligence to improve security
measures.
Research Trend Analysis: In academic and research fields, text mining can help
analyse large amounts of research papers and publications to identify emerging trends.
Meta-analysis of academic research using text mining can uncover important insights
and direct future research initiatives.
Text Mining Process
Text mining is a rapidly growing field, especially with the increasing volume of social media
and other text data. To manage this data, there is a need for efficient techniques to extract
meaningful information. The process of text mining can be divided into five phases:
102
Big Data Analytics 21CS71
103
Big Data Analytics 21CS71
104
Big Data Analytics 21CS71
clusters or groups within the data. Examples include grouping similar documents
together based on content.
2. Supervised Learning (Classification):
In supervised learning, the data is labelled, and the model is trained to classify new
data based on these labels. Examples include spam email classification or sentiment
analysis.
3. Evolutionary Pattern Identification:
This technique identifies patterns over time, such as analyzing news articles to
summarize events or identifying trends in research literature.
105
Big Data Analytics 21CS71
2. Ambiguity:
Words and phrases can have multiple meanings depending on context, which creates
ambiguity. For example, the word "bat" can refer to a flying mammal or a piece of
sports equipment. Resolving such ambiguity requires sophisticated context
understanding.
3. Tokenization:
Tokenization is the process of splitting text into smaller units (tokens), such as words
or phrases. However, tokenizing text correctly can be challenging due to punctuation,
contractions, and compound words that may not be straightforward to split.
4. Parsing:
Parsing aims to analyze the grammatical structure of sentences. This task can be
complicated by sentence complexity, variations in sentence structure, and non-
standard language usage, leading to difficulty in constructing accurate parse trees.
5. Stemming:
Stemming reduces words to their root form, but it is not always perfect. For example,
stemming might strip the suffixes in words like "running" or "better," which could
result in ambiguous roots like "run" or "good."
6. Synonymy and Polysemy:
o Synonymy refers to the challenge of identifying words with similar meanings
(e.g., "car" and "automobile").
o Polysemy involves words with multiple meanings, which can confuse systems
(e.g., "bank" can mean a financial institution or the side of a river). Addressing
these issues requires deep contextual understanding.
2. Mining Techniques
Various mining techniques face challenges, including:
1. Identification of Suitable Algorithm(s):
There is no one-size-fits-all algorithm for text mining. Choosing the right algorithm
depends on the task (e.g., classification, clustering) and the nature of the text. The
diversity of text data requires selecting appropriate algorithms to handle different
types of text and tasks effectively.
2. Massive Amount of Data and Annotated Corpora:
Text mining often deals with vast amounts of unstructured data, and for supervised
learning tasks, large annotated corpora (labelled data) are needed. Annotating such
massive datasets is time-consuming and expensive, which is a significant barrier.
3. Concepts and Semantic Relations Extraction:
Extracting meaningful concepts and understanding the relationships between them in
text is complex. This requires understanding deeper semantics and context, which can
be difficult to model accurately.
106
Big Data Analytics 21CS71
3. Variety of Data
The variety of data types and sources adds another layer of complexity to text mining:
1. Different Data Sources Require Different Approaches and Areas of Expertise:
Text data can come from various sources, such as social media, scientific articles,
news, and books. Each type of text may require different pre-processing, feature
extraction, and analysis techniques.
2. Unstructured and Language Independence:
Much of the text data is unstructured, meaning it doesn't have a predefined format,
making it harder to process. Additionally, text mining systems may need to work
across different languages, each with its unique structure and nuances, requiring
language-independent approaches.
4. Information Visualization
Once insights are extracted from text, presenting them in a meaningful way is a challenge.
Text mining results can be complex and multidimensional, so effective visualization tools are
needed to make the insights understandable and actionable for users.
6. Scalability
Text mining systems must be scalable to handle large volumes of text. As the amount of text
data continues to grow, the system needs to scale efficiently, ensuring that computational
resources are not exhausted and that processing time remains manageable even with large
datasets.
107
Big Data Analytics 21CS71
108
Big Data Analytics 21CS71
109
Big Data Analytics 21CS71
110
Big Data Analytics 21CS71
Web Mining
Web mining is the process of discovering patterns and insights from data on the World Wide
Web to enhance the web and improve the user experience. As the web grows exponentially,
with more data being uploaded every day than the entire web had just two decades ago, web
mining has become a crucial tool for understanding and optimizing how the internet is used.
The web serves various functions, including electronic commerce, business communication,
and social interactions, which makes it essential to extract valuable insights from web data.
Web mining collects data through web crawlers, web logs, and other means, helping to
uncover trends that can optimize content, improve user experiences, and provide business
insights.
Characteristics of Optimized Websites
For a website to be considered optimized, it needs to have several key characteristics across
three main aspects: appearance, content, and functionality.
1. Appearance
Aesthetic Design: The design of a website plays a significant role in user
engagement. A visually appealing website captures attention and encourages
interaction.
Well-formatted Content: The content should be easy to read, scannable, and
logically structured to enhance user experience.
Easy Navigation: A website with clear navigation pathways ensures that users can
easily find the information they are looking for.
Good Color Contrasts: Proper contrast improves readability and enhances the visual
appeal of the website.
2. Content
Well-planned Information Architecture: The content should be structured logically
and organized in a way that is intuitive for users.
Fresh Content: Regularly updated content ensures that visitors have access to the
latest information and keep them returning.
Search Engine Optimization (SEO): Optimizing the content for search engines
makes it more discoverable, driving organic traffic to the site.
Links to Other Good Sites: Having quality external links helps build authority and
offers additional value to users.
111
Big Data Analytics 21CS71
3. Functionality
Accessibility for Authorized Users: Ensuring the website is accessible to users with
disabilities and meets web accessibility standards is important for inclusivity.
Fast Loading Times: A website should load quickly to reduce user frustration and
abandonment.
Usable Forms: User-friendly forms, such as easy-to-fill contact or sign-up forms,
improve interaction rates.
Mobile-enabled: With the increasing use of mobile devices, a responsive website is
essential for reaching a wider audience.
112
Big Data Analytics 21CS71
113
Big Data Analytics 21CS71
Naïve Bayes:
Naïve Bayes is a supervised machine learning technique based on probability theory. It
predicts the probability of an instance belonging to a specific target class, given prior
probabilities and predictors. Despite its simplicity, Naïve Bayes is powerful for many real-
world applications.
114
Big Data Analytics 21CS71
Advantages:
1. Efficiency:
o Performs well when the assumption of independent predictors holds.
o Requires minimal training data for estimating test data, leading to short
training periods.
2. Simplicity:
o Easy to implement and computationally efficient.
Disadvantages:
1. Assumption of Independence:
o Naïve Bayes assumes that all predictors are independent. In reality, this is
often not true, which may limit the model’s accuracy.
2. Zero Frequency Issue:
o If a category in the test dataset does not appear in the training dataset, the
model assigns a probability of zero, making it unable to make predictions.
Solution: Apply smoothing techniques like Laplace Estimation to handle
such cases.
Practical Use:
Applications:
o Spam detection, sentiment analysis, text classification, and more.
Despite its simplicity and assumptions, Naïve Bayes often delivers robust results,
particularly for text-based applications.
Support Vector Machine (SVM)
Support Vector Machine (SVM) is a supervised machine learning algorithm used for
both classification and regression problems, but it is predominantly applied in
classification tasks.
The algorithm represents data points in an n-dimensional space (where nn is the
number of features) and identifies the optimal hyperplane to distinguish between
different classes.
How SVM Works:
1. Data Representation:
o Each data point is plotted as a point in n-dimensional space with its
coordinates corresponding to the feature values.
2. Hyperplane Identification:
o Classification is performed by finding the hyperplane that best separates the
two classes.
115
Big Data Analytics 21CS71
3. Margin Maximization:
o The "margin" is the distance between the hyperplane and the nearest data point
from each class.
o The optimal hyperplane maximizes this margin, ensuring better
generalization.
Advantages of SVM:
1. High Dimensional Feature Space:
o SVM performs well even when the number of features exceeds the number of
instances (e.g., spam filtering with numerous features).
2. Nonlinear Decision Boundaries:
o SVM can handle nonlinear decision boundaries by transforming the input data
into higher dimensions (using kernel tricks) where the classifier can be
represented as a linear function.
3. Ease of Understanding:
o Conceptually simple, offering an intuitive linear classification model.
4. Efficiency:
o Focuses only on a subset of relevant data points (support vectors), making it
computationally efficient.
5. Wide Availability:
o Supported by most modern data analytics and machine learning toolsets.
Disadvantages of SVM:
1. Numeric Input Requirement:
o SVM requires all data points in all dimensions to be numeric, limiting its
application to non-numeric datasets without preprocessing.
2. Binary Classification Limitation:
o Primarily designed for binary classification. Multi-class problems require
techniques like cascading multiple SVMs.
3. Computational Complexity:
o Training SVMs can be inefficient and time-consuming, especially for large
datasets.
4. Noise Sensitivity:
o SVM struggles with noisy data, requiring the computation of soft margins to
accommodate misclassifications.
5. Lack of Probability Estimates:
116
Big Data Analytics 21CS71
Practical Applications:
Spam Detection: Identifying spam emails using high-dimensional feature spaces.
Image Recognition: Classifying images into predefined categories.
Text Categorization: Sorting documents into topics or themes.
Bioinformatics: Predicting protein structure or gene expression profiles.
PageRank
PageRank is a link analysis algorithm developed by Larry Page and Sergey Brin, used to rank
web pages based on their importance in a web graph. It operates under the principle that more
important pages are likely to receive more links from other pages.
How PageRank Works:
1. Each page is assigned an initial rank (usually 1/N, where N is the total number of
pages).
2. The rank of a page is calculated iteratively based on the ranks of other pages linking
to it.
PageRank Formula:
117
Big Data Analytics 21CS71
Where:
d: Damping factor (typically set to 0.85). It accounts for the probability of randomly
following links versus jumping to a random page.
N; Total number of pages.
L(p): Set of pages linking to p.
C(q): Number of outbound links on page q.
Structure of the Web:
The web can be represented as a directed graph, where:
Nodes: Represent web pages.
Edges: Represent hyperlinks between pages.
Characteristics:
1. Bow-Tie Structure:
o The web graph often has a bow-tie structure with:
A strongly connected core of pages that are mutually reachable.
In-links: Pages that link to the core but are not reachable from it.
Out-links: Pages reachable from the core but do not link back.
Disconnected components: Isolated pages or groups of pages.
2. Small-World Phenomenon:
o Most pages are reachable from any other page within a small number of clicks
(high clustering coefficient).
3. Power-Law Distribution:
o The number of links per page follows a power-law distribution, meaning a
small number of pages have a high number of links.
118
Big Data Analytics 21CS71
Steps:
1. Representation:
o Use an adjacency matrix or adjacency list to represent the graph.
2. Rank Calculation:
o Apply the PageRank algorithm iteratively.
3. Properties:
o Indegree: Number of links pointing to a page.
o Outdegree: Number of links going out from a page.
o Connectivity: Identify strongly connected components.
4. Traversal:
o Use BFS/DFS to explore the graph for reachability and link structure.
119
Big Data Analytics 21CS71
A social network can be represented as a graph, where nodes (also called vertices)
signify entities such as individuals, groups, or organizations, and edges represent
the relationships or interactions between these entities. For instance, in a social
media platform, users can be represented as nodes, and friendships or follows can
be represented as edges. These edges may be undirected (e.g., mutual friendships)
or directed (e.g., one-way follows). Analyzing social networks using graph theory
enables us to understand connectivity, influence, and the overall structure of the
network.
Social network analytics refers to the study and interpretation of social networks
using computational and mathematical methods. This includes identifying key
individuals, understanding community structures, and uncovering patterns of
interaction. Tools like clustering, similarity measures, and community detection
are commonly used to reveal insights such as the most influential nodes, densely
connected communities, and trends within the network.
Clustering involves grouping nodes in a network such that nodes within the same
cluster are more interconnected than those in different clusters. Clustering helps
identify communities or subgroups within the network. Popular clustering methods
include modularity-based clustering, spectral clustering, and hierarchical
clustering. Modularity-based clustering measures the quality of clustering by
evaluating the density of links inside clusters compared to those between clusters.
This technique is widely used for discovering tightly-knit groups in large social
networks.
SimRank
120
Big Data Analytics 21CS71
Triangles in a social network graph are formed when three nodes are mutually
connected. Counting triangles is essential for analyzing the local clustering
coefficient, which indicates the likelihood of two neighbors of a node being
connected. This measure reflects the level of interconnectedness and transitivity in
the network. Graph matching, on the other hand, involves finding specific
subgraph patterns within a larger graph. This is useful in identifying motifs or
recurring structures in the network, such as organizational hierarchies or friend
groups.
Community detection aims to identify groups of nodes in a social network that are densely
connected internally but have sparse connections with nodes in other groups. Techniques
such as the Girvan-Newman algorithm, which removes edges with the highest betweenness
centrality, and the Louvain method, which maximizes modularity, are popular for discovering
communities. These methods help in understanding the underlying social dynamics and
targeting specific groups for marketing or information dissemination.
The insights gained from analyzing social networks are applied in various domains.
Clustering helps in detecting communities for marketing campaigns, SimRank is used for
friend and content recommendations, triangle counting measures social cohesion, and
community detection identifies influential subgroups for spreading awareness or
advertisements. Overall, social network analytics enables businesses, researchers, and
organizations to harness the power of networked data effectively.
Clique
It is a subset of vertices within a graph such that every two distinct vertices in the clique are
adjacent; in other words, it forms a complete subgraph. This means that every member of the
clique is directly connected to every other member. Cliques are significant in various fields,
including social network analysis, where they can represent groups of individuals who all
know each other.
------------------------------------END OF MODULE 5--------------------------------------------------
121
Big Data Analytics 21CS71
Module 2:
Introduction to Hadoop, Hadoop Distributed File System Basics, Essential Hadoop
Tools
1. What is Hadoop? Explain the core components of Hadoop.
2. Explain Hadoop Ecosystem with a neat Diagram
3. What are the features of Hadoop?
4. Explain Hadoop Physical Organisation
5. Explain Hadoop MapReduce Framework and Programming Model
6. Brief about YARN-Based Execution Model
122
Big Data Analytics 21CS71
Module 4
1. Explain Map Reduce Map tasks with the Map reduce programming model
2. Discuss, how to compose Map-reduce for calculations
3. Illustrate different Relational algebraic operations in Map reduce
4. Discuss HIVE
i) Features
ii) Architecture
iii) Installation Process
5. Compare HIVE and RDBMS
6. Explain HIVE Datatypes and file format
7. Discuss Hive Data Model with data flow sequences
8. Explain Hive Built in functions
9. Define HiveQL. Write a program to create, show, drop and query operations taking a
database for toy company
10. Explain Table partitioning, bucketing, views, join and aggregation in Hive QL
11. Explain PIG architecture with applications and features.
12. Give the differences between
i) Pig and Map reduce
ii) Pig and SQL
13. Explain Pig Latin Data Model with pig installation steps
14. Explain Pig Relational operations
123
Big Data Analytics 21CS71
Module 5
Machine Learning Algorithms for Big Data Analytics, Text, Web Content, Link and
Social Network Analytics
1. Explain the following
i) Text mining with text analytics process pipe line
ii)Text mining process and phases
iii)Text mining challenges
2. Discuss the following
i) Naïve base analysis
ii)Support vector machines
iii)Binary classification
3. Discuss
i) Web Mining
ii) Web content
iii) Web usage Analytics
4. Explain
i) Page rank
ii) Structure of Web and Analysing a Web graph authorities
5. What are Hubs and Authorities?
6. Explain Social Network as Graph and Social network analytics
7. Discuss
i) Clustering in social networks
ii) Sim rank
iii) Counting triangles and graph matches
iv) Direct discovery of communities
8. Discuss Analysis of Variances (ANOVA) and correlation indicators of linear relationship
9. Describe the regression analysis predict the value of the dependent variable in case of
linear regression
10. In Machine Learning, Explain Linear and Non-Linear Relationships with Graphs
11. Explain Multiple Regression. Explain their examples in forecasting and optimisation
12. Explain with neat diagram K-means clustering.
13. Explain Naïve Bayes Theorem with example.
14. Explain Apriori Algorithm to evaluate candidate key
124