CH 6 BDA

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

BIG DATA ANALYTICS

PECAIML601A

CHAPTER-6
1 MARK QUESTIONS

1. What is Pig Storage?

• In Pig, there is a default load function, that is Pig Storage. Also, we can use pig storage, whenever we want to load
data from a file system into the pig.
• We can also specify the delimiter of the data while loading data using pig storage (how the fields in the record are
separated). Also, we can specify the schema of the data along with the type of the data.

2. What is the difference between Pig Latin and SQL?

• Pig Latin is a procedural language while SQL is a declarative language. Additionally, Pig Latin is designed for
processing large datasets, while SQL is designed for working with structured data.

3. What is a relation in Pig Latin?

• In Pig Latin, a relation is a dataset consisting of a set of tuples.

4. What is the main purpose of PIG in Big Data Analytics?

• The main purpose of PIG in Big Data Analytics is to provide a high-level language for processing large datasets on
Apache Hadoop.

5. What is the function of the FOREACH statem ent in Pig Latin?

• The FOREACH statement in Pig Latin is used to apply a transformation to each tuple in a relation.

6. What is the default execution mode of Pig Latin?

• The default execution mode of Pig Latin is MapReduce.

7. What is HIVE?

• Hive is a data warehouse system which is used to analyze structured data. It is built on the top of Hadoop. It was
developed by Facebook.
• Hive provides the functionality of reading, writing, and managing large datasets residing in distributed storage. It
runs SQL like queries called HQL (Hive query language) which gets internally converted to MapReduce jobs.
• Using Hive, we can skip the requirement of the traditional approach of writing complex MapReduce programs. Hive
supports Data Definition Language (DDL), Data Manipulation Language (DML), and User Defined Functions (UDF).

8. What is a metastore in Hive?

• A metastore in Hive is a database that stores metadata about the data stored in Hadoop, such as schema
information, table partitions, and column statistics.

9. What is the purpose of the EXPLAIN statement in Hive?

• The EXPLAIN statement in Hive is used to analyze the execution plan for a query and identify any performance
bottlenecks or optimization opportunities.
10. What is the use of partitioning in Hive?

• Partitioning in Hive is used to divide data into smaller, more manageable parts based on a specified key or set of
keys. This can improve query performance by reducing the amount of data that needs to be processed.

5 MARKS QUESTIONS

1. Compare and contrast Hive and Pig. What are the key differences between the two tools?

• Hive and Pig are two popular data processing tools in the Hadoop ecosystem, but they have different approaches to
data processing and different use cases. Here are the key differences between Hive and Pig:

[1] LANGUAGE:
• Hive uses SQL-like language, called HiveQL, which makes it easy for users with SQL knowledge to work with Hive.
Pig, on the other hand, uses a high-level scripting language, called Pig Latin, which is more similar to a programming
language.

[2] DATA PROCESSING:


• Hive is designed to handle structured data that is stored in tables, while Pig can handle both structured and
unstructured data. Pig also has more advanced data processing capabilities, such as the ability to handle complex
data flows and custom data processing functions.

[3] QUERY EXECUTION:


• Hive is optimized for ad-hoc querying and analysis of large datasets, while Pig is designed for data processing tasks
such as ETL (Extract, Transform, Load) and data pipelines.

[4] PERFORMANCE:
• Hive is generally slower than Pig for iterative data processing tasks because it is optimized for SQL-like queries, which
are not well-suited for iterative processing. Pig is designed for iterative processing and is often faster than Hive for
these types of tasks.

[5] ECOSYSTEM INTEGRATION:


• Hive is tightly integrated with the Hadoop ecosystem and can work seamlessly with other Hadoop components such
as HDFS, HBase, and Spark. Pig can also work with these components, but it has more limited integration with some
of them.

Pig Hive

Pig operates on the client side of a cluster. Hive operates on the server side of a cluster.

Pig uses pig-latin language. Hive uses HiveQL language.

Pig is a Procedural Data Flow Language. Hive is a Declarative SQLish Language.

It was developed by Yahoo. It was developed by Facebook.

It is used by Researchers and Programmers. It is mainly used by Data Analysts.

It is used to handle structured and semi-structured data. It is mainly used to handle structured data.

It is used for programming. It is used for creating reports.

Pig scripts end with .pig extension. In HIve, all extensions are supported.
It does not support partitioning. It supports partitioning.

It loads data quickly. It loads data slowly.

It does not support JDBC. It supports JDBC.

It does not support ODBC. It supports ODBC.

Hive makes use of the exact variation of dedicated SQL-


Pig does not have a dedicated metadata database.
DDL language by defining tables beforehand.

It supports Avro file format. It does not support Avro file format.

Pig is suitable for complex and nested data structures. Hive is suitable for batch-processing OLAP systems.

Pig does not support schema to store data. Hive supports schema for data insertion in tables.

It is very easy to write UDFs to calculate matrices. It does support UDFs but is much hard to debug.

2. Give an example of a Pig Latin script to join two datasets and explain how it works.

ASSUME WE HAVE TWO DATASETS: CUSTOMERS AND TRANSACTIONS.


• The customers dataset contains customer information with columns id, name, and age. The transactions dataset
contains transaction information with columns id, date, and amount. We want to join these datasets based on the
id column to get a dataset that contains customer and transaction information together.

HERE'S THE PIG LATIN SCRIPT TO DO THIS JOIN:


• customers = LOAD 'customers.csv' USING PigStorage(',') AS (id:int, name:chararray, age:int);
• transactions = LOAD 'transactions.csv' USING PigStorage(',') AS (id:int, date:chararray, amount:float);
• joined_data = JOIN customers BY id, transactions BY id;
• DUMP joined_data;

HERE'S HOW THIS SCRIPT WORKS:


• The first two lines load the customers and transactions datasets into Pig, using the LOAD command. The datasets
are loaded from CSV files using the PigStorage function, which specifies that the files are comma-separated.
• The next line defines the joined_data variable and uses the JOIN command to join the customers and transactions
datasets based on the id column. The BY id clause specifies that the join should be done based on the id column in
both datasets.
• The final line uses the DUMP command to output the joined dataset to the console.

3. What is Pig Latin and how does it differ from SQL in the context of big data analysis?

• Pig Latin is a high-level scripting language used in the Hadoop ecosystem for analyzing large datasets. It is used with
Apache Pig, a platform that allows users to write data processing pipelines and execute them on Hadoop clusters.
Pig Latin provides a set of operators and functions for performing data transformations, filtering, aggregation, and
joining operations.
• In the context of big data analysis, Pig Latin differs from SQL in several ways:

Feature Apache Pig SQL (Structured Query


Language)
Language Paradigm Procedural Data Flow Language Declarative Query Language

Data Processing Model Data flow programming model Relational algebra and set-
oriented approach
Data Manipulation Schema-less, semi-structured data Structured data manipulation
manipulation
Data Types Supports complex data types (e.g., Primarily designed for structured
nested, bag, map) data types
Ecosystem Integration Part of the Hadoop ecosystem Widely supported across
database systems
Performance Optimization Data pipelining for parallel execution Query optimization for efficient
execution
Custom Functions User-defined functions (UDFs) for User-defined functions (UDFs) for
complex operations operations
Data Sources Hadoop Distributed File System (HDFS), Various databases, including
HBase, etc. relational DBs
Tool Compatibility Integrates well with other Hadoop tools Native querying language for
(e.g., Hive) most databases
Query Expressiveness Powerful for complex data Well-suited for structured data
transformations and joins querying
Learning Curve Steeper learning curve due to Familiar syntax for developers
procedural nature and analysts
Community Support Active open-source community and Widespread support and
documentation extensive resources

4. How can Pig be used to process unstructured data in Hadoop?

• Pig is an excellent tool for processing unstructured data in Hadoop. Here are some of the ways in which Pig can
be used for this purpose:

[1] LOADING DATA:


• Pig supports a wide range of file formats, including CSV, TSV, JSON, and XML, which makes it easy to load
unstructured data into a Hadoop cluster.

[2] EXTRACTING DATA:


• Pig provides a set of built-in functions for extracting data from unstructured data sources. For example, you can
use the TOKENIZE function to split text data into individual words, or use the REGEX_EXTRACT function to extract
specific patterns from text data.

[3] FILTERING DATA:


• Pig provides a variety of filtering operators that can be used to remove unwanted data from unstructured data
sources. For example, you can use the FILTER operator to remove records that do not match a certain criteria,
or use the DISTINCT operator to remove duplicates from a dataset.

[4] TRANSFORMING DATA:


• Pig provides a rich set of operators for transforming unstructured data into a structured format. For example,
you can use the GROUP operator to group data based on a certain criteria, or use the JOIN operator to combine
data from multiple sources.

[5] STORING DATA:


• Pig allows you to store the results of your data processing pipeline in a variety of file formats, including CSV, TSV,
and JSON. This makes it easy to integrate your data with other tools and platforms.

5. Compare and contrast Hive and Pig. What are the key differences between the two tools?

• Hive and Pig are both data processing tools that are part of the Apache Hadoop ecosystem, but they have some
key differences in terms of their purpose, language, and usage.
• Hive is primarily designed for data warehousing and SQL-like querying. It allows users to write SQL-like queries,
known as Hive Query Language (HQL), which are translated into MapReduce jobs or executed by more recent
query engines like Apache Tez or Apache Spark. Hive is well-suited for structured and semi-structured data
analysis and is often used by analysts and SQL-savvy users.
• On the other hand, Pig is a data scripting tool that uses a procedural language called Pig Latin. It enables users
to write data transformations and analysis in a procedural and data flow-oriented manner. Pig is suitable for
processing and analyzing semi-structured and unstructured data, and it is often used by developers and data
engineers who prefer a more programmatic approach.
• Here's a comparison highlighting the key differences between Hive and Pig:

Feature Hive Pig


Purpose Data warehousing and SQL-like querying Data scripting and data flow programming
Language Hive Query Language (HQL), similar to SQL Pig Latin, a procedural data flow language
Data Processing Model SQL-like, declarative query language Data flow programming model
Data Manipulation Schema-on-Read, structured and semi- Schema-on-Read, semi-structured and
structured data unstructured
Data Types Supports structured and semi-structured Supports semi-structured and unstructured
data types data
Ecosystem Integration Part of the Hadoop ecosystem Part of the Hadoop ecosystem
Performance Query optimization for efficient execution Data pipelining for parallel execution
Optimization
Custom Functions User-defined functions (UDFs) for complex User-defined functions (UDFs) for operations
operations
Tool Compatibility Integrates well with other Hadoop tools Integrates well with other Hadoop tools

Learning Curve Familiar SQL-like syntax Procedural language with a steeper learning
curve
Community Support Active open-source community and Active open-source community and
documentation documentation

15 MARKS QUESTIONS

1. Question

a) What is Apache Pig and how is it used for data processing?


b) Consider a dataset containing information about students and their grades in different subjects.
The dataset is stored in a CSV file on HDFS. Write the steps and the corresponding script in PIG
Latin to find the average grade for each stud ent and each subject
C) Explain the features of Apache Pig

Answer a) APACHE PIG


• Apache Pig is an open-source platform used for analyzing and processing large datasets. It provides a high-
level language called Pig Latin for writing data processing programs. Pig Latin is a scripting language that is
compiled into MapReduce jobs and executed on a Hadoop cluster.
• The data processing flow in Pig involves following operations,
o Data Loading:
o Data Transformation:
o Data Processing Pipelines:
o User-Defined Functions (UDFs
o Execution on Hadoop:
o Integration with Other Tools:
• Pig supports a wide range of data sources, including HDFS, HBase, Amazon S3, and relational databases. Pig
Latin provides a rich set of data manipulation operators, such as filter, group, join, and sort.
Answer b)

WE CAN USE PIG TO PROCESS THIS DATA AS FOLLOWS:


1. Load the data from the CSV file using the LOAD operator.
2. Use the FILTER operator to remove any records that contain invalid data.
3. Use the GROUP operator to group the data by student and subject.
4. Use the FOREACH operator to compute the average grade for each group.
5. Store the results in an output file using the STORE operator.

THE PIG LATIN SCRIPT FOR THIS EXAMPLE WOULD LOOK SOMETHING LIKE THIS:
− Load the data from the CSV file
student_grades = LOAD 'student_grades.csv' USING PigStorage(',');
− Filter out any records with invalid data
clean_data = FILTER student_grades BY $2 >= 0 AND $2 <= 100;
− Group the data by student and subject
grouped_data = GROUP clean_data BY ($0, $1);
− Compute the average grade for each group
average_grades = FOREACH grouped_data GENERATE group, AVG(clean_data.$2);
− Store the results in an output file
STORE average_grades INTO 'output';

Answer c) THE FEATURES of APACHE PIG

LESS DEVELOPMENT TIME


• It consumes less time while development. Hence, we can say, it is one of the major advantages.
• Especially considering vanilla MapReduce jobs’ complexity, time-spent, and maintenance of the programs.

EASY TO LEARN
• Well, the Learning curve of Apache Pig is not steep.
• That implies anyone who does not know how to write vanilla MapReduce or SQL for that matter could pick up
and can write MapReduce jobs.

PROCEDURAL LANGUAGE
• Apache Pig is a procedural language, not declarative, unlike SQL. Hence, we can easily follow the commands.
Also, offers better expressiveness in the transformation of data in every step.
• Moreover, while we compare it to vanilla MapReduce, it is much more like the English language.
• In addition, it is very concise and unlike Java but more like Python.

DATAFLOW
• It is a data flow language. That means here everything is about data even though we sacrifice control
structures like for loop or if structures.
• By “this data and because of data”, data transformation is a first class citizen.
• Also, we cannot create for loops without data.
• We need to always transform and manipulate data.

EASY TO CONTROL EXECUTION


• We can control the execution of every step because it is procedural in nature.
• Also, a benefit that it is, straightforward.
• That implies we can write our own UDF (User Defined Function) and inject in one specific part in the pipeline.

UDFS
• It is possible to write our own UDFs.

LAZY EVALUATION
• As per its name, it does not get evaluated unless you do not produce an output file or does not output any
message.
• It is a benefit of the logical plan.
• That it could optimize the program beginning to end and optimizer could produce an efficient plan to execute.

USAGE OF HADOOP FEATURES


• Through Pig, we can enjoy everything that Hadoop offers.
• Such as parallelization, fault-tolerance with many relational database features.

EFFECTIVE FOR UNSTRUCTURED


• Pig is quite effective for unstructured and messy large datasets.
• Basically, Pig is one of the best tools to make the large unstructured data to structured.

BASE PIPELINE
• Here, we have UDFs which we want to parallelize and utilize for large amounts of data.
• That means we can use Pig as a base pipeline where it does the hard work.
• For that, we just apply our UDF in the step that we want.

2. Question

a) Write down the features and limitation of Hive.


b) Draw the Hive Architecture
c) Explain the components of Hive Architecture

Answer a)

FEATURES OF HIVE:
• SQL-like language: Hive provides a SQL-like language called HiveQL to query and analyze data stored in HDFS.
• Schema-on-read: Hive follows a schema-on-read approach, which allows users to store data in any format and
define the schema at the time of querying.
• Distributed computing: Hive uses Hadoop MapReduce to perform distributed computing on large datasets.
• Data processing: Hive supports various data processing operations, such as filtering, sorting, aggregation, and join,
among others.
• Extensibility: Hive is highly extensible and allows users to write their own user-defined functions (UDFs) and
custom MapReduce scripts to process data.

LIMITATIONS OF HIVE:
• Latency: Hive is not suitable for real-time data processing as it has high latency due to its batch processing nature.
• Limited support for transactions: Hive has limited support for transactions, which makes it difficult to handle
complex data operations.
• Limited support for updates and deletes: Hive does not support updates and deletes on data, which makes it
difficult to modify data once it has been loaded.
• Limited performance optimization: Hive's performance can suffer when dealing with complex queries due to
limitations in query optimization.
• Limited support for complex data types: Hive has limited support for complex data types such as arrays, maps,
and structs, which can make it challenging to work with data that is not in a tabular format.

Answer b) Hive Architecture

Answer c) Components of Hive Architecture


• There are three core parts of Hive Architecture: -
o Hive Client
o Hive Services
o Hive Storage and Computer
• Hive Client
o Hive provides multiple drivers with multiple types of applications for communication. Hive supports all
apps written in programming languages like Python, C++, Java, etc.
o There are three categorized this client-
▪ Hive Thrift Clients
▪ Hive JDBC Driver
▪ Hive ODBC Driver
o Hive Thrift Client
▪ As the Thrift-based Apache Hive server will handle the application from all those languages that
support Thrift.
o Hive JDBC Driver
▪ Apache Hive provides Java applications with a JDBC driver to connect to it. The class- apache,
Hadoop, Hive. JDBC, HiveDriver, is described.
o Hive ODBC Driver
▪ ODBC Driver enables ODBC protocol-supporting applications to connect to Hive. For Example,
ODBC, JDBCuses Thrift to communicate with the Hive server.
• Hive Services
o Hive client integration can be performed through Hive Services. If a customer wishes to perform any
operation related to Hive operations, they must contact Hive Services.
o There are four categorized this Hive Services:-
▪ Hive CLI (Command Line Interface)
▪ Apache Hive Web Interfaces
▪ Hive Server
▪ Apache Hive Driver
o Hive CLI (Command Line Interface)
▪ This is a standard shell provided by Hive, where you can execute Hive queries and enter
commands directly.
▪ Apache Hive Web Interfaces
▪ Hive also offers a web-based GUI to execute Hive queries and commands in addition to the
command line GUI.
o Hive Server
▪ Its server is created on apache Thrift and therefore is also referred to as Thrift Server that allows
the different clients to submit a request to Hive and retrieve the final result.
o Apache Hive Driver
▪ The current Hive service driver mirrors the master driver and passes all types of ODBC, JDBC, and
other client-specific requests. Wade will process these programs in various Meta stores and
wildlife systems to keep things going.
• Hive Storage and Computer
o Hive services, such as the Meta Store, the file system, and work clients, are also involved in and do the
following for the Hive repository.
o Metadata tables created in Hive are stored in the "Meta storage database" in Hive.
o The results of the query and data loaded in the tables will be stored on HDFS in the Hadoop cluster.

3. Question

a) Explain briefly about different Hive Data Types.


b) What is the difference between Internal Table and External Table in Hive?
c) Write a Hive query to calculate the average salary of employees in each department from the
following table "employee_info":
employee_id employee_name department salary
1 John Sales 50000
2 Mike Marketing 60000
3 Sally Sales 55000
4 Tom Marketing 65000
5 Sarah Finance 70000
6 Bob Finance 75000

Answer a) Hive Data Types


• Hive provides support for various data types to handle data of different formats and structures. Here's a brief
explanation of the different data types in Hive:
• Primitive Types: These are the basic data types that are supported by most programming languages. Hive supports
the following primitive data types:
o INT: Signed 32-bit integer
o BIGINT: Signed 64-bit integer
o FLOAT: Single-precision floating-point number
o DOUBLE: Double-precision floating-point number
o BOOLEAN: Boolean values, true or false
o STRING: A sequence of characters
• Complex Types: Hive also supports complex data types to handle data in non-tabular formats. These data types
include:
o ARRAY: An ordered collection of elements of the same data type
o MAP: An unordered collection of key-value pairs, where the keys and values can be of different data
types
o STRUCT: A collection of named fields, where each field can have a different data type
o UNION: A data type that can hold different data types in different rows.
• Date and Time Types: Hive provides support for storing and querying date and time values. These data types
include:
o DATE: Stores date values in the format 'yyyy-MM-dd'
o TIMESTAMP: Stores both date and time values in the format 'yyyy-MM-dd HH:mm:ss.SSS'
• Decimal Types: Hive provides support for decimal data types to store decimal values with high precision. These
data types include:
o DECIMAL: Stores decimal values with a specified precision and scale.
• Binary Types: Hive also provides support for binary data types to store data in binary format. These data types
include:
o BINARY: Stores binary data in a byte array format.

Answer b) The difference between Internal Table and External Table in Hive
• Hive Internal Tables-
o It is also known as Managed table. When we create a table in Hive, it by default manages the data. This
means that Hive moves the data into its warehouse directory.
Usage:
o We want Hive to completely manage the lifecycle of the data and table.
o Data is temporary
• Hive External Tables-
o We can also create an external table. It tells Hive to refer to the data that is at an existing location outside
the warehouse directory.
Usage:
o Data is used outside of Hive. For example, the data files are read and processed by an existing program
that does not lock the files.
o We are not creating a table based on the existing table.
• The main differences between them are as follows:

Feature Internal Table External Table


Data Storage Stored in a default or user-specified directory Stored in a user-specified directory or
Location external system
Data Persistence Data is managed and controlled by Hive Data is not managed or controlled by Hive
Data Lifecycle Data is deleted when the table is dropped Data is not deleted when the table is
dropped
Metadata Hive manages metadata, including schema Hive manages metadata, including schema
and statistics and statistics
Data Movement Data movement is not allowed between Data movement is possible between clusters
clusters or databases or databases
Access Control Permissions are managed by Hive Permissions are managed by the underlying
storage system or external system
Data Durability Data is stored in the Hive warehouse Data can be stored anywhere, including
directory external systems
Performance Generally offers better performance due to Performance may vary depending on
internal storage external storage system
Data Loading Data loading is performed through Hive Data loading can be done directly into the
operations external location
Backup and Recovery Data is backed up and recovered along with Data needs to be separately backed up and
the Hive metadata recovered

Answer c)
• The following Hive query can be used to calculate the average salary of employees in each department:
o SELECT department, AVG(salary) AS avg_salary
o FROM employee_info
o GROUP BY department;

department avg_salary
Sales 52500
Marketing 62500
Finance 72500

You might also like