CH 6 BDA
CH 6 BDA
CH 6 BDA
PECAIML601A
CHAPTER-6
1 MARK QUESTIONS
• In Pig, there is a default load function, that is Pig Storage. Also, we can use pig storage, whenever we want to load
data from a file system into the pig.
• We can also specify the delimiter of the data while loading data using pig storage (how the fields in the record are
separated). Also, we can specify the schema of the data along with the type of the data.
• Pig Latin is a procedural language while SQL is a declarative language. Additionally, Pig Latin is designed for
processing large datasets, while SQL is designed for working with structured data.
• The main purpose of PIG in Big Data Analytics is to provide a high-level language for processing large datasets on
Apache Hadoop.
• The FOREACH statement in Pig Latin is used to apply a transformation to each tuple in a relation.
7. What is HIVE?
• Hive is a data warehouse system which is used to analyze structured data. It is built on the top of Hadoop. It was
developed by Facebook.
• Hive provides the functionality of reading, writing, and managing large datasets residing in distributed storage. It
runs SQL like queries called HQL (Hive query language) which gets internally converted to MapReduce jobs.
• Using Hive, we can skip the requirement of the traditional approach of writing complex MapReduce programs. Hive
supports Data Definition Language (DDL), Data Manipulation Language (DML), and User Defined Functions (UDF).
• A metastore in Hive is a database that stores metadata about the data stored in Hadoop, such as schema
information, table partitions, and column statistics.
• The EXPLAIN statement in Hive is used to analyze the execution plan for a query and identify any performance
bottlenecks or optimization opportunities.
10. What is the use of partitioning in Hive?
• Partitioning in Hive is used to divide data into smaller, more manageable parts based on a specified key or set of
keys. This can improve query performance by reducing the amount of data that needs to be processed.
5 MARKS QUESTIONS
1. Compare and contrast Hive and Pig. What are the key differences between the two tools?
• Hive and Pig are two popular data processing tools in the Hadoop ecosystem, but they have different approaches to
data processing and different use cases. Here are the key differences between Hive and Pig:
[1] LANGUAGE:
• Hive uses SQL-like language, called HiveQL, which makes it easy for users with SQL knowledge to work with Hive.
Pig, on the other hand, uses a high-level scripting language, called Pig Latin, which is more similar to a programming
language.
[4] PERFORMANCE:
• Hive is generally slower than Pig for iterative data processing tasks because it is optimized for SQL-like queries, which
are not well-suited for iterative processing. Pig is designed for iterative processing and is often faster than Hive for
these types of tasks.
Pig Hive
Pig operates on the client side of a cluster. Hive operates on the server side of a cluster.
It is used to handle structured and semi-structured data. It is mainly used to handle structured data.
Pig scripts end with .pig extension. In HIve, all extensions are supported.
It does not support partitioning. It supports partitioning.
It supports Avro file format. It does not support Avro file format.
Pig is suitable for complex and nested data structures. Hive is suitable for batch-processing OLAP systems.
Pig does not support schema to store data. Hive supports schema for data insertion in tables.
It is very easy to write UDFs to calculate matrices. It does support UDFs but is much hard to debug.
2. Give an example of a Pig Latin script to join two datasets and explain how it works.
3. What is Pig Latin and how does it differ from SQL in the context of big data analysis?
• Pig Latin is a high-level scripting language used in the Hadoop ecosystem for analyzing large datasets. It is used with
Apache Pig, a platform that allows users to write data processing pipelines and execute them on Hadoop clusters.
Pig Latin provides a set of operators and functions for performing data transformations, filtering, aggregation, and
joining operations.
• In the context of big data analysis, Pig Latin differs from SQL in several ways:
Data Processing Model Data flow programming model Relational algebra and set-
oriented approach
Data Manipulation Schema-less, semi-structured data Structured data manipulation
manipulation
Data Types Supports complex data types (e.g., Primarily designed for structured
nested, bag, map) data types
Ecosystem Integration Part of the Hadoop ecosystem Widely supported across
database systems
Performance Optimization Data pipelining for parallel execution Query optimization for efficient
execution
Custom Functions User-defined functions (UDFs) for User-defined functions (UDFs) for
complex operations operations
Data Sources Hadoop Distributed File System (HDFS), Various databases, including
HBase, etc. relational DBs
Tool Compatibility Integrates well with other Hadoop tools Native querying language for
(e.g., Hive) most databases
Query Expressiveness Powerful for complex data Well-suited for structured data
transformations and joins querying
Learning Curve Steeper learning curve due to Familiar syntax for developers
procedural nature and analysts
Community Support Active open-source community and Widespread support and
documentation extensive resources
• Pig is an excellent tool for processing unstructured data in Hadoop. Here are some of the ways in which Pig can
be used for this purpose:
5. Compare and contrast Hive and Pig. What are the key differences between the two tools?
• Hive and Pig are both data processing tools that are part of the Apache Hadoop ecosystem, but they have some
key differences in terms of their purpose, language, and usage.
• Hive is primarily designed for data warehousing and SQL-like querying. It allows users to write SQL-like queries,
known as Hive Query Language (HQL), which are translated into MapReduce jobs or executed by more recent
query engines like Apache Tez or Apache Spark. Hive is well-suited for structured and semi-structured data
analysis and is often used by analysts and SQL-savvy users.
• On the other hand, Pig is a data scripting tool that uses a procedural language called Pig Latin. It enables users
to write data transformations and analysis in a procedural and data flow-oriented manner. Pig is suitable for
processing and analyzing semi-structured and unstructured data, and it is often used by developers and data
engineers who prefer a more programmatic approach.
• Here's a comparison highlighting the key differences between Hive and Pig:
Learning Curve Familiar SQL-like syntax Procedural language with a steeper learning
curve
Community Support Active open-source community and Active open-source community and
documentation documentation
15 MARKS QUESTIONS
1. Question
THE PIG LATIN SCRIPT FOR THIS EXAMPLE WOULD LOOK SOMETHING LIKE THIS:
− Load the data from the CSV file
student_grades = LOAD 'student_grades.csv' USING PigStorage(',');
− Filter out any records with invalid data
clean_data = FILTER student_grades BY $2 >= 0 AND $2 <= 100;
− Group the data by student and subject
grouped_data = GROUP clean_data BY ($0, $1);
− Compute the average grade for each group
average_grades = FOREACH grouped_data GENERATE group, AVG(clean_data.$2);
− Store the results in an output file
STORE average_grades INTO 'output';
EASY TO LEARN
• Well, the Learning curve of Apache Pig is not steep.
• That implies anyone who does not know how to write vanilla MapReduce or SQL for that matter could pick up
and can write MapReduce jobs.
PROCEDURAL LANGUAGE
• Apache Pig is a procedural language, not declarative, unlike SQL. Hence, we can easily follow the commands.
Also, offers better expressiveness in the transformation of data in every step.
• Moreover, while we compare it to vanilla MapReduce, it is much more like the English language.
• In addition, it is very concise and unlike Java but more like Python.
DATAFLOW
• It is a data flow language. That means here everything is about data even though we sacrifice control
structures like for loop or if structures.
• By “this data and because of data”, data transformation is a first class citizen.
• Also, we cannot create for loops without data.
• We need to always transform and manipulate data.
UDFS
• It is possible to write our own UDFs.
LAZY EVALUATION
• As per its name, it does not get evaluated unless you do not produce an output file or does not output any
message.
• It is a benefit of the logical plan.
• That it could optimize the program beginning to end and optimizer could produce an efficient plan to execute.
BASE PIPELINE
• Here, we have UDFs which we want to parallelize and utilize for large amounts of data.
• That means we can use Pig as a base pipeline where it does the hard work.
• For that, we just apply our UDF in the step that we want.
2. Question
Answer a)
FEATURES OF HIVE:
• SQL-like language: Hive provides a SQL-like language called HiveQL to query and analyze data stored in HDFS.
• Schema-on-read: Hive follows a schema-on-read approach, which allows users to store data in any format and
define the schema at the time of querying.
• Distributed computing: Hive uses Hadoop MapReduce to perform distributed computing on large datasets.
• Data processing: Hive supports various data processing operations, such as filtering, sorting, aggregation, and join,
among others.
• Extensibility: Hive is highly extensible and allows users to write their own user-defined functions (UDFs) and
custom MapReduce scripts to process data.
LIMITATIONS OF HIVE:
• Latency: Hive is not suitable for real-time data processing as it has high latency due to its batch processing nature.
• Limited support for transactions: Hive has limited support for transactions, which makes it difficult to handle
complex data operations.
• Limited support for updates and deletes: Hive does not support updates and deletes on data, which makes it
difficult to modify data once it has been loaded.
• Limited performance optimization: Hive's performance can suffer when dealing with complex queries due to
limitations in query optimization.
• Limited support for complex data types: Hive has limited support for complex data types such as arrays, maps,
and structs, which can make it challenging to work with data that is not in a tabular format.
3. Question
Answer b) The difference between Internal Table and External Table in Hive
• Hive Internal Tables-
o It is also known as Managed table. When we create a table in Hive, it by default manages the data. This
means that Hive moves the data into its warehouse directory.
Usage:
o We want Hive to completely manage the lifecycle of the data and table.
o Data is temporary
• Hive External Tables-
o We can also create an external table. It tells Hive to refer to the data that is at an existing location outside
the warehouse directory.
Usage:
o Data is used outside of Hive. For example, the data files are read and processed by an existing program
that does not lock the files.
o We are not creating a table based on the existing table.
• The main differences between them are as follows:
Answer c)
• The following Hive query can be used to calculate the average salary of employees in each department:
o SELECT department, AVG(salary) AS avg_salary
o FROM employee_info
o GROUP BY department;
department avg_salary
Sales 52500
Marketing 62500
Finance 72500