Apache Hive Cookbook - Sample Chapter
Apache Hive Cookbook - Sample Chapter
ee
latest Hive
Understand the workings and structure of
Hive was developed by Facebook and later open sourced in the Apache community. Hive provides a
SQL-like interface to run queries on Big Data frameworks. Hive provides SQL-like syntax called Hive Query
Language (HQL) that includes all SQL capabilities, such as analytical functions, which are quintessential
today's Big Data world.
This book provides easy installation steps of the different types of metastore supported by Hive. This book
has simple and easy-to-learn recipes for conguring Hive clients and services. You would also learn different
Hive optimizations, including Partitions and Bucketing. The book also covers the source code explanation of
latest version of Hive.
Hive Query Language is being used by other frameworks, including Spark. Towards the end you will cover
the integration of Hive with these frameworks.
Sa
m
pl
Hive internals
Get an insight into the latest developments in
and problems
problems efciently
Saurabh Chauhan
Hanish Bansal
Shrey Mehrotra
real-world problems
$ 44.99 US
28.99 UK
P U B L I S H I N G
P U B L I S H I N G
Hanish Bansal
Shrey Mehrotra
Saurabh Chauhan
Shrey Mehrotra has 6 years of IT experience and, since the past 4 years, in designing and
architecting cloud and big data solutions for the governmental and financial domains.
Having worked with big data R&D Labs and Global Data and Analytical Capabilities, he
has gained insights into Hadoop, focusing on HDFS, MapReduce, and YARN. His technical
strengths also include Hive, Pig, Spark, Elasticsearch, Sqoop, Flume, Kafka, and Java.
He likes spending time performing R&D on different big data technologies. He is the coauthor of the book Learning YARN, a certified Hadoop developer, and has also written various
technical papers. In his free time, he listens to music, watches movies, and spending time
with friends.
Preface
Hive is an open source big data framework in the Hadoop ecosystem. It provides an
SQL-like interface to query data stored in HDFS. Underlying it runs MapReduce programs
corresponding to the SQL query. Hive was initially developed by Facebook and later added to
the Hadoop ecosystem.
Hive is currently the most preferred framework to query data in Hadoop. Because most of the
historical data is stored in RDBMS data stores, including Oracle and Teradata. It is convenient
for the developers to run similar SQL statements in Hive to query data.
Along with simple SQL statements, Hive supports wide variety of windowing and analytical
functions, including rank, row num, dense rank, lead, and lag.
Hive is considered as de facto big data warehouse solution. It provides a number of
techniques to optimize storage and processing of terabytes or petabytes of data in a
cost-effective way.
Hive could be easily integrated with a majority of other frameworks, including Spark and
HBase. Hive allows developers or analysts to execute SQL on it. Hive also supports querying
data stored in different formats such as JSON.
Preface
Chapter 3, Understanding the Hive Data Model, takes you through the details of different data
types provided by Hive in order to be helpful in data modeling.
Chapter 4, Hive Data Definition Language, helps you understand the syntax and semantics
of creating, altering, and dropping different objects in Hive, including databases, tables,
functions, views, indexes, and roles.
Chapter 5, Hive Data Manipulation Language, gives you complete understanding of Hive
interfaces for data manipulation. This chapter also includes some of the latest features in
Hive related to CRUD operations in Hive. It explains insert, update, and delete at the row level
in Hive available in Hive 0.14 and later versions.
Chapter 6, Hive Extensibility Features, covers a majority of advance concepts in Hive.
This chapter explain some concepts such as SerDes, Partitions, Bucketing, Windowing
and Analytics, and File Formats in Hive with the detailed examples.
Chapter 7, Joins and Join Optimization, gives you a detailed explanation of types of Join
supported by Hive. It also provides detailed information about different types of Join
optimizations available in Hive.
Chapter 8, Statistics in Hive, allows you to capture and analyze tables, partitions, and
column-level statistics. This chapter covers the configurations and commands use to
capture these statistics.
Chapter 9, Functions in Hive, gives you the detailed overview of the extensive set of inbuilt
functions supported by Hive, which can be used directly in queries. This chapter also covers
how to create a custom User-Defined Function and register in Hive.
Chapter 10, Hive Tuning, helps you out in optimizing the complex queries to reduce the
throughput time. It covers different optimization techniques using predicate pushdown,
by reducing number of maps, and by sampling.
Chapter 11, Hive Security, covers concepts to secure the data from any unauthorized
access. It explains the different mechanisms of authentication and authorization that can be
implement in Hive for security purposes. In case of critical or sensitive data, security is the
first thing that needs to be considered.
Chapter 12, Hive Integration with Other Frameworks, takes you through the integration
mechanism of Hive with some other popular frameworks such as Spark, HBase, Accumulo,
and Google Drill.
137
table_reference:
table_factor
| join_table
table_factor:
tbl_name [alias]
| table_subquery alias
| ( table_references )
join_condition:
ON equality_expression
table_reference: Is the table name or the joining table that is used in the join
query. table_reference can also be a query alias.
table_factor: It is the same as table_reference. It is a table name used in a
join query. It can also be a sub-query alias.
join_condition: join_condition: Is the join clause that will join two or more
tables based on an equality condition. The AND keyword is used in case a join is
required on more than two tables.
Getting ready
This recipe requires having Hive installed as described in the Installing Hive recipe of
Chapter 1, Developing Hive. You will also need the Hive CLI or Beeline client to run
the commands.
How to do it
Follow these steps to create a join in Hive:
SELECT
SELECT
SELECT
SELECT
138
a.* FROM
a.* FROM
a.* FROM
a.*, b.*
Chapter 7
SELECT a.fname, b.lname FROM Sales a JOIN Sales_orc b ON a.id = b.id;
SELECT a.* FROM Sales a JOIN Sales_orc b ON a.id = b.id and a.fname =
b.fname;
SELECT a.fname, b.lname, c.address FROM Sales a JOIN Sales_orc b ON
a.id = b.id join Sales_info c ON c.id = b.id;
SELECT a.fname, b.lname, c.address FROM Sales a JOIN Sales_orc b ON
a.id = b.id join Sales_info c ON c.address = b.address;
How it works
First, let us see the count of records in all the three tables: Sales, Sales_orc, and
Sales_info used in the preceding examples as shown in the following screenshots:
The first statement is a simple join statement that joins two tables: Sales and Sales_orc.
This works in the same manner as in a traditional RDBMS. The output is shown in the
following screenshot:
The second statement throws an error as Hive supports only equality join conditions and not
non-equality conditions. The output is as shown next:
139
The fourth statement displays all the columns from both the tables Sales and Sales_orc.
The output is as shown next:
The fifth statement displays the first name from the Sales table and the last name from
the Sales_orc table. This is in comparison to the earlier statement, which displays all the
columns from both the tables. The output is as shown next:
The sixth statement shows that we can have multiple join conditions in a single join
statement separated by an AND clause just like in a traditional RDBMS. The output is as
shown next:
140
Chapter 7
The seventh statement joins three tables: Sales, Sales_orc, and Sales_info. For this
statement, only a single map/reduce job is run because as per Hive if joining clauses contain
the same columns from tables, then only one map/reduce job is run. As per this example,
Sales_orc uses the id column in both the joining clauses so only one map/reduce job is
created. The output is as shown next:
The last statement joins three multiple tables, but this time the map/reduce jobs are two in
place on one. The result of the Sales and Sales_orc tables is the first of the two map/
reduce jobs, which is joined to the Sales_info table, the second map/reduce job. The
output is as shown next:
141
table 1
table 2
A right outer join behaves the opposite to a left outer join. In this join, all the rows from the
right table and the matching rows from the left table are displayed. All the unmatched rows
from the table on the left will be dropped and Null will be displayed.
A right outer join is as follows:
table 1
table 2
In a full outer join, all the rows will be displayed from both the tables. A full outer join
combines the result of the left outer join and the right outer join.
A full outer join is as follows:
table 1
table 2
142
Chapter 7
The general syntax for the left/right/full outer join is as follows:
SELECT [alias1].column_name(s), [alias2].column_name(s)
FROM table_name [alias1]
LEFT/RIGHT/FULL OUTER JOIN table_name2 [alias2]
ON [alias1].column_name = [alias2].column_name;
Following are some functions explained, that are used in the full outer join syntax:
[alias1]: Is an optional clause. The table name can also be used instead of the
alias name
[alias2]: Is an optional clause. The table name can also be used instead of the
alias name
How to do it
Follow these steps to create a left/right/full outer join in Hive:
SELECT * FROM Sales a LEFT OUTER JOIN Sales_orc b ON a.id = b.id;
SELECT * FROM Sales a RIGHT OUTER JOIN Sales_orc b ON a.id = b.id;
SELECT * FROM Sales a FULL OUTER JOIN Sales_orc b ON a.id = b.id;
SELECT * FROM Sales a LEFT OUTER JOIN Sales_orc b ON a.id = b.id WHERE
a.fname = 'John';
SELECT * FROM Sales a RIGHT OUTER JOIN Sales_orc b ON a.id = b.id
WHERE a.fname = 'John';
How it works
The first statement is an example of a left outer join. In this example, all the rows from the
Sales table and the matching rows from the Sales_orc table are displayed. The nonmatching rows will be dropped and NULL will be displayed. The output is as shown next:
143
The third statement is an example of a full outer join. In this example, all the rows from the
Sales_orc table and the Sales table are displayed. Null is displayed where the joining
condition is not met. The output is as shown next:
The fourth statement first joins the two tables based on the left outer join and then filters out
the rows based on the WHERE clause. The output is as shown next:
144
Chapter 7
The sixth statement first joins the two tables based on the right outer join and then filters out
the rows based on the WHERE clause. The output is as shown next:
Where:
table_reference: Is the table name or the joining table that is used in the join
query. table_reference can also be a query alias.
join_condition: join_condition: Is the join clause that will join two or more
tables based on an equality condition. The AND keyword is used in case a join is
How to do it
Run the following commands to create a left semi join in Hive:
SELECT a.* FROM Sales a LEFT SEMI JOIN Sales_orc b ON a.id = b.id;
SELECT a.*, b.* FROM Sales a LEFT SEMI JOIN Sales_orc b ON a.id = b.id;
SELECT a.* FROM Sales a LEFT SEMI JOIN Sales_orc b ON a.id = b.id WHERE
b.id = 1;
145
How it works
The first statement returns all the rows from the Sales tables. This statement works exactly
the same as mentioned next:
SELECT a.* FROM Sales a WHERE a.id IN (SELECT b.id FROM Sales_orc b);
The third statement will also throw an error as FAILED: SemanticException [Error
10009]: Line 1:12 Invalid table alias 'b'. As mentioned earlier, in a left semi
join, the right-hand side table cannot be used in a WHERE clause. The output of the query is
shown next:
146
Chapter 7
Where:
table_reference: Is the table name or the joining table that is used in the join
query. table_reference can also be a query alias.
join_condition: join_condition: Is the join clause that will join two or more
tables based on an equality condition. The AND keyword is used in case a join is
How to do it
Cross joins can be implemented using the JOIN keyword or CROSS JOIN keyword. If the
CROSS keyword is not specified then by default a cross join is applied.
The following are examples to use cross joins in tables:
147
How it works
The first statement pairs all rows from one table with the rows of another table. The output of
the query is shown next:
The second statement takes as much time in execution as the one in the first example, even
though the result set is filtered out with the help of the WHERE clause. This means that the
cross join is processed first, then the WHERE clause. The output of the query is shown next:
We can also use the CROSS keyword for CROSS joins. The third statement gives the same
result as the one in the first example. The output of the query is shown next:
148
Chapter 7
We can also club multiple join clauses into a single statement as shown in the fourth
statement. In this example, first the cross join is performed between the Sales and Sales_
orc table and the result set is then joined with the Location table. The output of the query
is shown next:
How to do it
There are two ways of using map-side joins in Hive.
One is to use the /*+ MAPJOIN(<table_name>)*/ hint just after the select keyword.
table_name has to be the table that is smaller in size. This is the old way of using
map-side joins.
The other way of using a map-side join is to set the following property to true and then run
a join query:
set hive.auto.convert.join=true;
149
How it works
Let us first run the set hive.auto.convert.join=true; command on the Hive shell.
The output of this command is shown next:
The first statement uses the MAPJOIN hint to optimize the execution time of the query. In this
example, the Sales_orc table is smaller compared to the Sales table. The output of the
first statement is shown in the following screenshot. The highlighted statement shows that
there are no reducers used while processing this query. The total time taken by this query is
40 seconds:
150
Chapter 7
The second statement does not use the MAPJOIN hint. In this case, the property hive.
auto.convert.join is set to true. In this, all the queries will be treated as MAPJOIN
queries whereas the hint is used for a specific query:
Now, let us run the set hive.auto.convert.join=false; command on the Hive shell
and run the second statement. The output of the second command is shown next:
151
In this type of join, not only tables need to be bucketed but also data needs to be bucketed
while inserting. For this, the following property needs to be set before inserting the data:
set hive.enforce.bucketing = true
152
Chapter 7
The general syntax for a bucket map join is as follows:
SELECT /*+ MAPJOIN(table2) */ column1, column2, column3
FROM table1 [alias_name1] JOIN table2 [alias_name2]
ON table1 [alias_name1].key = table2 [alias_name2].key
Where:
Getting ready
This recipe requires having Hive installed as described in the Installing Hive recipe of
Chapter 1, Developing Hive. You will also need the Hive CLI or Beeline client to run
the commands.
How to do it
Follow these steps to use a bucket map join in Hive:
How it works
In the first statement, Sales_orc has less data compared to the Sales table. The Sales
table is having the buckets in multiples of the buckets for Sales_orc. Only the matching
buckets are replicated onto each mapper.
The second statement works in the same manner as the first one. The only difference is that
in the preceding statement there is a join on more than two tables. The Sales_orc buckets
and Location buckets are fetched or replicated onto the mapper of the Sales table,
performing the joins at the mapper side only.
153
Where:
Getting ready
This recipe requires having Hive installed as described in the Installing Hive recipe of
Chapter 1, Developing Hive. You will also need the Hive CLI or Beeline client to run
the commands.
154
Chapter 7
How to do it
Follow these steps to use a bucket sort merge map join in Hive:
How it works
In the first statement, Sales_orc is having the same number of buckets as in the Sales
table. The Sales table is having the buckets in multiples of the buckets for Sales_orc.
Each mapper will read a bucket from the Sales table and the corresponding bucket from
the Sales_orc table and will perform a bucket sort merge map join.
The second statement works in the same manner as the first one. The only difference is that
in the preceding statement there is a join on more than two tables.
How to do it
Run the following command to use a bucket sort merge map join in Hive:
SELECT a.* FROM Sales a JOIN Sales_orc b ON a.id = b.id;
155
How it works
Let us suppose that there are two tables, Sales and Sales_orc, as shown next:
There is a join that needs to be performed on the ID column that is present in both tables.
The Sales table is having a column ID, which is highly skewed on 10. That is, the value 10 for
the ID column is appearing in large numbers compared to other values for the same column.
The Sales_orc table also having the value 10 for the ID column but not as much compared
to the Sales table. Now, considering this, first the Sales_orc table is read and the rows with
ID=10 are stored in the in-memory hash table. Once it is done, the set of mappers read the
Sales table having ID=10 and the value from the Sales_orc table is compared and the
partial output is computed at the mapper itself and no data needs to go
to the reducer, improving performance drastically.
This way, we end up reading only Sales_orc twice. The skewed keys in Sales are only read
and processed by the Mapper, and not sent to the reducer. The rest of the keys in Sales go
through only a single Map/Reduce. The assumption is that Sales_orc has few rows with
keys that are skewed in A. So these rows can be loaded into the memory.
156
www.PacktPub.com
Stay Connected: