Big Data Lab Manual and Syllabus
Big Data Lab Manual and Syllabus
Course Outcomes:
1.Understand Hadoop working environment.
2.Implement commands to transfer files from local server to HDFS.
3.Apply Map Reduce programs for real world problems.
4.Implement scripts using Pig to solve real world problems.
5.Analyze queries using Hive to analyze the datasets.
Create the following table in hive and insert transaction records into it. write an SQL
query to find the customer details who have made an order?
10 Understanding Spark
Textbooks(s)
1. Big Data Black Book by Dt Editorial Services, Dreamtech Publications, 2016.
2. Learning PySpark by Tomasz Drabas, Denny Lee, Packt publishing, 2017.
3. Tom White, "Hadoop: The Definitive Guide", 3/e,4/e O'Reilly, 2015.
Reference Book(s)
1. Bill Franks Taming, The Big Data Tidal Wave, 1/e, Wiley, 2012.
2. Frank J. Ohlhorst, Big Data Analytics, 1/e, Wiley, 2012 Course
Lab Manual for Big Data
Prerequisite Test
=============================
sudo apt update
sudo apt install openjdk-8-jdk -y
java -version; javac -version
sudo apt install openssh-server openssh-client -y
sudo adduser hdoop
su - hdoop
ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod 0600 ~/.ssh/authorized_keys
ssh localhost
Downloading Hadoop (Please note link is updated to new version of hadoop here on 6th May 2022)
===============================
wget https://downloads.apache.org/hadoop/common/hadoop-3.2.3/hadoop-3.2.3.tar.gz
tar xzf hadoop-3.2.3.tar.gz
Editing 6 important files
=================================
1st file
===========================
sudo nano .bashrc - here you might face issue saying hdoop is not sudo user
if this issue comes then
su - aman
sudo adduser hdoop sudo
2nd File
============================
sudo nano $HADOOP_HOME/etc/hadoop/hadoop-env.sh
#Add below line in this file in the end
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
3rd File
===============================
sudo nano $HADOOP_HOME/etc/hadoop/core-site.xml
#Add below lines in this file(between "<configuration>" and "<"/configuration>")
<property>
<name>hadoop.tmp.dir</name>
<value>/home/hdoop/tmpdata</value>
<description>A base for other temporary directories.</description>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
<description>The name of the default file system></description>
</property>
4th File
====================================
sudo nano $HADOOP_HOME/etc/hadoop/hdfs-site.xml
#Add below lines in this file(between "<configuration>" and "<"/configuration>")
<property>
<name>dfs.data.dir</name>
<value>/home/hdoop/dfsdata/namenode</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>/home/hdoop/dfsdata/datanode</value>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
5th File
================================================
sudo nano $HADOOP_HOME/etc/hadoop/mapred-site.xml
#Add below lines in this file(between "<configuration>" and "<"/configuration>")
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
6th File
==================================================
sudo nano $HADOOP_HOME/etc/hadoop/yarn-site.xml
#Add below lines in this file(between "<configuration>" and "<"/configuration>")
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>127.0.0.1</value>
</property>
<property>
<name>yarn.acl.enable</name>
<value>0</value>
</property>
<property>
<name>yarn.nodemanager.env-whitelist</name>
<value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR
,CLASSPATH_PERPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</
value>
</property>
Launching Hadoop
hdfs namenode -format
./start-dfs.sh
Reference: https://www.youtube.com/watch?v=Ih5cuJYYz6Y
1.1 Hadoop Installation (Windows)
Cloudera Quick Start VM Installation
Step1: Download & Install VM Ware Workstation Player
Step2: Download Cloudera quick start VM
Step3: Install Cloudera quick start VM on VM Ware workstation
1) Download & Install VM Ware Workstation Player
Step3:
Install Cloudera quick start VM on VM Ware workstation
(NOTE: It takes time to load-depends on system performance)
(Desktop View)
Cloudera Exploration:
GUI based query editor
2. Hadoop Commands
Hadoop is a open-source distributed framework that is used to store and process a large set of datasets. To
store data, Hadoop uses HDFS, and to process data, it uses MapReduce & Yarn.
hadoop fs or hdfs dfs are file system commands to interact with HDFS.
These commands are very similar to Unix Commands.
o Create a text file in your local machine and write some text into it.
$ nano data.txt
WC_Mapper.java
package com.javatpoint;
import
java.io.IOException;import
java.util.StringTokenizer;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reporter;
public class WC_Mapper extends MapReduceBase implements Mapper<LongWritable,Text,
Text,IntWritable>{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value,OutputCollector<Text,IntWritable> output,
Reporter reporter) throws IOException{
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()){
word.set(tokenizer.nextToken());
output.collect(word, one);
}
}
WC_Reducer.java
package com.javatpoint;
import java.io.IOException;
import java.util.Iterator;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;
WC_Runner.java
package com.javatpoint;
import java.io.IOException;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.TextInputFormat;
import org.apache.hadoop.mapred.TextOutputFormat;
public class WC_Runner {
public static void main(String[] args) throws IOException{
JobConf conf = new JobConf(WC_Runner.class);
conf.setJobName("WordCount");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(WC_Mapper.class);
conf.setCombinerClass(WC_Reducer.class);
conf.setReducerClass(WC_Reducer.class
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf,new Path(args[0]));
ileOutputFormat.setOutputPath(conf,new Path(args[1]));
JobClient.runJob(conf);
}
}
USE Database
TRUNCATE Table
Before moving forward, note that the Hive commands are case-insensitive.
CREATE DATABASE is the same as create database.
1) Open Hue ->Under Query -> Choose Editor -> Click on Hive
Note: To Execute the queries, Select the Row and Click on Run button
2) To Check Available Databases (Hive contains a default database)
SHOW DATABASES
5) DESCRIBE
6) DROP
7) Before creation of a table, check the following path
/user/hive/warehouse/
(ii)External Table
Example: /home/cloudera/hive_practice/sample.txt
External table:
In internal or managed table, the default path to store the table is /user/hive/warehouse/…
create external table (table name)(abc STRING) location '/user/cloudera/(folder name)/(table name);
To Drop a table
Note: Observe the two tables paths i.e., Managed Table path and External Table path
5. Implement matrix multiplication with Hadoop Map Reduce.
Mapper Logic:
import java.io.IOException;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
public void map(LongWritable key, Text value, Context context) throws IOException,
InterruptedException
int m = Integer.parseInt(conf.get("m"));
int p = Integer.parseInt(conf.get("p"));
String[] indicesAndValue =
Text();
if (indicesAndValue[0].equals("M")
{
context.write(outputKey, outputValue);
else
}
}
}
}
Reducer Logic :
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
public void reduce(Text key, Iterable<Text> values, Context context) throws IOException,
InterruptedException
String[] value;
value = val.toString().split(",");
if (value[0].equals("M"))
hashA.put(Integer.parseInt(value[1]), Float.parseFloat(value[2]));
}
else
hashB.put(Integer.parseInt(value[1]), Float.parseFloat(value[2]));
if (result != 0.0f)
}
Driver Logic :
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
conf.set("m", "2");
conf.set("n", "2");
conf.set("p", "2");
job.setJarByClass(MatrixDriver.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setMapperClass(MatrixMapper.class);
job.setReducerClass(MatrixReducer.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
job.submit();
}
7. Given the following table schema.
Employee_table {ID: INT, Name: Varchar (10), Age: INT, Salary: INT}
Loan_table {LoanID:INT, ID: INT, Loan_applied: Boolean, Loan_amt: INT)
a.Create a database and the following tables in Hive.
b.Insert records into the table.
c.Write an SQL to retrieve the employee details who have applied for a loan.
Display the Employee ID, Name department names, Job ID, Salary of all Employees.
9. Given
+----+----------+-----+-----------+----------+
| ID | NAME | AGE | ADDRESS | SALARY |
+----+----------+-----+-----------+----------+
+-----+---------------------+-------------+--------+
|OID | DATE | CUSTOMER_ID | AMOUNT |
Create the following table in hive and insert transaction records into it. write an SQL query to find the
customer details who have made an order?
10. Understanding Spark
Spark program is written in terms of operations on RDDs. RDD is
partitioned and distributed to cluster nodes. RDD can be stored on disk or
memory. RDDs are manipulated using set of parallel transformations and
actions. Spark keeps track of how RDDs are created so that it can be rebuilt
in case of job failures or in case of slow workers.
What is RDD?
RDD is a fundamental data structure of spark.
Each dataset in RDD is divided into logical partitions, which may be
computed on different nodes of the cluster.
Spark context is used to create RDDs.
RDDs are immutable i.e, once created can not be changed. Changes
made to RDD can only be stored in another RDD.
out[]: [5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]
parallelize(list, #partitions)
<class 'pyspark.rdd.RDD'>
Operations on RDDs:
Actions : Execute the transformations and get the data from the workers to
the driver.
1 first()
2 take(n)
3 collect()
4 count()
5 reduce(func)
7 takeOrdered(n, [ordering])
Out[25]: 5
#collect action will get all the elements of the RDD to the driver
rdd2.collect()
out[]: [5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]
Out[]: 15
S.No Transformations
1 distinct([numTasks])
2 filter(func)
3 map(func)
4 flatMap(func)
5 union(otherDataset)
6 intersection(otherDataset)
7 groupByKey([numTasks])
8
reduceByKey(func, [numTasks])
9 join(otherDataset, [numTasks])
10
cartesian(otherDataset)
# create a list
lst=[('modi','pm'),('Babu','cm'),
('Naidu','vp'),('kcr','cm'),
('narasimham','governor'),
('uma','minister'),
('ktr','minister'),('Babu','cm'),
('narasimham','governor')]
rdd_lst = sc.parallelize(lst)
# convert list to rdd
rdd_lst.collect()
Out[3]: [('modi', 'pm'),
('Babu', 'cm'),
('Naidu', 'vp'),
('kcr', 'cm'),
('narasimham', 'governor'),
('uma', 'minister'),
('ktr', 'minister'),
('Babu', 'cm'),
('narasimham', 'governor')]
distinct_rddlist =
rdd_lst.distinct()
distinct_rddlist.collect()
Out[7]: [('uma', 'minister'),
('kcr', 'cm'),
('Babu', 'cm'),
('Naidu', 'vp'),
('modi', 'pm'),
('ktr', 'minister'),
('narasimham', 'governor')]
filter_rddlist.collect()
Out[9]: [('Babu', 'cm'),
('kcr', 'cm'),
('Babu', 'cm')]
map_rddlist.collect()
Out[30]: [('pm', 'modi'),
('cm', 'Babu'),
('vp', 'Naidu'),
('cm', 'kcr'),
('governor', 'narasimham'),
('minister', 'uma'),
('minister', 'ktr'),
('cm', 'Babu'),
('governor', 'narasimham')]
sortbykey_rdd = map_rddlist.sortByKey()
sortbykey_rdd.collect()
Out[60]: [('cm', 'Babu'),
('cm', 'kcr'),
('cm', 'Babu'),
('governor', 'narasimham'),
('governor', 'narasimham'),
('minister', 'uma'),
('minister', 'ktr'),
('pm', 'modi'),
('vp', 'Naidu')]
#Group the values for each key in the RDD into a single sequence
groupbykey_rdd =
map_rddlist.groupByKey()
groupbykey_rdd.collect()
Out[62]: [('vp', <pyspark.resultiterable.ResultIterable at
0x7f75041de450>), ('minister', <pyspark.resultiterable.ResultIterable at
0x7f75040fe810>), ('governor', <pyspark.resultiterable.ResultIterable at
0x7f75040fe990>), ('cm', <pyspark.resultiterable.ResultIterable at
0x7f75040fe850>), ('pm', <pyspark.resultiterable.ResultIterable at
0x7f75040fe9d0>)]
for x in groupbykey_rdd.collect():
print x[0], list(x[1])
vp ['Naidu']
minister ['uma', 'ktr']
governor ['narasimham', 'narasimham']
cm ['Babu', 'kcr', 'Babu']
pm ['modi']
rdd1.collect()
Out[70]: [5, 6, 7, 8, 9]
rdd2.collect()
Out[71]: [8, 9, 10, 11, 12, 13, 14]
rdd3.collect()
Out[72]: [1, 2, 3, 4, 5]
#flatMap trnasformation is similar to map, but each input item can be
mapped #to 0 or more output items
flatmap_rdd3.takeOrdered(7)
Out[75]: [1, 2, 3, 4, 5, 5, 6]
# union returns a new RDD that contains the union of the elements in the
source dataset and the argument
# it includes duplicate elements also
rddunion=rdd1.union(rdd2)
rddunion.collect()
Out[77]: [5, 6, 7, 8, 9, 8, 9, 10,
11, 12, 13, 14]
#intersection returns a new RDD that contains the intersection of elements in
the source dataset and the argument.
rddintersection = rdd1.intersection(rdd2)
rddintersection.collect()
Out[82]: [8, 9]
rddcartesian=rdd1.cartesian(rdd2)
rddcartesian.collect()
Out[83]: [(5, 8), (5, 9), (5, 10), (5, 11), (5,
12), (5, 13), (5, 14), (6, 8), (6, 9), (6, 10),
(6, 11), (6, 12), (6, 13), (6, 14), (7, 8), (7,
9), (7, 10), (7, 11), (7, 12), (7, 13), (7,
14), (8, 8), (8, 9), (8, 10), (8, 11), (8, 12),
(8, 13), (8, 14), (9, 8), (9, 9), (9, 10), (9,
11), (9, 12), (9, 13), (9, 14)]
#JOIN: When called on datasets of type (K, V) and (K, W), returns a dataset
of
# (K, (V, W)) pairs with all pairs of elements for each key. Outer joins are
supported through leftOuterJoin, rightOuterJoin, and fullOuterJoin.
rdd1=sc.parallelize([(1,8.5),(2,7.7),(3,9),(5,6.5)])
rdd2=sc.parallelize([(1,'cse'),(3,'ece'),(4,'me'),(5,'eee')])
rddjoin=rdd1.join(rdd2)
rddjoin.collect()
Out[]: [(1, (8.5, 'cse')),
#leftOuterJoin: performs a join starting with the first (left-most) RDD and
#then any matching second (right-most) RDD elements.
rddleftouter=rdd1.leftOuterJoin(rdd2)
rddleftouter.collect()
Out[]: [(1, (8.5, 'cse')),
rddrightouter=rdd1.rightOuterJoin(rdd2)
rddrightouter.collect()
Out[]: [(1, (8.5, 'cse')),
#fullOuterJoin: returns all matching elements from both RDDs whether the
other RDD matches or not.
rddfullouter=rdd1.fullOuterJoin(rdd2)
rddfullouter.collect()
Out[42]: [(1, (8.5, 'cse')),
x = sc.textFile(“/user/cloudera/wc")
x.xollect()
a= x.flatMap(lambda line: line.split(" "))
a.collect()
b=a.map(lambda word: (word, 1))
b.collect()
c=b.reduceByKey(lambda a, b: a + b)
c.collect()
c.saveAsTextFile(“/user/cloudera/output")