Bda Practical

Download as pdf or txt
Download as pdf or txt
You are on page 1of 62

PRACTICAL NO: 01

AIM:
HDFS Basics, Hadoop Ecosystem Tools Overview. -Installing Hadoop. -Copying File to
Hadoop. -Copy from Hadoop File system and deleting file. -Moving and displaying files in
HDFS. -Programming exercises on Hadoop

THEORY:
Hadoop is an open-source framework that is used to efficiently store and process large datasets
ranging in size from gigabytes to petabytes of data. Instead of using one large computer to store
and process the data, Hadoop allows clustering multiple computers to analyze massive datasets
in parallel more quickly.
Features of Hadoop:

1. Hadoop is an open-source project, which means its source code is available free of cost
for inspection, modification, and analyses that allows enterprises to modify the code as
per their requirements.
2. Hadoop cluster is scalable means we can add any number of nodes (horizontal scalable)
or increase the hardware capacity of nodes (vertical scalable) to achieve high
computation power. This provides horizontal as well as vertical scalability to the
Hadoop framework.
3. Fault tolerance is the most important feature of Hadoop. HDFS in Hadoop 2 uses a
replication mechanism to provide fault tolerance.
4. This feature of Hadoop ensures the high availability of the data, even in unfavorable
conditions.
5. Hadoop cluster consists of nodes of commodity hardware that are inexpensive, thus
provides a cost-effective solution for storing and processing big data.
6. Hadoop stores data in a distributed fashion, which allows data to be processed
distributedly on a cluster of nodes. Thus, it provides lightning-fast processing capability
to the Hadoop framework.
7. Hadoop is popularly known for its data locality feature means moving computation logic
to the data, rather than moving data to the computation logic. This feature of Hadoop
reduces the bandwidth utilization in a system.
8. Hadoop can process unstructured data. Thus, provide feasibility to the users to analyze
data of any formats and size.
9. Hadoop is easy to use as the clients don’t have to worry about distributing computing.
The processing is handled by the framework itself.
10. In Hadoop due to the replication of data in the cluster, data is stored reliably on the
cluster machines despite machine failures.

Hadoop Software Can Be Installed in Three Modes of Operation:

• Stand Alone Mode: Hadoop is a distributed software and is designed to run on a


commodity of machines. However, we can install it on a single node in stand-alone mode. In
this mode, Hadoop software runs as a single monolithic java process. This mode is extremely

1|Page
useful for debugging purpose. You can first test run your Map-Reduce application in this mode
on small data, before actually executing it on cluster with big data.
• Pseudo Distributed Mode: In this mode also, Hadoop software is installed on a Single
Node. Various daemons of Hadoop will run on the same machine as separate java processes.
Hence all the daemons namely NameNode, DataNode, SecondaryNameNode, JobTracker,
TaskTracker run on single machine.
• Fully Distributed Mode: In Fully Distributed Mode, the daemons NameNode,
JobTracker, SecondaryNameNode (Optional and can be run on a separate node) run on the
Master Node.
The daemons DataNode and TaskTracker run on the Slave Node.

HADOOP INSTALLATION:

Steps for Installation:


1. sudo apt-get update.

2. In this step, we will install latest version of JDK on the machine.


The Oracle JDK is the official JDK; however, it is no longer provided by Oracle as a
default installation for Ubuntu. You can still install it using apt-get. To install any
version, first execute the following commands:
a. sudo apt-get install python-softwareproperties
b. sudo add-apt-repository ppa:webupd8team/ java
c. sudo apt-get update

2|Page
Then, depending on the version you want to install, execute one of the following
commands:
Oracle JDK 7: sudo apt-get install oraclejava7-installer
Oracle JDK 8: sudo apt-get install oraclejava8-installer

3. Now, let us setup a new user account for Hadoop installation. This step is optional, but
recommended because it gives you flexibility to have a separate account for Hadoop
installation by separating this installation from other software installation
a. sudo adduser hadoop_dev ( Upon executing this command, you will prompted to
enter the new password for this user. Please enter the password and enter other
details.
Don’t forget to save the details at the end)
b. su - hadoop_dev ( Switches the user from current user to the new user created i.e
Hadoop_dev)

3|Page
4. Edit configuration file /home/hadoop_dev/ hadoop2/etc/hadoop/hadoop-env.sh and
set JAVA_HOME in that file. a. vim
/home/hadoop_dev/hadoop2/etc/hadoop/ hadoop-env.sh
b. uncomment JAVA_HOME and update it following line:
export JAVA_HOME=C:\Java\jdk1.8.0_202 ( Please check for your relevant java
installation and set this value accordingly. Latest versions of Hadoop require > JDK1.7)

4|Page
5. This finishes the Hadoop setup in stand-alone mode.

5|Page
6. Edit the file
/home/Hadoop_dev/hadoop2/etc/ hadoop/core-site.xml as below:
<configuration>
<property>
<name> fs.defaultFS</name>
<value>fs.defaultFS hdfs://localhost:9000</value>
</property>
</configuration>

7. Edit the file


/home/Hadoop_dev/hadoop2/etc/ hadoop/hdfs-site.xml as below:
<configuration>
<property>
<name> dfs.replication </name>
<value>1 </value>
</property>
</configuration>

6|Page
8. Edit the file /home/hadoop_dev/hadoop2/etc/ hadoop/mapred-site.xml as below:
<configuration>
<property>
<name> mapreduce.framework.name</name>
<value>yarn </value>
</property>
</configuration>

9. Edit the fie /home/hadoop_dev/hadoop2/etc/ hadoop/yarn-site.xml as below:


<configuration>
<property>
<name> yarn.nodemanager.aux-services</name>
<value> mapreduce_shuffle</value>
</property>
</configuration>

Basic Operations:

7|Page
• Copy a file from/To Local file system to HDFS copyFromLocal

Usage:

hadoop fs -copyFromLocal URI

Example:

hadoop fs -copyFromLocal C:\Hadoop_File\airport-codes-na.txt

• CopyToLocal

Usage:

hadoop fs -copyToLocal [-ignorecrc] [-crc] URI

8|Page
Usage:
hadoop fs -mv
Example: hadoop fs -mv
C:\Hadoop_File\airport-codes-na.txt
Remove a file or directory in HDFS.
Remove files specified as argument. Deletes directory only when it is empty
Usage:
hadoop fs -rm
Example:

hadoop fs -rm C:\airport-codes-na.txt

Conclusion: Studied basic Hadoop and Hadoop ecosystem tools.

9|Page
PRACTICAL NO: 02
AIM:
Use of Sqoop tool to transfer data between Hadoop and relational database servers.
a. Sqoop - Installation.
b. To execute basic commands of Hadoop eco system component Sqoop.

THEORY:

Sqoop is a tool designed to transfer data between Hadoop and relational database servers. It is
used to import data from relational databases such as MySQL, Oracle to Hadoop HDFS, and
export from Hadoop file system to relational databases.
Sqoop has two main functions: Importing and Exporting.
Sqoop Import: The import tool imports individual tables from RDBMS to HDFS. Each row in
a table is treated as a record in HDFS. All records are stored as text data in text files or as
binary data in Avro and Sequence files.
Sqoop Export: The export tool exports a set of files from HDFS back to an RDBMS. The files
given as input to Sqoop contain records, which are called as rows in table. Those are read and
parsed into a set of records and delimited with user-specified delimiter.

Sqoop Features:

Sqoop has several features, which makes it helpful in the Big Data world:
1. Parallel Import/Export: Sqoop uses the YARN framework to import and export data. This provides fault
tolerance on top of parallelism.
2. Import Results of an SQL Query: Sqoop enables us to import the results returned from an SQL query
into HDFS.
3. Connectors For All Major RDBMS Databases: Sqoop provides connectors for multiple RDBMSs, such
as the MySQL and Microsoft SQL servers.
4. Kerberos Security Integration: Sqoop supports the Kerberos computer network authentication protocol,
which enables nodes communication over an insecure network to authenticate users securely.
5. Provides Full and Incremental Load: Sqoop can load the entire table or parts of the table

Advantages of Sqoop:
• With the help of Sqoop, we can perform transfer operations of data with a variety of structured data
stores like Oracle, Teradata, etc.
• Sqoop helps us to perform ETL operations in a very fast and cost-effective manner.
• With the help of Sqoop, we can perform parallel processing of data which leads to fasten the overall
process.
• Sqoop uses the MapReduce mechanism for its operations which also supports fault tolerance.

Disadvantages of Sqoop:
• The failure occurs during the implementation of operation needed a special solution to handle the
problem.
• The Sqoop uses JDBC connection to establish a connection with the relational database management
system which is an inefficient way.

10 | P a g e
• The performance of Sqoop export operation depends upon hardware configuration relational database
management system.

SQOOP INSTALLATIONS:

Step 1: Open Windows PowerShell

Step 2: Verifying JAVA Installation

Step 3: Run this commands on windows Powershell


>> Set-Execution Policy RemoteSigned -Scope CurrentUser
>> irm get.scoop.sh | iex

Scoop has been installed successfully!!

11 | P a g e
SQOOP COMMANDS:

Scoop install Firefox:

Scoop install python:

Scoop cat gifski:

Conclusion: Used Sqoop tool to transfer data between Hadoop and relational database servers.

12 | P a g e
PRACTICAL NO: 03
AIM:

To install and configure MongoDB/Cassandra/HBase/Hypertable to execute NoSQL commands.


Theory:
MongoDB is a source-available cross-platform document-oriented database program. Classified as a
NoSQL database program, MongoDB uses JSON-like documents with optional schemas. MongoDB is
developed by MongoDB Inc. and licensed under the Server-Side Public License which is deemed non-
free by several distributions.
INSTALLATION:
STEP 1:
You have to Download and install MongoDB Version 6.0.1 (current) or lower version is also
applicable to run the software.
Description: - MongoDB Installation
MongoDB is a NoSQL database server available for Most of the latest operating systems. Use
one of the following tutorials to install MongoDB on your system.
MongoDB Installation
Follow below tutorials for the installation of MongoDB

• Install Mongodb on Ubuntu


• Install Mongodb on Debian
• Install Mongodb on CentOS & Fedora
Windows users can download MongoDB installer from its official website.

STEP 2:
Now you have to download and install MongoDB Shell from

13 | P a g e
(https://www.mongodb.com/try/download/shell)
Extract the MongoDB shell file to Local disk and run the software from bin folder (bin-
>mongosh.exe file and run the software).
Description: What is Mongo Shell?
MongoDB provides an interactive mongo shell based on JavaScript. This provides a command line
interface between users and MongoDB databases.

What is mongo Shell?


Mongo shell is a command line interface between user and database. You can connect to your
MongoDB server using this shell and manage your databases and collections. You can also perform the
administrative tasks on MongoDB.

Connect to MongoDB Shell


Type mongo on your system terminal. It will automatically connect to your local MongoDB server
running on port 27017.

STEP 3:
Open the MongoDB compass application

STEP 4:
COPY THE URI OF NEW CONNECTON (For e.g. mongodb://localhost:27017)

14 | P a g e
Description: You need to specify the hostname to connect remote database. Also, specify port
if MongoDB is running on different port.

STEP 5:
COPIED URI (For e.g. mongodb://localhost:27017)
Description: -This “27017” You Have To Write In (mongosh.exe) application and press
Enter(After pressing the enter button the new connection of localhost will connect to the shell).

The localhost 27017 is successfully connected through the shell.


STEP 6:
NOW start applying the NoSQL Command in the mongosh.exe application
1) MongoDB Create Database
MongoDB does not provide commands to create a database. You can use use
dbName statement to select a database in the mongo shell. Use the following example:

Syntax:
> use DATABASE_NAME

15 | P a g e
Limitations
MongoDB database names cannot be empty and must have fewer than 64 characters.
Create Database
First, of all, Use the below command to select the database. To create a new database make
sure it does not exist.
> use exp3
You have a database name selected. Now you need to insert at least one document to keep this
database. As you have already executed use the command above, so all the statements will be
executed on the above database.
> db.users.insert({ id: 1 })

Show Databases – After inserting the first record the database will be listed in show
dbs command.
2) MongoDB show databases
In MongoDB, you can use the show dbs command to list all databases on a MongoDB server.
This will show you the database name, as well as the size of the database in gigabytes.
Example:
> show dbs;
Output

>show collections

Create or insert some more data in db.exp3.insert({“lastname”:”Jaiswal”})

>show collections

16 | P a g e
3)MongoDB Delete Database
The dropDatabase() command is used to delete the current database calculating all the
associated data files. To remove any database first you need to select that database. You can
use show dbs command to list available databases.

>db.exp5.drop()

4)MongoDB Create Collection


MongoDB stores documents in collections. The collections are analogous to tables in relational
databases like MySQL. Generally, you don’t need to create a collection, MongoDB does this
automatically during insertion of the first document.

Syntax

Use db.createCollection() command to create collection. You need to specify the name of
collection, Also can provide some optional options like memory size and indexing.

> db.createCollection(name, options)


INPUT
>db.createCollection(”ARMIET”)
OUTPUT

5)Mongodb Show Collection


Use show collections command from MongoDB shell to list all collections created in the
current database. First, select the database you want to view the collection.

Mongodb Show Collection


Select your database and run show collections command to list available collections in
MongoDB database.

> show collections


Output:

17 | P a g e
6) Mongodb Get Collection Information
To find detailed information about the database collections use .getCollectionInfos() function. This
will provide an array of documents with collection information, such as name and options.

> db.getCollectionInfos();
You can also use filters with the above commands, for example, to fetch the details of specific collection
with their name use following command:

INPUT

> db.getCollectionInfos({ name: "accounts" });


Output:

7) MongoDB Rename Collection


Use db.collection.renameCollection() method to rename existing collection in MongoDB
database.
Syntax:
db.collection.renameCollection(target, dropTarget)
Example:
For example, You have a collection with wrong spelling “exp3“. Let’s use the following
command on mongo Shell to correct collection name to “exp10“.
INPUT
>db.collection.renameCollection(“exp10”)
OUTPUT

8) MongoDB Drop Collection


Use db.collection.drop() statement to delete existing collection. The result will be true on
successful deletion.
Syntax
> db.COLLECTION_NAME.drop()

18 | P a g e
MongoDB Drop Collection

Finally, delete the collection named tecadmin using below methods from the mydb database.

> db.exp5.drop();
OUTPUT

Also, you can also use drop() method with getCollection() to remove existing collection as
following.
> db.getCollection("exp10").drop();
OUTPUT

9) MongoDB Insert Document


Use db.collection.insert() method to insert new document to MongoDB collection. You don’t need to
create the collection first. Insert method will automatically create collection if not exists.
Syntax
> db.COLLECTION_NAME.insert(document)
Insert Single Document
Insert single ducument using the insert() method. It’s required a json format of document to pass as
arguments.

10) MongoDB Query Document


Use db.collection.find() method to query collection for available documents. You can use show
collections command to view available collection in your database.
Syntax
> db.COLLECTION_NAME.find(condition)
Search All Documents

19 | P a g e
Execute find() function on the collection without any condition to get all the available
documents in collection.
> db.user.find()

You can also use pretty() function with above command to show formatted output.
> db.users.find().pretty();

Search Specific Documents


You can define conditions to select documents matches the specified condition.
> db.users.find({"id": 1001})
The above command will show all the fields from the document having {"id": 1001}

11) MongoDB Update Document


Use db.collection.update() method to update document in MongoDB database.
Syntax:
> db.users.update(CONDITION, UPDATED DATA, OPTIONS)
Find Current Document
First find the current document in collection with user_name = Rohitkumar.

20 | P a g e
Update Document
Now change location from India to Australia for the document where id = 1001.

12) MongoDB Delete Document


Use db.colloction.remove() method to delete any document from MongoDB collection on
basis of critaria.
Syntax:
> db.colloction.remove(CONDITION)
Delete Matching Document
Now use the following command deletes a specific document from MongoDB collection with
user_name = rahul.
> db.users.remove({"user_name":"Rohitkumar"})
The above command will remove all document having user_name rahul. To remove only first
matching document from collection use following command.
> db.users.remove({"user_name":"Rohitkumar"},1)

13) MongoDB limit() Method


Use limit() method to show the limited number of documents in a collection with find() method.
Syntax:
db.COLLECTION_NAME.find().limit(NUMBER)
Example:
For example users collection have 19 documents. Now you need to get only first 2 documents
from collection.
> db.users.find().limit(2);

21 | P a g e
14) MongoDB db.hostInfo

Host information Returns a document with information about the underlying system that
the mongod or mongos runs on. Some of the returned fields are only included on some
platforms.

db.hostInfo() provides a mongosh helper method around the hostInfo command. The output
of db.hostInfo() on a Linux system will resemble the following:
>db.hostInfo

>db.hostInfo()

15) MongoDB db.cloneDatabase


Definition

db.cloneDatabase("hostname")
Deprecated since version 4.0. Starting in version 4.2, MongoDB removes
the clone command. The deprecated db.cloneDatabase(), which wraps the clone command,
can only be run against MongoDB 4.0 or earlier versions. For behavior and examples, refer to
the 4.0 or earlier version of the manual.
>db.cloneDatabase

22 | P a g e
16) MongoDB db.help
Returns: Text Output listing common method on the db object
> db.help

CONCLUSION:

Hence, we successfully studied to install and configure MongoDB to execute NoSQL


commands.

23 | P a g e
PRACTICAL NO: 04
AIM: Experiment on Hadoop Map-Reduce: -Write a program to implement a word count program
using MapReduce.

Theory:
MapReduce is a processing technique and a program model for distributed computing based on java.
The MapReduce algorithm contains two important tasks, namely Map and Reduce. Map takes a set of
data and converts it into another set of data, where individual elements are broken down into tuples
(key/value pairs). Secondly, reduce task, which takes the output from a map as an input and combines
those data tuples into a smaller set of tuples. As the sequence of the name MapReduce implies, the
reduce task is always performed after the map job. The major advantage of MapReduce is that it is easy
to scale data processing over multiple computing nodes. Under the MapReduce model, the data
processing primitives are called mappers and reducers. Decomposing a data processing application into
mappers and reducers is sometimes nontrivial. But, once we write an application in the MapReduce
form, scaling the application to run over hundreds, thousands, or even tens of thousands of machines in
a cluster is merely a configuration change. This simple scalability is what has attracted many
programmers to use the MapReduce model.
The Algorithm
Generally, MapReduce paradigm is based on sending the computer to where the data resides!
MapReduce program executes in three stages, namely map stage, shuffle stage, and reduce stage.
Map stage: The map or mapper’s job is to process the input data. Generally, the input data is in the
form of file or directory and is stored in the Hadoop file system (HDFS). The input file is passed to the
mapper function line by line. The mapper processes the data and creates several small chunks of data.
Reduce stage: This stage is the combination of the Shuffle stage and the Reduce stage. The Reducer’s
job is to process the data that comes from the mapper. After processing, it produces a new set of output,
which will be stored in the HDFS.
During a MapReduce job, Hadoop sends the Map and Reduce tasks to the appropriate servers in the
cluster.
The framework manages all the details of data-passing such as issuing tasks, verifying task completion,
and copying data around the cluster between the nodes.
Most of the computing takes place on nodes with data on local disks that reduces the network traffic.
After completion of the given tasks, the cluster collects and reduces the data to form an appropriate
result, and sends it back to the Hadoop server.
Inputs and Outputs (Java Perspective) The MapReduce framework operates on pairs, that is, the
framework views the input to the job as a set of pairs and produces a set of pairs as the output of the
job, conceivably of different types.

The key and the value classes should be in serialized manner by the framework and hence, need to
implement the Writable interface. Additionally, the key classes have to implement the Writable-

24 | P a g e
Comparable interface to facilitate sorting by the framework. Input and Output types of a MapReduce
job: (Input) -> map -> -> reduce -> (Output).

Input Output

Map <k1, v1> list (<k2, v2>)

Reduce <k2, list(v2)> list (<k3, v3>)

Terminology
PayLoad - Applications implement the Map and the Reduce functions, and form the core of the job.
Mapper - Mapper maps the input key/value pairs to a set of intermediate key/value pair.
NamedNode - Node that manages the Hadoop Distributed File System (HDFS).
DataNode - Node where data is presented in advance before any processing takes place.
MasterNode - Node where JobTracker runs and which accepts job requests from clients.
SlaveNode - Node where Map and Reduce program runs.
JobTracker - Schedules jobs and tracks the assign jobs to Task tracker.
Task Tracker - Tracks the task and reports status to JobTracker.
Job - A program is an execution of a Mapper and Reducer across a dataset.
Task - An execution of a Mapper or a Reducer on a slice of data.
Task Attempt - A particular instance of an attempt to execute a task on a SlaveNode.
A) INSTALLATION
STEP 1: First you have to download CLOUDERA 5.13.0 and
VIRTUAL BOX
After download you have to install Virtual box 6.1.38 latest version to device

25 | P a g e
STEP 2: After that place the cloudera 5.13.0 OVF file to the virtual box and
connect the cloudera to the virtualbox

STEP 3: Cloudera is now connected to virtual box

STEP 4: QuickStart the Cloudera OVF file

STEP 5: Click and Open the eclipse Application From Cloudera quickstart-vm-
5.13.0-0-virtualbox(Running)-Oracle VM VirtualBox

STEP 6: Create new package or make new project of java and give the name as
WordCount and click to next step

26 | P a g e
STEP 7: New Java Project name is created

STEP 8: Go to Libraries and add External JARS (activation.jar) file from the device
and finish the process

STEP 9: Now create java project

27 | P a g e
B) IMPLEMENTATION
STEP 1: Type the necessary Java Code for the WordCount project

STEP 2: Following are the Java Code for the program

28 | P a g e
STEP 3:- Save the WordCount Java Code file to the System

STEP 4: Click and Open the Terminal from the system and type all the Linux command for
MapReduce

STEP 5: DATA IS EXECUTING

29 | P a g e
STEP 6: TYPE ALL THE ECESSARY COMMAND TO LOAD THE DATA OF MAPREDUCE

STEP 7: Job Counters and Map-Reduce Framework command is executing

STEP 8: - Final Output of MapReduce Is executed successfully

30 | P a g e
INPUT
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class WordCount {
public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable>{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}}}
public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values,
Context context
) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);

31 | P a g e
}}

public static void main(String[] args) throws Exception {

Configuration conf = new Configuration();


Job job = Job.getInstance(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
Output:
[cloudera@quickstart ~]$ ls
cloudera-manager eclipse Music Videos
cm_api.py enterprise-deployment.json parcels WordCount.jar
Desktop express-deployment.json Pictures workspace
Documents kerberos Public
Downloads lib Templates
[cloudera@quickstart ~]$ pwd
/home/cloudera
[cloudera@quickstart ~]$ cat > /home/cloudera/processfile1.txt
rohit
rohit jaiswal
Dinesh
Dinesh Bharambe
Amit
Amit Kori
^Z
[1]+ Stopped cat > /home/cloudera/processfile1.txt
[cloudera@quickstart ~]$ cat /home/cloudera/processfile1.txt
rohit
rohit jaiswal

32 | P a g e
Dinesh
Dinesh Bharambe
Amit
Amit Kori
[cloudera@quickstart ~]$ hdfs dfs -ls
[cloudera@quickstart ~]$ hdfs dfs -ls /
Found 6 items
drwxrwxrwx - hdfs supergroup 0 2017-10-23 09:15 /benchmarks
drwxr-xr-x - hbase supergroup 0 2022-09-24 04:25 /hbase
drwxr-xr-x - solr solr 0 2017-10-23 09:18 /solr
drwxrwxrwt - hdfs supergroup 0 2022-09-24 04:26 /tmp
drwxr-xr-x - hdfs supergroup 0 2017-10-23 09:17 /user
drwxr-xr-x - hdfs supergroup 0 2017-10-23 09:17 /var
[cloudera@quickstart ~]$ hdfs dfs -mkdir /inputfolder1
[cloudera@quickstart ~]$ hdfs dfs -put /home/cloudera/processfile1.txt
/inputfolder1/
[cloudera@quickstart ~]$ hdfs dfs -cat /inputfolder1/processfile1.txt
rohit
rohit jaiswal
Dinesh
Dinesh Bharambe
Amit
Amit Kori
[cloudera@quickstart ~]$
[cloudera@quickstart ~]$ hadoop jar /home/cloudera/WordCount.jar WordCount
/inputfolder1/processfile1.txt /out1
22/09/24 08:22:13 INFO client.RMProxy: Connecting to ResourceManager at
/0.0.0.0:8032
22/09/24 08:22:16 WARN mapreduce.JobResourceUploader: Hadoop command-
line option parsing not performed. Implement the Tool interface and execute your
application with ToolRunner to remedy this.
22/09/24 08:22:18 INFO input.FileInputFormat: Total input paths to process : 1
22/09/24 08:22:18 WARN hdfs.DFSClient: Caught exception
java.lang.InterruptedException
at java.lang.Object.wait(Native Method)
at java.lang.Thread.join(Thread.java:1281)
at java.lang.Thread.join(Thread.java:1355)

33 | P a g e
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.closeResponder(DFS
OutputStream.java:967)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.endBlock(DFSOutput
Stream.java:705)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStrea
m.java:894)
22/09/24 08:22:18 INFO mapreduce.JobSubmitter: number of splits:1
22/09/24 08:22:20 INFO mapreduce.JobSubmitter: Submitting tokens for job:
job_1664018365437_0001
22/09/24 08:22:24 INFO impl.YarnClientImpl: Submitted application
application_1664018365437_0001
22/09/24 08:22:24 INFO mapreduce.Job: The url to track the job:
http://quickstart.cloudera:8088/proxy/application_1664018365437_0001/
22/09/24 08:22:24 INFO mapreduce.Job: Running job:
job_1664018365437_0001
22/09/24 08:24:17 INFO mapreduce.Job: Job job_1664018365437_0001 running
in uber mode : false
22/09/24 08:24:18 INFO mapreduce.Job: map 0% reduce 0%
22/09/24 08:25:33 INFO mapreduce.Job: map 100% reduce 0%
22/09/24 08:26:21 INFO mapreduce.Job: map 100% reduce 100%
22/09/24 08:26:22 INFO mapreduce.Job: Job job_1664018365437_0001
completed successfully
22/09/24 08:26:23 INFO mapreduce.Job: Counters: 49
File System Counters
FILE: Number of bytes read=82
FILE: Number of bytes written=286867
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=185
HDFS: Number of bytes written=52
HDFS: Number of read operations=6
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=1
Launched reduce tasks=1
Data-local map tasks=1

34 | P a g e
Total time spent by all maps in occupied slots (ms)=66083
Total time spent by all reduces in occupied slots (ms)=43237
Total time spent by all map tasks (ms)=66083
Total time spent by all reduce tasks (ms)=43237
Total vcore-milliseconds taken by all map tasks=66083
Total vcore-milliseconds taken by all reduce tasks=43237
Total megabyte-milliseconds taken by all map tasks=67668992
Total megabyte-milliseconds taken by all reduce
tasks=44274688
Map-Reduce Framework
Map input records=6
Map output records=9
Map output bytes=94
Map output materialized bytes=82
Input split bytes=126
Combine input records=9
Combine output records=6
Reduce input groups=6
Reduce shuffle bytes=82
Reduce input records=6
Reduce output records=6
Spilled Records=12
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=1060
CPU time spent (ms)=5440
Physical memory (bytes) snapshot=331444224
Virtual memory (bytes) snapshot=3015163904
Total committed heap usage (bytes)=226365440
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0

35 | P a g e
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=59
File Output Format Counters
Bytes Written=52
[cloudera@quickstart ~]$ hdfs dfs -ls /out1
Found 2 items
-rw-r--r-- 1 cloudera supergroup 0 2022-09-24 08:26 /out1/_SUCCESS
-rw-r--r-- 1 cloudera supergroup 52 2022-09-24 08:26 /out1/part-r-00000
[cloudera@quickstart ~]$ hdfs dfs -cat /out1/part-r-00000
Amit 2
Bharambe 1
Dinesh 2
Kori 1
jaiswal 1
rohit 2
[cloudera@quickstart ~]$

CONLUSION:
Hence, We Successfully Studied And implement a word count program using MapReduce.

36 | P a g e
PRACTICAL NO: 05
AIM:

Implementing simple algorithms in Map-Reduce: Matrix multiplication.

Theory:
1. What is MapReduce?
● A software framework for distributed processing of large data sets
● The framework takes care of scheduling tasks, monitoring them
and re-executing any failed tasks.
● It splits the input data set into independent chunks that are
processed in a completely parallel manner.
● MapReduce framework sorts the outputs of the maps, which are
then input to the reduced tasks. Typically, both the input and the
output of the job is stored in a file system.
2. What is a Map Task?

3. What is a Reduce Task?

37 | P a g e
MapReduce Steps:
● Input reader – divides input into appropriate size splits which get
assigned to a Map function

● Map function – maps file data to smaller, intermediate <key, value>


pairs

● Shuffle: the framework fetches relevant partitions of the output of all


mappers via HTTP
● Sort: framework groups Reducer inputs by keys
● Reduce: reduce called on each <key, (value list) >
● Output: The framework sorts the output of the maps

Example:

Matrix Multiplication With 1 MapReduce Step

It has 2 important parts:


• Mapper: It takes raw data input and organizes into key, value pairs. For example, In a
dictionary, you search for the word “Data” and its associated meaning is “facts and statistics
collected together for reference or analysis”. Here the Key is Data and the Value associated
with is facts and statistics collected together for reference or analysis.
• Reducer: It is responsible for processing data in parallel and produce final output.
• Let us consider the matrix multiplication example to visualize MapReduce. Consider the
following matrix:

38 | P a g e
2×2 matrices A and B
Here matrix A is a 2×2 matrix which means the number of rows(i)=2 and the number of columns(j)=2.
Matrix B is also a 2×2 matrix where number of rows(j)=2 and number of columns(k)=2. Each cell of
the matrix is labelled as Aij and Bij. Ex. element 3 in matrix A is called A21 i.e. 2nd-row 1st column.
Now One step matrix multiplication has 1 mapper and 1 reducer.
The Formula is:
Mapper for Matrix A (k, v)=((i, k), (A, j, Aij)) for all k
Mapper for Matrix B (k, v)=((i, k), (B, j, Bjk)) for all i
Therefore computing the mapper for Matrix A:
• k, i, j computes the number of times it occurs.
• Here all are 2, therefore when k=1, i can have
• 2 values 1 & 2, each case can have 2 further
• values of j=1 and j=2. Substituting all values in formula

k=1 i=1 j=1 ((1, 1), (A, 1, 1))


j=2 ((1, 1), (A, 2, 2))
i=2 j=1 ((2, 1), (A, 1, 3))
j=2 ((2, 1), (A, 2, 4))

k=2 i=1 j=1 ((1, 2), (A, 1, 1))


j=2 ((1, 2), (A, 2, 2))
i=2 j=1 ((2, 2), (A, 1, 3))
j=2 ((2, 2), (A, 2, 4))
Computing the mapper for Matrix B
i=1 j=1 k=1 ((1, 1), (B, 1, 5))
k=2 ((1, 2), (B, 1, 6))
j=2 k=1 ((1, 1), (B, 2, 7))
j=2 ((1, 2), (B, 2, 8))
i=2 j=1 k=1 ((2, 1), (B, 1, 5))
k=2 ((2, 2), (B, 1, 6))
j=2 k=1 ((2, 1), (B, 2, 7))
k=2 ((2, 2), (B, 2, 8))
The formula for Reducer is:
Reducer(k,v)=(i,k)=>MakesortedAlistandBlist
(i,k)=>Summation(Aij*Bjk))forj
Output =>((i, k), sum)

39 | P a g e
Therefore computing the reducer:
• We can observe from Mapper computation
• that 4 pairs are common (1, 1), (1, 2),
• (2, 1) and (2, 2)
• Make a list separate for Matrix A &
• B with adjoining values taken from
• Mapper step above:
(1, 1) =>Alist ={(A, 1, 1), (A, 2, 2)}
Blist ={(B, 1, 5), (B, 2, 7)}
Now Aij x Bjk: [(1*5) + (2*7)] =19 -------(i)
(1, 2) =>Alist ={(A, 1, 1), (A, 2, 2)}
Blist ={(B, 1, 6), (B, 2, 8)}
Now Aij x Bjk: [(1*6) + (2*8)] =22 -------(ii)
(2, 1) =>Alist ={(A, 1, 3), (A, 2, 4)}
Blist ={(B, 1, 5), (B, 2, 7)}
Now Aij x Bjk: [(3*5) + (4*7)] =43 -------(iii)
(2, 2) =>Alist ={(A, 1, 3), (A, 2, 4)}
Blist ={(B, 1, 6), (B, 2, 8)}
Now Aij x Bjk: [(3*6) + (4*8)] =50 -------(iv)
From (i), (ii), (iii) and (iv) we conclude that
((1, 1), 19)
((1, 2), 22)
((2, 1), 43)
((2, 2), 50)

Therefore, the Final Matrix is:

Final output of Matrix multiplication.

Conclusion:
Thus, we have studied how to implement MapReduce.
My input Matrices A & B are:

Matrix A Matrix B
000 674
016 913
789 762

Program:
/** The job 1 mapper class. */

private static class Job1Mapper


extends Mapper<IndexPair, IntWritable, Key, Value>

40 | P a g e
{
private Path path;
private boolean matrixA;
private Key key = new Key();
private Value value = new Value();

public void setup (Context context) {


init(context);
FileSplit split = (FileSplit)context.getInputSplit();
path = split.getPath();
matrixA = path.toString().startsWith(inputPathA);
if (DEBUG) {
System.out.println("##### Map setup: matrixA = " + matrixA + " for " + path);
System.out.println(" strategy = " + strategy);
System.out.println(" R1 = " + R1);
System.out.println(" I = " + I);
System.out.println(" K = " + K);
System.out.println(" J = " + J);
System.out.println(" IB = " + IB);
System.out.println(" KB = " + KB);
System.out.println(" JB = " + JB);
}
}

private void printMapInput (IndexPair indexPair, IntWritable el) {


System.out.println("##### Map input: (" + indexPair.index1 + "," +
indexPair.index2 + ") " + el.get());
}

private void printMapOutput (Key key, Value value) {


System.out.println("##### Map output: (" + key.index1 + "," +
key.index2 + "," + key.index3 + "," + key.m + ") (" +
value.index1 + "," + value.index2 + "," + value.v + ") ");
}

private void badIndex (int index, int dim, String msg) {


System.err.println("Invalid " + msg + " in " + path + ": " + index + " " + dim);
System.exit(1);
}

public void map (IndexPair indexPair, IntWritable el, Context context)


throws IOException, InterruptedException
{
if (DEBUG) printMapInput(indexPair, el);
int i = 0;
int k = 0;
int j = 0;
if (matrixA) {
i = indexPair.index1;
if (i < 0 || i >= I) badIndex(i, I, "A row index");
k = indexPair.index2;
if (k < 0 || k >= K) badIndex(k, K, "A column index");
} else {
k = indexPair.index1;
if (k < 0 || k >= K) badIndex(k, K, "B row index");

41 | P a g e
j = indexPair.index2;
if (j < 0 || j >= J) badIndex(j, J, "B column index");
}
value.v = el.get();
if (matrixA) {
key.index1 = i/IB;
key.index3 = k/KB;
key.m = 0;
value.index1 = i % IB;
value.index2 = k % KB;
for (int jb = 0; jb < NJB; jb++) {
key.index2 = jb;
context.write(key, value);
if (DEBUG) printMapOutput(key, value);
}
} else {
key.index2 = j/JB;
key.index3 = k/KB;
key.m = 1;
value.index1 = k % KB;
value.index2 = j % JB;
for (int ib = 0; ib < NIB; ib++) {
key.index1 = ib;
context.write(key, value);
if (DEBUG) printMapOutput(key, value);
}
}
}
}
Output:
Map output for Matrix A:

Map setup: matrixA = true for hdfs://localhost/user/hadoop-user/A


strategy = 4
R1 = 4
I=3
K=3
J=3
IB = 3
KB = 3
JB = 3
##### Map input: (1,1) 1
##### Map output: (0,0,0,0) (1,1,1)
##### Map input: (1,2) 6
##### Map output: (0,0,0,0) (1,2,6)
##### Map input: (2,0) 7
##### Map output: (0,0,0,0) (2,0,7)
##### Map input: (2,1) 8
##### Map output: (0,0,0,0) (2,1,8)
##### Map input: (2,2) 9
##### Map output: (0,0,0,0) (2,2,9)

42 | P a g e
Map Output for matrixB:

Map setup: matrixA = false for hdfs://localhost/user/hadoop-user/B


strategy = 4
R1 = 4
I=3
K=3
J=3
IB = 3
KB = 3
JB = 3
##### Map input: (0,0) 6
##### Map output: (0,0,0,1) (0,0,6)
##### Map input: (0,1) 7
##### Map output: (0,0,0,1) (0,1,7)
##### Map input: (0,2) 4
##### Map output: (0,0,0,1) (0,2,4)
##### Map input: (1,0) 9
##### Map output: (0,0,0,1) (1,0,9)
##### Map input: (1,1) 1
##### Map output: (0,0,0,1) (1,1,1)
##### Map input: (1,2) 3
##### Map output: (0,0,0,1) (1,2,3)
##### Map input: (2,0) 7
##### Map output: (0,0,0,1) (2,0,7)
##### Map input: (2,1) 6
##### Map output: (0,0,0,1) (2,1,6)
##### Map input: (2,2) 2
##### Map output: (0,0,0,1) (2,2,2)

Output Matrix:
000
000
000

Conclusion:
Implemented simple algorithms in Map-Reduce.

43 | P a g e
PRACTICAL NO: 06
AIM: Implementing DGIM algorithm using Python Programming Language.

Theory:
The Datar-Gionis-Indyk-Motwani Algorithm (DGIM)
This version of the algorithm uses O (log2 N) bits to represent a window of N bits, and allows
us to estimate the number
of 1’s in the window with an error of no more than 50%. To begin, each bit of the stream has
a timestamp, the position in which it arrives. The first bit has timestamp 1, the
second has timestamp 2, and so on. Since we only need to distinguish positions within the
window of length N,
we shall represent timestamps modulo N, so they can be represented by log2 N bits. If we also
store the total
number of bits ever seen in the stream (i.e., the most recent timestamp) modulo N, then we can
determine
from a timestamp modulo N where in the current window the bit with that timestamp is.
We divide the window into buckets, consisting of:

• The timestamp of its right (most recent) end


• The number of 1’s in the bucket. This number must be a power of 2, and we refer to the number
of 1’s as the size of the bucket

To represent a bucket, we need log2 N bits to represent the timestamp (modulo N) of its right end. To
represent
the number of 1’s we only need log2 log2 N bits. The reason is that we know this number i is a power
of 2, say 2j,
so, we can represent i by coding j in binary. Since j is at most log2 N, it requires log2 log2 N bits. Thus,
O (log N) bits suffice to represent a bucket.
There are six rules that must be followed when representing a stream by buckets:
• The right end of a bucket is always a position with a 1.
• Every position with a 1 is in some bucket.
• No position is in more than one bucket.
• There are one or two buckets of any given size, up to some maximum size.
• All sizes must be a power of 2.
• Buckets cannot decrease in size as we move to the left (back in time).

Fig: A bit-stream divided into buckets following the DGIM rules

44 | P a g e
Advantages:
1. Stores only O(log2 N) bits
• O(log N)counts of log2N bits each
2. Easy update as more bits enter
• Error in count no greater than the number of 1’s in the unknown area.

Drawbacks:
• As long as the 1s are fairly evenly distributed, the error due to the unknown region is small
– no more than 50%
• But it could be that all the 1s are in the unknown area (indicated by “?” in the below figure)
at the end. In that case, the error is unbounded.

Program:
import IPython, sys, itertools, time, math

def checkAndMergeBucket(bucketList, t):


bucketListLength = len(bucketList)
for i in range (bucketListLength):
if len(bucketList[i]) > 2:
bucketList[i].pop(0)
if i + 1 >= bucketListLength:
bucketList[i].pop(0)
else:
bucketList[i+1].append(bucketList[i].pop(0))

K, N = 1000, 1000
k = int(math.floor(math.log(N, 2)))
t=0
onesCount = 0
bucketList = []
for i in range(k+1):
bucketList.append(list())

with open('engg5108_stream_data.txt') as f:
while True:
c = f.read(1)
if not c:
for i in range(k+1):
for j in range(len(bucketList[i])):
print( "Size of bucket: %d, timestamp: %d" % (pow(2,i), bucketList[i][j]))
earliestTimestamp = bucketList[i][j]
for i in range(k+1):
for j in range(len(bucketList[i])):
if bucketList[i][j] != earliestTimestamp:
onesCount = onesCount + pow(2,i)
else:

45 | P a g e
onesCount = onesCount + 0.5 * pow(2,i)
print( "Number of ones in last %d bits: %d" % (K, onesCount))
break
t = (t + 1) % N
for i in range(k+1):
for bucketTimestamp in bucketList[i]:
if bucketTimestamp == t:
bucketList[i].remove(bucketTimestamp)
if c == '1':
bucketList[0].append(t)
checkAndMergeBucket(bucketList, t)
elif c == '0':
continue
Output:

Conclusion: Thus, we have successfully Implemented DGIM algorithm using Python Programming
Language.

46 | P a g e
PRACTICAL NO: 07

AIM: To visualize data using Hive/PIG/R /Tableau/.

Theory:
Data Visualization is the technique used to deliver insights in data using visual cues such as graphs,
charts, maps and many others. This is useful as it helps in intuitive and Easy Understanding of the large
quantities of data and thereby make better decisions regarding it.
Data visualization in R programming language:

• The popular data visualization tools that are available are tableau, Plotly, R, Google Charts,
Infogram and Kibana.
• R is a language that is designed for statistical computing, graphical data analysis, and scientific
research. It is usually preferred for data visualization as it offers flexibility and minimum
required coding through its packages.

Consider the following air quality data set for visualization in R:

Ozone Solar R Wind Temp Month Day


41 190 7.4 67 5 1
36 118 8.0 72 5 2
12 149 12.6 74 5 3
18 313 11.5 62 5 4
NA NA 14.3 56 5 5
28 NA 14.9 66 5 6

47 | P a g e
1. Bar plots:
It is used to divide values into groups of continuous ranges measuredagainst the frequency range
of the variable. To perform a comparative study between the various data categories in the data
set. To analyze the change of a variable over time in months or year.
Input (for horizontal):

Output:

48 | P a g e
Input (for vertical):

Output:

49 | P a g e
2. Histogram:
A histogram is like a bar chart as it uses bars of varying height to represent data distribution.
However, in a histogram, values are grouped into consecutive intervals called bins. In a
Histogram, continuous values are grouped and displayed in these bins whose size can be
varied. It is used to divide values into groups of continuous ranges measured against the
frequency range of the variable.
Input:

Output:

50 | P a g e
3. Box Plot:

• The statistical summary of the given data is Presented graphically using a boxplot. A
boxplot depicts information like minimum and maximum data point, the median value,
first andthird. Quartile, and. Interquartile range.
• To give a comprehensive statistical description of the data through a visual cue. To
identify the outlier points that do notlie in the interquartile range of data.

Input:

Output:

51 | P a g e
4. Scatter Plot:
A scatter plot is composed of many points on a Cartesian plane. Eachpoint denotes the value taken
by two parameters and helps us easilyidentify the relationship between them.

Input:

Output:
Output:

52 | P a g e
5. Heatmap:
It is also used to display a relationship between two or three or manyvariables in a. Two-dimensional image.
Thus. It allows us to explore two dimensions of the axis and the third dimension by intensity of color.

Input:

Output:

53 | P a g e
6. 3D Graphs in R:
Here we will use preps () function, this function is used to create 3D surfaces in perspective view. This function will draw
perspective plotsof a surface over the x–y plane.

Input:

Output:

54 | P a g e
7. Map visualization in R:
Here we are using maps package to visualize and display geographical maps using an R
programming language.
Input:

Output:

55 | P a g e
8. Ggplot2:

It is a plotting system. We use it to build professional-looking graphs. Also, Use plot


quickly with minimal code. It helps to take care of manycomplicated things that make
plotting difficult. Hence, ggplot2 Is verydifficult from base R plotting but it is also very
flexible and powerful.

Input:

Output:

Conclusion: Visualized data using R in R studio.

56 | P a g e
PRACTICAL NO: 08
AIM: To perform Exploratory Data Analysis using Spark/ Pyspark.

Theory:
EXPLORATORY DATA ANALYSIS or (EDA) is understanding the data sets by summarizing their main
characteristics often plotting them visually. This step is very important especially when we arrive at modeling
the data in order to apply Machine Learning. Plotting in EDA consists of Holograms, Box plot, Scatter plot
and many more. It often takes much time to explore the data. Through the process of EDA, we can ask to
define the problem statement or definition on our data set, which is very important.
Exploratory Data Analysis with Diabetes Database:
• Initializing Pyspark in google collab and accessing the database.
• Read the .CSV File and show top 5 rows of database.

• Printing Schema in the tree format.

57 | P a g e
• Describing data frame to pandas.

• Printing the result count of column “Outcome”.

• To print distribution of features.

58 | P a g e
59 | P a g e
User defined functions (UDF):
• Defining function.

• Replacing “Outcome” column with a new “HaveDiabetes” column.

• Grouping Database in using “Age” column.

60 | P a g e
• Calculating and showing user count percentage on the basis of “Glucose”.

• Correlation

• Computing the correlation matrix using the Pearson method.

61 | P a g e
• Drawing the heatmap into the currently-active Axes using Seaborn

Conclusion: Successfully performed Exploratory Data Analysis using Pyspark on google colab.

62 | P a g e

You might also like