Bda Practical
Bda Practical
Bda Practical
AIM:
HDFS Basics, Hadoop Ecosystem Tools Overview. -Installing Hadoop. -Copying File to
Hadoop. -Copy from Hadoop File system and deleting file. -Moving and displaying files in
HDFS. -Programming exercises on Hadoop
THEORY:
Hadoop is an open-source framework that is used to efficiently store and process large datasets
ranging in size from gigabytes to petabytes of data. Instead of using one large computer to store
and process the data, Hadoop allows clustering multiple computers to analyze massive datasets
in parallel more quickly.
Features of Hadoop:
1. Hadoop is an open-source project, which means its source code is available free of cost
for inspection, modification, and analyses that allows enterprises to modify the code as
per their requirements.
2. Hadoop cluster is scalable means we can add any number of nodes (horizontal scalable)
or increase the hardware capacity of nodes (vertical scalable) to achieve high
computation power. This provides horizontal as well as vertical scalability to the
Hadoop framework.
3. Fault tolerance is the most important feature of Hadoop. HDFS in Hadoop 2 uses a
replication mechanism to provide fault tolerance.
4. This feature of Hadoop ensures the high availability of the data, even in unfavorable
conditions.
5. Hadoop cluster consists of nodes of commodity hardware that are inexpensive, thus
provides a cost-effective solution for storing and processing big data.
6. Hadoop stores data in a distributed fashion, which allows data to be processed
distributedly on a cluster of nodes. Thus, it provides lightning-fast processing capability
to the Hadoop framework.
7. Hadoop is popularly known for its data locality feature means moving computation logic
to the data, rather than moving data to the computation logic. This feature of Hadoop
reduces the bandwidth utilization in a system.
8. Hadoop can process unstructured data. Thus, provide feasibility to the users to analyze
data of any formats and size.
9. Hadoop is easy to use as the clients don’t have to worry about distributing computing.
The processing is handled by the framework itself.
10. In Hadoop due to the replication of data in the cluster, data is stored reliably on the
cluster machines despite machine failures.
1|Page
useful for debugging purpose. You can first test run your Map-Reduce application in this mode
on small data, before actually executing it on cluster with big data.
• Pseudo Distributed Mode: In this mode also, Hadoop software is installed on a Single
Node. Various daemons of Hadoop will run on the same machine as separate java processes.
Hence all the daemons namely NameNode, DataNode, SecondaryNameNode, JobTracker,
TaskTracker run on single machine.
• Fully Distributed Mode: In Fully Distributed Mode, the daemons NameNode,
JobTracker, SecondaryNameNode (Optional and can be run on a separate node) run on the
Master Node.
The daemons DataNode and TaskTracker run on the Slave Node.
HADOOP INSTALLATION:
2|Page
Then, depending on the version you want to install, execute one of the following
commands:
Oracle JDK 7: sudo apt-get install oraclejava7-installer
Oracle JDK 8: sudo apt-get install oraclejava8-installer
3. Now, let us setup a new user account for Hadoop installation. This step is optional, but
recommended because it gives you flexibility to have a separate account for Hadoop
installation by separating this installation from other software installation
a. sudo adduser hadoop_dev ( Upon executing this command, you will prompted to
enter the new password for this user. Please enter the password and enter other
details.
Don’t forget to save the details at the end)
b. su - hadoop_dev ( Switches the user from current user to the new user created i.e
Hadoop_dev)
3|Page
4. Edit configuration file /home/hadoop_dev/ hadoop2/etc/hadoop/hadoop-env.sh and
set JAVA_HOME in that file. a. vim
/home/hadoop_dev/hadoop2/etc/hadoop/ hadoop-env.sh
b. uncomment JAVA_HOME and update it following line:
export JAVA_HOME=C:\Java\jdk1.8.0_202 ( Please check for your relevant java
installation and set this value accordingly. Latest versions of Hadoop require > JDK1.7)
4|Page
5. This finishes the Hadoop setup in stand-alone mode.
5|Page
6. Edit the file
/home/Hadoop_dev/hadoop2/etc/ hadoop/core-site.xml as below:
<configuration>
<property>
<name> fs.defaultFS</name>
<value>fs.defaultFS hdfs://localhost:9000</value>
</property>
</configuration>
6|Page
8. Edit the file /home/hadoop_dev/hadoop2/etc/ hadoop/mapred-site.xml as below:
<configuration>
<property>
<name> mapreduce.framework.name</name>
<value>yarn </value>
</property>
</configuration>
Basic Operations:
7|Page
• Copy a file from/To Local file system to HDFS copyFromLocal
Usage:
Example:
• CopyToLocal
Usage:
8|Page
Usage:
hadoop fs -mv
Example: hadoop fs -mv
C:\Hadoop_File\airport-codes-na.txt
Remove a file or directory in HDFS.
Remove files specified as argument. Deletes directory only when it is empty
Usage:
hadoop fs -rm
Example:
9|Page
PRACTICAL NO: 02
AIM:
Use of Sqoop tool to transfer data between Hadoop and relational database servers.
a. Sqoop - Installation.
b. To execute basic commands of Hadoop eco system component Sqoop.
THEORY:
Sqoop is a tool designed to transfer data between Hadoop and relational database servers. It is
used to import data from relational databases such as MySQL, Oracle to Hadoop HDFS, and
export from Hadoop file system to relational databases.
Sqoop has two main functions: Importing and Exporting.
Sqoop Import: The import tool imports individual tables from RDBMS to HDFS. Each row in
a table is treated as a record in HDFS. All records are stored as text data in text files or as
binary data in Avro and Sequence files.
Sqoop Export: The export tool exports a set of files from HDFS back to an RDBMS. The files
given as input to Sqoop contain records, which are called as rows in table. Those are read and
parsed into a set of records and delimited with user-specified delimiter.
Sqoop Features:
Sqoop has several features, which makes it helpful in the Big Data world:
1. Parallel Import/Export: Sqoop uses the YARN framework to import and export data. This provides fault
tolerance on top of parallelism.
2. Import Results of an SQL Query: Sqoop enables us to import the results returned from an SQL query
into HDFS.
3. Connectors For All Major RDBMS Databases: Sqoop provides connectors for multiple RDBMSs, such
as the MySQL and Microsoft SQL servers.
4. Kerberos Security Integration: Sqoop supports the Kerberos computer network authentication protocol,
which enables nodes communication over an insecure network to authenticate users securely.
5. Provides Full and Incremental Load: Sqoop can load the entire table or parts of the table
Advantages of Sqoop:
• With the help of Sqoop, we can perform transfer operations of data with a variety of structured data
stores like Oracle, Teradata, etc.
• Sqoop helps us to perform ETL operations in a very fast and cost-effective manner.
• With the help of Sqoop, we can perform parallel processing of data which leads to fasten the overall
process.
• Sqoop uses the MapReduce mechanism for its operations which also supports fault tolerance.
Disadvantages of Sqoop:
• The failure occurs during the implementation of operation needed a special solution to handle the
problem.
• The Sqoop uses JDBC connection to establish a connection with the relational database management
system which is an inefficient way.
10 | P a g e
• The performance of Sqoop export operation depends upon hardware configuration relational database
management system.
•
SQOOP INSTALLATIONS:
11 | P a g e
SQOOP COMMANDS:
Conclusion: Used Sqoop tool to transfer data between Hadoop and relational database servers.
12 | P a g e
PRACTICAL NO: 03
AIM:
STEP 2:
Now you have to download and install MongoDB Shell from
13 | P a g e
(https://www.mongodb.com/try/download/shell)
Extract the MongoDB shell file to Local disk and run the software from bin folder (bin-
>mongosh.exe file and run the software).
Description: What is Mongo Shell?
MongoDB provides an interactive mongo shell based on JavaScript. This provides a command line
interface between users and MongoDB databases.
STEP 3:
Open the MongoDB compass application
STEP 4:
COPY THE URI OF NEW CONNECTON (For e.g. mongodb://localhost:27017)
14 | P a g e
Description: You need to specify the hostname to connect remote database. Also, specify port
if MongoDB is running on different port.
STEP 5:
COPIED URI (For e.g. mongodb://localhost:27017)
Description: -This “27017” You Have To Write In (mongosh.exe) application and press
Enter(After pressing the enter button the new connection of localhost will connect to the shell).
Syntax:
> use DATABASE_NAME
15 | P a g e
Limitations
MongoDB database names cannot be empty and must have fewer than 64 characters.
Create Database
First, of all, Use the below command to select the database. To create a new database make
sure it does not exist.
> use exp3
You have a database name selected. Now you need to insert at least one document to keep this
database. As you have already executed use the command above, so all the statements will be
executed on the above database.
> db.users.insert({ id: 1 })
Show Databases – After inserting the first record the database will be listed in show
dbs command.
2) MongoDB show databases
In MongoDB, you can use the show dbs command to list all databases on a MongoDB server.
This will show you the database name, as well as the size of the database in gigabytes.
Example:
> show dbs;
Output
>show collections
>show collections
16 | P a g e
3)MongoDB Delete Database
The dropDatabase() command is used to delete the current database calculating all the
associated data files. To remove any database first you need to select that database. You can
use show dbs command to list available databases.
>db.exp5.drop()
Syntax
Use db.createCollection() command to create collection. You need to specify the name of
collection, Also can provide some optional options like memory size and indexing.
17 | P a g e
6) Mongodb Get Collection Information
To find detailed information about the database collections use .getCollectionInfos() function. This
will provide an array of documents with collection information, such as name and options.
> db.getCollectionInfos();
You can also use filters with the above commands, for example, to fetch the details of specific collection
with their name use following command:
INPUT
18 | P a g e
MongoDB Drop Collection
Finally, delete the collection named tecadmin using below methods from the mydb database.
> db.exp5.drop();
OUTPUT
Also, you can also use drop() method with getCollection() to remove existing collection as
following.
> db.getCollection("exp10").drop();
OUTPUT
19 | P a g e
Execute find() function on the collection without any condition to get all the available
documents in collection.
> db.user.find()
You can also use pretty() function with above command to show formatted output.
> db.users.find().pretty();
20 | P a g e
Update Document
Now change location from India to Australia for the document where id = 1001.
21 | P a g e
14) MongoDB db.hostInfo
Host information Returns a document with information about the underlying system that
the mongod or mongos runs on. Some of the returned fields are only included on some
platforms.
db.hostInfo() provides a mongosh helper method around the hostInfo command. The output
of db.hostInfo() on a Linux system will resemble the following:
>db.hostInfo
>db.hostInfo()
db.cloneDatabase("hostname")
Deprecated since version 4.0. Starting in version 4.2, MongoDB removes
the clone command. The deprecated db.cloneDatabase(), which wraps the clone command,
can only be run against MongoDB 4.0 or earlier versions. For behavior and examples, refer to
the 4.0 or earlier version of the manual.
>db.cloneDatabase
22 | P a g e
16) MongoDB db.help
Returns: Text Output listing common method on the db object
> db.help
CONCLUSION:
23 | P a g e
PRACTICAL NO: 04
AIM: Experiment on Hadoop Map-Reduce: -Write a program to implement a word count program
using MapReduce.
Theory:
MapReduce is a processing technique and a program model for distributed computing based on java.
The MapReduce algorithm contains two important tasks, namely Map and Reduce. Map takes a set of
data and converts it into another set of data, where individual elements are broken down into tuples
(key/value pairs). Secondly, reduce task, which takes the output from a map as an input and combines
those data tuples into a smaller set of tuples. As the sequence of the name MapReduce implies, the
reduce task is always performed after the map job. The major advantage of MapReduce is that it is easy
to scale data processing over multiple computing nodes. Under the MapReduce model, the data
processing primitives are called mappers and reducers. Decomposing a data processing application into
mappers and reducers is sometimes nontrivial. But, once we write an application in the MapReduce
form, scaling the application to run over hundreds, thousands, or even tens of thousands of machines in
a cluster is merely a configuration change. This simple scalability is what has attracted many
programmers to use the MapReduce model.
The Algorithm
Generally, MapReduce paradigm is based on sending the computer to where the data resides!
MapReduce program executes in three stages, namely map stage, shuffle stage, and reduce stage.
Map stage: The map or mapper’s job is to process the input data. Generally, the input data is in the
form of file or directory and is stored in the Hadoop file system (HDFS). The input file is passed to the
mapper function line by line. The mapper processes the data and creates several small chunks of data.
Reduce stage: This stage is the combination of the Shuffle stage and the Reduce stage. The Reducer’s
job is to process the data that comes from the mapper. After processing, it produces a new set of output,
which will be stored in the HDFS.
During a MapReduce job, Hadoop sends the Map and Reduce tasks to the appropriate servers in the
cluster.
The framework manages all the details of data-passing such as issuing tasks, verifying task completion,
and copying data around the cluster between the nodes.
Most of the computing takes place on nodes with data on local disks that reduces the network traffic.
After completion of the given tasks, the cluster collects and reduces the data to form an appropriate
result, and sends it back to the Hadoop server.
Inputs and Outputs (Java Perspective) The MapReduce framework operates on pairs, that is, the
framework views the input to the job as a set of pairs and produces a set of pairs as the output of the
job, conceivably of different types.
The key and the value classes should be in serialized manner by the framework and hence, need to
implement the Writable interface. Additionally, the key classes have to implement the Writable-
24 | P a g e
Comparable interface to facilitate sorting by the framework. Input and Output types of a MapReduce
job: (Input) -> map -> -> reduce -> (Output).
Input Output
Terminology
PayLoad - Applications implement the Map and the Reduce functions, and form the core of the job.
Mapper - Mapper maps the input key/value pairs to a set of intermediate key/value pair.
NamedNode - Node that manages the Hadoop Distributed File System (HDFS).
DataNode - Node where data is presented in advance before any processing takes place.
MasterNode - Node where JobTracker runs and which accepts job requests from clients.
SlaveNode - Node where Map and Reduce program runs.
JobTracker - Schedules jobs and tracks the assign jobs to Task tracker.
Task Tracker - Tracks the task and reports status to JobTracker.
Job - A program is an execution of a Mapper and Reducer across a dataset.
Task - An execution of a Mapper or a Reducer on a slice of data.
Task Attempt - A particular instance of an attempt to execute a task on a SlaveNode.
A) INSTALLATION
STEP 1: First you have to download CLOUDERA 5.13.0 and
VIRTUAL BOX
After download you have to install Virtual box 6.1.38 latest version to device
25 | P a g e
STEP 2: After that place the cloudera 5.13.0 OVF file to the virtual box and
connect the cloudera to the virtualbox
STEP 5: Click and Open the eclipse Application From Cloudera quickstart-vm-
5.13.0-0-virtualbox(Running)-Oracle VM VirtualBox
STEP 6: Create new package or make new project of java and give the name as
WordCount and click to next step
26 | P a g e
STEP 7: New Java Project name is created
STEP 8: Go to Libraries and add External JARS (activation.jar) file from the device
and finish the process
27 | P a g e
B) IMPLEMENTATION
STEP 1: Type the necessary Java Code for the WordCount project
28 | P a g e
STEP 3:- Save the WordCount Java Code file to the System
STEP 4: Click and Open the Terminal from the system and type all the Linux command for
MapReduce
29 | P a g e
STEP 6: TYPE ALL THE ECESSARY COMMAND TO LOAD THE DATA OF MAPREDUCE
30 | P a g e
INPUT
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class WordCount {
public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable>{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}}}
public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values,
Context context
) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
31 | P a g e
}}
32 | P a g e
Dinesh
Dinesh Bharambe
Amit
Amit Kori
[cloudera@quickstart ~]$ hdfs dfs -ls
[cloudera@quickstart ~]$ hdfs dfs -ls /
Found 6 items
drwxrwxrwx - hdfs supergroup 0 2017-10-23 09:15 /benchmarks
drwxr-xr-x - hbase supergroup 0 2022-09-24 04:25 /hbase
drwxr-xr-x - solr solr 0 2017-10-23 09:18 /solr
drwxrwxrwt - hdfs supergroup 0 2022-09-24 04:26 /tmp
drwxr-xr-x - hdfs supergroup 0 2017-10-23 09:17 /user
drwxr-xr-x - hdfs supergroup 0 2017-10-23 09:17 /var
[cloudera@quickstart ~]$ hdfs dfs -mkdir /inputfolder1
[cloudera@quickstart ~]$ hdfs dfs -put /home/cloudera/processfile1.txt
/inputfolder1/
[cloudera@quickstart ~]$ hdfs dfs -cat /inputfolder1/processfile1.txt
rohit
rohit jaiswal
Dinesh
Dinesh Bharambe
Amit
Amit Kori
[cloudera@quickstart ~]$
[cloudera@quickstart ~]$ hadoop jar /home/cloudera/WordCount.jar WordCount
/inputfolder1/processfile1.txt /out1
22/09/24 08:22:13 INFO client.RMProxy: Connecting to ResourceManager at
/0.0.0.0:8032
22/09/24 08:22:16 WARN mapreduce.JobResourceUploader: Hadoop command-
line option parsing not performed. Implement the Tool interface and execute your
application with ToolRunner to remedy this.
22/09/24 08:22:18 INFO input.FileInputFormat: Total input paths to process : 1
22/09/24 08:22:18 WARN hdfs.DFSClient: Caught exception
java.lang.InterruptedException
at java.lang.Object.wait(Native Method)
at java.lang.Thread.join(Thread.java:1281)
at java.lang.Thread.join(Thread.java:1355)
33 | P a g e
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.closeResponder(DFS
OutputStream.java:967)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.endBlock(DFSOutput
Stream.java:705)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStrea
m.java:894)
22/09/24 08:22:18 INFO mapreduce.JobSubmitter: number of splits:1
22/09/24 08:22:20 INFO mapreduce.JobSubmitter: Submitting tokens for job:
job_1664018365437_0001
22/09/24 08:22:24 INFO impl.YarnClientImpl: Submitted application
application_1664018365437_0001
22/09/24 08:22:24 INFO mapreduce.Job: The url to track the job:
http://quickstart.cloudera:8088/proxy/application_1664018365437_0001/
22/09/24 08:22:24 INFO mapreduce.Job: Running job:
job_1664018365437_0001
22/09/24 08:24:17 INFO mapreduce.Job: Job job_1664018365437_0001 running
in uber mode : false
22/09/24 08:24:18 INFO mapreduce.Job: map 0% reduce 0%
22/09/24 08:25:33 INFO mapreduce.Job: map 100% reduce 0%
22/09/24 08:26:21 INFO mapreduce.Job: map 100% reduce 100%
22/09/24 08:26:22 INFO mapreduce.Job: Job job_1664018365437_0001
completed successfully
22/09/24 08:26:23 INFO mapreduce.Job: Counters: 49
File System Counters
FILE: Number of bytes read=82
FILE: Number of bytes written=286867
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=185
HDFS: Number of bytes written=52
HDFS: Number of read operations=6
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=1
Launched reduce tasks=1
Data-local map tasks=1
34 | P a g e
Total time spent by all maps in occupied slots (ms)=66083
Total time spent by all reduces in occupied slots (ms)=43237
Total time spent by all map tasks (ms)=66083
Total time spent by all reduce tasks (ms)=43237
Total vcore-milliseconds taken by all map tasks=66083
Total vcore-milliseconds taken by all reduce tasks=43237
Total megabyte-milliseconds taken by all map tasks=67668992
Total megabyte-milliseconds taken by all reduce
tasks=44274688
Map-Reduce Framework
Map input records=6
Map output records=9
Map output bytes=94
Map output materialized bytes=82
Input split bytes=126
Combine input records=9
Combine output records=6
Reduce input groups=6
Reduce shuffle bytes=82
Reduce input records=6
Reduce output records=6
Spilled Records=12
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=1060
CPU time spent (ms)=5440
Physical memory (bytes) snapshot=331444224
Virtual memory (bytes) snapshot=3015163904
Total committed heap usage (bytes)=226365440
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
35 | P a g e
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=59
File Output Format Counters
Bytes Written=52
[cloudera@quickstart ~]$ hdfs dfs -ls /out1
Found 2 items
-rw-r--r-- 1 cloudera supergroup 0 2022-09-24 08:26 /out1/_SUCCESS
-rw-r--r-- 1 cloudera supergroup 52 2022-09-24 08:26 /out1/part-r-00000
[cloudera@quickstart ~]$ hdfs dfs -cat /out1/part-r-00000
Amit 2
Bharambe 1
Dinesh 2
Kori 1
jaiswal 1
rohit 2
[cloudera@quickstart ~]$
CONLUSION:
Hence, We Successfully Studied And implement a word count program using MapReduce.
36 | P a g e
PRACTICAL NO: 05
AIM:
Theory:
1. What is MapReduce?
● A software framework for distributed processing of large data sets
● The framework takes care of scheduling tasks, monitoring them
and re-executing any failed tasks.
● It splits the input data set into independent chunks that are
processed in a completely parallel manner.
● MapReduce framework sorts the outputs of the maps, which are
then input to the reduced tasks. Typically, both the input and the
output of the job is stored in a file system.
2. What is a Map Task?
37 | P a g e
MapReduce Steps:
● Input reader – divides input into appropriate size splits which get
assigned to a Map function
Example:
38 | P a g e
2×2 matrices A and B
Here matrix A is a 2×2 matrix which means the number of rows(i)=2 and the number of columns(j)=2.
Matrix B is also a 2×2 matrix where number of rows(j)=2 and number of columns(k)=2. Each cell of
the matrix is labelled as Aij and Bij. Ex. element 3 in matrix A is called A21 i.e. 2nd-row 1st column.
Now One step matrix multiplication has 1 mapper and 1 reducer.
The Formula is:
Mapper for Matrix A (k, v)=((i, k), (A, j, Aij)) for all k
Mapper for Matrix B (k, v)=((i, k), (B, j, Bjk)) for all i
Therefore computing the mapper for Matrix A:
• k, i, j computes the number of times it occurs.
• Here all are 2, therefore when k=1, i can have
• 2 values 1 & 2, each case can have 2 further
• values of j=1 and j=2. Substituting all values in formula
39 | P a g e
Therefore computing the reducer:
• We can observe from Mapper computation
• that 4 pairs are common (1, 1), (1, 2),
• (2, 1) and (2, 2)
• Make a list separate for Matrix A &
• B with adjoining values taken from
• Mapper step above:
(1, 1) =>Alist ={(A, 1, 1), (A, 2, 2)}
Blist ={(B, 1, 5), (B, 2, 7)}
Now Aij x Bjk: [(1*5) + (2*7)] =19 -------(i)
(1, 2) =>Alist ={(A, 1, 1), (A, 2, 2)}
Blist ={(B, 1, 6), (B, 2, 8)}
Now Aij x Bjk: [(1*6) + (2*8)] =22 -------(ii)
(2, 1) =>Alist ={(A, 1, 3), (A, 2, 4)}
Blist ={(B, 1, 5), (B, 2, 7)}
Now Aij x Bjk: [(3*5) + (4*7)] =43 -------(iii)
(2, 2) =>Alist ={(A, 1, 3), (A, 2, 4)}
Blist ={(B, 1, 6), (B, 2, 8)}
Now Aij x Bjk: [(3*6) + (4*8)] =50 -------(iv)
From (i), (ii), (iii) and (iv) we conclude that
((1, 1), 19)
((1, 2), 22)
((2, 1), 43)
((2, 2), 50)
Conclusion:
Thus, we have studied how to implement MapReduce.
My input Matrices A & B are:
Matrix A Matrix B
000 674
016 913
789 762
Program:
/** The job 1 mapper class. */
40 | P a g e
{
private Path path;
private boolean matrixA;
private Key key = new Key();
private Value value = new Value();
41 | P a g e
j = indexPair.index2;
if (j < 0 || j >= J) badIndex(j, J, "B column index");
}
value.v = el.get();
if (matrixA) {
key.index1 = i/IB;
key.index3 = k/KB;
key.m = 0;
value.index1 = i % IB;
value.index2 = k % KB;
for (int jb = 0; jb < NJB; jb++) {
key.index2 = jb;
context.write(key, value);
if (DEBUG) printMapOutput(key, value);
}
} else {
key.index2 = j/JB;
key.index3 = k/KB;
key.m = 1;
value.index1 = k % KB;
value.index2 = j % JB;
for (int ib = 0; ib < NIB; ib++) {
key.index1 = ib;
context.write(key, value);
if (DEBUG) printMapOutput(key, value);
}
}
}
}
Output:
Map output for Matrix A:
42 | P a g e
Map Output for matrixB:
Output Matrix:
000
000
000
Conclusion:
Implemented simple algorithms in Map-Reduce.
43 | P a g e
PRACTICAL NO: 06
AIM: Implementing DGIM algorithm using Python Programming Language.
Theory:
The Datar-Gionis-Indyk-Motwani Algorithm (DGIM)
This version of the algorithm uses O (log2 N) bits to represent a window of N bits, and allows
us to estimate the number
of 1’s in the window with an error of no more than 50%. To begin, each bit of the stream has
a timestamp, the position in which it arrives. The first bit has timestamp 1, the
second has timestamp 2, and so on. Since we only need to distinguish positions within the
window of length N,
we shall represent timestamps modulo N, so they can be represented by log2 N bits. If we also
store the total
number of bits ever seen in the stream (i.e., the most recent timestamp) modulo N, then we can
determine
from a timestamp modulo N where in the current window the bit with that timestamp is.
We divide the window into buckets, consisting of:
To represent a bucket, we need log2 N bits to represent the timestamp (modulo N) of its right end. To
represent
the number of 1’s we only need log2 log2 N bits. The reason is that we know this number i is a power
of 2, say 2j,
so, we can represent i by coding j in binary. Since j is at most log2 N, it requires log2 log2 N bits. Thus,
O (log N) bits suffice to represent a bucket.
There are six rules that must be followed when representing a stream by buckets:
• The right end of a bucket is always a position with a 1.
• Every position with a 1 is in some bucket.
• No position is in more than one bucket.
• There are one or two buckets of any given size, up to some maximum size.
• All sizes must be a power of 2.
• Buckets cannot decrease in size as we move to the left (back in time).
44 | P a g e
Advantages:
1. Stores only O(log2 N) bits
• O(log N)counts of log2N bits each
2. Easy update as more bits enter
• Error in count no greater than the number of 1’s in the unknown area.
Drawbacks:
• As long as the 1s are fairly evenly distributed, the error due to the unknown region is small
– no more than 50%
• But it could be that all the 1s are in the unknown area (indicated by “?” in the below figure)
at the end. In that case, the error is unbounded.
Program:
import IPython, sys, itertools, time, math
K, N = 1000, 1000
k = int(math.floor(math.log(N, 2)))
t=0
onesCount = 0
bucketList = []
for i in range(k+1):
bucketList.append(list())
with open('engg5108_stream_data.txt') as f:
while True:
c = f.read(1)
if not c:
for i in range(k+1):
for j in range(len(bucketList[i])):
print( "Size of bucket: %d, timestamp: %d" % (pow(2,i), bucketList[i][j]))
earliestTimestamp = bucketList[i][j]
for i in range(k+1):
for j in range(len(bucketList[i])):
if bucketList[i][j] != earliestTimestamp:
onesCount = onesCount + pow(2,i)
else:
45 | P a g e
onesCount = onesCount + 0.5 * pow(2,i)
print( "Number of ones in last %d bits: %d" % (K, onesCount))
break
t = (t + 1) % N
for i in range(k+1):
for bucketTimestamp in bucketList[i]:
if bucketTimestamp == t:
bucketList[i].remove(bucketTimestamp)
if c == '1':
bucketList[0].append(t)
checkAndMergeBucket(bucketList, t)
elif c == '0':
continue
Output:
Conclusion: Thus, we have successfully Implemented DGIM algorithm using Python Programming
Language.
46 | P a g e
PRACTICAL NO: 07
Theory:
Data Visualization is the technique used to deliver insights in data using visual cues such as graphs,
charts, maps and many others. This is useful as it helps in intuitive and Easy Understanding of the large
quantities of data and thereby make better decisions regarding it.
Data visualization in R programming language:
• The popular data visualization tools that are available are tableau, Plotly, R, Google Charts,
Infogram and Kibana.
• R is a language that is designed for statistical computing, graphical data analysis, and scientific
research. It is usually preferred for data visualization as it offers flexibility and minimum
required coding through its packages.
47 | P a g e
1. Bar plots:
It is used to divide values into groups of continuous ranges measuredagainst the frequency range
of the variable. To perform a comparative study between the various data categories in the data
set. To analyze the change of a variable over time in months or year.
Input (for horizontal):
Output:
48 | P a g e
Input (for vertical):
Output:
49 | P a g e
2. Histogram:
A histogram is like a bar chart as it uses bars of varying height to represent data distribution.
However, in a histogram, values are grouped into consecutive intervals called bins. In a
Histogram, continuous values are grouped and displayed in these bins whose size can be
varied. It is used to divide values into groups of continuous ranges measured against the
frequency range of the variable.
Input:
Output:
50 | P a g e
3. Box Plot:
• The statistical summary of the given data is Presented graphically using a boxplot. A
boxplot depicts information like minimum and maximum data point, the median value,
first andthird. Quartile, and. Interquartile range.
• To give a comprehensive statistical description of the data through a visual cue. To
identify the outlier points that do notlie in the interquartile range of data.
Input:
Output:
51 | P a g e
4. Scatter Plot:
A scatter plot is composed of many points on a Cartesian plane. Eachpoint denotes the value taken
by two parameters and helps us easilyidentify the relationship between them.
Input:
Output:
Output:
52 | P a g e
5. Heatmap:
It is also used to display a relationship between two or three or manyvariables in a. Two-dimensional image.
Thus. It allows us to explore two dimensions of the axis and the third dimension by intensity of color.
Input:
Output:
53 | P a g e
6. 3D Graphs in R:
Here we will use preps () function, this function is used to create 3D surfaces in perspective view. This function will draw
perspective plotsof a surface over the x–y plane.
Input:
Output:
54 | P a g e
7. Map visualization in R:
Here we are using maps package to visualize and display geographical maps using an R
programming language.
Input:
Output:
55 | P a g e
8. Ggplot2:
Input:
Output:
56 | P a g e
PRACTICAL NO: 08
AIM: To perform Exploratory Data Analysis using Spark/ Pyspark.
Theory:
EXPLORATORY DATA ANALYSIS or (EDA) is understanding the data sets by summarizing their main
characteristics often plotting them visually. This step is very important especially when we arrive at modeling
the data in order to apply Machine Learning. Plotting in EDA consists of Holograms, Box plot, Scatter plot
and many more. It often takes much time to explore the data. Through the process of EDA, we can ask to
define the problem statement or definition on our data set, which is very important.
Exploratory Data Analysis with Diabetes Database:
• Initializing Pyspark in google collab and accessing the database.
• Read the .CSV File and show top 5 rows of database.
57 | P a g e
• Describing data frame to pandas.
58 | P a g e
59 | P a g e
User defined functions (UDF):
• Defining function.
60 | P a g e
• Calculating and showing user count percentage on the basis of “Glucose”.
• Correlation
61 | P a g e
• Drawing the heatmap into the currently-active Axes using Seaborn
Conclusion: Successfully performed Exploratory Data Analysis using Pyspark on google colab.
62 | P a g e