0% found this document useful (0 votes)
42 views71 pages

Big Data Lab Manual and Syllabus

The document provides details about setting up a Hadoop cluster and performing various operations. It discusses installing Hadoop, configuring important files, launching services and provides examples of commands and MapReduce programs. It also lists lab experiments including installation, file operations in HDFS, writing MapReduce programs and using Pig, Hive and Spark on datasets.

Uploaded by

startechbyjus123
Copyright
© © All Rights Reserved
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
Download as doc, pdf, or txt
0% found this document useful (0 votes)
42 views71 pages

Big Data Lab Manual and Syllabus

The document provides details about setting up a Hadoop cluster and performing various operations. It discusses installing Hadoop, configuring important files, launching services and provides examples of commands and MapReduce programs. It also lists lab experiments including installation, file operations in HDFS, writing MapReduce programs and using Pig, Hive and Spark on datasets.

Uploaded by

startechbyjus123
Copyright
© © All Rights Reserved
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
Download as doc, pdf, or txt
Download as doc, pdf, or txt
You are on page 1/ 71

GITAM DEEMED TO BE UNIVERSITY HYDERABAD

19ECS442P: BIG DATA LAB


Course objectives:

1.Provide the knowledge to setup a Hadoop Cluster.


2.Discuss Hadoop commands to process big data.
3.Impart knowledge to develop programs using MapReduce.
4.Discuss Pig, Pig Latin and HiveQL to process big data.
5.Present latest big data frameworks and applications using Spark.

Course Outcomes:
1.Understand Hadoop working environment.
2.Implement commands to transfer files from local server to HDFS.
3.Apply Map Reduce programs for real world problems.
4.Implement scripts using Pig to solve real world problems.
5.Analyze queries using Hive to analyze the datasets.

Lab Experiments for Big Data

1 Installation of Hadoop Cluster

2 Perform file management task in Hadoop.


a. Creating directory
a. List the contents of a directory.
a. Upload and download a file.
a. See contents of a file
b. Copy a file from source to destination.
c. Move file from source to destination.
3 Map reduce programming
a. Wordcount program using Java.
a. Wordcount program using python
4 Databases, Tables, Views, Functions, and Indexes
5 Write a program to perform matrix multiplication in hadoop with a matrix size of nxn where n
>1000.
7 Given the following table schema.
Employee_table {ID: INT, Name: Varchar (10), Age: INT, Salary: INT}
Loan_table {LoanID:INT, ID: INT, Loan_applied: Boolean, Loan_amt: INT)
a. Create a database and the following tables in Hive.
a. Insert records into the table.
a. write an SQL to retrieve the employee details who have applied for a loan.
8 Write a query to create a table that stores the employee records working in the same department
together in the same sub-directory in HDFS. The schema for the table is given below:Emp_table:
{id, name, dept, yoj}
9 Given
+----+----------+-----+-----------+----------+
| ID | NAME | AGE | ADDRESS | SALARY |
+----+----------+-----+-----------+----------+
+-----+---------------------+-------------+--------+
|OID | DATE | CUSTOMER_ID | AMOUNT |

Create the following table in hive and insert transaction records into it. write an SQL
query to find the customer details who have made an order?
10 Understanding Spark

Textbooks(s)
1. Big Data Black Book by Dt Editorial Services, Dreamtech Publications, 2016.
2. Learning PySpark by Tomasz Drabas, Denny Lee, Packt publishing, 2017.
3. Tom White, "Hadoop: The Definitive Guide", 3/e,4/e O'Reilly, 2015.

Reference Book(s)
1. Bill Franks Taming, The Big Data Tidal Wave, 1/e, Wiley, 2012.
2. Frank J. Ohlhorst, Big Data Analytics, 1/e, Wiley, 2012 Course
Lab Manual for Big Data

1. Hadoop Installation (Linux)

Prerequisite Test
=============================
sudo apt update
sudo apt install openjdk-8-jdk -y
java -version; javac -version
sudo apt install openssh-server openssh-client -y
sudo adduser hdoop
su - hdoop
ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod 0600 ~/.ssh/authorized_keys
ssh localhost
Downloading Hadoop (Please note link is updated to new version of hadoop here on 6th May 2022)
===============================
wget https://downloads.apache.org/hadoop/common/hadoop-3.2.3/hadoop-3.2.3.tar.gz
tar xzf hadoop-3.2.3.tar.gz
Editing 6 important files
=================================
1st file
===========================
sudo nano .bashrc - here you might face issue saying hdoop is not sudo user
if this issue comes then
su - aman
sudo adduser hdoop sudo

sudo nano .bashrc


#Add below lines in this file
#Hadoop Related Options
export HADOOP_HOME=/home/hdoop/hadoop-3.2.3
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
export HADOOP_OPTS"-Djava.library.path=$HADOOP_HOME/lib/nativ"
source ~/.bashrc

2nd File
============================
sudo nano $HADOOP_HOME/etc/hadoop/hadoop-env.sh
#Add below line in this file in the end
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64

3rd File
===============================
sudo nano $HADOOP_HOME/etc/hadoop/core-site.xml
#Add below lines in this file(between "<configuration>" and "<"/configuration>")
<property>
<name>hadoop.tmp.dir</name>
<value>/home/hdoop/tmpdata</value>
<description>A base for other temporary directories.</description>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
<description>The name of the default file system></description>
</property>

4th File
====================================
sudo nano $HADOOP_HOME/etc/hadoop/hdfs-site.xml
#Add below lines in this file(between "<configuration>" and "<"/configuration>")
<property>
<name>dfs.data.dir</name>
<value>/home/hdoop/dfsdata/namenode</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>/home/hdoop/dfsdata/datanode</value>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>

5th File
================================================
sudo nano $HADOOP_HOME/etc/hadoop/mapred-site.xml
#Add below lines in this file(between "<configuration>" and "<"/configuration>")
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>

6th File
==================================================
sudo nano $HADOOP_HOME/etc/hadoop/yarn-site.xml
#Add below lines in this file(between "<configuration>" and "<"/configuration>")
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>127.0.0.1</value>
</property>
<property>
<name>yarn.acl.enable</name>
<value>0</value>
</property>
<property>
<name>yarn.nodemanager.env-whitelist</name>

<value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR
,CLASSPATH_PERPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</
value>
</property>
Launching Hadoop
hdfs namenode -format
./start-dfs.sh
Reference: https://www.youtube.com/watch?v=Ih5cuJYYz6Y
1.1 Hadoop Installation (Windows)
Cloudera Quick Start VM Installation
Step1: Download & Install VM Ware Workstation Player
Step2: Download Cloudera quick start VM
Step3: Install Cloudera quick start VM on VM Ware workstation
1) Download & Install VM Ware Workstation Player

Download VM Ware from the given link


https://www.vmware.com/in/products/workstation-player/workstation-player-evaluation.html
Step2:
Download Cloudera quick start VM
https://www.youtube.com/redirect?
event=video_description&redir_token=QUFFLUhqa2w5bUc2YkNzNHhJZnN2cjJPUHl2blRBWnFDZ3x
BQ3Jtc0tsNms5X2ZEU3Blb0dERWdEa1h0NDRiZEMxYXdOOS1DLUlxNGo0NEhNcVFGcGpTYldjNm
tzTFh5dEF0YjJKSnV0aDBvbzlFMjRMWXB4LWk5aldreTNlMkhJdlMyWkRIZU1xVmJWQ1lsTDlOc2
R5NG1IUQ&q=https%3A%2F%2Fdownloads.cloudera.com%2Fdemo_vm%2Fvmware%2Fcloudera-
quickstart-vm-5.13.0-0-vmware.zip&v=0LmLMur_MSo
Extract file by using 7-zip or winrar

Step3:
Install Cloudera quick start VM on VM Ware workstation
(NOTE: It takes time to load-depends on system performance)
(Desktop View)
Cloudera Exploration:
GUI based query editor
2. Hadoop Commands

 Hadoop is a open-source distributed framework that is used to store and process a large set of datasets. To
store data, Hadoop uses HDFS, and to process data, it uses MapReduce & Yarn.
 hadoop fs or hdfs dfs are file system commands to interact with HDFS.
 These commands are very similar to Unix Commands.

Unix Commands Hadoop Commands


default path /home/cloudera default path /user/cloudera

a) ls command: [cloudera@quickstart ~]$ $hdfs dfs -ls


This command is used to list all the files.
or
ex:
[cloudera@quickstart ~]$ ls [cloudera@quickstart ~]$ $hadoop fs -ls
NOTE: From the same terminal, we can type both Unix and HDFS commands.

b) mkdir: creates directory.

[cloudera@quickstart ~]$ cd demoLocal


[cloudera@quickstart demoLocal]$cd ..
[cloudera@quickstart ~]$ clear --- to clear the screen

Note: Default path to HDFS files /user/cloudera

[cloudera@quickstart ~]$hdfs dfs -mkdir demoHdfs

[cloudera@quickstart ~]$ hdfs dfs -ls


To check whether demoHdfs directory created or not, do the following steps:
i) open browser -> click HUE (username and password: cloudera)
[cloudera@quickstart ~]$ hadoop version
Hadoop 2.6.0-cdh5.13.0
Subversion http://github.com/cloudera/hadoop -r 42e8860b182e55321bd5f5605264da4adc8882be
Compiled by jenkins on 2017-10-04T18:08Z
Compiled with protoc 2.5.0
From source with checksum 5e84c185f8a22158e2b0e4b8f85311
This command was run using /usr/lib/hadoop/hadoop-common-2.6.0-cdh5.13.0.jar
[cloudera@quickstart ~]$
3 Map Reduce Programming
a.Map Reduce Word Count Using Java

Steps to execute MapReduce word count

o Create a text file in your local machine and write some text into it.
$ nano data.txt

o Check the text written in the data.txt file.


$ cat data.txt

o Create a directory in HDFS, where to kept text file.


$ hdfs dfs -mkdir /test
o Upload the data.txt file on HDFS in the specific directory.
$ hdfs dfs -put /home/codegyani/data.txt /test
Write the MapReduce program using eclipse.

WC_Mapper.java
package com.javatpoint;
import
java.io.IOException;import
java.util.StringTokenizer;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reporter;
public class WC_Mapper extends MapReduceBase implements Mapper<LongWritable,Text,
Text,IntWritable>{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value,OutputCollector<Text,IntWritable> output,
Reporter reporter) throws IOException{
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()){
word.set(tokenizer.nextToken());
output.collect(word, one);
}
}

WC_Reducer.java
package com.javatpoint;
import java.io.IOException;
import java.util.Iterator;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;

public class WC_Reducer extends MapReduceBase implements Reducer<Text,IntWritable,


Text,IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values,OutputCollector<Text,IntWrita
ble> output,

Reporter reporter) throws IOException {


int sum=0;
while (values.hasNext()
sum+=values.next().get();
}
output.collect(key,new IntWritable(sum));
}
}

WC_Runner.java
package com.javatpoint;
import java.io.IOException;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.TextInputFormat;
import org.apache.hadoop.mapred.TextOutputFormat;
public class WC_Runner {
public static void main(String[] args) throws IOException{
JobConf conf = new JobConf(WC_Runner.class);
conf.setJobName("WordCount");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(WC_Mapper.class);
conf.setCombinerClass(WC_Reducer.class);
conf.setReducerClass(WC_Reducer.class
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf,new Path(args[0]));
ileOutputFormat.setOutputPath(conf,new Path(args[1]));
JobClient.runJob(conf);
}
}

Download the source code.


o Create the jar file of this program and name it countworddemo.jar.
o Run the jar file
hadoop jar /home/codegyani/wordcountdemo.jar com.javatpoint.WC_Runner
/test/data.txt /r_output
o The output is stored in /r_output/part-00000

o Now execute the command to see the output.


hdfs dfs -cat /r_output/part-00000
b. Word Count Using Python
4. Databases, Tables, Views, Functions, and Indexes
1. CREATE
2. SHOW
3. DESCRIBE
4. USE
5. DROP
6. ALTER
7. TRUNCATE

Table-1 Hive DDL commands


DDL Command Use With

CREATE Database, Table

SHOW Databases, Tables, Table Properties, Partitions, Functions, Index

DESCRIBE Database, Table, view

USE Database

DROP Database, Table

ALTER Database, Table

TRUNCATE Table

Before moving forward, note that the Hive commands are case-insensitive.
CREATE DATABASE is the same as create database.
1) Open Hue ->Under Query -> Choose Editor -> Click on Hive

Note: To Execute the queries, Select the Row and Click on Run button
2) To Check Available Databases (Hive contains a default database)
SHOW DATABASES

3) To Create a new Database


CREATE DATABASE GITAM
4) To verify the database available
Run the first row again i.e., SHOW DATABASES
4) USE

5) DESCRIBE

6) DROP
7) Before creation of a table, check the following path
/user/hive/warehouse/

Tables can be created in two ways


(i)Internal Table,also known as Managed Table.

(ii)External Table

(i) Internal Table Syntax:

CREATE table table_name(schema);

Example: create table gitam_table(abc STRING);

(ii) External Table Syntax:

CREATE external table table_name(schema);

Example: create external table hyd_table(abc STRING);

(i)Internal Table Example


SHOW Tables

Create a text document in the folder in /home/cloudera/(folder name)/textfile name

Example: /home/cloudera/hive_practice/sample.txt

To check the content inside the table:

SELECT * FROM gitam_table

To check whether he table is created or not, check the following path


/user/hive/warehouse/…

External table:
In internal or managed table, the default path to store the table is /user/hive/warehouse/…

In external table, we can store the table at our specified path.

To create a external table

create external table (table name)(abc STRING) location '/user/cloudera/(folder name)/(table name);

To load the data into the table


load data local inpath 'file:///home/cloudera/(folder name)/sample.txt' into table (table name);
To retrieve the records of the table
SELECT * FROM (table name)

To Drop a table
Note: Observe the two tables paths i.e., Managed Table path and External Table path
5. Implement matrix multiplication with Hadoop Map Reduce.
Mapper Logic:

import java.io.IOException;

import org.apache.hadoop.conf.*;

import org.apache.hadoop.io.*;

import org.apache.hadoop.mapreduce.*;

public class MatrixMapper extends Mapper<LongWritable, Text, Text, Text>

public void map(LongWritable key, Text value, Context context) throws IOException,
InterruptedException

Configuration conf = context.getConfiguration();

int m = Integer.parseInt(conf.get("m"));

int p = Integer.parseInt(conf.get("p"));

String line = value.toString();

String[] indicesAndValue =

line.split(","); Text outputKey = new

Text();

Text outputValue = new Text();

if (indicesAndValue[0].equals("M")
{

for (int k = 0; k < p; k++)

outputKey.set(indicesAndValue[1] + "," + k);

outputValue.set("M," + indicesAndValue[2] + "," + indicesAndValue[3]);

context.write(outputKey, outputValue);

else

for (int i = 0; i < m; i++)

outputKey.set(i + "," + indicesAndValue[2]);

outputValue.set("N," + indicesAndValue[1] + "," + indicesAndValue[3]);


context.write(outputKey, outputValue);

}
}
}
}
Reducer Logic :

import java.io.IOException;

import java.util.*;

import org.apache.hadoop.io.*;

import org.apache.hadoop.mapreduce.*;

public class MatrixReducer extends Reducer<Text, Text, Text, Text>

public void reduce(Text key, Iterable<Text> values, Context context) throws IOException,
InterruptedException

String[] value;

HashMap<Integer, Float> hashA = new HashMap<Integer,

Float>(); HashMap<Integer, Float> hashB = new HashMap<Integer,

Float>(); for (Text val : values)

value = val.toString().split(",");

if (value[0].equals("M"))

hashA.put(Integer.parseInt(value[1]), Float.parseFloat(value[2]));

}
else

hashB.put(Integer.parseInt(value[1]), Float.parseFloat(value[2]));

int n = Integer.parseInt(context.getConfiguration().get("n")); float result = 0.0f;

float a_ij; float b_jk;

for (int j = 0; j < n; j++)

a_ij = hashA.containsKey(j) ? hashA.get(j) : 0.0f;

b_jk = hashB.containsKey(j) ? hashB.get(j) : 0.0f;

result += a_ij * b_jk;

if (result != 0.0f)

context.write(null, new Text(key.toString() + "," + Float.toString(result)));

}
Driver Logic :

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.conf.*;

import org.apache.hadoop.io.*;

import org.apache.hadoop.mapreduce.*;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;

public class MatrixDriver

public static void main(String[] args) throws Exception

Configuration conf = new Configuration();

// M is an m-by-n matrix; N is an n-by-p matrix.

conf.set("m", "2");
conf.set("n", "2");

conf.set("p", "2");

Job job = Job.getInstance(conf, "MatrixMultiplication");

job.setJarByClass(MatrixDriver.class);

job.setOutputKeyClass(Text.class);

job.setOutputValueClass(Text.class);

job.setMapperClass(MatrixMapper.class);

job.setReducerClass(MatrixReducer.class);

job.setInputFormatClass(TextInputFormat.class);

job.setOutputFormatClass(TextOutputFormat.class);

FileInputFormat.addInputPath(job, new Path(args[0]));

FileOutputFormat.setOutputPath(job, new Path(args[1]));

job.submit();

}
7. Given the following table schema.
Employee_table {ID: INT, Name: Varchar (10), Age: INT, Salary: INT}
Loan_table {LoanID:INT, ID: INT, Loan_applied: Boolean, Loan_amt: INT)
a.Create a database and the following tables in Hive.
b.Insert records into the table.
c.Write an SQL to retrieve the employee details who have applied for a loan.

a.Create a database and the following tables in Hive.


b.Insert records into the table
c.Write an SQL to retrieve the employee details who have applied for a loan.
8.Write a query to create a table which stores the employee records working in the same department
together in the same sub-directory in HDFS. The schema for the table is given below:Emp_table: {id,
name, dept, yoj}
Display the Employee ID, Name department names, Job ID, Salary of all Employees.

Display the Employee ID, Name department names, Job ID, Salary of all Employees.
9. Given
+----+----------+-----+-----------+----------+
| ID | NAME | AGE | ADDRESS | SALARY |
+----+----------+-----+-----------+----------+
+-----+---------------------+-------------+--------+
|OID | DATE | CUSTOMER_ID | AMOUNT |

Create the following table in hive and insert transaction records into it. write an SQL query to find the
customer details who have made an order?
10. Understanding Spark
Spark program is written in terms of operations on RDDs. RDD is
partitioned and distributed to cluster nodes. RDD can be stored on disk or
memory. RDDs are manipulated using set of parallel transformations and
actions. Spark keeps track of how RDDs are created so that it can be rebuilt
in case of job failures or in case of slow workers.
What is RDD?
 RDD is a fundamental data structure of spark.
 Each dataset in RDD is divided into logical partitions, which may be
computed on different nodes of the cluster.
 Spark context is used to create RDDs.
 RDDs are immutable i.e, once created can not be changed. Changes
made to RDD can only be stored in another RDD.

RDDs can be constructed in 3 ways


1) Create from python list
#create python list
lst=range(5,20)
lst

out[]: [5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]

parallelize(list, #partitions)

#create RDD using python list.


list_rdd1=sc.parallelize(lst,3)
print type(list_rdd1)

<class 'pyspark.rdd.RDD'>

2) By transforming the existing RDD

#to copy one RDD to another RDD


rdd2=list_rdd1

3) By reading from a file

#to read from a file


r3=sc.textFile(‘/user/cloudera/
mr/file1’)
sc.textFile(‘file:///home/cloudera/path/’)

Operations on RDDs:

We can perform two types of operations on RDDs namely transformations


and actions.
 Apache Spark Transformation is a function that produces new RDD
from the existing RDDs.
 In other words, transformations are functions that take a RDD as the
input and produce one or many RDDs as the output.
 Transformations are lazy i.e. they are not computed immediately but
executed only when action is run on it.
 Spark remembers the set of transformations applied on the base
dataset, apply the optimizations and execute when action is applied.
 This also helps in automatic recovery from failed or slow machines.

Actions : Execute the transformations and get the data from the workers to
the driver.

The following table gives a list of Actions, which return values.

S.No Action & Meaning

1 first()

2 take(n)

3 collect()

4 count()

5 reduce(func)

7 takeOrdered(n, [ordering])

#first() returns the first element of the RDD.


list_rdd1.first()

Out[25]: 5

#take will display first N elements of the RDD.


list_rdd.take(3)
Out[]: [5,6,7]

#collect action will get all the elements of the RDD to the driver
rdd2.collect()

out[]: [5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]

# count action counts the number of elements in the RDD


rdd2.count()

Out[]: 15

#reduce action reduces to a single value by applying a function

x=rdd2.reduce(lambda i,j : i+j)


print type(x),x

<type 'int'> 130

S.No Transformations

1 distinct([numTasks])

2 filter(func)

3 map(func)

4 flatMap(func)

5 union(otherDataset)

6 intersection(otherDataset)

7 groupByKey([numTasks])
8
reduceByKey(func, [numTasks])

9 join(otherDataset, [numTasks])

10
cartesian(otherDataset)

# create a list

lst=[('modi','pm'),('Babu','cm'),
('Naidu','vp'),('kcr','cm'),
('narasimham','governor'),
('uma','minister'),
('ktr','minister'),('Babu','cm'),
('narasimham','governor')]

rdd_lst = sc.parallelize(lst)
# convert list to rdd
rdd_lst.collect()
Out[3]: [('modi', 'pm'),
('Babu', 'cm'),
('Naidu', 'vp'),
('kcr', 'cm'),
('narasimham', 'governor'),
('uma', 'minister'),
('ktr', 'minister'),
('Babu', 'cm'),
('narasimham', 'governor')]

#distinct transformation is used to select unique elements of RDD

distinct_rddlist =
rdd_lst.distinct()
distinct_rddlist.collect()
Out[7]: [('uma', 'minister'),
('kcr', 'cm'),
('Babu', 'cm'),
('Naidu', 'vp'),
('modi', 'pm'),
('ktr', 'minister'),
('narasimham', 'governor')]

#filter transformation is used to select the elements satisfying the #function


filter_rddlist=rdd_lst.filter(lambda element:
element[1]=='cm')

filter_rddlist.collect()
Out[9]: [('Babu', 'cm'),
('kcr', 'cm'),
('Babu', 'cm')]

#map: Returns a new distributed dataset, formed by passing each element


of the source through a function
map_rddlist = rdd_lst.map(lambda (x,y):(y,x))

map_rddlist.collect()
Out[30]: [('pm', 'modi'),
('cm', 'Babu'),
('vp', 'Naidu'),
('cm', 'kcr'),
('governor', 'narasimham'),
('minister', 'uma'),
('minister', 'ktr'),
('cm', 'Babu'),
('governor', 'narasimham')]

#SortByKey: When called on a dataset of (K, V) pairs where K implements


Ordered, returns a dataset of (K, V) pairs sorted by keys in ascending or
descending order, as specified in the Boolean ascending argument.

#sort the input RDD by the key value

sortbykey_rdd = map_rddlist.sortByKey()

sortbykey_rdd.collect()
Out[60]: [('cm', 'Babu'),
('cm', 'kcr'),
('cm', 'Babu'),
('governor', 'narasimham'),
('governor', 'narasimham'),
('minister', 'uma'),
('minister', 'ktr'),
('pm', 'modi'),
('vp', 'Naidu')]

#groupByKey(): When called on a dataset of (K, V) pairs, returns a dataset


of #(K, Iterable<V>) pairs.

#Group the values for each key in the RDD into a single sequence
groupbykey_rdd =
map_rddlist.groupByKey()
groupbykey_rdd.collect()
Out[62]: [('vp', <pyspark.resultiterable.ResultIterable at
0x7f75041de450>), ('minister', <pyspark.resultiterable.ResultIterable at
0x7f75040fe810>), ('governor', <pyspark.resultiterable.ResultIterable at
0x7f75040fe990>), ('cm', <pyspark.resultiterable.ResultIterable at
0x7f75040fe850>), ('pm', <pyspark.resultiterable.ResultIterable at
0x7f75040fe9d0>)]

for x in groupbykey_rdd.collect():
print x[0], list(x[1])

vp ['Naidu']
minister ['uma', 'ktr']
governor ['narasimham', 'narasimham']
cm ['Babu', 'kcr', 'Babu']
pm ['modi']

#reduceByKey() : When called on a dataset of (K, V) pairs, returns a dataset


of #(K, V) pairs where the values for each key are aggregated using the
given reduce #function func, which must be of type (V, V) ⇒ V.

#Merge the values for each key


reducebykey_rdd = map_rddlist.reduceByKey(lambda x,y:x+y)
reducebykey_rdd.collect()

Out[58]: [('modi', 'pm'),


('narasimham', 'governorgovernor'),
('Naidu', 'vp'),
('uma', 'minister'),
('Babu', 'cmcm'),
('ktr', 'minister'),
('kcr', 'cm')]
rdd1=sc.parallelize(range(5,10))
rdd2=sc.parallelize(range(8,15))
list1=[1,2,3,4,5]
rdd3 = sc.parallelize(list1)

rdd1.collect()

Out[70]: [5, 6, 7, 8, 9]

rdd2.collect()
Out[71]: [8, 9, 10, 11, 12, 13, 14]

rdd3.collect()
Out[72]: [1, 2, 3, 4, 5]
#flatMap trnasformation is similar to map, but each input item can be
mapped #to 0 or more output items

flatmap_rdd3=rdd3.flatMap(lambda i : (i,i+5, i*5))


#apply flat map
flatmap_rdd3.collect()
Out[74]: [1, 6, 5, 2, 7, 10, 3, 8, 15, 4, 9, 20, 5, 10, 25]

#takeOrdered : Get the N elements from a RDD ordered in ascending order or


as specified by #the optional key function.

flatmap_rdd3.takeOrdered(7)
Out[75]: [1, 2, 3, 4, 5, 5, 6]

#to order in descending order


flatmap_rdd3.takeOrdered(flatmap_rdd3
.count(),lambda x: -x)
Out[92]: [25, 20, 15, 10, 10, 9, 8, 7, 6, 5, 5, 4, 3,
2, 1]

#Union and Intersection

# union returns a new RDD that contains the union of the elements in the
source dataset and the argument
# it includes duplicate elements also

rddunion=rdd1.union(rdd2)
rddunion.collect()
Out[77]: [5, 6, 7, 8, 9, 8, 9, 10,
11, 12, 13, 14]
#intersection returns a new RDD that contains the intersection of elements in
the source dataset and the argument.

rddintersection = rdd1.intersection(rdd2)
rddintersection.collect()
Out[82]: [8, 9]

# Cartesian: When called on datasets of types T and U, returns a dataset of


# (T, U) pairs (all pairs of elements).

rddcartesian=rdd1.cartesian(rdd2)
rddcartesian.collect()
Out[83]: [(5, 8), (5, 9), (5, 10), (5, 11), (5,
12), (5, 13), (5, 14), (6, 8), (6, 9), (6, 10),
(6, 11), (6, 12), (6, 13), (6, 14), (7, 8), (7,
9), (7, 10), (7, 11), (7, 12), (7, 13), (7,
14), (8, 8), (8, 9), (8, 10), (8, 11), (8, 12),
(8, 13), (8, 14), (9, 8), (9, 9), (9, 10), (9,
11), (9, 12), (9, 13), (9, 14)]

#JOIN: When called on datasets of type (K, V) and (K, W), returns a dataset
of
# (K, (V, W)) pairs with all pairs of elements for each key. Outer joins are
supported through leftOuterJoin, rightOuterJoin, and fullOuterJoin.

rdd1=sc.parallelize([(1,8.5),(2,7.7),(3,9),(5,6.5)])
rdd2=sc.parallelize([(1,'cse'),(3,'ece'),(4,'me'),(5,'eee')])
rddjoin=rdd1.join(rdd2)
rddjoin.collect()
Out[]: [(1, (8.5, 'cse')),

(3, (9, 'ece')),

(5, (6.5, 'eee'))]

#leftOuterJoin: performs a join starting with the first (left-most) RDD and
#then any matching second (right-most) RDD elements.

rddleftouter=rdd1.leftOuterJoin(rdd2)
rddleftouter.collect()
Out[]: [(1, (8.5, 'cse')),

(2, (7.7, None)),

(3, (9, 'ece')),

(5, (6.5, 'eee'))]

#rightOuterJoin: performs a join starting with the second (right-most) RDD


and then any matching first (left-most) RDD elements.

rddrightouter=rdd1.rightOuterJoin(rdd2)
rddrightouter.collect()
Out[]: [(1, (8.5, 'cse')),

(3, (9, 'ece')),

(4, (None, 'me')),


(5 , (6.5, 'eee'))]

#fullOuterJoin: returns all matching elements from both RDDs whether the
other RDD matches or not.
rddfullouter=rdd1.fullOuterJoin(rdd2)
rddfullouter.collect()
Out[42]: [(1, (8.5, 'cse')),

(2, (7.7, None)),

(3, (9, 'ece')),

(4, (None, 'me')),

(5, (6.5, 'eee'))]

x = sc.textFile(“/user/cloudera/wc")
x.xollect()
a= x.flatMap(lambda line: line.split(" "))
a.collect()
b=a.map(lambda word: (word, 1))
b.collect()
c=b.reduceByKey(lambda a, b: a + b)
c.collect()
c.saveAsTextFile(“/user/cloudera/output")

You might also like