Big Data Syllabus For Theory and Lab

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 4

19ECS442: BIG DATA  

The course is designed which largely involves collecting data from different sources, manage
it in a way that it becomes available to be consumed by analysts and finally deliver data
products useful to the organization business. The process of converting large amounts of
unstructured raw data, retrieved from different sources to a data product useful for
organizations forms the core of Big Data Analytics. 
Course objectives: 
1. To introduce an in depth understanding of all the concepts related to Big Data and its
uses 
2. To provide an insight on the underlying technologies to handle Big Data and the
Ecosystem of Hadoop. 
3. To explore the layers of Big Data Stack and YARN Functionality. 
4. To Understand the Architecture, benefits and Properties of Hive and Pig. 
5. To provide learners with a deep and systematic knowledge on Spark. 
 
Module I: Module Name: Getting an overview of Big Data

Big Data definition, History of Data Management, Structuring Big Data, Elements of
Bigdata, Big Data Analytics. 
 
Exploring use of Big Data in Business Context: Use of Big Data in Social Networking, Use of
Big Data in preventing Fraudulent Activities in Insurance Sector & in Retail Industry.  

Learning Outcomes:  
After completion of this unit, the student will be able to: 
 
1. Learn various sources of data and forms of data generation. (L2) 
2. Understand the evolution and elements of Big Data. (L2) 
3. Explore different opportunities available in the career path. (L3) 
4. Understand the role and importance of Big Data in various domains. (L2) 
 
Module II: 
Handling Big Data Number of hours (LTP) 6 0

Distributed and parallel computing for Big Data, Introducing Hadoop, Cloud computing and
Big Data, In-memory Computing Technology for Big Data. 

Understanding Hadoop Ecosystem: Hadoop Ecosystem, Hadoop Distributed File System,


MapReduce, Hadoop YARN, Introducing HBase, Combing HBase and HDFS, Hive, Pig
and Pig Latin, Sqoop, ZooKeeper, Flume, Oozie. 
 
Learning Outcomes:  
After completion of this unit, the student will be able to: 
 
1. Identify the difference between distributed and parallel computing. (L3) 
2. Learn the importance of Virtualization in Big Data. (L2) 
3. Learn the details of Hadoop and Cloud Computing. (L2) 
4. Learn the architecture and features of HDFS. (L2) 
 
Module III:
Understanding Big Data Technology Foundations Number of hours (LTP) 6 0 6 

The MapReduce Framework, Techniques to Optimize Map Reduce Jobs, Uses of Map Reduce,
Role of HBase in Big Data Processing. 
Exploring the Big Data Stack, Virtualization and Big Data, Virtualization approaches. 

Learning Outcomes:  
After completion of this unit, the student will be able to: 
1. Understand Hadoop Ecosystem, MapReduce and HBase. (L2) 
2. Apply the technique in optimizing MapReduce jobs. (L3) 
3. Explore the layers of Big Data Stack. (L2) 
4. Learn virtualization approaches in handling Big Data operations. (L2) 
 
Module IV: HIVE and PIG Number of hours (LTP) 6 0 6 

Exploring Hive: Introducing Hive, Getting Started with Hive, Hive Services, Data Types,
Built- in Functions, Hive-DDL, Data Manipulation, Data Retrieval Queries, Using Joins. 

Analysing Data with Pig: Introducing Pig, Running Pig, Getting started with Pig Latin,
working `with operators in Pig, Debugging Pig, Working with Functions in pig, Error Handling
in Pig. 
 
Learning Outcomes:  
After completion of this unit, the student will be able to: 
1. Learn the working of Hive and query execution. (L2) 
2. Learn the importance of Pig. (L2) 
3. Choose the operators in Pig. (L2) 

Module V: SPARK Number of hours (LTP) 6 0 6 


Introduction, Spark Jobs and API, Spark 2.0 Architecture, Resilient Distributed Datasets:
Internal Working, Creating RDDs, Transformations, Actions. Data Frames: Python to RDD
Communications, speeding up PySpark with Data Frames, Creating Data Frames and
Simple Data Frame Queries, Interoperating with RDDs, Querying with Data Frame. 

Learning Outcomes:  
After completion of this unit, the student will be able to: 
 
1. Get an overview of Spark technology and Jobs Organization concept (L2) 
2. Understand the schema less data structure available in PySpark (L3) 
3. Get an overview of data frames that bridges the gap between Scala and Python in
terms of efficiency. (L2) 
4. Able to handle a real time Big Data Application. (L4) 
 
Textbooks(s) 
1. Big Data Black Book by Dt Editorial Services, Dreamtech Publications, 2016.  
2. Learning PySpark by Tomasz Drabas, Denny Lee, Packt publishing, 2017.  
3. Tom White, "Hadoop: The Definitive Guide", 3/e,4/e O'Reilly, 2015. 

Reference Book(s) 
1. Bill Franks Taming, The Big Data Tidal Wave, 1/e, Wiley, 2012. 
2. Frank J. Ohlhorst, Big Data Analytics, 1/e, Wiley, 2012 Course

Outcomes: 
1. Demonstrate the big data concepts for real world data analysis (L1). 
2. Develop Map Reduce concepts (L2).  
3. Learn how Pig Latin is used for programming in Hadoop. (L3). 
4. Illustrate Hadoop API for Map reduce framework (L4). 
5. Develop basic programs of map reduce framework particularly driver code, mapper
code, reducer code (L5).
6. Learn Apache Spark fundamentals, RDD, DataFrame(L6)

Lab experiments for Bigdata


1  Installation of Hadoop Cluster – 
a. Stand Alone Mode,   b. Pseudo Distributed Mode, c.Fully Distributed Mode 
2  Perform file management task in Hadoop. 
a. Creating directory 
b. List the contents of a directory 
c. Upload and download a file 
d. See contents of a file 
e. Copy a file from source to destination 
f. Move file from source to destination. 
3  Map reduce programming 
a. Wordcount program using Java 
b. Wordcount program using python 
4  Databases,Tables,Views,Functions and Indexes 
5  Write a program to perform matrix multiplication in hadoop with a matrix size of nxn
where n >1000.  
7  Given the following table schema 
Employee_table {ID: INT, Name: Varchar (10), Age: INT, Salary: INT}  
Loan_table {LoanID:INT, ID: INT, Loan_applied: Boolean, Loan_amt: INT)  
a. Create a database and the following tables in Hive. 
b. Insert records into the table 
c. write an SQL to retrieve the employee details who have applied for a loan. 
8  Write a query to create a table which stores the employee records working in the same
department together in the same sub-directory in HDFS. The schema for the table is given
below:Emp_table: {id, name, dept, yoj} 
9  Given  
+----+----------+-----+-----------+----------+   
| ID | NAME     | AGE | ADDRESS   | SALARY   |  
+----+----------+-----+-----------+----------+  
+-----+---------------------+-------------+--------+  
|OID | DATE                | CUSTOMER_ID | AMOUNT |  
 
Create the following table in hive and insert transaction records into it. write
an SQL query to find the customer details who have made an order? 
10  Understanding Spark  
 

You might also like