Business Intelligence & Big Data Analytics-CSE3124Y

Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

Business Intelligence & Big Data

Analytics- CSE3124Y
ENVIRONMENT FOR BIG DATA

LECTURE 2
Learning Outcomes
Explain the terms computer clusters and distributed
computing
Determine the importance of virtualisation for big data
applications
List the requirements for setting up a big data
environment
Choose a Hadoop Distribution and Version using:
◦Apache Software Foundation, or
◦Cloudera’s Distribution Including Apache Hadoop (CDH)
Computer Clusters
•A computer cluster is a single logical unit consisting of
multiple computers that are linked through a fast local area
network (LAN).
•The components of a cluster, nodes (computers used as a
servers), run their own instance of an operating system.
•A node typically includes CPU, memory, and disk(s) storage.
Distributed Computing (1)
•Distributed computing is a technique that allows
individual computers to be networked together.
•A distributed file system is a client/server application that
allows clients to access and process data stored on the
server as if it were stored on their own computer.
•File systems that manage the storage across a network of
machines are called distributed file systems.
Distributed computing: Multiple computers appear as
one super computer, communicate with each other by
message passing, operate together to achieve a common
goal
Distributed Computing (2)
•Challenges
– Heterogeneity
– Openness
– Security
– Scalability
– Concurrency
– Fault tolerance
– Transparency
Virtualisation
•Virtualization refers to the creation of a virtual
resource such as a server, desktop, operating
system, file, storage or network.
•The main goal of virtualization is to manage
workloads by radically transforming traditional
computing to make it more scalable.
Importance of Virtualisation to Big Data
(1)
Solving big data challenges requires the management of
large volumes of highly distributed data stores along with
the use of compute- and data-intensive applications.
 Virtualization provides the added level of efficiency to
make big data platforms a reality.
Although virtualization is technically not a requirement
for big data analysis, software frameworks are more
efficient in a virtualized environment.
Importance of Virtualisation to Big Data
(2)
Virtualization has three characteristics that support the scalability
and operating efficiency required for big data environments:
Partitioning: In virtualization, many applications and operating
systems are supported in a single physical system by partitioning the
available resources.
Isolation: Each virtual machine is isolated from its host physical
system and other virtualized machines. Because of this isolation, if
one virtual instance crashes, the other virtual machines and the host
system aren’t affected. In addition, data isn’t shared between one
virtual instance and another.
Encapsulation: A virtual machine can be represented as a single file,
so you can identify it easily based on the services it provides.
BIG DATA SERVER VIRTUALIZATION
•In server virtualization, one physical server is partitioned into multiple
virtual servers.
• The hardware and resources of a machine (including the random access
memory (RAM), CPU, hard drive, and network controller) can be virtualized
into a series of virtual machines that each runs its own applications and
operating system.
•A virtual machine (VM) is a software representation of a physical
machine that can execute or perform the same functions as the physical
machine.
•Server virtualization provide efficiency in the use of physical resources.
• Of course, installation, configuration, and administrative tasks are associated
with setting up these virtual machines.
Setting Up the environment for Big Data
Requirements
The Virtual Machine
Hadoop Environment
Apache Hadoop
Apache Hadoop is an open-source software framework for distributed storage and
distributed processing of Big Data on clusters of commodity hardware.
Open-source available:
–From Apache Hadoop Foundation
–As Distributions such as Cloudera’s Distribution Including Apache Hadoop (CDH)
Distributed storage with HDFS
–Massive amounts of data
–HDFS sits on top of a native file system such as Linux

Note: Proprietary options such as IBM’s BigInsights are also available.


Activity 1
Explain the importance of a virtual machine for big data applications
List the requirements for setting up a big data environment
Cloudera’s Distribution Including Apache
Hadoop (CDH)
•Is a single install package available from the Apache Hadoop core
repository
•Includes a stable version of Hadoop, critical bug fixes, and solid new
features from the development version
•Includes the following components:
◦ –Apache Hadoop
◦ –Hive, Pig, HBase, Solr, Mahout, Spark, YARN
◦ –Flume, Hue, Oozie, and Sqoop
◦ –ZooKeeper
Introduction to HortonWorks Virtual Machine
(VM)
•Hortonworks is a company which provides a virtual environment pre-configured
with Hadoop.
•The Hortonworks Sandbox includes the latest developments from HDP
distribution.
•With this environment you can save a lot of time of installation and
configuration. Hortonworks also provides tutorials with which you can use to
start learning Hadoop.
•With the Sandbox it is very easy to learn Hadoop and no extra PC is required.
•It safe to perform any experimentation in the virtual environment so that your
original system remains safe