Business Intelligence & Big Data Analytics-CSE3124Y
Business Intelligence & Big Data Analytics-CSE3124Y
Business Intelligence & Big Data Analytics-CSE3124Y
Analytics- CSE3124Y
ENVIRONMENT FOR BIG DATA
LECTURE 2
Learning Outcomes
Explain the terms computer clusters and distributed
computing
Determine the importance of virtualisation for big data
applications
List the requirements for setting up a big data
environment
Choose a Hadoop Distribution and Version using:
◦Apache Software Foundation, or
◦Cloudera’s Distribution Including Apache Hadoop (CDH)
Computer Clusters
•A computer cluster is a single logical unit consisting of
multiple computers that are linked through a fast local area
network (LAN).
•The components of a cluster, nodes (computers used as a
servers), run their own instance of an operating system.
•A node typically includes CPU, memory, and disk(s) storage.
Distributed Computing (1)
•Distributed computing is a technique that allows
individual computers to be networked together.
•A distributed file system is a client/server application that
allows clients to access and process data stored on the
server as if it were stored on their own computer.
•File systems that manage the storage across a network of
machines are called distributed file systems.
Distributed computing: Multiple computers appear as
one super computer, communicate with each other by
message passing, operate together to achieve a common
goal
Distributed Computing (2)
•Challenges
– Heterogeneity
– Openness
– Security
– Scalability
– Concurrency
– Fault tolerance
– Transparency
Virtualisation
•Virtualization refers to the creation of a virtual
resource such as a server, desktop, operating
system, file, storage or network.
•The main goal of virtualization is to manage
workloads by radically transforming traditional
computing to make it more scalable.
Importance of Virtualisation to Big Data
(1)
Solving big data challenges requires the management of
large volumes of highly distributed data stores along with
the use of compute- and data-intensive applications.
Virtualization provides the added level of efficiency to
make big data platforms a reality.
Although virtualization is technically not a requirement
for big data analysis, software frameworks are more
efficient in a virtualized environment.
Importance of Virtualisation to Big Data
(2)
Virtualization has three characteristics that support the scalability
and operating efficiency required for big data environments:
Partitioning: In virtualization, many applications and operating
systems are supported in a single physical system by partitioning the
available resources.
Isolation: Each virtual machine is isolated from its host physical
system and other virtualized machines. Because of this isolation, if
one virtual instance crashes, the other virtual machines and the host
system aren’t affected. In addition, data isn’t shared between one
virtual instance and another.
Encapsulation: A virtual machine can be represented as a single file,
so you can identify it easily based on the services it provides.
BIG DATA SERVER VIRTUALIZATION
•In server virtualization, one physical server is partitioned into multiple
virtual servers.
• The hardware and resources of a machine (including the random access
memory (RAM), CPU, hard drive, and network controller) can be virtualized
into a series of virtual machines that each runs its own applications and
operating system.
•A virtual machine (VM) is a software representation of a physical
machine that can execute or perform the same functions as the physical
machine.
•Server virtualization provide efficiency in the use of physical resources.
• Of course, installation, configuration, and administrative tasks are associated
with setting up these virtual machines.
Setting Up the environment for Big Data
Requirements
The Virtual Machine
Hadoop Environment
Apache Hadoop
Apache Hadoop is an open-source software framework for distributed storage and
distributed processing of Big Data on clusters of commodity hardware.
Open-source available:
–From Apache Hadoop Foundation
–As Distributions such as Cloudera’s Distribution Including Apache Hadoop (CDH)
Distributed storage with HDFS
–Massive amounts of data
–HDFS sits on top of a native file system such as Linux