Computing has emerged as a promising new approach for ad-hoc parallel data processing. Major cloud computing companies have started to integrate frame-works for parallel data processing in their product portfolio, making it easy for customers to access these services and to position their programs. However, the processing frameworks which are currently used stem from the field of group computing and disregard the particular nature of a cloud. As a result, the allocated compute resources may be inadequate for big parts of the submitted job and unnecessarily increase processing time and cost. In this paper we discuss the opportunities and challenges for efficient parallel data processing in clouds and present our ongoing research project Nephele. Nephele is the first data processing structure to explicitly exploit the dynamic resource allocation offered by today's compute clouds for both, task scheduling and execution. It allows assigning the particular tasks of a processing job to different types of virtual machines and takes care of their instantiation and termination during the job execution. Based on this new framework, we perform evaluations on a compute cloud system and compare the results to the existing data processing framework Hadoop.

Existing System:The vast amount of data they have to deal with every day has made traditional database solutions prohibitively expensive. Instead, these companies have popularized an architectural example based on a large number of product servers. Problems like processing crawled documents or regenerate a web index are split into several independent subtasks, distributed among the available nodes, and computed in parallel. In order to simplify the development of distributed applications on top of such architectures, many of these companies have also built modified data processing frameworks. They can be classified by terms like high-throughput compute (HTC) or many-task computing (MTC), depending on the amount of data and the number of tasks involved in the computation. Only recently, Amazon has integrated Hadoop as one of its core infrastructure services. However, instead of embracing its dynamic resource allocation, current data processing frameworks rather expect the cloud to imitate the static nature of the cluster environments they

were originally designed for, e.g., at the moment the types and number of VMs allocated at the beginning of a compute job cannot be changed in the course of processing, although the tasks the job consists of might have completely different demands on the environment. As a result, rented resources may be inadequate for big parts of the processing job, which may lower the overall processing performance and increase the cost.

Proposed System:In this paper, we have discussed the challenges and opportunities for efficient similar data processing in cloud environments and presented Nephele, the first data processing framework to exploit the dynamic resource provisioning offered by todays IaaS clouds. We have described Nepheles basic architecture and presented a performance comparison to the well-established data processing framework Hadoop. The performance evaluation gives a first impression on how the ability to assign specific virtual machine types to specific tasks of a processing job, as well as the possibility to automatically allocate/deallocate virtual machines in the course of a job execution, can help to improve the overall resource utilization and, consequently, reduce the processing cost. With a structure like Nephele at hand, there are a variety of open research issues, which we plan to address for future work. In particular, we are interested in improving Nepheles ability to adapt to resource overload or underutilization during the job execution automatically.

Dynamic allocating/deallocating different conpute resource from cloud

Execution of task carried out by a set of instances which shared task.

Allocating/deallocating of task ensures cost and time efficient

System Requirement Specification Software Interface

JDK 1.5 Java Swing SQL Server

Hardware Interface


: : :


: 512 MB DD RAM : 40 GB

