Unit V-IBM InfoSphere

Download as pdf or txt
Download as pdf or txt
You are on page 1of 3

IBM InfoSphere BigInsights and Streams

• BigInsights is an analytics platform that enables companies to turn complex Internet-scale


information sets into insights.
• It consists of a packaged Apache Hadoop distribution, with a greatly simplified installation
process, and associated tools for application development, data movement, and cluster
management.
• *Other open source technologies in BigInsights are:
◦ Pig
▪ A platform that provides a high-level language for expressing programs that analyze
large datasets.
▪ Pig has a compiler that translates Pig programs into sequences of MapReduce jobs
that the Hadoop framework executes.
◦ HIVE
▪ A data-warehousing solution built on top of the Hadoop environment.
▪ It has familiar relational-database concepts, such as tables, columns, and partitions,
and a subset of SQL (HiveQL) to the unstructured world of Hadoop.
▪ Hive queries are compiled into MapReduce jobs executed using Hadoop.
◦ Jaql
▪ An IBM-developed query language designed for JavaScript Object Notation (JSON)
and provides a SQL-like interface.
◦ Hbase
▪ A column-oriented NoSQL data-storage environment designed to support large,
sparsely populated tables in Hadoop.
◦ Flume
▪ A distributed, reliable, available service for efficiently moving large amounts of data
as it is produced.
▪ Flume is well-suited to gathering logs from multiple systems and inserting them into
the Hadoop Distributed File System (HDFS) as they are generated.
◦ Avro
▪ A data-serialization technology that uses JSON for defining data types and protocols,
and serializes data in a compact binary format.
◦ Lucene
▪ A search-engine library that provides high-performance and full-featured text search.
◦ ZooKeeper
▪ A centralized service for maintaining configuration information and naming,
providing distributed synchronization and group services.
◦ Oozie
▪ A workflow scheduler system for managing and orchestrating the execution of
Apache Hadoop jobs.
• In addition, the BigInsights distribution includes the following IBM-specific technologies:
◦ BigSheets
▪ A browser-based spreadsheet-like interface that enables business users to gather and
analyze data easily.
▪ Users can work with several common formats of data like csv,tsv(tab separated
value).
◦ Text analytics
▪ Pre-built library for text annotators.

◦ Adaptive MapReduce.
▪ An IBM Research solution for speeding up the execution of small MapReduce jobs
by changing how MapReduce tasks are handled.

Stream Computing

• Stream computing is a new paradigm necessitated by new data-generating scenarios, such as


the ubiquity of mobile devices, location services, and sensor pervasiveness.
• In static data computation questions are asked of static data.
• In streaming data computation data is continuously evaluated by static questions.

The InfoSphere platform

• InfoSphere is a comprehensive information-integration platform that includes data


warehousing and analytics, information integration, master data management, life-cycle
management, and data security and privacy.
• The InfoSphere Streams platform
◦ supports real-time processing of streaming data,
◦ enables the results of continuous queries to be updated over time, and
◦ can detect insights within data streams that are still in motion.
• The main design goals of InfoSphere Streams are to:
◦ Respond quickly to events and changing business conditions and requirements.
◦ Support continuous analysis of data at rates that are orders of magnitude greater than
existing systems.
◦ Adapt rapidly to changing data forms and types.
◦ Manage high availability, heterogeneity, and distribution for the new stream paradigm.
◦ Provide security and information confidentiality for shared information.
• InfoSphere Streams
◦ Provides a programming model and IDE for defining data sources.
◦ Software analytic modules called operators fused into processing execution units.
◦ Provides infrastructure to support the composition of scalable stream-processing
applications from these components.
• The main platform components are:
• Runtime environment— This includes platform services and a scheduler for deploying and
monitoring Streams applications across a single host or set of integrated hosts.
• Programming model— You can write Streams applications using the Streams Processing
Language (SPL), a declarative language. In this model, a Streams application is represented
as a graph that consists of operators and the streams that connect them.
• Monitoring tools and administrative interfaces— Streams applications process data at
speeds much higher than those that the normal collection of operating system monitoring
utilities can efficiently handle. InfoSphere Streams provides the tools that can deal with this
environment.

Streams Processing Language


• SPL, the programming language for InfoSphere Streams, is a distributed data-flow
composition language.
• It is an extensible and full-featured language like C++ or Java.
• The basic building blocks of SPL programs:
• Stream— An infinite sequence of structured tuples. It can be consumed by operators
on a tuple-by-tuple basis or through the definition of a window.
• Tuple— A structured list of attributes and their types. Each tuple on a stream has the
form dictated by its stream type.
• Stream type— Specifies the name and data type of each attribute in the tuple.
• Window— A finite, sequential group of tuples. It can be based on count, time,
attribute value, or punctuation marks.
• Operator— The fundamental building block of SPL, its operators process data from
streams and can produce new streams.
• Processing element (PE)— The fundamental execution unit. A PE can encapsulate a
single operator or many fused operators.
• Job— A deployed Streams application for execution. It consists of one or more PEs.
In addition to a set of PEs, the SPL compiler also generates an Application
Description Language (ADL) file that describes the structure of the application. The
ADL file includes details about each PE, such as which binary file to load and
execute, scheduling restrictions, stream formats, and an internal operator data-flow
graph.

Data at rest vs Data in Motion


(Refer pdf-Data at rest vs Data in motion)

You might also like