Discover millions of ebooks, audiobooks, and so much more with a free trial

From $11.99/month after trial. Cancel anytime.

Real-time Analytics with Storm and Cassandra
Real-time Analytics with Storm and Cassandra
Real-time Analytics with Storm and Cassandra
Ebook424 pages2 hours

Real-time Analytics with Storm and Cassandra

Rating: 0 out of 5 stars

()

Read preview

About this ebook

About This Book
  • Create your own data processing topology and implement it in various real-time scenarios using Storm and Cassandra
  • Build highly available and linearly scalable applications using Storm and Cassandra that will process voluminous data at lightning speed
  • A pragmatic and example-oriented guide to implement various applications built with Storm and Cassandra
Who This Book Is For

If you want to efficiently use Storm and Cassandra together and excel at developing production-grade, distributed real-time applications, then this book is for you. No prior knowledge of using Storm and Cassandra together is necessary. However, a background in Java is expected.

LanguageEnglish
Release dateMar 27, 2015
ISBN9781784390006
Real-time Analytics with Storm and Cassandra

Related to Real-time Analytics with Storm and Cassandra

Related ebooks

Computers For You

View More

Related articles

Reviews for Real-time Analytics with Storm and Cassandra

Rating: 0 out of 5 stars
0 ratings

0 ratings0 reviews

What did you think?

Tap to rate

Review must be at least 10 words

    Book preview

    Real-time Analytics with Storm and Cassandra - Shilpi Saxena

    Table of Contents

    Real-time Analytics with Storm and Cassandra

    Credits

    About the Author

    About the Reviewers

    www.PacktPub.com

    Support files, eBooks, discount offers, and more

    Why subscribe?

    Free access for Packt account holders

    Preface

    What this book covers

    What you need for this book

    Who this book is for

    Conventions

    Reader feedback

    Customer support

    Downloading the example code

    Errata

    Piracy

    Questions

    1. Let's Understand Storm

    Distributed computing problems

    Real-time business solution for credit or debit card fraud detection

    Aircraft Communications Addressing and Reporting system

    Healthcare

    Other applications

    Solutions for complex distributed use cases

    The Hadoop solution

    A custom solution

    Licensed proprietary solutions

    Other real-time processing tools

    A high-level view of various components of Storm

    Delving into the internals of Storm

    Quiz time

    Summary

    2. Getting Started with Your First Topology

    Prerequisites for setting up Storm

    Components of a Storm topology

    Spouts

    Bolts

    Streams

    Tuples – the data model in Storm

    Executing a sample Storm topology – local mode

    WordCount topology from the Storm-starter project

    Executing the topology in the distributed mode

    Set up Zookeeper (V 3.3.5) for Storm

    Setting up Storm in the distributed mode

    Launching Storm daemons

    Executing the topology from Command Prompt

    Tweaking the WordCount topology to customize it

    Quiz time

    Summary

    3. Understanding Storm Internals by Examples

    Customizing Storm spouts

    Creating FileSpout

    Tweaking WordCount topology to use FileSpout

    The SocketSpout class

    Anchoring and acking

    The unreliable topology

    Stream groupings

    Local or shuffle grouping

    Fields grouping

    All grouping

    Global grouping

    Custom grouping

    Direct grouping

    Quiz time

    Summary

    4. Storm in a Clustered Mode

    The Storm cluster setup

    Zookeeper configurations

    Cleaning up Zookeeper

    Storm configurations

    Storm logging configurations

    The Storm UI

    Section 1

    Section 2

    Section 3

    Section 4

    The visualization section

    Storm monitoring tools

    Quiz time

    Summary

    5. Storm High Availability and Failover

    An overview of RabbitMQ

    Installing the RabbitMQ cluster

    Prerequisites for the setup of RabbitMQ

    Setting up a RabbitMQ server

    Testing the RabbitMQ server

    Creating a RabbitMQ cluster

    Enabling the RabbitMQ UI

    Creating mirror queues for high availability

    Integrating Storm with RabbitMQ

    Creating a RabbitMQ feeder component

    Wiring the topology for the AMQP spout

    Building high availability of components

    High availability of the Storm cluster

    Guaranteed processing of the Storm cluster

    The Storm isolation scheduler

    Quiz time

    Summary

    6. Adding NoSQL Persistence to Storm

    The advantages of Cassandra

    Columnar database fundamentals

    Types of column families

    Types of columns

    Setting up the Cassandra cluster

    Installing Cassandra

    Multiple data centers

    Prerequisites for setting up multiple data centers

    Installing Cassandra data centers

    Introduction to CQLSH

    Introduction to CLI

    Using different client APIs to access Cassandra

    Storm topology wired to the Cassandra store

    The best practices for Storm/Cassandra applications

    Quiz time

    Summary

    7. Cassandra Partitioning, High Availability, and Consistency

    Consistent hashing

    One or more node goes down

    One or more node comes back up

    Replication in Cassandra and strategies

    Cassandra consistency

    Write consistency

    Read consistency

    Consistency maintenance features

    Quiz time

    Summary

    8. Cassandra Management and Maintenance

    Cassandra – gossip protocol

    Bootstrapping

    Failure scenario handling – detection and recovery

    Cassandra cluster scaling – adding a new node

    Cassandra cluster – replacing a dead node

    The replication factor

    The nodetool commands

    Cassandra fault tolerance

    Cassandra monitoring systems

    JMX monitoring

    Datastax OpsCenter

    Quiz time

    Summary

    9. Storm Management and Maintenance

    Scaling the Storm cluster – adding new supervisor nodes

    Scaling the Storm cluster and rebalancing the topology

    Rebalancing using the GUI

    Rebalancing using the CLI

    Setting up workers and parallelism to enhance processing

    Scenario 1

    Scenario 2

    Scenario 3

    Storm troubleshooting

    The Storm UI

    Storm logs

    Quiz time

    Summary

    10. Advance Concepts in Storm

    Building a Trident topology

    Understanding the Trident API

    Local partition manipulation operation

    Functions

    Filters

    partitionAggregate

    Sum aggregate

    CombinerAggregator

    ReducerAggregator

    Aggregator

    Operations related to stream repartitioning

    Data aggregations over the streams

    Grouping over a field in a stream

    Merge and join

    Examples and illustrations

    Quiz time

    Summary

    11. Distributed Cache and CEP with Storm

    The need for distributed caching in Storm

    Introduction to memcached

    Setting up memcache

    Building a topology with a cache

    Introduction to the complex event processing engine

    Esper

    Getting started with Esper

    Integrating Esper with Storm

    Quiz time

    Summary

    A. Quiz Answers

    Chapter 1

    Chapter 2

    Chapter 3

    Chapter 4

    Chapter 5

    Chapter 6

    Chapter 7

    Chapter 8

    Chapter 9

    Chapter 10

    Chapter 11

    Index

    Real-time Analytics with Storm and Cassandra


    Real-time Analytics with Storm and Cassandra

    Copyright © 2015 Packt Publishing

    All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

    Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.

    Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

    First published: March 2015

    Production reference: 1240315

    Published by Packt Publishing Ltd.

    Livery Place

    35 Livery Street

    Birmingham B3 2PB, UK.

    ISBN 978-1-78439-549-0

    www.packtpub.com

    Credits

    Author

    Shilpi Saxena

    Reviewers

    Sourav Gulati

    Saurabh Gupta

    Ranjeet Kumar Jha

    Mark Kerzner

    Sonal Raj

    Commissioning Editor

    Akram Hussain

    Acquisition Editor

    Larissa Pinto

    Content Development Editor

    Shweta Pant

    Technical Editor

    Saurabh Malhotra

    Copy Editors

    Pranjali Chury

    Merilyn Pereira

    Project Coordinator

    Shipra Chawhan

    Proofreaders

    Simran Bhogal

    Maria Gould

    Paul Hindle

    Indexer

    Mariammal Chettiyar

    Graphics

    Sheetal Aute

    Valentina D'silva

    Abhinash Sahu

    Production Coordinator

    Manu Joseph

    Cover Work

    Manu Joseph

    About the Author

    Shilpi Saxena is a seasoned professional, who is leading in management with an edge of being a technology evangelist. She is an engineer who has exposure to a variety of domains (machine to machine space, health care, telecom, hiring, and manufacturing). She has experience in all aspects of conception and execution of enterprise solutions. She has been architecting, managing and delivering solutions in the big data space for the last 3 years, handling high performance geographically distributed teams of elite engineers.

    Shilpi has more than 12 years (3 years in the big data space) of experience in development and execution of various facets of enterprise solutions both in product/services dimensions of the software industry. An engineer by degree and profession, she has worn varied hats—developer, technical leader, product owner, tech manager, and so on, and she has seen all flavors the industry has to offer.

    She has architected and worked through some of the pioneers' production implementation in big data on Storm and Impala with auto scaling in AWS.

    To know more about her, visit her LinkedIn profile at http://in.linkedin.com/pub/shilpi-saxena/4/552/a30.

    I would like to thank my husband, Sachin Saxena, and my mother, Manju Saxena, for their constant support and encouragement while writing this book. A sincere word of thanks to Impetus and all my mentors, who gave me a chance to innovate and learn as part of the big data group.

    About the Reviewers

    Sourav Gulati is an MCA and has been working in the IT industry for about 5 years. He has worked on technologies such as Java and Unix shell scripting and has also worked on big data technologies such as Hadoop, Cassandra, Storm, and so on. Initially, he started working for Tech Mahindra in 2010 and then moved to Impetus in 2012. Currently, he is working as a senior software engineer at Impetus.

    I would really like to thank Shilpi Saxena and Packt Publishing for giving me the chance to be a part of this book. This book is packed with practical knowledge and experience. I would also like to wish Shilpi a lot of success with this book.

    Saurabh Gupta is the lead software engineer at Impetus Technologies and has around 8 years of experience in IT. He started his career with Java/J2EE and headed toward NoSQL and big data technologies. He loves to read about new technologies or tools on the market. He believes that there are no secrets to success, but rather that it is the result of preparation, hard work, and learning from failure.

    I want to thank my wife, Nalini, and the rest of my family, who supported and encouraged me in spite of all the time it kept me away from them.

    Ranjeet Kumar Jha has over 12 years of experience in various phases of project life cycles, including the development and design phases, and has also been part of production support for Java/J2EE and big data-based applications. He has more than 6 years of experience as a technical architect in Java technologies and more than 3 years in big data stacks. He has worked in various domains such as finance, insurance, e-commerce, digital media, and online advertisements.

    Ranjeet has worked as a programmer, designer, and mentor and now works as an architect in all types of projects related to Java, especially J2EE and big data.

    His LinkedIn profile is available at https://www.linkedin.com/in/jharanjeet.

    His certifications include:

    OCM-JEA 5 (Oracle Certified Master, Java Enterprise Architect) with a 94 percent score in 2011

    OCE-WSD (Oracle Certified Expert, JAVA EE 6 Web Services Developer) in 2013

    SCJP (Sun Certified Java Programmer) in 2004

    SCWCD (Sun Certified Web Component Developer) in 2004

    Java Development with Apache Cassandra from DataStax in 2014

    MongoDB for Java Developers from MongoDB University in 2014

    The companies he has worked for include the following:

    EtechAces Consulting and Marketing Pvt Ltd. Gurgaon (Delhi NCR)

    Times Internet Ltd (TimesGroup), Noida (Delhi NCR)

    Ebusinessware Inc (now Xoriant Corporation), Gurgaon (Delhi NCR)

    WIPRO, Gurgaon (Delhi NCR)

    AgreeYa Solution Pvt Ltd, Noida (Delhi NCR)

    INCA Informatics, Noida (Delhi NCR)

    I would like to thank my family—my wife, Anila Jha, and two kids, Anushka Jha and Tanisha Jha, for their constant support, encouragement, and patience. Without you, I wouldn't have achieved so much! Love you all immensely.

    Mark Kerzner holds degrees in law, math, and computer science. He is a software architect who has been working on Hadoop-based systems since 2008. Mark is a cofounder of Elephant Scale, a big data training and consulting company. He is a coauthor of the open source books Hadoop Illuminated and Hbase Design Patterns, both by Packt Publishing. He has also authored and coauthored other books and patents, which can be found at http://www.amazon.com.

    I would like to acknowledge the help of my colleagues, in particular, Sujee Maniyam, and last but not least, my multitalented family.

    Sonal Raj is a hacker, Pythonista, big data believer, and a technology dreamer. He has a passion for design and is an artist at heart. He blogs about technology, design, and gadgets at http://www.sonalraj.com/. When not working on projects, he can be found traveling, stargazing, or reading.

    He has pursued engineering in computer science and loves to work on community projects. He has been a research fellow at SERC, IISc, Bangalore, and has taken up projects on graph computations using Neo4j and Storm. Sonal has been a speaker at PyCon India and local meets on Neo4j and has also published articles and research papers in leading magazines and international journals. He has contributed to several open source projects.

    Sonal has been actively involved in the development of machine learning frameworks and has worked on technologies such as NoSQL databases including MongoDB and streaming using Apache Spark. He is currently working at Goldman Sachs.

    I am grateful to the author for patiently listening to my critiques and I'd like to thank the open source community for keeping their passion alive and contributing to such remarkable projects. A special thank you to my parents, without whom I would never have grown to love learning as much as I do.

    www.PacktPub.com

    Support files, eBooks, discount offers, and more

    For support files and downloads related to your book, please visit www.PacktPub.com.

    Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at for more details.

    At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.

    https://www2.packtpub.com/books/subscription/packtlib

    Do you need instant solutions to your IT questions? PacktLib is Packt's online digital book library. Here, you can search, access, and read Packt's entire library of books.

    Why subscribe?

    Fully searchable across every book published by Packt

    Copy and paste, print, and bookmark content

    On demand and accessible via a web browser

    Free access for Packt account holders

    If you have an account with Packt at www.PacktPub.com, you can use this to access PacktLib today and view 9 entirely free books. Simply use your login credentials for immediate access.

    Preface

    Storm, initially a project from the house of Twitter, has graduated to the league of Apache and thus rechristened from Twitter Storm. It is the brainchild of Nathan Marz that's now adopted by leagues of Cloudera's Distribution Including Apache Hadoop (CDH) and the Hortonworks Data Platform (HDP), and so on.

    Apache Storm is a highly scalable, distributed, fast, and reliable real-time computing system designed to process very high velocity data. Cassandra complements the computing capability by providing lightning-fast read and writes, and this is the best combination currently available for data

    Enjoying the preview?
    Page 1 of 1