Real-time Analytics with Storm and Cassandra
()
About this ebook
- Create your own data processing topology and implement it in various real-time scenarios using Storm and Cassandra
- Build highly available and linearly scalable applications using Storm and Cassandra that will process voluminous data at lightning speed
- A pragmatic and example-oriented guide to implement various applications built with Storm and Cassandra
If you want to efficiently use Storm and Cassandra together and excel at developing production-grade, distributed real-time applications, then this book is for you. No prior knowledge of using Storm and Cassandra together is necessary. However, a background in Java is expected.
Related to Real-time Analytics with Storm and Cassandra
Related ebooks
Introduction to Machine Learning in the Cloud with Python: Concepts and Practices Rating: 0 out of 5 stars0 ratingsPractical Convolutional Neural Networks: Implement advanced deep learning models using Python Rating: 0 out of 5 stars0 ratingsTika in Action Rating: 0 out of 5 stars0 ratingsModern Computer Vision with PyTorch: A practical roadmap from deep learning fundamentals to advanced applications and Generative AI Rating: 0 out of 5 stars0 ratingsNeural Networks with Python Rating: 0 out of 5 stars0 ratingsEnsemble Methods for Machine Learning Rating: 0 out of 5 stars0 ratingsPython AI Programming: Navigating fundamentals of ML, deep learning, NLP, and reinforcement learning in practice Rating: 0 out of 5 stars0 ratingsBuilding Machine Learning and Deep Learning Models on Google Cloud Platform: A Comprehensive Guide for Beginners Rating: 0 out of 5 stars0 ratingsPro Spring Boot 2: An Authoritative Guide to Building Microservices, Web and Enterprise Applications, and Best Practices Rating: 0 out of 5 stars0 ratingsRandom Graphs Rating: 5 out of 5 stars5/5Mastering Kubernetes: Dive into Kubernetes and learn how to create and operate world-class cloud-native systems Rating: 0 out of 5 stars0 ratingsFun Q: A Functional Introduction to Machine Learning in Q Rating: 0 out of 5 stars0 ratingsPython Object-Oriented Programming: Build robust and maintainable object-oriented Python applications and libraries Rating: 0 out of 5 stars0 ratingsSpark GraphX in Action Rating: 0 out of 5 stars0 ratingsElasticsearch 8.x Cookbook: Over 180 recipes to perform fast, scalable, and reliable searches for your enterprise Rating: 0 out of 5 stars0 ratingsDeep Learning for Time Series Cookbook: Use PyTorch and Python recipes for forecasting, classification, and anomaly detection Rating: 0 out of 5 stars0 ratingsFlex on Java Rating: 0 out of 5 stars0 ratingsApache Spark Graph Processing Rating: 0 out of 5 stars0 ratingsModern Data Mining Algorithms in C++ and CUDA C: Recent Developments in Feature Extraction and Selection Algorithms for Data Science Rating: 0 out of 5 stars0 ratingsHadoop MapReduce v2 Cookbook - Second Edition Rating: 0 out of 5 stars0 ratingsStart Concurrent: An Introduction to Problem Solving in Java with a Focus on Concurrency, 2014 Rating: 0 out of 5 stars0 ratingsDeep Belief Nets in C++ and CUDA C: Volume 2: Autoencoding in the Complex Domain Rating: 0 out of 5 stars0 ratingsTensorFlow A Complete Guide - 2019 Edition Rating: 0 out of 5 stars0 ratingsHands-on Time Series Analysis with Python: From Basics to Bleeding Edge Techniques Rating: 5 out of 5 stars5/5Implementing Enterprise Observability for Success: Strategically plan and implement observability using real-life examples Rating: 0 out of 5 stars0 ratingsMastering Time Series Analysis and Forecasting with Python Rating: 0 out of 5 stars0 ratings
Computers For You
The Invisible Rainbow: A History of Electricity and Life Rating: 5 out of 5 stars5/5Slenderman: Online Obsession, Mental Illness, and the Violent Crime of Two Midwestern Girls Rating: 4 out of 5 stars4/5Standard Deviations: Flawed Assumptions, Tortured Data, and Other Ways to Lie with Statistics Rating: 4 out of 5 stars4/5The Innovators: How a Group of Hackers, Geniuses, and Geeks Created the Digital Revolution Rating: 4 out of 5 stars4/5Procreate for Beginners: Introduction to Procreate for Drawing and Illustrating on the iPad Rating: 5 out of 5 stars5/5Alan Turing: The Enigma: The Book That Inspired the Film The Imitation Game - Updated Edition Rating: 4 out of 5 stars4/5The ChatGPT Millionaire Handbook: Make Money Online With the Power of AI Technology Rating: 4 out of 5 stars4/5Excel 101: A Beginner's & Intermediate's Guide for Mastering the Quintessence of Microsoft Excel (2010-2019 & 365) in no time! Rating: 0 out of 5 stars0 ratingsLearning the Chess Openings Rating: 5 out of 5 stars5/5Elon Musk Rating: 4 out of 5 stars4/5An Ultimate Guide to Kali Linux for Beginners Rating: 3 out of 5 stars3/5Mastering ChatGPT: 21 Prompts Templates for Effortless Writing Rating: 4 out of 5 stars4/5Deep Search: How to Explore the Internet More Effectively Rating: 5 out of 5 stars5/5Uncanny Valley: A Memoir Rating: 4 out of 5 stars4/5CompTIA Security+ Get Certified Get Ahead: SY0-701 Study Guide Rating: 5 out of 5 stars5/5How to Create Cpn Numbers the Right way: A Step by Step Guide to Creating cpn Numbers Legally Rating: 4 out of 5 stars4/5Tor and the Dark Art of Anonymity Rating: 5 out of 5 stars5/5101 Awesome Builds: Minecraft® Secrets from the World's Greatest Crafters Rating: 4 out of 5 stars4/5SQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL Rating: 4 out of 5 stars4/5The Hacker Crackdown: Law and Disorder on the Electronic Frontier Rating: 4 out of 5 stars4/5Grokking Algorithms: An illustrated guide for programmers and other curious people Rating: 4 out of 5 stars4/5The Professional Voiceover Handbook: Voiceover training, #1 Rating: 5 out of 5 stars5/5Everybody Lies: Big Data, New Data, and What the Internet Can Tell Us About Who We Really Are Rating: 4 out of 5 stars4/5The Self-Taught Computer Scientist: The Beginner's Guide to Data Structures & Algorithms Rating: 0 out of 5 stars0 ratingsCompTIA IT Fundamentals (ITF+) Study Guide: Exam FC0-U61 Rating: 0 out of 5 stars0 ratings
Reviews for Real-time Analytics with Storm and Cassandra
0 ratings0 reviews
Book preview
Real-time Analytics with Storm and Cassandra - Shilpi Saxena
Table of Contents
Real-time Analytics with Storm and Cassandra
Credits
About the Author
About the Reviewers
www.PacktPub.com
Support files, eBooks, discount offers, and more
Why subscribe?
Free access for Packt account holders
Preface
What this book covers
What you need for this book
Who this book is for
Conventions
Reader feedback
Customer support
Downloading the example code
Errata
Piracy
Questions
1. Let's Understand Storm
Distributed computing problems
Real-time business solution for credit or debit card fraud detection
Aircraft Communications Addressing and Reporting system
Healthcare
Other applications
Solutions for complex distributed use cases
The Hadoop solution
A custom solution
Licensed proprietary solutions
Other real-time processing tools
A high-level view of various components of Storm
Delving into the internals of Storm
Quiz time
Summary
2. Getting Started with Your First Topology
Prerequisites for setting up Storm
Components of a Storm topology
Spouts
Bolts
Streams
Tuples – the data model in Storm
Executing a sample Storm topology – local mode
WordCount topology from the Storm-starter project
Executing the topology in the distributed mode
Set up Zookeeper (V 3.3.5) for Storm
Setting up Storm in the distributed mode
Launching Storm daemons
Executing the topology from Command Prompt
Tweaking the WordCount topology to customize it
Quiz time
Summary
3. Understanding Storm Internals by Examples
Customizing Storm spouts
Creating FileSpout
Tweaking WordCount topology to use FileSpout
The SocketSpout class
Anchoring and acking
The unreliable topology
Stream groupings
Local or shuffle grouping
Fields grouping
All grouping
Global grouping
Custom grouping
Direct grouping
Quiz time
Summary
4. Storm in a Clustered Mode
The Storm cluster setup
Zookeeper configurations
Cleaning up Zookeeper
Storm configurations
Storm logging configurations
The Storm UI
Section 1
Section 2
Section 3
Section 4
The visualization section
Storm monitoring tools
Quiz time
Summary
5. Storm High Availability and Failover
An overview of RabbitMQ
Installing the RabbitMQ cluster
Prerequisites for the setup of RabbitMQ
Setting up a RabbitMQ server
Testing the RabbitMQ server
Creating a RabbitMQ cluster
Enabling the RabbitMQ UI
Creating mirror queues for high availability
Integrating Storm with RabbitMQ
Creating a RabbitMQ feeder component
Wiring the topology for the AMQP spout
Building high availability of components
High availability of the Storm cluster
Guaranteed processing of the Storm cluster
The Storm isolation scheduler
Quiz time
Summary
6. Adding NoSQL Persistence to Storm
The advantages of Cassandra
Columnar database fundamentals
Types of column families
Types of columns
Setting up the Cassandra cluster
Installing Cassandra
Multiple data centers
Prerequisites for setting up multiple data centers
Installing Cassandra data centers
Introduction to CQLSH
Introduction to CLI
Using different client APIs to access Cassandra
Storm topology wired to the Cassandra store
The best practices for Storm/Cassandra applications
Quiz time
Summary
7. Cassandra Partitioning, High Availability, and Consistency
Consistent hashing
One or more node goes down
One or more node comes back up
Replication in Cassandra and strategies
Cassandra consistency
Write consistency
Read consistency
Consistency maintenance features
Quiz time
Summary
8. Cassandra Management and Maintenance
Cassandra – gossip protocol
Bootstrapping
Failure scenario handling – detection and recovery
Cassandra cluster scaling – adding a new node
Cassandra cluster – replacing a dead node
The replication factor
The nodetool commands
Cassandra fault tolerance
Cassandra monitoring systems
JMX monitoring
Datastax OpsCenter
Quiz time
Summary
9. Storm Management and Maintenance
Scaling the Storm cluster – adding new supervisor nodes
Scaling the Storm cluster and rebalancing the topology
Rebalancing using the GUI
Rebalancing using the CLI
Setting up workers and parallelism to enhance processing
Scenario 1
Scenario 2
Scenario 3
Storm troubleshooting
The Storm UI
Storm logs
Quiz time
Summary
10. Advance Concepts in Storm
Building a Trident topology
Understanding the Trident API
Local partition manipulation operation
Functions
Filters
partitionAggregate
Sum aggregate
CombinerAggregator
ReducerAggregator
Aggregator
Operations related to stream repartitioning
Data aggregations over the streams
Grouping over a field in a stream
Merge and join
Examples and illustrations
Quiz time
Summary
11. Distributed Cache and CEP with Storm
The need for distributed caching in Storm
Introduction to memcached
Setting up memcache
Building a topology with a cache
Introduction to the complex event processing engine
Esper
Getting started with Esper
Integrating Esper with Storm
Quiz time
Summary
A. Quiz Answers
Chapter 1
Chapter 2
Chapter 3
Chapter 4
Chapter 5
Chapter 6
Chapter 7
Chapter 8
Chapter 9
Chapter 10
Chapter 11
Index
Real-time Analytics with Storm and Cassandra
Real-time Analytics with Storm and Cassandra
Copyright © 2015 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
First published: March 2015
Production reference: 1240315
Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham B3 2PB, UK.
ISBN 978-1-78439-549-0
www.packtpub.com
Credits
Author
Shilpi Saxena
Reviewers
Sourav Gulati
Saurabh Gupta
Ranjeet Kumar Jha
Mark Kerzner
Sonal Raj
Commissioning Editor
Akram Hussain
Acquisition Editor
Larissa Pinto
Content Development Editor
Shweta Pant
Technical Editor
Saurabh Malhotra
Copy Editors
Pranjali Chury
Merilyn Pereira
Project Coordinator
Shipra Chawhan
Proofreaders
Simran Bhogal
Maria Gould
Paul Hindle
Indexer
Mariammal Chettiyar
Graphics
Sheetal Aute
Valentina D'silva
Abhinash Sahu
Production Coordinator
Manu Joseph
Cover Work
Manu Joseph
About the Author
Shilpi Saxena is a seasoned professional, who is leading in management with an edge of being a technology evangelist. She is an engineer who has exposure to a variety of domains (machine to machine space, health care, telecom, hiring, and manufacturing). She has experience in all aspects of conception and execution of enterprise solutions. She has been architecting, managing and delivering solutions in the big data space for the last 3 years, handling high performance geographically distributed teams of elite engineers.
Shilpi has more than 12 years (3 years in the big data space) of experience in development and execution of various facets of enterprise solutions both in product/services dimensions of the software industry. An engineer by degree and profession, she has worn varied hats—developer, technical leader, product owner, tech manager, and so on, and she has seen all flavors the industry has to offer.
She has architected and worked through some of the pioneers' production implementation in big data on Storm and Impala with auto scaling in AWS.
To know more about her, visit her LinkedIn profile at http://in.linkedin.com/pub/shilpi-saxena/4/552/a30.
I would like to thank my husband, Sachin Saxena, and my mother, Manju Saxena, for their constant support and encouragement while writing this book. A sincere word of thanks to Impetus and all my mentors, who gave me a chance to innovate and learn as part of the big data group.
About the Reviewers
Sourav Gulati is an MCA and has been working in the IT industry for about 5 years. He has worked on technologies such as Java and Unix shell scripting and has also worked on big data technologies such as Hadoop, Cassandra, Storm, and so on. Initially, he started working for Tech Mahindra in 2010 and then moved to Impetus in 2012. Currently, he is working as a senior software engineer at Impetus.
I would really like to thank Shilpi Saxena and Packt Publishing for giving me the chance to be a part of this book. This book is packed with practical knowledge and experience. I would also like to wish Shilpi a lot of success with this book.
Saurabh Gupta is the lead software engineer at Impetus Technologies and has around 8 years of experience in IT. He started his career with Java/J2EE and headed toward NoSQL and big data technologies. He loves to read about new technologies or tools on the market. He believes that there are no secrets to success, but rather that it is the result of preparation, hard work, and learning from failure.
I want to thank my wife, Nalini, and the rest of my family, who supported and encouraged me in spite of all the time it kept me away from them.
Ranjeet Kumar Jha has over 12 years of experience in various phases of project life cycles, including the development and design phases, and has also been part of production support for Java/J2EE and big data-based applications. He has more than 6 years of experience as a technical architect in Java technologies and more than 3 years in big data stacks. He has worked in various domains such as finance, insurance, e-commerce, digital media, and online advertisements.
Ranjeet has worked as a programmer, designer, and mentor and now works as an architect in all types of projects related to Java, especially J2EE and big data.
His LinkedIn profile is available at https://www.linkedin.com/in/jharanjeet.
His certifications include:
OCM-JEA 5 (Oracle Certified Master, Java Enterprise Architect) with a 94 percent score in 2011
OCE-WSD (Oracle Certified Expert, JAVA EE 6 Web Services Developer) in 2013
SCJP (Sun Certified Java Programmer) in 2004
SCWCD (Sun Certified Web Component Developer) in 2004
Java Development with Apache Cassandra from DataStax in 2014
MongoDB for Java Developers from MongoDB University in 2014
The companies he has worked for include the following:
EtechAces Consulting and Marketing Pvt Ltd. Gurgaon (Delhi NCR)
Times Internet Ltd (TimesGroup), Noida (Delhi NCR)
Ebusinessware Inc (now Xoriant Corporation), Gurgaon (Delhi NCR)
WIPRO, Gurgaon (Delhi NCR)
AgreeYa Solution Pvt Ltd, Noida (Delhi NCR)
INCA Informatics, Noida (Delhi NCR)
I would like to thank my family—my wife, Anila Jha, and two kids, Anushka Jha and Tanisha Jha, for their constant support, encouragement, and patience. Without you, I wouldn't have achieved so much! Love you all immensely.
Mark Kerzner holds degrees in law, math, and computer science. He is a software architect who has been working on Hadoop-based systems since 2008. Mark is a cofounder of Elephant Scale, a big data training and consulting company. He is a coauthor of the open source books Hadoop Illuminated and Hbase Design Patterns, both by Packt Publishing. He has also authored and coauthored other books and patents, which can be found at http://www.amazon.com.
I would like to acknowledge the help of my colleagues, in particular, Sujee Maniyam, and last but not least, my multitalented family.
Sonal Raj is a hacker, Pythonista, big data believer, and a technology dreamer. He has a passion for design and is an artist at heart. He blogs about technology, design, and gadgets at http://www.sonalraj.com/. When not working on projects, he can be found traveling, stargazing, or reading.
He has pursued engineering in computer science and loves to work on community projects. He has been a research fellow at SERC, IISc, Bangalore, and has taken up projects on graph computations using Neo4j and Storm. Sonal has been a speaker at PyCon India and local meets on Neo4j and has also published articles and research papers in leading magazines and international journals. He has contributed to several open source projects.
Sonal has been actively involved in the development of machine learning frameworks and has worked on technologies such as NoSQL databases including MongoDB and streaming using Apache Spark. He is currently working at Goldman Sachs.
I am grateful to the author for patiently listening to my critiques and I'd like to thank the open source community for keeping their passion alive and contributing to such remarkable projects. A special thank you to my parents, without whom I would never have grown to love learning as much as I do.
www.PacktPub.com
Support files, eBooks, discount offers, and more
For support files and downloads related to your book, please visit www.PacktPub.com.
Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at
At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.
https://www2.packtpub.com/books/subscription/packtlib
Do you need instant solutions to your IT questions? PacktLib is Packt's online digital book library. Here, you can search, access, and read Packt's entire library of books.
Why subscribe?
Fully searchable across every book published by Packt
Copy and paste, print, and bookmark content
On demand and accessible via a web browser
Free access for Packt account holders
If you have an account with Packt at www.PacktPub.com, you can use this to access PacktLib today and view 9 entirely free books. Simply use your login credentials for immediate access.
Preface
Storm, initially a project from the house of Twitter, has graduated to the league of Apache and thus rechristened from Twitter Storm. It is the brainchild of Nathan Marz that's now adopted by leagues of Cloudera's Distribution Including Apache Hadoop (CDH) and the Hortonworks Data Platform (HDP), and so on.
Apache Storm is a highly scalable, distributed, fast, and reliable real-time computing system designed to process very high velocity data. Cassandra complements the computing capability by providing lightning-fast read and writes, and this is the best combination currently available for data