Discover millions of ebooks, audiobooks, and so much more with a free trial

Only $11.99/month after trial. Cancel anytime.

Scalable Data Streaming with Amazon Kinesis: Design and secure highly available, cost-effective data streaming applications with Amazon Kinesis
Scalable Data Streaming with Amazon Kinesis: Design and secure highly available, cost-effective data streaming applications with Amazon Kinesis
Scalable Data Streaming with Amazon Kinesis: Design and secure highly available, cost-effective data streaming applications with Amazon Kinesis
Ebook537 pages4 hours

Scalable Data Streaming with Amazon Kinesis: Design and secure highly available, cost-effective data streaming applications with Amazon Kinesis

Rating: 0 out of 5 stars

()

Read preview

About this ebook

Amazon Kinesis is a collection of secure, serverless, durable, and highly available purpose-built data streaming services. This data streaming service provides APIs and client SDKs that enable you to produce and consume data at scale.
Scalable Data Streaming with Amazon Kinesis begins with a quick overview of the core concepts of data streams, along with the essentials of the AWS Kinesis landscape. You'll then explore the requirements of the use case shown through the book to help you get started and cover the key pain points encountered in the data stream life cycle. As you advance, you'll get to grips with the architectural components of Kinesis, understand how they are configured to build data pipelines, and delve into the applications that connect to them for consumption and processing. You'll also build a Kinesis data pipeline from scratch and learn how to implement and apply practical solutions. Moving on, you'll learn how to configure Kinesis on a cloud platform. Finally, you’ll learn how other AWS services can be integrated into Kinesis. These services include Redshift, Dynamo Database, AWS S3, Elastic Search, and third-party applications such as Splunk.
By the end of this AWS book, you’ll be able to build and deploy your own Kinesis data pipelines with Kinesis Data Streams (KDS), Kinesis Data Firehose (KFH), Kinesis Video Streams (KVS), and Kinesis Data Analytics (KDA).

LanguageEnglish
Release dateMar 31, 2021
ISBN9781800564336
Scalable Data Streaming with Amazon Kinesis: Design and secure highly available, cost-effective data streaming applications with Amazon Kinesis

Related to Scalable Data Streaming with Amazon Kinesis

Related ebooks

Computers For You

View More

Related articles

Reviews for Scalable Data Streaming with Amazon Kinesis

Rating: 0 out of 5 stars
0 ratings

0 ratings0 reviews

What did you think?

Tap to rate

Review must be at least 10 words

    Book preview

    Scalable Data Streaming with Amazon Kinesis - Tarik Makota

    Cover.png

    BIRMINGHAM—MUMBAI

    Scalable Data Streaming with Amazon Kinesis

    Copyright © 2021 Packt Publishing

    All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

    Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author(s), nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

    Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

    Group Product Manager: Kunal Parikh

    Publishing Product Manager: Devika Battike

    Senior Editor: Mohammed Yusuf Imaratwale

    Content Development Editors: Sean Lobo and Tazeen Shaikh

    Technical Editor: Devanshi Deepak Ayare

    Copy Editor: Safis Editing

    Project Coordinator: Aparna Ravikumar Nair

    Proofreader: Safis Editing

    Indexer: Tejal Daruwale Soni

    Production Designer: Shankar Kalbhor

    First published: March 2021

    Production reference: 1300321

    Published by Packt Publishing Ltd.

    Livery Place

    35 Livery Street

    Birmingham

    B3 2PB, UK.

    ISBN 978-1-80056-540-1

    www.packt.com

    Contributors

    About the authors

    Tarik Makota hails from a small town in Bosnia. He is a principal solutions architect with AWS, a builder, a writer, and the self-proclaimed best fly fisherman at AWS. Never a perfect student, he managed to earn an MSc in software development and management from RIT. When he is not doing the cloud or writing, Tarik spends most of his time fly fishing to pursue slippery trout. He feeds his addiction by spending summers in Montana. Tarik lives in New Jersey with his family, Mersiha, Hana, and two exceptionally perfect dogs.

    Brian Maguire is a solutions architect at AWS, where he is focused on helping customers build solutions in the cloud. He is a technologist, writer, teacher, and student who loves learning. Brian lives in New Hope, Pennsylvania, with his family, Lorna, Ciara, Chris, and several cats.

    Danny Gagne is a solutions architect at AWS. He has extensive experience in the design and implementation of large-scale, high-performance analysis systems. He lives in New York City.

    Rajeev Chakrabarti is a principal developer advocate with the Amazon Kinesis and the Amazon MSK team. He has worked for many years in the big data and data streaming space. Before joining the Amazon Kinesis team, he was a streaming specialist solutions architect helping customers build streaming pipelines. He lives in New Jersey with his family, Shaifalee and Anushka.

    About the reviewers

    Ritesh Gupta works as a software development manager with AWS, leading the control plane and data plane teams on the Kinesis Data Streams service. He has over 20 years of experience in leading and delivering geographically distributed web-scale applications and highly available distributed systems supporting millions of transactions per second; he has 10 years of experience in managing engineers and managers. Prior to Amazon, he worked at Microsoft, EA Games, Dell, and a few successful start-ups. His technical expertise cuts across building web-scale applications, enterprise software, and big data. I thank my wife, Jyothi, and daughter, Udita, for putting up with the late-night learning sessions that have allowed me to be where I am.

    Randy Ridgley is an experienced technology generalist working with organizations in the media and entertainment, casino gaming, and public sector fields that are looking to adopt cloud technologies. He started his journey into software development at a young age, building BASIC programs on the Commodore 64. In his professional career, he started by building Windows applications, eventually graduating to Linux with multiple programming languages. Currently, you can find Randy spending most of his time building end-to-end real-time streaming solutions on AWS using serverless technologies and IoT.

    Table of Contents

    Preface

    Section 1: Introduction to Data Streaming and Amazon Kinesis

    Chapter 1: What Are Data Streams?

    Introducing data streams

    Sources of data

    The value of real-time data in analytics

    Decoupling systems

    Challenges associated with distributed systems

    Transactions per second

    Scaling

    Latency

    Fault tolerance/high availability

    Overview of messaging concepts

    Overview of core messaging components

    Messaging concepts

    Examples of data streaming

    Application log processing

    Internet of Things

    Real-time recommendations

    Video streams

    Summary

    Further reading

    Chapter 2: Messaging and Data Streaming in AWS

    Amazon Kinesis Data Streams (KDS)

    Encryption, authentication, and authorization

    Producing and consuming records

    Data delivery guarantees

    Integration with other AWS services

    Monitoring

    Amazon Kinesis Data Firehose (KDF)

    Encryption, authentication, and authorization

    Monitoring

    Producers

    Delivery destinations

    Transformations

    Amazon Kinesis Data Analytics (KDA)

    Amazon KDA for SQL

    Amazon Kinesis Data Analytics for Apache Flink (KDA Flink)

    Amazon Kinesis Video Streams (KVS)

    Amazon Simple Queue Service (SQS)

    Amazon Simple Notification Service (SNS)

    Amazon SNS integrations with other AWS services

    Encryption at rest

    Amazon MQ for Apache ActiveMQ

    IoT Core

    Device software

    Control services

    Analytics services

    Amazon Managed Streaming for Apache Kafka (MSK)

    Apache Kafka

    Amazon MSK

    Amazon EventBridge

    Service comparison summary

    Summary

    Chapter 3: The SmartCity Bike-Sharing Service

    The mission for sustainable transportation

    SmartCity new mobile features

    SmartCity data pipeline

    SmartCity data lake

    SmartCity operations and analytics dashboard

    SmartCity video

    The AWS Well-Architected Framework

    Summary

    Further reading

    Section 2: Deep Dive into Kinesis

    Chapter 4: Kinesis Data Streams

    Technical requirements

    Discovering Amazon Kinesis Data Streams

    Creating streams and shards

    Creating a stream producer application

    Creating a stream consumer application

    Data pipelines with Amazon Kinesis Data Streams

    Data pipeline design (simple)

    Data pipeline design (intermediate)

    Data pipeline design (full design)

    Designing for scalable and reliable analytics pipelines

    Monitoring and scaling with Amazon Kinesis Data Streams

    X-Ray tracing with Amazon Kinesis Data Streams

    Scaling up with Amazon Kinesis Data Streams

    Securing Amazon Kinesis Data Streams

    Implementing least-privilege access

    Summary

    Further reading

    Chapter 5: Kinesis Firehose

    Technical requirements

    Setting up the AWS account

    Using a local development environment

    Using an AWS Cloud9 development environment

    Code examples

    Discovering Amazon Kinesis Firehose

    Understanding KDF delivery streams

    Understanding encryption in KDF

    Using data transformation in KDF with a Lambda function

    Understanding delivery stream destinations

    Amazon S3

    Amazon Redshift

    Amazon Elasticsearch Service

    Splunk destination

    HTTP endpoint destination

    Understanding data format conversion in KDF

    Deserialization

    Schema

    Serializer

    Data format conversion errors

    Understanding monitoring in KDF

    Use-case example – Bikeshare station data pipeline with KDF

    Steps to recreate the example

    Summary

    Further reading

    Chapter 6: Kinesis Data Analytics

    Technical requirements

    AWS account setup

    AWS CDK

    Java and Java IDE

    Code examples

    Discovering Amazon KDA

    Working on SmartCity bike share analytics use cases

    Creating operational insights using SQL Engine

    Core concepts and capabilities

    Creating operational insights using Apache Flink

    Options for running Flink applications in AWS Cloud

    Flink applications on KDA

    Building bike ride analytic applications

    Setting up a producer application

    Building a KDA SQL application

    Building a KDA Flink application

    Monitoring KDA applications

    Summary

    Further reading

    Blogs

    Workshops

    Chapter 7: Amazon Kinesis Video Streams

    Technical requirements

    AWS account setup

    Using a local development environment

    Code examples

    Understanding video fundamentals

    Containers

    Codecs

    Discovering Amazon Kinesis video streams WebRTC

    Core concepts and connection patterns

    Creating a signaling channel

    Establishing a connection

    Discovering Amazon KVS

    Key components of KVS

    Stream

    Kinesis producer

    Consuming

    Creating a stream

    Producing

    Integration with Rekognition

    Building video-enabled applications with KVS

    Summary

    Further reading

    Section 3: Integrations

    Chapter 8: Kinesis Integrations

    Technical requirements

    AWS account setup

    AWS CLI

    Kinesis Data Generator

    Code examples

    Amazon services that can produce data to send to Kinesis

    Amazon Connect

    Amazon Aurora database activity

    DynamoDB activity

    Processing Kinesis data with Apache Spark

    Amazon services that consume data from Kinesis

    Serverless data lake

    Amazon services that transform Kinesis data

    Routing events with EventBridge

    Third-party integrations with Kinesis

    Splunk

    Summary

    Further reading

    Why subscribe?

    Other Books You May Enjoy

    Preface

    Amazon Kinesis is a collection of secure, serverless, durable, and highly available purpose-built data streaming services. These data streaming services provide APIs and client SDKs to enable you to produce and consume data at scale.

    Scalable Data Streaming with Amazon Kinesis begins with a quick overview of the core concepts of data streams along with the essentials of the AWS Kinesis landscape. You'll then explore the requirements of the use cases shown throughout the book to help you get started, and cover the key pain points encountered in the data stream life cycle. As you advance, you'll get to grips with the architectural components of Kinesis, understand how they are configured to build data pipelines, and delve into the applications that connect to them for consumption and processing. You'll also build a Kinesis data pipeline from scratch and learn how to implement and apply practical solutions. Moving on, you'll learn how to configure Kinesis on a cloud platform. Finally, you'll learn how other AWS services can be integrated into Kinesis. These services include Redshift, Dynamo Database, AWS S3, Elasticsearch, and third-party applications such as Splunk.

    By the end of this AWS book, you'll be able to build and deploy your own Kinesis data pipelines with Kinesis Data Streams (KDS), Kinesis Firehose (KFH), Kinesis Video Streams (KVS), and Kinesis Data Analytics (KDA).

    Who this book is for

    This book is for solutions architects, developers, system administrators, data engineers, and data scientists looking to evaluate and choose the most performant, secure, scalable, and cost-effective data streaming technology to overcome their data ingestion and processing challenges on AWS. Prior knowledge of cloud architectures on AWS, data streaming technologies, and architectures is expected.

    What this book covers

    Chapter 1, What Are Data Streams?, covers core streaming concepts so that you will have a detailed understanding of their application in distributed systems.

    Chapter 2, Messaging and Data Streaming in AWS, takes a brief look at the ecosystem of AWS services in the messaging space. After reading this chapter, you will have a good understanding of the various services, be able to differentiate them, and understand the strengths of each service.

    Chapter 3, The SmartCity Bike-Sharing Service, reviews the existing bike-sharing application and how the city plans to modernize it. This chapter will provide the background information for the examples used throughout the book.

    Chapter 4, Kinesis Data Streams, teaches concepts and capabilities, common deployment patterns, monitoring and scaling, and how to secure KDS. We will step through a data streaming solution that will ingest, process, and feed data from multiple SmartCity data systems.

    Chapter 5, Kinesis Firehose, teaches the concepts, common deployment patterns, monitoring and scaling, and security in KFH.

    Chapter 6, Kinesis Data Analytics, covers the concepts and capabilities, approaches for common deployment patterns, monitoring and scaling, and security in KDA. You will learn how real-time streaming data can be queried like a database with SQL or code.

    Chapter 7, Amazon Kinesis Video Streams, explores the concepts, monitoring and scaling, security, and deployment patterns for real-time communication and data ingestion. We will step through a solution that will provide real-time access to a video stream and ingest video data for the SmartCity data system.

    Chapter 8, Kinesis Integrations, reviews how to integrate Kinesis with several Amazon services, such as Amazon Redshift, Amazon DynamoDB, AWS Glue, Amazon Aurora, Amazon Athena, and other third-party services such as Splunk. We will integrate a wide variety of services to create a serverless data lake.

    To get the most out of this book

    All of the examples in the chapters in this book are run using an AWS account to access services such as Amazon Kinesis, DynamoDB, and Amazon S3. Readers will need a Windows, Mac, or Linux computer with an internet connection. Many of the examples in the book use a command-line terminal such as PuTTY, macOS Terminal, GNOME Terminal, or iTerm2 to run commands and change configuration. The examples written in Python are written for the Python 3 interpreter and may not work with Python 2. For the examples written for the Java platform, readers are encouraged to use Java version 11 and AWS Java SDK version 1.11. We make extensive use of the AWS CLI v2 and will also use Docker for some examples. In addition to software, a webcam or IP camera and Android device will be needed to fully execute some of the examples.

    If you are using the digital version of this book, we advise you to type the code yourself or access the code via the GitHub repository (link available in the next section). Doing so will help you avoid any potential errors related to the copying and pasting of code.

    Download the example code files

    You can download the example code files for this book from GitHub at https://github.com/PacktPublishing/Streaming-Data-Solutions-with-Amazon-Kinesis. In case there's an update to the code, it will be updated on the existing GitHub repository.

    We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

    Download the color images

    We also provide a PDF file that has color images of the screenshots/diagrams used in this book. You can download it here: https://static.packt-cdn.com/downloads/9781800565401_ColorImages.pdf.

    Conventions used

    There are a number of text conventions used throughout this book.

    Code in text: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: In this command, we'll send the test2.mkv file we downloaded to the KVS stream.

    A block of code is set as follows:

    aws glue create-database --database-input {\"Name\":\"smartcitybikes\"}

    aws glue create-table --database-name smartcitybikes --table-input file://SmartCityGlueTable.json

    When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:

    mediaSource.start();

    Any command-line input or output is written as follows:

    aws rekognition start-stream-processor --name kvsprocessor

    Bold: Indicates a new term, an important word, or words that you see onscreen. For example, words in menus or dialog boxes appear in the text like this. Here is an example: Once you have entered the appropriate information, all that's left is to click Create signaling channel.

    Tips or important notes

    Appear like this.

    Get in touch

    Feedback from our readers is always welcome.

    General feedback: If you have questions about any aspect of this book, mention the book title in the subject of your message and email us at [email protected].

    Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/support/errata, selecting your book, clicking on the Errata Submission Form link, and entering the details.

    Piracy: If you come across any illegal copies of our works in any form on the Internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.

    If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.

    Reviews

    Please leave a review. Once you have read and used this book, why not leave a review on the site that you purchased it from? Potential readers can then see and use your unbiased opinion to make purchase decisions, we at Packt can understand what you think about our products, and our authors can see your feedback on their book. Thank you!

    For more information about Packt, please visit packt.com.

    Section 1: Introduction to Data Streaming and Amazon Kinesis

    In this section, you will be introduced to the concept of data streams and how they are used to create scalable data solutions. 

    This section comprises the following chapters:

    Chapter 1, What Are Data Streams?

    Chapter 2, Messaging and Data Streaming in AWS

    Chapter 3, The SmartCity Bike-Sharing Service

    Chapter 1: What Are Data Streams?

    A data stream is a system where data continuously flows from multiple sources, just like water flows through a stream. The data is often produced and collected simultaneously in a continuous flow of many small files or records. Data streams are utilized by a wide range of business, medical, government, social media, and mobile applications. These applications include financial applications for the stock market and e-commerce ordering systems that collect orders and cover fulfillment of delivery.

    In the entertainment space, live data is produced by sensing devices embedded in player equipment, video game players generate large amounts of data at a massive scale, and there are new social media posts thousands of times per second. Governments also leverage streaming data and geospatial services to monitor land, wildlife, and other activities.

    Data volume and velocity are increasing at faster rates, creating new challenges in data processing and analytics. This book will detail these challenges and demonstrate how Amazon Kinesis can be used to address them. We will begin by discussing key concepts related to messaging in a technology-agnostic form to provide a solid foundation for building your Kinesis knowledge.

    Incorporating data streams into your application architecture will allow you to deliver high-performance solutions that are secure, scalable, and fast. In this chapter, we will cover core streaming concepts so that you will have a detailed understanding of their application to distributed systems. You will learn what a data stream is, how to leverage data streams to scale, and examine a number of high-level use cases.

    This chapter covers the following topics:

    Introducing data streams

    Challenges associated with distributed systems

    Overview of messaging concepts

    Examples of data streaming

    Introducing data streams

    Data streams are a way of storing a sequence of messages. They enable us to design systems where we think about state as a series of events instead of only entities and values, or rows and columns in a database. This shift in mindset and technology enables real-time analytics to extract the value from data by acting on it before it is stale. They also enable organizations to design and develop resilient software based on microservice architectures by helping them to decouple systems. We will begin with an overview of streaming data sources, why real-time data analysis is valuable, and how they can be used architecturally to decouple systems. We will then review the core challenges associated with distributed systems, and conclude with an overview of key messaging concepts and some high-level examples. Messages can contain a wide variety of information and come from different sources, so let's look at the primary sources and data formats.

    Sources of data

    The proliferation of data steadily increases from sources such as social media, IoT devices, web clickstreams, application logs, and video cameras. This data poses challenges to most systems, since it is typically high-velocity, intermittent, and bursty, making it difficult to adequately provision and design downstream systems. Payloads are generally small, except when containing audio or video data, and come in a variety of formats.

    In this book, we will be focusing on three data formats. These formats include the following:

    JavaScript Object Notation (JSON)

    Log files

    Time-encoded binary files such as video

    JSON streams

    JSON has become the dominant format for message serialization over the past 10 years. It is a lightweight data interchange format that is easy for humans to read and write and is based on the JavaScript object syntax. It has two data structures – hash tables and lists. A hash table consists of key-value pairs, {key:value}, where the keys must be unique. A list is a set of values in a specific order, [value 1, value 2]. The following code sample shows a sample IoT JSON message:

    {

        deviceid : device001,

        eventTime: -192778200,

        temp : 68.4,

        humidity : 77.3,

        coords : {

            latitude : 32.779039,

            longitude : -96.808660

        }

    }

    Log file streams

    Log files come in a variety of formats. Common ones include Apache Commons Logging, Apache Combined Log, Apache Error Log, and RFC3164 Syslog. They are plain text, and usually each line, delineated by a newline ('\n') character, is a separate log entry. In the following sample log, we see an HTTP GET request where the IP address is 10.13.37.01, the datetime of the request, the HTTP verb, the URL fragment, the HTTP version, the response code, and the size of the result.

    The sample log line in Apache Commons Logging format is as follows:

    10.13.37.01 - - [03/Sep/2017:12:00:01 +0830] GET /mailman/listinfo/test HTTP/1.1 200 2457

    Time-encoded binary streams

    Time-encoded binary streams consist of a time series of records where each record is related to the adjacent records (prior and subsequent records). These can be used for a wide variety of sensor data, from audio streams and RADAR signals to video streams. Throughout this book, the primary focus will be video streams and their applications.

    Figure 1.1 – Time-encoded video data

    Figure 1.1 – Time-encoded video data

    As shown in Figure 1.1, video streams are composed of fragments, where each fragment is a self-contained sequence of media frames. There are no dependencies between fragments. We will discuss video streams in more detail in Chapter 7, Kinesis Video Streams. Now that we've covered the types of data that we'll be processing, let's take a step back to understand the value of real-time data in analytics.

    The value of real-time data in analytics

    Analysis is done to support decision making by individuals, organizations, or computer programs. Traditionally, data analysis has been done on batches of data, usually in long-running jobs that occur overnight and that happen periodically at predetermined times: nightly, weekly, quarterly, and so on. This not only limits the scope of actions available to decisions makers, but it is also only providing them with a representation of the past environment. Information is now available seconds after it is produced, so we need to design systems that provide decision makers with the freshest data available to make timely decisions.

    The OODA – Observe, Orient, Decide, Act – loop is a decision-making, conceptual framework that describes how decisions are made when reacting to an event. By breaking it down into these four components,

    Enjoying the preview?
    Page 1 of 1