AWS Playbook-V2

Download as pdf or txt
Download as pdf or txt
You are on page 1of 59
At a glance
Powered by AI
The document provides an overview of cloud computing concepts and AWS services for data extraction, transformation, querying and visualization.

The purpose is to better understand the NYC taxi system to improve city commute efficiency.

The steps outlined are to download the data files, insert into DynamoDB, connect DynamoDB to Redshift and analyze the data on Redshift.

AWS Playbook

Version:02
March 2021

L&D | AWS Playbook


Contents

Introduction to Cloud Computing 03

AWS Fundamentals 10

Hands on Setup 24

AWS Learning Journey 27

AWS Learning Plan 31

AWS Certifications 40

Assignments 43

Case Studies 48

Deloitte Touche Tohmatsu India LLP L&D | AWS Playbook


Introduction to
Cloud Computing
Introduction to Cloud
Cloud is a key enabler of digital business models, big data and analytics, Internet of Things (IoT), artificial intelligence, and more. Establishing a
What is cloud infrastructure improves internal processes by facilitating cost reduction, optimizing service delivery, and increasing organizational agility, as
well as optimizes external performance by enabling business innovation, increasing market responsiveness and enhancing customer experience.
Cloud? Cloud adoption may involve strategy and readiness, migration of products, services or data, custom or package implementation, and can comprise
cloud solutions and managed services.

Cloud On Premise

On Premise
As your resources move from on
VS -premises to off-premises,
your costs are reduced, and your
Cloud administration requirements
decrease ( as depicted in the picture
on the right)

Deloitte Touche Tohmatsu India LLP L&D | AWS Playbook


Cloud Computing
Cloud Computing is the delivery of computing services—including servers, storage, databases, networking, software,
analytics, and intelligence—over the Internet (“the cloud”) to offer faster innovation, flexible resources, and economies
What is of scale. You typically pay only for cloud services you use, helping lower your operating costs, run your infrastructure
more efficiently and scale as your business needs change.
Cloud In other words, Cloud Computing is a way to rent compute power and storage from someone else's datacenter. You can
Computing? treat cloud resources like you would your resources in your own datacenter. When you're done using them, you give
them back. You're billed only for what you use. Instead of maintaining CPUs and storage in your datacenter, you rent
them for the time that you need them. The cloud provider takes care of maintaining the underlying infrastructure for
you. The cloud enables you to quickly solve your toughest business challenges and bring cutting-edge solutions to your
users.

Cost Cloud computing eliminates the capital expense of buying


hardware and software and setting up and running on-site
datacenters—the racks of servers, the round-the-clock
electricity for power and cooling, the IT experts for managing Speed Cloud computing eliminates the capital expense of buying
Advantages the infrastructure. It adds up fast.
hardware and software and setting up and running on-site
datacenters—the racks of servers, the round-the-clock electricity
of Cloud Global Scale
Cloud computing eliminates the capital expense of buying
for power and cooling, the IT experts for managing the
infrastructure. It adds up fast.

Computing hardware and software and setting up and running on-site


datacenters—the racks of servers, the round-the-clock
electricity for power and cooling, the IT experts for managing
the infrastructure. It adds up fast. Productivity
Cloud computing eliminates the capital expense of buying hardware
and software and setting up and running on-site datacenters—the
racks of servers, the round-the-clock electricity for power and
Performance cooling, the IT experts for managing the infrastructure. It adds up
Cloud computing eliminates the capital expense of buying fast.
hardware and software and setting up and running on-site
datacenters—the racks of servers, the round-the-clock Reliability
electricity for power and cooling, the IT experts for managing Cloud computing eliminates the capital expense of
the infrastructure. It adds up fast. buying hardware and software and setting up and
running on-site datacenters—the racks of servers, the
round-the-clock electricity for power and cooling, the IT
experts for managing the infrastructure. It adds up fast.

Deloitte Touche Tohmatsu India LLP L&D | AWS Playbook


Different Cloud Computing Models
This cloud service model is the closest to managing physical servers. A cloud provider keeps the hardware up to date,
but operating system maintenance and network configuration is left to the cloud tenant. For example, Azure virtual
IAAS machines are fully operational virtual compute devices running in Microsoft's datacenters. An advantage of this cloud
service model is rapid deployment of new compute devices. Setting up a new virtual machine is considerably faster than
procuring, installing, and configuring a physical server.

This cloud service model is a managed hosting environment. The cloud provider manages the virtual machines and
networking resources, and the cloud tenant deploys their applications into the managed hosting environment. For
PAAS example, Azure App Services provides a managed hosting environment where developers can upload their web
applications without having to deal with the physical hardware and software requirements.

In this cloud service model, the cloud provider manages all aspects of the application environment, such as virtual
machines, networking resources, data storage, and applications. The cloud tenant only needs to provide their data to
the application managed by the cloud provider. For example, Office 365 provides a fully working version of Office that
SAAS runs in the cloud. All that you need to do is create your content, and Office 365 takes care of everything else.

Deloitte Touche Tohmatsu India LLP L&D | AWS Playbook


Deployment Models for Cloud Computing
There are three deployment models for cloud computing: public cloud, private cloud, and hybrid cloud. Each deployment model has different
aspects that you should consider as you migrate to the cloud.
Services are offered over the public internet and available to anyone who wants to purchase them. Cloud resources like
Public Cloud servers and storage are owned and operated by a third-party cloud service provider and delivered over the internet.

Private cloud solutions are dedicated to one organization or business, and often have much more specific security controls
than a public cloud. Many medical offices, banking institutions etc. use a private cloud. Using private cloud storage allows
Private cloud them to control highly sensitive data like medical records, trade secrets, or other classified information. Private cloud solutions
utilize infrastructure that is either owned and controlled by the organization, or they are able to contractually require those
specific criteria be met by a vendor who manages the infrastructure.

This computing environment combines a public cloud and a private cloud by allowing data and applications to be shared
Hybrid cloud between them. An example of a hybrid cloud solution is an organization that wants to keep confidential information secured
on their private cloud, but make more general, customer-facing content on a public cloud.

Examples
Public Cloud Private Cloud

Public Cloud
• HP Data
vs • AWS Centres
Hybrid
Private Cloud • Azure • Telestra
• GCP Cloud
Cloud
• Ubuntu

Deloitte Touche Tohmatsu India LLP L&D | AWS Playbook


Brief Comparison between Cloud Service Providers
Service Category

AWS provides a wide range of machine types, CPUs, Azure provides a wide range of machine types, CPUs, There is a narrow range of standard machine types,
serverless, containers and event driven compute serverless, containers and event driven compute CPUs and Memory amounts supported, with on-
Compute options. AWS is the only provider to offer SKUs across options. Azure offers a bare metal instance only for use demand, reserved and transient instance pricing. GCP
a select set of global regions and native support for with SAP HANA and capability to run VMWare workloads doesn't offer a bare metal instance and capabilities to
VMWare workloads using a 3rd party solution by CloudSimple natively run VMWare workloads

There is a wide range of data storage options for There is a wide range of data storage options for object, There is a similar range of offerings for storage types
object, block, file and blob storage as well as a hybrid block, file and blob storage as well as a hybrid storage compared to other providers. However, network
Storage storage gateway. The network storage option is gateway. There is an offering of SMB based network storage is only available as NFS and there are no hybrid
available in NFS and SMB formats storage storage and cold storage options

AWS is a good option for data processing such as batch Azure’s data platforms score better than AWS and GCP GCP data platforms do not score well when compared
and streaming, data migration services and in-memory and fits well with user requirements. Azure data stores to AWS and Azure as they don’t scale well and are not
Data computing. However, AWS data stores are relatively also can scale better and can handle more concurrent optimal for indexing, scanning and handling concurrent
expensive and do not scale well when compared to queries queries
Azure

AWS provides a wide variety of native advanced Azure provides an optimal mix of standard advanced GCP provides a rich set of capabilities for predictive
analytics services such ML Workbench, Image analytics capabilities that are tightly integrated across analytics and deep learning especially using readily
Analytics & Data Processing, healthcare focused NLP toolset, speech the whole ecosystem. However, Azure currently doesn’t available open-source frameworks. Alphabet’s
Science and chatbots. However, AWS services are not optimal provide capabilities in healthcare centric AI tools subsidiaries continuously build and evolve strategic
standard reporting and BI solutions healthcare features which makes GCP a robust provider
for specific AI/ML use cases

AWS is a strong contender in the networking and Azure has the most optimal combination of networking, GCP services score less compared to Azure and AWS as
Platform Ops categories but is not optimal for IAM, security, dev, data platform Ops capabilities they do not offer a native dev environment
Infrastructure & Logging and Billing Management. and has no native management capability as well has no current solutions
Ops Disaster Recovery capability for backups and disaster recovery

Deloitte Touche Tohmatsu India LLP L&D | AWS Playbook


Strength & Weakness
VENDORS STRENGTH WEAKNESS
• Dominant Market Position : "AWS has been the market share leader in cloud IaaS • Difficult to use : many enterprises find it difficult to understand the company's cost
for over 10 years.“ structure .

• Extensive Mature Offerings : AWS has a huge and growing array of available • Cost Management : many enterprises find it difficult to manage those costs effectively
services, as well as the most comprehensive network of worldwide data centers when running a high volume of workloads on the service

• Support for large Firms : It has the deepest capabilities for governing a large • Overwhelming options
number of users and resources

• Global Reach : AWS is the most mature, enterprise-ready provider

• Second Largest Provider • Less Enterprise Ready : clients report that the service experience feels less enterprise-
• Integration with Microsoft Tools & Software : enterprises that use a lot of ready than they expected, given Microsoft's long history as an enterprise vendor
Microsoft software often find that it also makes sense for them to use Azure.
• Broad Feature Set :Rich set of API and developer tools • Incomplete Management tooling : Azure doesn't offer as much support for DevOps
• Hybrid Cloud : uses a mix of on-premises, private cloud and third-party, public approaches as some of the other cloud platforms
cloud services with orchestration between the two platforms
• Support for open source

• Designed for cloud native businesses • Late Entrant to IaaS Market

• Commitment to open source and portability: GCP specializes in high compute • Fewer features and services : it doesn't offer as many different services and features as
offerings like Big Data, analytics and machine learning AWS and Azure

• Deep discounts and flexible contracts • Fewer worldwide data centers

• DevOps Expertise

Deloitte Touche Tohmatsu India LLP L&D | AWS Playbook


AWS Fundamentals
Introduction to AWS

Amazon Web Services (AWS) is one of the world’s most wide-ranging and
largely implemented, comprehensive and easy to use computing cloud
platform offered by Amazon. The platform is developed with a
combination of infrastructure as a service (IaaS), platform as a service
What is (PaaS) and packaged software as a service (SaaS) offerings.
AWS
AWS offers upto 200 fully featured services from data centers
globally. AWS products include services like security, analytics,
development tools, databases, storage, networking, migration, and
enterprise applications.

Reference Tutorials
• https://www.youtube.com/watch?v=a9__D53WsUs
• https://www.youtube.com/watch?v=wWeyzYzd17o
• https://aws.amazon.com/what-is-aws/

Deloitte Touche Tohmatsu India LLP L&D | AWS Playbook


AWS Reference Architecture
BI/Analytics/IOT Workloads
DATA SOURCES INGEST DATA STORAGE AND PROCESSING INFORMATION DELIVERY INFORMATION
CONSUMERS

Landing Zone and Data Lake (AWS S3) Functional Users


Business Applications Data Storage and Business Intelligence
Data Movement Ingestion Methods
Provisioning Platform—QuickSight
Production Data Lake (AtScale, Cognos, Business Consumers/
Customer &
MicroStrategy, etc.) Analytics
Batch ETL & ELT— Analytics Marts—
Distribution AWS Direct AWS Glue Raw Layer Processed Layer Consumption
Layer AWS Redshift (Visualization, Dashboards, Finance and
Connect Cleansed Data Transactional Analytics)
File (Batch, Intra- Data based on Accounting
from Structured Data
Marketing, Sales & day batch, mini- Applications Applications by Domains Corporate Business
Distribution batch)Talend) Advanced Analytics/ ML—
(Informatica,
Raw Files DQ Applied Files Analytical SageMaker, AWS ML, Functions and HR
AWS Storage Datasets Rekognition, etc.
Atomic Data
ERP Gateway Consumption Unstructured Data Scientist
External data Marts/ODS—AWS (Text Analytics, Predictive
Messaging Ready External Data Community
RDS Modeling, Data Mashup)
Batch—AWS Data
Kinesis Firehose External Data
HR & Finance Analytics and External Consumers
AWS Database Visualization—QuickSight
Migration Web Services— (Tableau, Qlik)
NiFi Research Data Lake In-Memory
Service Processing —Presto Information Access
Master Data Mgmt. (Processing, Analysis,
on EMR Visualization) Channels

Analytical Processing
Stream Ingestion Streaming Analytics
Other RDBMS Portals
Batch/Micro-Batch Processing (AWS EMR, Data Bricks) Interactive (Kibana)
AWS CLI S3
Querying —AWS
Data Athena (Real-time dashboards Mobile
Streams/Sync and transactional
Other Data Sources
Messaging—AWS applications)
Kinesis Enterprise Search
AWS S3 Transfer Stream Processing (AWS IOT Processing (AWS IoT Real-time Search—
IOT Data Acceleration (Confluent Kafka/ MR, Kinesis) Analytics)
AWS Elastic Search Analytics/ML
Model Repository Enterprise
NiFi)
Applications
Unstructured Data

Geospatial Data
Developer and Management Tools
External Data
AWS Identity & AWS Key AWS AWS AWS
AWS Directory Code Repository AWS Code Deploy
Live Streams Access Management AWS CloudTrail CloudWatch Management CloudFormation
Service (Git, Bit Bucket) (Jenkins/Circle CI)
Management Service (Data Dog) Console (Ansible)

Enterprise Data Governance

Enterprise Content Data Quality Metadata Management Data Security & Master and Reference Business Rules Audit, Balance and Data Catalog and
Management Management Privacy Data Management Management Control Discovery

Deloitte Touche Tohmatsu India LLP L&D | Big Data Playbook


AWS Components and Services
Data Sources Data Ingestion Data Storage and Processing Data Provisioning Information Delivery Consumption Layer
Traditional Data Warehouse Operational Business
Structured Data Raw Data Staging Data Marts
Applications Business Processes
Enterprise ETL Tools
Amazon QuickSight
Flat Files AWS Data Pipeline l
Amazon S3 Amazon Amazon Amazon Amazon Pinpoint
RDBMS Redshift Glacier Redshift
Batch Data Integration Amazon CloudFront

Hadoop Platform Data Discovery


Amazon Machine
Semi-Structured Data Amazon Amazon Learning
Amazon DynamoDB DMS Amazon Athena
AWS AWS RDS
Amazon
JSON Files
Snowball Batch EFS sAmazon
{i} Kinesis Analytics AWS Amazon
Amazon Amazon Amazon DynamoDB Trusted Connect
Amazon ES Advisor
</>
XML Files Amazon Storage EBS EMR
Gateway Enrichment Engines
Data API/Subscription Amazon Systems
Stream and Real Time Manager
Un-Structured Data Integration Datahub Compute Layer
( AI & ML Deployment) External Channels
AWS AWS IoT
Images Kinesis Amazon API Gateway External Gateways
Firehose s
Videos Amazon AWS Direct
AWS AWS AWS Amazon Amazon
Amazon Virtual Chime Connect
Glue Batch Lambda EC2 LightSail Amazon Kinesis Streams
Private Cloud

Data Governance and Operations


AWS Identity & AWS Service Business Data AWS
AWS AWS Glue Amazon AWS
Access Catalog Glossary Stewardship Organizations
Ropeworks Data Catalog CloudTrail Artifact
Management

Deloitte Touche Tohmatsu India LLP L&D | AWS Playbook


AWS Components List
Data Analytics Data Ingestion/Integration Advanced analytics, ML and IOT
• EMR • Glue • SageMaker • Greengrass
• Lambda • Kinesis • Comprehend • Machine Learning
• Athena • Data Pipeline • Lex • TensorFlow on AWS
• Elastic Search • SNS • IoT Analytics
• SQS

2020

Data Movement DevOps and operations management Storage


• Direct Connect • CodeCommiit • S3
• Storage Gateway • CodeDeploy • Glacier
• Database Migration Service • CloudTrail
• S3 Transfer Acceleration • CloudWatch
• Directory Service

Deloitte Touche Tohmatsu India LLP L&D | AWS Playbook


AWS Service Areas

Data Ingestion
Data Ingestion Data
Data Storage
Storage
&
& Integration
Integration

DevOps & operations management Advanced analytics, ML & IOT

Data
Data Movement
Movement

Databases
Databases Interactive Querying
Querying
Data Processing
Processing Interactive
&
& Data Management
Data Management Data & analytics
& Compute & analytics
& Compute

Deloitte Touche Tohmatsu India LLP L&D | AWS Playbook


AWS Direct Connect
• A cloud service solution that makes it easy to establish a dedicated network connection from your
premises to AWS.
• Allows to establish private connectivity between AWS and your datacenter, office, or colocation
environment, which in many cases can reduce your network costs, increase bandwidth throughput,
and provide a more consistent network experience than Internet-based connections.

AWS Storage Gateway


• A hybrid cloud storage service that gives you on-premises access to virtually unlimited cloud storage.

• Connects an on-premises software appliance with cloud-based storage to provide seamless integration with data security
features between your on-premises IT environment and the AWS storage infrastructure

• Provides service to store data in the AWS Cloud for scalable and cost-effective storage that helps maintain data security.

Data
Movement AWS Data Migration Service
• Helps to migrate databases to AWS quickly and securely.

• The source database remains fully operational during the migration, minimizing downtime to applications
that rely on the database.
• Supports homogeneous migrations such as Oracle to Oracle, as well as heterogeneous migrations between
different database platforms.

AWS S3 Transfer Acceleration


• A bucket-level feature that enables fast, easy, and secure transfers of files over long distances between your client
and an S3 bucket

• Reduces the variability in Internet routing, congestion and speeds that can affect transfers, and logically shortens the
distance to S3 for remote applications.

L&D | AWS PLAYBOOK


Deloitte Touche Tohmatsu India LLP L&D | AWS PLAYBOOK
AWS Glue
• A cloud service solution that makes it easy to establish a dedicated network connection from your
premises to AWS.
• Allows to establish private connectivity between AWS and your datacenter, office, or colocation
environment, which in many cases can reduce your network costs, increase bandwidth throughput,
and provide a more consistent network experience than Internet-based connections.

AWS Kinesis Data Firehose


• A hybrid cloud storage service that gives you on-premises access to virtually unlimited cloud storage.

Data • Connects an on-premises software appliance with cloud-based storage to provide seamless integration with data security
features between your on-premises IT environment and the AWS storage infrastructure

• Provides service to store data in the AWS Cloud for scalable and cost-effective storage that helps maintain data security.

Ingestion &
Integration AWS Kinesis
• A massively scalable and durable real time data streaming service like website clickstreams,
database stream events, financial transactions, social media feed etc.

• Gigabytes of data can be captured in seconds and collected data can be available in milliseconds to
enable real-time analytic use cases

AWS Data Pipeline


• A bucket-level feature that enables fast, easy, and secure transfers of files over long distances between your client
and an S3 bucket

• Reduces the variability in Internet routing, congestion and speeds that can affect transfers, and logically shortens the
distance to S3 for remote applications.

L&D | AWS PLAYBOOK


Deloitte Touche Tohmatsu India LLP L&D | AWS PLAYBOOK
S3 – Simple Storage Service

• An object storage service. User creates buckets to store data

• Industry-leading scalability, data availability, security, and performance.

• Provides easy-to-use management features so you can organize your data and configure finely-
tuned access controls to meet your specific business, organizational, and compliance requirements.

• Designed for 99.999999999% (11 9's) of durability, and stores data for millions of applications for

AWS
companies all around the world.

Storage
AWS Glacier

• A secure, durable, and extremely low-cost Amazon S3 cloud storage classes for data archiving
and long-term backup.
• Users create archives and vaults for storage
• An archive can be any data such as a photo, video, or document and is a base unit of storage in
S3 Glacier.
• Designed to deliver 99.999999999% durability and provide comprehensive security and
compliance capabilities that can help meet even the most stringent regulatory requirements.

Deloitte Touche Tohmatsu India LLP L&D | AWS PLAYBOOK


AWS RDS
• A fully managed service that makes it easy for you to deploy, secure, and run Elasticsearch cost effectively at scale.

• Builds, monitor, and troubleshoot your applications using the tools you love, at the scale you need.

• Provides support for open source Elasticsearch APIs, managed Kibana, integration with Logstash and other AWS
services, and built-in alerting and SQL querying. Amazon

AWS Redshift
• An interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. No need for complex ETL jobs
to prepare your data for analysis

• Serverless, so there is no infrastructure to manage, and you pay only for the queries that you run.

• Points to your data in Amazon S3, define the schema, and start querying using standard SQL. Most results are delivered within

Databases seconds.

and Data AWS Aurora


• is a scalable, serverless, embeddable, machine learning-powered business intelligence (BI) service built for
the cloud.

Management • It helps in creating and publishing interactive BI dashboards which include Machine learning powered
insights.

AWS Dynamo DB
• An open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes
ranging from gigabytes to petabytes.

AWS ElasticCache
• An open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes
ranging from gigabytes to petabytes.
Deloitte Touche Tohmatsu India LLP L&D | AWS PLAYBOOK
EMR – Elastic Map Reduce

• A managed cluster platform that simplifies running big data frameworks, such as Apache
Hadoop and Apache Spark on AWS to process and analyze vast amounts of data.

• Using these frameworks and related open-source projects, such as Apache Hive and Apache
Pig, we can process data for analytics purposes and business intelligence workloads.
Data • Used to transform and move large amounts of data in and out of other AWS data stores and
databases.
Processing
and
Compute AWS Lambda

• Serverless compute service that lets you run code without provisioning or managing
servers, creating workload-aware cluster scaling logic, maintaining event integrations, or
managing runtimes.

• Can run code for virtually any type of application or backend service - all with zero
administration.

Deloitte Touche Tohmatsu India LLP L&D | AWS PLAYBOOK


AWS Elastic Search
• A fully managed service that makes it easy for you to deploy, secure, and run Elasticsearch cost effectively at scale.

• Builds, monitor, and troubleshoot your applications using the tools you love, at the scale you need.

• Provides support for open source Elasticsearch APIs, managed Kibana, integration with Logstash and other AWS
services, and built-in alerting and SQL querying. Amazon

AWS Athena
• An interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. No need for complex ETL jobs
to prepare your data for analysis

• Serverless, so there is no infrastructure to manage, and you pay only for the queries that you run.

• Points to your data in Amazon S3, define the schema, and start querying using standard SQL. Most results are delivered within

Interactive seconds.

querying and AWS QuickSight


• is a scalable, serverless, embeddable, machine learning-powered business intelligence (BI) service built for
the cloud.

analytics • It helps in creating and publishing interactive BI dashboards which include Machine learning powered
insights.

Presto on EMR
• An open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes
ranging from gigabytes to petabytes.

Deloitte Touche Tohmatsu India LLP L&D | AWS PLAYBOOK


AWS Sagemaker
A fully-managed platform that enables developers and data scientists to quickly and easily build, train, and deploy
machine learning models at any scale. It includes hosted Jupyter notebooks that make it easy to explore and visualize
the training data stored in Amazon S3.

AWS Comprehend

A continuously-trained Natural Language Processing (NLP) service that uses machine learning to find
insights and relationships across unstructured text like customer reviews and news articles.

AWS Lex
A service for building conversational interfaces into any application using voice and text, enabling
developers to bring sophisticated, natural language chatbots to applications.
Advanced AWS Machine Learning
A managed service for building ML models and generating predictions that enable the development of
analytics, ML, robust, scalable smart applications.

AWS IoT Analytics


and IOT A fully-managed service that makes it easy to run sophisticated analytics on massive volumes of IoT data. It
simplifies running analytics on IoT data to get insights for better and more accurate decisions for IoT
applications and machine learning use cases.

AWS Greengrass
Amazon’s IOT service that lets devices process the data they generate locally, while still taking
advantage of AWS services when an internet connection is available.

Tensorflow on AWS
Deep learning framework in AWS. Popular choice for deep learning research and application development,
particularly in areas such as computer vision, natural language understanding and speech translation.

Deloitte Touche Tohmatsu India LLP L&D | AWS PLAYBOOK


AWS Identity Access and Management (IAM)
A Service that allows to manage users and their level of access to the AWS console and integrates with many
different AWS services. It provides centralized control and shared access of AWS accounts, granular permissions,
identity federation, multifactor authentication and password rotation policy.

AWS Key Management Service

A managed service to create and control the encryption keys used to encrypt data and uses FIPS
140-2 validated hardware security modules to protect the security of the keys.

AWS CloudFormation
A Service that allows to quickly and easily model and provision infrastructure resources and
applications in an automated and secure manner on AWS.

DevOps and AWS CodeCommit


A fully-managed source control service that makes it easy for clients to host secure and highly scalable
private Git repositories.
Operations AWS CodeDeploy

management A service that automates software deployments to a variety of compute services including Amazon
EC2, AWS Lambda, and instances running on-premises.

AWS CloudTrail
A service that enables governance, compliance, operational auditing—log, continuously monitor,
and retain account activity related to your AWS infrastructure.

AWS CloudWatch
A monitoring service for AWS cloud resources and the applications that runs on AWS. CloudWatch can be
used to collect and track metrics, collect and monitor log files, set alarms, and automatically react to
changes in your AWS resources.
AWS Directory Service
Directory Service for Microsoft AD, enables your directory-aware workloads and AWS resources to
Deloitte Touche Tohmatsu India LLP
use managed Active Directory in the AWS Cloud. L&D | AWS PLAYBOOK
Hands On Setup
Set up using AWS Free Tier
Setting up of Free-Tier Account on Amazon Web AWS Free Tier Offerings
Services (AWS)
• Log into aws-sign-up link to navigate into AWS Free Tier AWS provides three types of free offers depending upon the
home page. product used. Below are the details.
• Navigate to Section “Create a Free Account” displayed at
the home page. • Always Free – Offers do not expire and are available to
• Page 1: Enter the Organizational email and choose a all AWS customers. Please follow the link to get the list of
password as per the requirement displayed. Enter the all products which are available in this offering.
Account name of your choice. • 12 months Free – This offering includes products for 12-
• Page 2: Fill in required details in the account creation months from the sign-up date. See the list here.
fields. • Trials – Short-term free trial offers start from the date of
• Pending/Skeptical as both personal/business accounts activating particular service. More details is available
require Credit/Debit card details here.

Deloitte Touche Tohmatsu India LLP L&D | AWS Playbook


Setup Training Platforms
We will be leveraging the below platforms to access various learning materials in this playbook. You are required create account/ Sign in with
your Deloitte ID on all platforms to access all the content free of cost.

Udemy
Cura
Visit Udemy for a learning content repository (that can be accessed
through Cura) that provides technical courses on topics like Cloud, AI,
Analytics, and Big Data Cura is Delotite’s new personalized learning platform. It uses machine
learning to bring you the most relevant content based on your
insterest, skills, and development needs
LinkedIN Use Cura to:
Visit LinkedIn for a learning content repository. Explore a variety of courses • Reskill or upskill quickly through continuous learning
available via Linked in Learning. opportunities
• Find just-in-time information on topics you need to learn more
about
Microsoft Learner Experience Portal
• Access Udemy, LinkedIn Learning and other training materials
(LxP)
Your one-stop access to a variety of learning choices: Instructor-led
training, guided self-paced learning through MS Learn and access to
Microsoft Certification exams.
Setup Up Microsoft LXP:

Deloitte Touche Tohmatsu India LLP L&D | AWS Playbook


AWS Learning
Journeys
Data Engineer ADVANCED

1. Mastering AWS Glue, Quicksight, Athena


& Redshift Spectrum
2. AWS Lambda
3. What is Streaming Data?
4. AWS Kinesis
5. Database Migration Services
6. Database Design & Deployment

BEGINNER INTERMEDIATE
1. What is Big Data? 1. ETL in AWS
2. What is a Data Lake ? 2. Serverless ETL & BI on AWS
3. Cloud Computing on AWS 3. AWS Redshift
4. Data Lake in AWS 4. DynamoDB
5. Learning the AWS Well-Architected Framework 5. AWS EMR
6. Diving into AWS Web Services AWS S3 6. AWS Athena
7. AWS S3 Glacier Developers Guide 7. AWS Aurora
8. AWS EC2 8. Monitoring
9. Introduction to Python 9. AWS CloudWatch
10. SQL for Beginners 10. Docker in AWS
11. Diving into AWS Web Services 11. Introduction to Kubernetes
12. Basics of Machine Learning 12. Exploring Networking in AWS

** Links embedded for each topic, click to explore


Deloitte Touche Tohmatsu India LLP L&D | AWS Playbook
Solution Architect
ADVANCED
1. Kubernetes on AWS
2. AWS Advanced Security
3. AWS Enterprise Security
4. AWS Disaster Recovery
BEGINNER/INTERMEDIATE 5. AWS Code Pipeline
6. Professional Solution Architect
1. What is a Data Lake ? 7. Hybrid Cloud with AWS
2. Cloud Computing on AWS 8. Setting up CI CD in AWS
3. Networking in AWS 9. Migrating Multi-tier environments using AWS Server
4. Security Fundamentals Migration Service
5. Well Architected Framework:Security
6. Well Architected Framework:Reliability
7. Network and Storage Design
8. Virtual Machine Migration
9. DynamoDB Auto-scaling
10. CI CD in AWS tutorial

** Links embedded for each topic, click to explore


Deloitte Touche Tohmatsu India LLP L&D | AWS Playbook
Data Scientist
ADVANCED
1. Amazon AI Services
2. Codeguru
3. Amazon Polly
4. Amazon Transcribe
5. Amazon Lex
6. Amazon Comprehend
7. Amazon Personalize
8. Amazon Forecast
9. Amazon Kendra
10. Amazon Recokgnition
11. Amazon Fraud Detector
12. Amazon Textract
13. Building Expense Tracker using
AWS Textract
14. Amazon Translate
INTERMEDIATE
15. AWS Machine Learning
BEGINNER 1. Elements of Data Science
2. AWS Machine Learning Deep Dive
1. What is a Data Lake ? 3. AWS Machine Learning Essentials
2. Cloud Computing on AWS 4. AWS Machine Learning by Example
5. AWS Sagemaker
6. AWS SageMaker Practical for Beginners

** Links embedded for each topic, click to explore


Deloitte Touche Tohmatsu India LLP L&D | AWS Playbook
AWS Learning Plan
Data Engineer | Beginner
Days Title Duration(min) Portal
What is Big Data? 10 AWS
Day1
What is a Data Lake ? 210 AWS
Cloud Computing on AWS 500 Udemy
Day2
Cloud Computing on AWS 500 Udemy
Data Lake in AWS 210 Udemy
Day3
Learning the AWS Well-Architected Framework 60 LinkedIn
Diving into AWS Web Services 100 LinkedIn
Day4
AWS S3 210 Udemy
AWS S3 210 Udemy
Day5
AWS S3 Glacier Developers Guide 10 AWS
Day6 AWS EC2 300 Udemy
AWS EC2 300 Udemy
Day7
Introduction to Python 190 Udemy
Day8 SQL for Beginners 450 Udemy
Day9 SQL for Beginners 450 Udemy
Diving into AWS Web Services 100 LinkedIn
Day10
Basics of Machine Learning 30 AWS

Deloitte Touche Tohmatsu India LLP L&D | AWS Playbook


Data Engineer | Intermediate
Days Title Duration(min) Portal
Day11 ETL in AWS 270 LinkedIn
Day12 Serverless ETL & BI on AWS 390 Udemy
Serverless ETL & BI on AWS 390 Udemy
Day13
AWS Redshift 90 AWS
DynamoDB 90 LinkedIn
Day14 AWS EMR 30 AWS
AWS Athena 90 Udemy
AWS Aurora 60 AWS
Day15 Monitoring 90 LinkedIn
AWS CloudWatch 80 LinkedIn
Docker in AWS 480 Udemy
Day16
Docker in AWS 480 Udemy
Day17 Introduction to Kubernetes 330 Udemy
Day18 Introduction to Kubernetes 330 Udemy
Day19 Exploring Networking in AWS 450 Udemy
Day20 Exploring Networking in AWS 450 Udemy

Deloitte Touche Tohmatsu India LLP L&D | AWS Playbook


Data Engineer | Advance
Days Title Duration(min) Portal

Day21 Mastering AWS Glue, Quicksight, Athena & Redshift Spectrum 1200 Udemy

Day22 Mastering AWS Glue, Quicksight, Athena & Redshift Spectrum 1200 Udemy

Day23 Mastering AWS Glue, Quicksight, Athena & Redshift Spectrum 1200 Udemy

Day24 Mastering AWS Glue, Quicksight, Athena & Redshift Spectrum 1200 Udemy

Day25 Mastering AWS Glue, Quicksight, Athena & Redshift Spectrum 1200 Udemy

Day26 AWS Lambda 420 Udemy

Day27 AWS Lambda 420 Udemy

What is Streaming Data? 30 AWS


Day28
AWS Kinesis 150 LinkedIn

Database Migration Services 60 AWS


Day29
Database Design & Deployment 140 LinkedIn

Deloitte Touche Tohmatsu India LLP L&D | AWS Playbook


Solution Architect| Beginner / Intermediate
Days Title Duration(min) Portal
Day1 What is a Data Lake ? 210 Udemy
Cloud Computing on AWS 500 Udemy
Day2
Cloud Computing on AWS 500 Udemy
Networking in AWS 480 Udemy
Day3
Networking in AWS 480 Udemy
Security Fundamentals 170 LinkedIn
Day4
Well Architected Framework:Security 90 LinkedIn
Well Architected Framework:Reliability 90 LinkedIn
Day5 Well Architected Framework:Operational Excellence 90 LinkedIn
Well Architected Framework:Cost Optimization 60 LinkedIn
Network and Storage Design 150 LinkedIn
Day6
Virtual Machine Migration 90 LinkedIn
DynamoDB Auto-scaling 30 AWS
Day7
Serverless APIs and Apps 500 Udemy
Day8 Serverless APIs and Apps 500 Udemy
AWS Cloudfront(Reference - Caching) 43 Udemy
Day9 AWS VPC(Reference - VPC) 80 Udemy
CI CD in AWS tutorial 100 YouTube

Deloitte Touche Tohmatsu India LLP L&D | AWS Playbook


Solution Architect| Advance
Days Title Duration(min) Portal
Kubernetes on AWS 90 LinkedIn
Day10
AWS Advanced Security 300 Udemy
AWS Advanced Security 300 Udemy
Day11
AWS Enterprise Security 285 LinkedIn
AWS Enterprise Security 285 LinkedIn
Day12
AWS Disaster Recovery 120 LinkedIn
Day13 AWS Code Pipeline 330 Udemy
AWS Code Pipeline 330 Udemy
Day13
Professional Solution Architect 1920 Udemy
Day14 Professional Solution Architect 1920 Udemy
Day15 Professional Solution Architect 1920 Udemy
Day16 Professional Solution Architect 1920 Udemy
Day17 Professional Solution Architect 1920 Udemy
Day18 Professional Solution Architect 1920 Udemy
Day19 Professional Solution Architect 1920 Udemy
Day20 Professional Solution Architect 1920 Udemy
Hybrid Cloud with AWS 60 AWS
Hybrid Cloud with AWS 50 YouTube
Day21
Setting up CI CD in AWS 45 AWS
Migrating Multi-tier environments using AWS Server Migration Service 90 AWS

Deloitte Touche Tohmatsu India LLP L&D | AWS Playbook


Data Scientist| Beginner

Days Title Duration(min) Portal

Day1 What is a Data Lake ? 210 Udemy

Cloud Computing on AWS 500


Day2 Udemy
Cloud Computing on AWS 500

Deloitte Touche Tohmatsu India LLP L&D | AWS Playbook


Data Scientist| Intermediate
Days Title Duration(min) Portal

Day3 Elements of Data Science 480 AWS

Day4 Elements of Data Science 480 AWS

AWS Machine Learning Deep Dive 5 AWS

Day5 AWS Machine Learning by Example 90 LinkedIn

AWS Machine Learning Essentials 180 LinkedIn

AWS Machine Learning Essentials 180 LinkedIn


Day6
AWS Sagemaker 90 LinkedIn

Day8 AWS SageMaker Practical for Beginners 900 Udemy

Day9 AWS SageMaker Practical for Beginners 900 Udemy

Day10 AWS SageMaker Practical for Beginners 900 Udemy

Day11 AWS SageMaker Practical for Beginners 900 Udemy

Deloitte Touche Tohmatsu India LLP L&D | AWS Playbook


Data Scientist| Advance
Days Title Duration(min) Portal
Amazon AI Services 60 AWS
Codeguru 60 AWS
Day12
Amazon Polly 60 AWS
Amazon Transcribe 60 AWS
Amazon Lex 60 AWS
Amazon Comprehend 60 AWS
Day13
Amazon Personalize 60 AWS
Amazon Forecast 60 AWS
Amazon Kendra 60 AWS
Amazon Recokgnition 60 AWS
Day14
Amazon Fraud Detector 60 AWS
Amazon Textract 60 AWS
Building Expense Tracker using AWS Textract 24 LinkedIn
Day15 Amazon Translate 60 AWS
AWS Machine Learning 600 Udemy
Day16 AWS Machine Learning 600 Udemy
Day17 AWS Machine Learning 600 Udemy

Deloitte Touche Tohmatsu India LLP L&D | AWS Playbook


AWS Certifications
Role Based Certification Program

Deloitte Touche Tohmatsu India LLP L&D | AWS Playbook


Certify by Level
ADVANCED
1. AWS Certified Solutions Architect – Professional – 180
min - 22800
SPECIALIZATION
2. AWS Certified DevOps Engineer – Professional – 180 min
- 22800
1. AWS Certified Data Analytics – Specialty - 170 min - 22800

2. AWS Certified Database – Specialty – 170 min - 22800

3. Machine Learning – Specialty – 170 min - 22800

4. Advanced Networking – Specialty – 170 min - 22800

5. Security – Specialty - 170 min - 22800

6. Alexa Skill Builder – Specialty – 170 min - 22800

BEGINNER

1. AWS Certified Cloud Practitioner – 90 min - 7600

2. AWS Certified Solution Architect – 130 min - 11400

3. AWS Certified Developer – Associate – 130 - 11400

Note: The above costs for certifications are for Feb 2021. Prices are subject to vary. ** Links embedded for each topic, click to explore
Deloitte Touche Tohmatsu India LLP L&D | AWS Playbook
Assignments
• This section contains 4 assignments
• Each Assignment consists of one or more labs and each lab has a specific
result which can be used in the subsequent labs/assignments
• Please do not skip any Lab as each Lab is equally important
• An Assignment will be marked complete if all the labs under it are
complete.
• Please attempt the Labs and assignment in the sequential order they are
defined
• The assignments use two Datasets:

Introduction • Flight_Weather Dataset: A csv file which contains flight data for the
year 2011 & 2012
• Ecommerce Sales Dataset: A csv file which contains the ecommerce
sales data

Location:
• Flight Weather dataset: Flight_weather.csv
• Ecommerce sales dataset: Ecommerce_sales.csv

Deloitte Touche Tohmatsu India LLP L&D | AWS PLAYBOOK


Lab1: Setup an AWS EC2 Instance and execute helloworld.py
python script on EC2. Use below information to setup VM:

▪ Choose Instance type in free tier category

▪ Select No of Instances as 1

Assignment 1: Lab2: Setup AWS CLI and execute the same helloworld.py
script from your local system on AWS using AWS CLI

Lab3: Setup AWS S3 storage. Upload the data file in AWS S3


‘Hello World in AWS’ bucket using UI and AWS CLI

Lab4: Setup and Connect to AWS RDS and access the data file
(flights data) in above step and calculate the following:

1. No of flights departed from NY in 2011

2. Which state has the maximum traffic in 2011

3. Top 5 maximum duration flights in 2012

Deloitte Touche Tohmatsu India LLP L&D | AWS PLAYBOOK


Lab1: Setup AWS EMR Cluster using AWS UI. Give
Assignment 2: S3 access role from EMR
Lab2: Setup a SSH connection to the cluster from
Working on AWS EMR your local system
Lab3: Write a pyspark script to read the
In this assignment, we flight_data.csv file from S3 and calculate the
will setup an EMR cluster, following:
write a PYSPARK code to
1. No of flights departed from NY in 2011
process flight_data.csv
file in S3 and execute the 2. Which state has the maximum traffic in 2011
code on EMR cluster 3. Top 5 maximum duration flights in 2012
Lab4: Execute the pyspark script on AWS EMR
using ssh.

Deloitte Touche Tohmatsu India LLP L&D | AWS PLAYBOOK


Lab1: Create an AWS redshift Cluster
Assignment 3: • Chose the smallest instance possible

• Create a new IAM role with


AWS Redshift with S3 and AmazonS3ReadOnlyAccess
Dbeaver Lab2: Setup Dbeaver in your local system and establish a
connection with your AWS Redshift cluster
In this exercise, we will
set up an AWS redshift Lab3: Create a table in Redshift through Dbeaver and use the
Cluster, copy data from flights_data.csv file to refer to the schema for table creation
S3 bucket to Redshift. We Lab4: Copy the data from S3 bucket to Amazon Redshift
will then use DBeaver to cluster
query data in redshift and
do the calculations. Lab5: Query the data in Redshift using Dbeaver and
calculate the following:

1. No of flights departed from NY in 2011

2. Which state has the maximum traffic in 2011

3. Top 5 maximum duration flights in 2012

Deloitte Touche Tohmatsu India LLP L&D | AWS PLAYBOOK


Case studies
Bring Your Own Data (BYOD) labs help you build a serverless data
pipeline based on your own data.
Starting from a sample of your data saved in Amazon S3, it can go
Case Study 1: through an intensive workshop that focuses on transforming,
analyzing and visualizing your data.
At the end, you will have a POC on which you can continue to
evolve into a more complex data pipeline and deriving more
BRING YOUR OWN DATA insights.
LABS (BYOD) We will be leveraging
AWS Glue for the data catalogue and run ETL on the data lake
Amazon Athena to query the data lake
Amazon QuickSight for data visualization.

Links: Bring Your Own Data Labs (BYOD) :: Bring Your Own Data
Labs (BYOD) (workshop.aws), GitHub - aws-samples/bring-your-
own-data-labs: Bring your own data Labs: Build a serverless data
pipeline based on your own data

Deloitte Touche Tohmatsu India LLP L&D | AWS PLAYBOOK


1. Choose your preferred dataset (in csv format). Websites
like https://www.kaggle.com/ are a good source.
2. Your dataset size should not exceed 2 – 3 GB max.
3. Data row size should not exceed 1MB

Pre-requisites 4. Data with multiple related tables via foreign keys are
supported in the context of this workshop.
5. Data with nested fields like JSON structures
are NOT supported in the context of this workshop.
6. Structure your data in Amazon S3 so that each table
would be in a separate folder, with the whole data in
separate bucket.
7. Before uploading your data files to Amazon S3, make sure
the files are UTF-8 encoding format.

Deloitte Touche Tohmatsu India LLP L&D | AWS PLAYBOOK


• Setup AWS Glue DataBrew and Create a new project
• Run and view data profiling job.
• Clean and Transform the data.

Data Preparation
with AWS Glue DataBrew

For further details refer this link

Deloitte Touche Tohmatsu India LLP L&D | AWS PLAYBOOK


1. Configure the permissions for the resources that we are
going to use.
2. Create a data catalog from the raw files with AWS Glue
Crawler
3. Transform the raw files into Apache Parquet format
(https://parquet.apache.org/) using AWS Glue jobs.
Data Ingestion 4. Create a data catalog from the curated files to be used by
with AWS Glue Athena.
5. EXTRA: Run a local container for Jupyter Pyspark
environment for local testing.
6. OPTIONAL: Create a Development Endpoint

For further details refer this link.

Deloitte Touche Tohmatsu India LLP L&D | AWS PLAYBOOK


Convert the raw data you extracted into a curated data set by
using AWS Glue Workflows. Follow these steps:
1. Orchestrate the data pipeline using the Workflow feature
Orchestrating ◦ Create your workflow
◦ Add crawlers
The Data ◦ Add Transform Job
Pipeline ◦ Add crawler for the curated layer

with AWS Glue 2. Reviewing the results

For further details refer this link


Deloitte Touche Tohmatsu India LLP L&D | AWS PLAYBOOK
1. Setup Amazon S3 and Athena to store query results
2. Join Tables
3. Store SQL Join results

Interactive 4. Review Athena best practice.

Querying 5. Create Amazon Athena Database and Table


6. Detect new partitions
with Amazon Athena
7. Detect language, sentiement and extract entities from
text fields.

For further details refer this link


Deloitte Touche Tohmatsu India LLP L&D | AWS PLAYBOOK
1. Sign up for Amazon QuickSight Enterprise Edition here
2. Review QuickSight definitions
3. Configure Amazon Quicksight to use Amazon Athena as a data
source
4. Prepare your data
5. Visualize data using Amazon Quicksight

Visualization
with Amazon QuickSight

For further details refer this link


Deloitte Touche Tohmatsu India LLP L&D | AWS PLAYBOOK
Company Introduction: The New York City Taxi and Limousine
Commission (NYC TLC) is an agency of the New York City
Case Study 2: government that licenses and regulates the medallion taxis and
for-hire vehicle industries, including app-based companies.
The TLC's regulatory landscape includes medallion (yellow)
taxicabs, green or Boro taxicabs, black cars (including both
NYC Taxi and Limousine traditional and app-based services), community-based livery
Commission (TLC) cars, commuter vans, paratransit vehicles (ambulettes), and
some luxury limousines.

Problem Statement: The purpose of this case study is to get a


better understanding of the taxi system so that the city of New
York can improve the efficiency of in-city commutes. To
understand the number of trips taken , trips per hour , mode of
payment type, Average Distance travelled per hour.

Deloitte Touche Tohmatsu India LLP L&D | AWS PLAYBOOK


Step 1: Download Source Files
https://azure.microsoft.com/en-us/services/open-datasets/catalog/nyc-
taxi-limousine-commission-yellow-taxi-trip-records/
Access the data from the above link:
Case Study 2: The yellow taxi trip records include fields capturing pick-up and drop-off
dates/times, pick-up and drop-off locations, trip distances, itemized fares,
rate types, payment types, and driver-reported passenger counts.
NYC Taxi and Limousine
Commission (TLC) Step 2: Setup AWS Dynamo DB

• Insert the data files from the above and insert data into AWS
Dynamo DB
• Run Queries on AWS portal for the below:
• Total Trips taken per Month
• Total Trips taker per Hour
• Average Speed taken by Yellow Taxis per Hour of trips
• Average Distance travelled by Yellow Taxis per Hour
Deloitte Touche Tohmatsu India LLP L&D | AWS PLAYBOOK
Step 3: Access Dynamo DB through Redshift

• Connect to AWS Dynamo DB through Redshift


• Analyze the data on Redshift
• Identify the day in which most trip duration travelled in a week.
Case Study 2: • Identify the timing range the maximum number of trips taken.
• Which is the most preferred mode of payment?
NYC Taxi and Limousine • Do multiple travelers tip more compared to solo travelers?
Commission (TLC)
Note: Pick a month and year of your choice to answer the above
questions from the dataset.

Deloitte Touche Tohmatsu India LLP L&D | AWS PLAYBOOK


Thank You

You might also like