Internship 1
Internship 1
Internship 1
A report submitted in the partial fulfillment of the requirements for the skill oriented course
BACHELOR OF TECHNOLOGY
IN
Name: V BEULAH
Roll No:21811A05E1
2024- 2025
AVANTHI INSTITUTE OF ENGINEERING & TECHNOLOGY
(Accredited By NAAC A+, Approved by AICTE and Permanently Affiliated to JNTUG
VIZIANAGARAM,AP)
This paper concludes with scenarios that showcase the analytics options in use, as well as
additional resources for getting started with big data analytics on AWS
Objectives:
1. Data Integration and Centralization:
•Unify diverse datasets from educational institutions, encompassing student records
faculty information, academic performance, and administrative data.
3. Scalable Infrastructure:
•Design and deploy a scalable data infrastructure on AWS, ensuring it can adapt to the
growing data volumes from an expanding network of technical institutions.
Page
no.
Abstract
v
List of Figures
vii
Week 1 Overview of AWS academy Data Engineering 1-2
Conclusion 15
References 16
COURSE MODULES
Figure Description Page no.
Course objectives:
This course prepares you to do the following:
Summarize the role and value of data science in a data-driven organization.
Recognize how the elements of data influence decisions about the
infrastructure of a data pipeline.
Illustrate a data pipeline by using AWS services to meet a generalized use case.
Identify the risks and approaches to secure and govern data at each step and
each transition of the data pipeline
Identify scaling considerations and best practices for building pipelines that
handle large-scale datasets.
Design and build a data collection process while considering constraints such
as scalability, cost, fault tolerance, and latency.
Code Whisperer code generation offers many benefits for software development
organizations. It accelerates application development for faster delivery of software solutions.
By automating repetitive tasks, it optimizes the use of developer time, so developers can focus
on more critical aspects of the project. Additionally, code generation helps mitigate security
vulnerabilities, safeguarding the integrity of the codebase. Code Whisperer also protects open
source intellectual property by providing the open source reference tracker. Code Whisperer
enhances code quality and reliability, leading to robust and efficient applications. And it
supports an efficient response to evolving software threats, keeping the codebase up to date
with the latest security practices. Code Whisperer has the potential to increase development
speed, security, and the quality of software.
WEEK- 3:
DATA DRIVEN ORGANIZATIONS
Another key characteristic of deriving insights by using your data pipeline is that the process
will almost always be iterative. You have a hypothesis about what you expect to find in the
data, and you need to experiment and see where it takes you. You might develop your
hypothesis by using BI tools to do initial discovery and analysis of data that has already been
collected. You might iterate within a pipeline segment, or you might iterate across the entire
pipeline. For example, in this illustration, the initial iteration (number 1) yielded a result that
wasn't as defined as was desired. Therefore, the data scientist refined the model and
reprocessed the data to get a better result (number 2). After reviewing those results, they
determined that additional data could improve the detail available in their result, so an
additional data source was tapped and ingested through the pipeline to produce the desired
result (number 3). A pipeline often has iterations of storage and processing. For example, after
the external data is ingested into pipeline storage, iterative processing transforms the data into
different levels of refinement for different needs.
WEEK- 4:
THE ELEMENTS OF DATA, DESIGN PRINCIPLES & PATTERNS FOR DATA PIPELINES
The reality is that a modern architecture might include all of these elements. The key to a
modern data architecture is to apply the three-pronged strategy that you learned about earlier.
Modernize the technology that you are using. Unify your data sources to create a single source
of truth that can be accessed and used across the organization. And innovate to get higher
value analysis from the data that you have.
The architecture illustrates the following other AWS purpose-built services that integrate
with Amazon S3 and map to each component that was described on the previous slide:
Amazon Redshift is a fully managed data warehouse service.
•Amazon OpenSearch Service is a purpose-built data store and search engine that is
optimized for real-time analytics, including log analytics.
•Amazon EMR provides big data processing and simplifies some of the most complex
elements of setting up big data processing.
•Amazon Aurora provides a relational database engine that was built for the cloud.
•Amazon DynamoDB is a fully managed nonrelational database that is designed to run
high-performance applications.
•Amazon Sage Maker is an AI/ML service that democratizes access to ML process
Data being ingested into the Amazon S3 data lake arrives at the landing zone, where it is first
cleaned and stored into the raw zone for permanent storage. Because data that is destined for
the data warehouse needs to be highly trusted and conformed to a schema, the data needs to
be processed further additional transformations would include applying the schema and
partitioning (structuring) as well as other transformations that are required to make the data
conform to requirements that are established for the trusted zone. Finally, the processing layer
prepares the data for the curated zone by modeling and augmenting it to be joined with other
datasets (enrichment) and then stores the transformed, validated data in the curated layer.
Datasets from the curated layer are ready to be ingested into the data warehouse to make them
available for low-latency access or complex SQL querying.
Producers ingest records onto the stream. Producers are integrations that collect data from
a source and load it onto the stream. Consumers process records. Consumers read data from
the stream and perform their own processing on it. The stream itself provides a temporary but
durable storage layer for the streaming solution. In the pipeline that is depicted in this slide,
Amazon CloudWatch Events is the producer that puts CloudWatch Events event data onto
the stream. Kinesis Data Streams provides the storage. The data is then available to multiple
consumers.
WEEK- 4:
SECURING & SCALING DATA PIPELINE
Data wrangling:
Transforming large amounts of unstructured or structured raw data from multiple sources with
different schemas into a meaningful set of data that has value for downstream processes or
users.
Data Structuring:
For the scenario that was described previously, the structuring step includes exporting a .json
file from the customer support ticket system, loading the .json file into Excel, and letting
Excel parse the file. For the mapping step for the supp2 data, the data engineer would modify
the cust num field to match the customer id field in the data warehouse.
For this example, you would perform additional data wrangling steps before compressing
the file for upload to the S3 bucket.
Data Cleaning:
It includes;
.
WEEK- 6:
INGESTING BY BATCH OR BY STREAM
5.1 Comparing batch and stream ingestion:
Use Amazon App Flow to ingest data from a software as a service (SaaS) application. You can
do the following with Amazon App Flow:•Create a connector that reads from a SaaS source
and includes filters.
•Map fields in each source object to fields in the destination and perform
transformations.
•Perform validation on records to be transferred.
•Securely transfer to Amazon S3 or Amazon Redshift. You can trigger an ingestion on
demand, on event, or on a schedule.An example use case for Amazon App Flow is to ingest
customer support ticket data from the Zendesk SaaS product.
WEEK- 7:
STORING AND ORGANIZING DATA:
7.1 Storage in the modern data architecture:
Data in cloud object storage is handled as objects. Each object is assigned a key, which is a
unique identifier. When the key is paired with metadata that is attached to the objects, other
AWS services can use the information to unlock a multitude of capabilities. Thanks to
economies of scale, cloud object storage comes at a lower cost than traditional storage.
Apache Hadoop:
Apache Spark:
ML Concepts:
ML Life Cycle:
Collecting Data:
Data characteristics:
•How much data is there?
•At what speed and volume does it arrive?
•How frequently is it updated?
•How quickly is it processed?
•What type of data is it?
gmail.com)
WEEK- 10:
AUTOMATING THE PIPELINE
Automating Infrastructure deployment:
If you build infrastructure with code, you gain the benefits of repeatability and reusability
while you build your environments. In the example shown, a single template is used to deploy
Network Load Balancers and Auto Scaling groups that contain Amazon Elastic Compute
Cloud (Amazon EC2) instances. Network Load Balancers distribute traffic evenly across
targets.
CI/CD:
CI/CD can be pictured as a pipeline, where new code is submitted on one end, tested over a
series of stages (source, build, test, staging, and production), and then published as
production-ready code.
With Step Functions, you can use visual workflows to coordinate the components of
distributed applications and microservices.
You define a workflow, which is also referred to as a state machine, as a series of
steps and transitions between each step.
Step Functions is integrated with Athena to facilitate building workflows that include
Athena queries and data processing operations.
CONCLUSION
Data engineering is a critical component in the modern data landscape, playing a crucial
role in the success of data-driven decision-making and analytics. As we draw conclusions
about data engineering, several key points come to the forefront: