Module 1

Module 1
Microsoft Azure Data Fundamentals: Explore

core data concepts
Over the last few decades, the amount of data generated by systems, applications, and devices has increased
significantly. Data is everywhere, in a multitude of structures and formats.
Identify data formats

Data is a collection of facts such as numbers, descriptions, and observations used to record
information. Data structures in which this data is organized often represents entities that are
important to an organization (such as customers, products, sales orders, and so on). Each entity
typically has one or more attributes, or characteristics (for example, a customer might have a name, an
address, a phone number, and so on).
You can classify data as structured, semi-structured, or unstructured.
Structured data
Structured data is data that adheres to a fixed schema, so all of the data has the same fields or
properties. Most commonly, the schema for structured data entities is tabular - in other words, the
data is represented in one or more tables that consist of rows to represent each instance of a data
entity, and columns to represent attributes of the entity.
Semi-structured data
One common format for semi-structured data is JavaScript Object Notation (JSON). The example below shows a
pair of JSON documents that represent customer information.
Unstructured data
Not all data is structured or even semi-structured. For example, documents, images, audio and video
data, and binary files might not have a specific structure. This kind of data is referred to
as unstructured data.
Data stores
Organizations typically store data in structured, semi-structured, or unstructured format to record

details of entities (for example, customers and products), specific events (such as sales transactions),
or other information in documents, images, and other formats. The stored data can then be retrieved
for analysis and reporting later.
There are two broad categories of data store in common use:
 File stores
 Databases
Explore file storage

The ability to store data in files is a core element of any computing system. Files can be stored in local
file systems on the hard disk of your personal computer, and on removable media such as USB drives;
but in most organizations, important data files are stored centrally in some kind of shared file storage
system. Increasingly, that central storage location is hosted in the cloud, enabling cost-effective,
secure, and reliable storage for large volumes of data.
The specific file format used to store data depends on a number of factors, including:
 The type of data being stored (structured, semi-structured, or unstructured).

 The applications and services that will need to read, write, and process the data.
 The need for the data files to be readable by humans, or optimized for efficient storage and
processing.
Some common file formats are discussed below.
Delimited text files
Data is often stored in plain text format with specific field delimiters and row terminators. The most
common format for delimited data is comma-separated values (CSV) in which fields are separated by
commas, and rows are terminated by a carriage return / new line.
JavaScript Object Notation (JSON)
JSON is a ubiquitous format in which a hierarchical document schema is used to define data entities
(objects) that have multiple attributes. Each attribute might be an object (or a collection of objects);
making JSON a flexible format that's good for both structured and semi-structured data.
Extensible Markup Language (XML)
XML is a human-readable data format that was popular in the 1990s and 2000s. It's largely been
superseded by the less verbose JSON format, but there are still some systems that use XML to
represent data. XML uses tags enclosed in angle-brackets (<../>) to define elements and attributes
Binary Large Object (BLOB)
Ultimately, all files are stored as binary data (1's and 0's), but in the human-readable formats
discussed above, the bytes of binary data are mapped to printable characters (typically through a
character encoding scheme such as ASCII or Unicode). Some file formats however, particularly for
unstructured data, store the data as raw binary that must be interpreted by applications and rendered.
Common types of data stored as binary include images, video, audio, and application-specific
documents.
Optimized file formats
While human-readable formats for structured and semi-structured data can be useful, they're typically
not optimized for storage space or processing. Over time, some specialized file formats that enable
compression, indexing, and efficient storage and processing have been developed.
Some common optimized file formats you might see include Avro, ORC, and Parquet:
 Avro is a row-based format. It was created by Apache. Each record contains a header that
describes the structure of the data in the record. This header is stored as JSON. The data is
stored as binary information. An application uses the information in the header to parse the
binary data and extract the fields it contains. Avro is a good format for compressing data and
minimizing storage and network bandwidth requirements.
 ORC (Optimized Row Columnar format) organizes data into columns rather than rows. It was
developed by HortonWorks for optimizing read and write operations in Apache Hive (Hive is a
data warehouse system that supports fast data summarization and querying over large datasets).
An ORC file contains stripes of data. Each stripe holds the data for a column or set of columns. A
stripe contains an index into the rows in the stripe, the data for each row, and a footer that holds
statistical information (count, sum, max, min, and so on) for each column.
 Parquet is another columnar data format. It was created by Cloudera and Twitter. A Parquet file
contains row groups. Data for each column is stored together in the same row group. Each row
group contains one or more chunks of data. A Parquet file includes metadata that describes the
set of rows found in each chunk. An application can use this metadata to quickly locate the
correct chunk for a given set of rows, and retrieve the data in the specified columns for these
rows. Parquet specializes in storing and processing nested data types efficiently. It supports very
efficient compression and encoding schemes.
Explore databases
A database is used to define a central system in which data can be stored and queried. In a simplistic sense, the
file system on which files are stored is a kind of database; but when we use the term in a professional data
context, we usually mean a dedicated system for managing data records rather than files.
Relational databases
Relational databases are commonly used to store and query structured data. The data is stored in
tables that represent entities, such as customers, products, or sales orders. Each instance of an entity
is assigned a primary key that uniquely identifies it; and these keys are used to reference the entity
instance in other tables.
Non-relational databases
Key-value databases in which each record consists of a unique key and an associated value, which can be in
any format.
Document databases, which are a specific form of key-value database in which the value is a JSON document
(which the system is optimized to parse and query)
Column family databases, which store tabular data comprising rows and columns, but you can
divide the columns into groups known as column-families. Each column family holds a set of columns
that are logically related together.
Graph databases, which store entities as nodes with links to define relationships between them.
Explore transactional data processing

A transactional data processing system is what most people consider the primary function of business
computing. A transactional system records transactions that encapsulate specific events that the
organization wants to track. A transaction could be financial, such as the movement of money
between accounts in a banking system, or it might be part of a retail system, tracking payments for
goods and services from customers. Think of a transaction as a small, discrete, unit of work.
Transactional systems are often high-volume, sometimes handling many millions of transactions in a
single day. The data being processed has to be accessible very quickly. The work performed by
transactional systems is often referred to as Online Transactional Processing (OLTP).
OLTP solutions rely on a database system in which data storage is optimized for both read and write
operations in order to support transactional workloads in which data records are created, retrieved,
updated, and deleted (often referred to as CRUD operations). These operations are applied
transactionally, in a way that ensures the integrity of the data stored in the database. To accomplish
this, OLTP systems enforce transactions that support so-called ACID semantics:
 Atomicity – each transaction is treated as a single unit, which succeeds completely or fails
completely. For example, a transaction that involved debiting funds from one account and
crediting the same amount to another account must complete both actions. If either action can't
be completed, then the other action must fail.
 Consistency – transactions can only take the data in the database from one valid state to
another. To continue the debit and credit example above, the completed state of the transaction
must reflect the transfer of funds from one account to the other.
 Isolation – concurrent transactions cannot interfere with one another, and must result in a
consistent database state. For example, while the transaction to transfer funds from one account
to another is in-process, another transaction that checks the balance of these accounts must
return consistent results - the balance-checking transaction can't retrieve a value for one
account that reflects the balance before the transfer, and a value for the other account that
reflects the balance after the transfer.
 Durability – when a transaction has been committed, it will remain committed. After the
account transfer transaction has completed, the revised account balances are persisted so that
even if the database system were to be switched off, the committed transaction would be
reflected when it is switched on again.
OLTP systems are typically used to support live applications that process business data - often
Explore analytical data processing

referred to as line of business (LOB) applications.
Analytical data processing typically uses read-only (or read-mostly) systems that store vast volumes of historical
data or business metrics. Analytics can be based on a snapshot of the data at a given point in time, or a series
of snapshots.
1. Operational data is extracted, transformed, and loaded (ETL) into a data lake for analysis.
2. Data is loaded into a schema of tables - typically in a Spark-based data lakehouse with tabular
abstractions over files in the data lake, or a data warehouse with a fully relational SQL engine.
3. Data in the data warehouse may be aggregated and loaded into an online analytical processing
(OLAP) model, or cube. Aggregated numeric values (measures) from fact tables are calculated for
intersections of dimensions from dimension tables. For example, sales revenue might be totaled
by date, customer, and product.
4. The data in the data lake, data warehouse, and analytical model can be queried to produce
reports, visualizations, and dashboards.
Data lakes are common in large-scale data analytical processing scenarios, where a large volume of
file-based data must be collected and analyzed.
Data warehouses are an established way to store data in a relational schema that is optimized for read
operations – primarily queries to support reporting and data visualization. Data Lakehouses are a
more recent innovation that combine the flexible and scalable storage of a data lake with the
relational querying semantics of a data warehouse. The table schema may require some
denormalization of data in an OLTP data source (introducing some duplication to make queries
perform faster).
An OLAP model is an aggregated type of data storage that is optimized for analytical workloads. Data
aggregations are across dimensions at different levels, enabling you to drill up/down to view
aggregations at multiple hierarchical levels; for example to find total sales by region, by city, or for an
individual address. Because OLAP data is pre-aggregated, queries to return the summaries it contains
can be run quickly.
Different types of user might perform data analytical work at different stages of the overall
architecture. For example:
 Data scientists might work directly with data files in a data lake to explore and model data.
 Data Analysts might query tables directly in the data warehouse to produce complex reports and
visualizations.
 Business users might consume pre-aggregated data in an analytical model in the form of reports
or dashboards.
Check your knowledge

1. How is data in a relational table organized?
Rows and Columns
That's correct. Structured data is typically tabular data that is represented by rows and columns in a
database table.
Header and Footer
Pages and Paragraphs
2. Which of the following is an example of unstructured data?
An Employee table with columns EmployeeID, EmployeeName, and EmployeeDesignation
Audio and Video files
That's correct. Audio and video files are unstructured data.

A table within a relational database
3. What is a data warehouse?
A nonrelational database optimized for read and write operations
A relational database optimized for read operations
That's correct. A data warehouse is a relational database in which the schema is optimized for queries
that read data.
A storage location for unstructured data files
Explore job roles in the world of data
The three key job roles that deal with data in most organizations are:
 Database administrators manage databases, assigning permissions to users, storing backup

copies of data and restore data in the event of a failure.
 Data engineers manage infrastructure and processes for data integration across the
organization, applying data cleaning routines, identifying data governance rules, and
implementing pipelines to transfer and transform data between systems.
 Data analysts explore and analyze data to create visualizations and charts that enable
organizations to make informed decisions.
Explore job roles in the world of data

The three key job roles that deal with data in most organizations are:
 Database administrators manage databases, assigning permissions to users, storing backup

copies of data and restore data in the event of a failure.
 Data engineers manage infrastructure and processes for data integration across the
organization, applying data cleaning routines, identifying data governance rules, and
implementing pipelines to transfer and transform data between systems.
 Data analysts explore and analyze data to create visualizations and charts that enable
organizations to make informed decisions.
Identify data services

Microsoft Azure is a cloud platform that powers the applications and IT infrastructure for some of the
world's largest organizations. It includes many services to support cloud solutions, including
transactional and analytical data workloads.
Some of the most commonly used cloud services for data are described below.
Azure SQL
Azure SQL is the collective name for a family of relational database solutions based on the Microsoft
SQL Server database engine. Specific Azure SQL services include:
 Azure SQL Database – a fully managed platform-as-a-service (PaaS) database hosted in Azure
 Azure SQL Managed Instance – a hosted instance of SQL Server with automated maintenance,
which allows more flexible configuration than Azure SQL DB but with more administrative
responsibility for the owner.
 Azure SQL VM – a virtual machine with an installation of SQL Server, allowing maximum
configurability with full management responsibility.
Azure Database for open-source relational databases
Azure includes managed services for popular open-source relational database systems, including:
 Azure Database for MySQL - a simple-to-use open-source database management system that
is commonly used in Linux, Apache, MySQL, and PHP (LAMP) stack apps.
 Azure Database for MariaDB - a newer database management system, created by the original
developers of MySQL. The database engine has since been rewritten and optimized to improve
performance. MariaDB offers compatibility with Oracle Database (another popular commercial
database management system).
 Azure Database for PostgreSQL - a hybrid relational-object database. You can store data in
relational tables, but a PostgreSQL database also enables you to store custom data types, with
their own non-relational properties.
As with Azure SQL database systems, open-source relational databases are managed by database
administrators to support transactional applications, and provide a data source for data engineers
building pipelines for analytical solutions and data analysts creating reports.
Azure Cosmos DB
Azure Cosmos DB is a global-scale non-relational (NoSQL) database system that supports

multiple application programming interfaces (APIs), enabling you to store and manage data as
JSON documents, key-value pairs, column-families, and graphs.
Azure Storage
Azure Storage is a core Azure service that enables you to store data in:
 Blob containers - scalable, cost-effective storage for binary files.

 File shares - network file shares such as you typically find in corporate networks.
 Tables - key-value storage for applications that need to read and write data values quickly.
Data engineers use Azure Storage to host data lakes - blob storage with a hierarchical namespace
that enables files to be organized in folders in a distributed file system.
Azure Data Factory
Azure Data Factory is used by data engineers to build extract, transform, and load (ETL) solutions that
populate analytical data stores with data from transactional systems across the organization.
Azure Synapse Analytics
Azure Synapse Analytics is a comprehensive, unified Platform-as-a-Service (PaaS) solution for data
analytics that provides a single service interface for multiple analytical capabilities, including:
 Pipelines - based on the same technology as Azure Data Factory.

 SQL - a highly scalable SQL database engine, optimized for data warehouse workloads.
 Apache Spark - an open-source distributed data processing system that supports multiple
programming languages and APIs, including Java, Scala, Python, and SQL.
 Azure Synapse Data Explorer - a high-performance data analytics solution that is optimized for
real-time querying of log and telemetry data using Kusto Query Language (KQL).
Azure Databricks
Azure Databricks is an Azure-integrated version of the popular Databricks platform, which combines
the Apache Spark data processing platform with SQL database semantics and an integrated
management interface to enable large-scale data analytics.
Data engineers can use existing Databricks and Spark skills to create analytical data stores in Azure
Databricks.
Data Analysts can use the native notebook support in Azure Databricks to query and visualize data in
an easy to use web-based interface.
Azure HDInsight
Azure HDInsight is an Azure service that provides Azure-hosted clusters for popular Apache open-
source big data processing technologies, including:
 Apache Spark - a distributed data processing system that supports multiple programming languages
and APIs, including Java, Scala, Python, and SQL.
 Apache Hadoop - a distributed system that uses MapReduce jobs to process large volumes of data
efficiently across multiple cluster nodes. MapReduce jobs can be written in Java or abstracted by
interfaces such as Apache Hive - a SQL-based API that runs on Hadoop.
 Apache HBase - an open-source system for large-scale NoSQL data storage and querying.
 Apache Kafka - a message broker for data stream processing.
Data engineers can use Azure HDInsight to support big data analytics workloads that depend on
multiple open-source technologies.
Azure Stream Analytics
Azure Stream Analytics is a real-time stream processing engine that captures a stream of data
from an input, applies a query to extract and manipulate data from the input stream, and
writes the results to an output for analysis or further processing.
Data engineers can incorporate Azure Stream Analytics into data analytics architectures that capture
streaming data for ingestion into an analytical data store or for real-time visualization.
Azure Data Explorer
Azure Data Explorer is a standalone service that offers the same high-performance querying of log
and telemetry data as the Azure Synapse Data Explorer runtime in Azure Synapse Analytics.
Data analysts can use Azure Data Explorer to query and analyze data that includes a timestamp
attribute, such as is typically found in log files and Internet-of-things (IoT) telemetry data.
Microsoft Purview
Microsoft Purview provides a solution for enterprise-wide data governance and discoverability. You
can use Microsoft Purview to create a map of your data and track data lineage across multiple data
sources and systems, enabling you to find trustworthy data for analysis and reporting.
Data engineers can use Microsoft Purview to enforce data governance across the enterprise and
ensure the integrity of data used to support analytical workloads.
Microsoft Fabric
Microsoft Fabric is a unified Software-as-a-Service (SaaS) analytics platform based on open and
governed lakehouse that includes functionality to support:
 Data ingestion and ETL

 Data lakehouse analytics
 Data warehouse analytics
 Data Science and machine learning
 Realtime analytics
 Data visualization
 Data governance and management
Check your knowledge
1. Which one of the following tasks is the responsibility of a database administrator?
Backing up and restoring databases
Correct. Database Administrators back up the database and restore it when data is lost or corrupted.
Creating dashboards and reports
Creating pipelines to process data in a data lake
2. Which role is most likely to use Azure Data Factory to define a data pipeline for an ETL process?
Database Administrator
Data Engineer
Correct. Data engineers create data pipelines.

Data Analyst
3. Which services would you use as a SaaS solution for data analytics?
Azure SQL Database
Azure Synapse Analytics
Microsoft Fabric
Correct. Microsoft Fabric is a SaaS platform for data analytics.

Module 1

Uploaded by

Module 1

Uploaded by

Module 1

Microsoft Azure Data Fundamentals: Explore

Identify data formats

You can classify data as structured, semi-structured, or unstructured.

Organizations typically store data in structured, semi-structured, or unstructured format to record

There are two broad categories of data store in common use:

Explore file storage

 The type of data being stored (structured, semi-structured, or unstructured).

Some common file formats are discussed below.

Delimited text files

JavaScript Object Notation (JSON)

Extensible Markup Language (XML)

Optimized file formats

Explore transactional data processing

Explore analytical data processing

Check your knowledge

Rows and Columns

Pages and Paragraphs

2. Which of the following is an example of unstructured data?

An Employee table with columns EmployeeID, EmployeeName, and EmployeeDesignation

Audio and Video files

That's correct. Audio and video files are unstructured data.

3. What is a data warehouse?

A nonrelational database optimized for read and write operations

A relational database optimized for read operations

 Database administrators manage databases, assigning permissions to users, storing backup

Explore job roles in the world of data

 Database administrators manage databases, assigning permissions to users, storing backup

Identify data services

Azure Cosmos DB is a global-scale non-relational (NoSQL) database system that supports

 Blob containers - scalable, cost-effective storage for binary files.

Azure Data Factory

 Pipelines - based on the same technology as Azure Data Factory.

Azure Data Explorer

 Data ingestion and ETL

Backing up and restoring databases

Creating pipelines to process data in a data lake

Correct. Data engineers create data pipelines.

Azure SQL Database

Azure Synapse Analytics

Correct. Microsoft Fabric is a SaaS platform for data analytics.

You might also like