Cheat Sheet DP900
Cheat Sheet DP900
Cheat Sheet DP900
A data lake is a centralized data repository for unstructured & semi-structured data
• A data lake is intended to store vast amounts of data
• Data lakes generally use objects (blobs) or files as its storage medium
Azure Data Lake Store (Gen 2)
• Azure Blob storage which has been extended to support big data analytics workloads
• In order to efficiently access data, Data Lake Storage adds a hierarchical namespace to Azure Blob
Storage
o ACLs, Throttling Management, Performance Optimizers
• You can access the data lake via (Blob) wasbs:// or (File system) abfs://
Azure Synapse Analytics – a data warehouse and unified analytics platform
• Has two underlying transformations engines: SQL Pools & Spark Pools
• Synapse SQL is T-SQL but designed to be distributed
o SQL Dedicated Pools – reserves compute for processing
o Serverless Endpoints – on-demand, no guarantee of performance
• Data is stored on Azure Data Lake Store (Gen2)
• Operations are performed within the Azure Synapse Studio
• PolyBase – enables your SQL Server instance to query data with T-SQL (used to connect many
relational data sources)
3. Account Storage
Azure Storage Accounts – an umbrella service for various forms of managed storage:
• Azure Tables
• Azure Blob Storage
• Azure Files
Azure Blob Storage – Object storage that is distributed across many machines
• Support 3 types
o Blob blobs – store text & binary data, blocks of data that can be managed individually, up to 4.7TiB
o Append blobs – Optimized for append operations, ideal for logging
o Page blobs – store random access files up to 8TB in size
Azure Storage Explorer – a standalone cross-platform app to access various storage formats within
Azure Storage accounts
4. Power BI
Business Intelligence (BI) – both a data-analysis strategy and technology for business info. Helps
organizations make data-driven decisions
Power BI Desktop – A desktop app to design interactive reports from various data sources and can be
published to Power BI Service
Power BI Service – A web-app to view reports, and create interactive shareable dashboards by pinning
various dataset and report visualizations
Power BI Mobile – A mobile web-app to view BI reports on the go
Power BI Report Builder – Windows app build pixel-perfect printable reports (used to build paginated
reports)
Power BI Embedded – embed Power BI visualizations into web-apps
Interactive Reports – Reports in Power BI, drag visualizations, load data from many data sources (Both
in Desktop & Service)
Dashboards – Build sharable dashboards by pinning various Power BI visualizations (a single page
report designed for a screen) Only Service
Visualizations – A visualization is a chart or graph that is backed by a dataset.
5. Relational Databases
Structured Query Language (SQL) – designed to access and maintain data for a relational database
management system (RDBMS)
Online Transaction Processing (OLTP) – frequent & short queries for transactional information (eg. Databases)
Online Analytical Processing (OLAP) – complex queries for large databases to produces reports &
analytics (eg. Data Warehouses)
MySQL – it’s a pure relational database (RDBMS) it is easy to setup & use, most popular open-source
relational db.
Postgres – it’s an object-relational db (ORDBMS), it is more advanced & well liked among developers
Read Replicas – a duplicate of your database kept in-sync with the main to help to reduce reads on
your primary databases
Azure SQL – An umbrella service for different offerings of MS SQL databases hosting services
• SQL VMs – for lift-and-shift when you want OS access & control, or you need to bring-your-own-
license (BYOL) for Azure Hybrid Benefit
• Managed SQL – for lift-and-shift when you the broadest amount of compatibility with SQL versions
▪ You can use Azure Arc to run this service on-premise
▪ Gives you many of the benefit of a fully-managed databases
• SQL Databases – Fully managed SQL databases
▪ Run a single server
▪ Run as a database (collection of servers)
▪ Run in an Elastic Pool (databases of different sizes residing on one server to save costs)
Connection Policy
• Three modes:
1. Default – choose Proxy or Redirect initially depending on if the server is within or outside the
Azure Network
2. Proxy – outside the Azure network, proxied through a gateway
a. Listen on port 1443 when connecting via Proxy mode through a gateway outside the
Azure Network
3. Redirect – redirected within the Azure Network
6. T-SQL
Transact-SQL (T-SQL) is a set of programming extensions from Sybase & Microsoft that add several
features to the Structured Query Language (SQL).
For Microsoft SQL Server there are five groups of SQL Commands:
• Data Definition Language (DDL)
o Used to define the database schema
• Data Query Language (DQL)
o Used for performing queries on the data
• Data Manipulation Language (DML)
o Manipulation of data in the database
• Data Control Language (DCL)
o Rights, permissions and other controls of the database
• Transaction Control Language (TCL)
o Transactions within the database
7. Database Security
Azure Defender SQL – a unified package for advanced SQL security capabilities for Vulnerability
Assessment and Advanced Threat Protection
Server Firewall Rules – an internal firewall that resides on the db server, all connections are rejected
by default to db
Always Encrypted – a feature that encrypts columns in an Azure SQL Database or SQL Server
SQL DB Contributor – Manage SQL db, but not access to them, can’t manage their security related
policies or their parent SQL servers
SQL Managed Instance Contributor – Manage SQL Managed Instances and required network
configuration, can’t give access to others
SQL Security Manager – Manage the security-related policies of SQL servers and db, but not access to
them SQL servers.
SQL Server Contributor – Manage SQL servers and databases, but not access to them SQL servers
Transparent Data Encryption (TDE) – encrypts data-at-rest for Microsoft Databases
Dynamic Data Masking – you can choose your db columns to that will be masked (obscured) for
specific users
Azure Private Links – allows you to establish secure connections between Azure resources so traffic
remains within the Azure Network
CosmoDB – a fully-managed NoSQL service that supports multiple NoSQL engines called APIs
• Core SQL API (default) – a document database, you can use SQL to query documents
• Graph API – a graph db that you can use in Gremlin to traverse the nodes and edges
• MongoDB API – a MongoDB database (document db)
• Tables API – Azure Tables Key/Value
Apache TinkerPop – an open-source framework to have an agnostic way to talk to many graph db
• Gremlin – Graph traversal language to traverse nodes & edges
9. Hadoop
Apache Hadoop – open-source framework for distributed processing of large data sets
• Hadoop Distributed File System (HDFS) – a resilient and redundant file storage distributed on
clusters of common hardware
• Hadoop MapReduce – writes apps that can process multi-terabyte data in-parallel on large clusters
of common hardware
• Hbase – a distributed, scalable, big data store
• YARN – manages resources, nodes, containers and performs scheduling
• HIVE – used for generating reports using an SQL language
• PIG – a high-level scripting language to write complex data transformations
• Apache Spark – can perform 100x faster in memory and 10x faster than disk than Hadoop,
supports ETLs, Streaming and ML flows
• Apache Kafka – a streaming pipeline and analytics service
• HDInsights – is a managed service to run popular open-source analytics service. It is fully-managed
Hadoop system
10. Azure and Databricks
Apache Spark – an open-source unified analytics engine for big data and machine learning
• 100x faster in memory than Hadoop
• 10x faster in disk than Hadoop
• Perform ETL (batch), streaming and ML workloads
• The Apache ecosystem is composed of:
o Spark Core – The underlying engine and API
o Spark SQL – Use SQL and also new data structure called DataFrame to work with data
o Spark Streaming – ingest data from many streaming services
o GraphX – distributed graph-processing framework
o Machine Learning Library (MLib) – a distributed machine-learning framework
o Resilient Distributed Dataset (RDD) is a domain specific language (DSL) to execute
various parallel operations on an Apache Spark cluster.
Databricks is a software company specializing in providing fully managed Apache Spark clusters.
Azure Databricks is a partnership between Microsoft and Databricks to offer the Databricks platform
within the Azure Portal running on Azure computer services
Azure Data Factory is a managed service for ETL, ELT and data integration
• Create data-driven workflows for orchestrating data movement and transforming data at scale
• Build ELT pipelines visually without writing any code via a web-interface
SQL Server Integration Services (SSIS) – a platform for building enterprise-level data integration and
data transformations solutions
• A low-code tool for building ELT pipelines, very similar to Azure Data Factory but existed 15 years
prior
• Integrates with Azure Data Factory
Azure Data Studio – an IDE similar to Visual Studio Code, that is cross-platform and works with SQL
and non-relational data, has many extensions.
SQL Server Management Studio (SMSS) – an IDE for managing any SQL infrastructure that only works
for Windows. More mature than Data Studio.
SQL Server Data Tools (SSDT) – Visual studio extension to work and design visually SQL databases