0% found this document useful (0 votes)
185 views8 pages

Microsoft Azure Data Fundamentals Explore Core Data Concepts

Download as docx, pdf, or txt
0% found this document useful (0 votes)
185 views8 pages

Microsoft Azure Data Fundamentals Explore Core Data Concepts

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1/ 8

Microsoft Azure Data Fundamentals: Explore core data concepts  

In this module you will learn how to:

 Identify common data formats


 Describe options for storing data in files
 Describe options for storing data in databases
 Describe characteristics of transactional data processing solutions
 Describe characteristics of analytical data processing solutions

Identify data formats


Data is a collection of facts such as numbers, descriptions, and observations used to
record information. Data structures in which this data is organized often
represents entities that are important to an organization (such as customers,
products, sales orders, and so on). Each entity typically has one or more attributes, or
characteristics (for example, a customer might have a name, an address, a phone
number, and so on).

You can classify data as structured, semi-structured, or unstructured.

Structured data

Structured data is data that adheres to a fixed schema, so all of the data has the
same fields or properties. Most commonly, the schema for structured data entities
is tabular - in other words, the data is represented in one or more tables that consist
of rows to represent each instance of a data entity, and columns to represent
attributes of the entity. For example, the following image shows tabular data
representations for Customer and Product entities.
Structured data is often stored in a database in which multiple tables can reference
one another by using key values in a relational model; which we'll explore in more
depth later.

Semi-structured data

Semi-structured data is information that has some structure, but which allows for
some variation between entity instances. For example, while most customers may
have an email address, some might have multiple email addresses, and some might
have none at all.

One common format for semi-structured data is JavaScript Object Notation (JSON).


The example below shows a pair of JSON documents that represent customer
information. Each customer document includes address and contact information, but
the specific fields vary between customers.

JSONCopy
// Customer 1
{
"firstName": "Joe",
"lastName": "Jones",
"address":
{
"streetAddress": "1 Main St.",
"city": "New York",
"state": "NY",
"postalCode": "10099"
},
"contact":
[
{
"type": "home",
"number": "555 123-1234"
},
{
"type": "email",
"address": "joe@litware.com"
}
]
}

// Customer 2
{
"firstName": "Samir",
"lastName": "Nadoy",
"address":
{
"streetAddress": "123 Elm Pl.",
"unit": "500",
"city": "Seattle",
"state": "WA",
"postalCode": "98999"
},
"contact":
[
{
"type": "email",
"address": "samir@northwind.com"
}
]
}

 Note

JSON is just one of many ways in which semi-structured data can be represented.
The point here is not to provide a detailed examination of JSON syntax, but rather to
illustrate the flexible nature of semi-structured data representations.

Unstructured data

Not all data is structured or even semi-structured. For example, documents, images,
audio and video data, and binary files might not have a specific structure. This kind of
data is referred to as unstructured data.
Data stores

Organizations typically store data in structured, semi-structured, or unstructured


format to record details of entities (for example, customers and products), specific
events (such as sales transactions), or other information in documents, images, and
other formats. The stored data can then be retrieved for analysis and reporting later.

There are two broad categories of data store in common use:

 File stores
 Databases

We'll explore both of these types of data store in subsequent topics.


Explore file storage
The ability to store data in files is a core element of any computing system. Files can
be stored in local file systems on the hard disk of your personal computer, and on
removable media such as USB drives; but in most organizations, important data files
are stored centrally in some kind of shared file storage system. Increasingly, that
central storage location is hosted in the cloud, enabling cost-effective, secure, and
reliable storage for large volumes of data.

The specific file format used to store data depends on a number of factors, including:

 The type of data being stored (structured, semi-structured, or unstructured).


 The applications and services that will need to read, write, and process the data.
 The need for the data files to be readable by humans, or optimized for efficient
storage and processing.

Some common file formats are discussed below.

Delimited text files

Data is often stored in plain text format with specific field delimiters and row
terminators. The most common format for delimited data is comma-separated values
(CSV) in which fields are separated by commas, and rows are terminated by a
carriage return / new line. Optionally, the first line may include the field names. Other
common formats include tab-separated values (TSV) and space-delimited (in which
tabs or spaces are used to separate fields), and fixed-width data in which each field is
allocated a fixed number of characters. Delimited text is a good choice for structured
data that needs to be accessed by a wide range of applications and services in a
human-readable format.

The following example shows customer data in comma-delimited format:

Copy
FirstName,LastName,Email
Joe,Jones,joe@litware.com
Samir,Nadoy,samir@northwind.com
JavaScript Object Notation (JSON)

JSON is a ubiquitous format in which a hierarchical document schema is used to


define data entities (objects) that have multiple attributes. Each attribute might be an
object (or a collection of objects); making JSON a flexible format that's good for both
structured and semi-structured data.

The following example shows a JSON document containing a collection of


customers. Each customer has three attributes (firstName, lastName, and contact),
and the contact attribute contains a collection of objects that represent one or more
contact methods (email or phone). Note that objects are enclosed in braces ({..}) and
collections are enclosed in square brackets ([..]). Attributes are represented
by name : value pairs and separated by commas (,).

JSONCopy
{
"customers":
[
{
"firstName": "Joe",
"lastName": "Jones",
"contact":
[
{
"type": "home",
"number": "555 123-1234"
},
{
"type": "email",
"address": "joe@litware.com"
}
]
},
{
"firstName": "Samir",
"lastName": "Nadoy",
"contact":
[
{
"type": "email",
"address": "samir@northwind.com"
}
]
}
]
}

Extensible Markup Language (XML)

XML is a human-readable data format that was popular in the 1990s and 2000s. It's
largely been superseded by the less verbose JSON format, but there are still some
systems that use XML to represent data. XML uses tags enclosed in angle-brackets
(<../>) to define elements and attributes, as shown in this example:

XMLCopy
<Customers>
<Customer name="Joe" lastName="Jones">
<ContactDetails>
<Contact type="home" number="555 123-1234"/>
<Contact type="email" address="joe@litware.com"/>
</ContactDetails>
</Customer>
<Customer name="Samir" lastName="Nadoy">
<ContactDetails>
<Contact type="email" address="samir@northwind.com"/>
</ContactDetails>
</Customer>
</Customers>

Binary Large Object (BLOB)

Ultimately, all files are stored as binary data (1's and 0's), but in the human-readable
formats discussed above, the bytes of binary data are mapped to printable characters
(typically through a character encoding scheme such as ASCII or Unicode). Some file
formats however, particularly for unstructured data, store the data as raw binary that
must be interpreted by applications and rendered. Common types of data stored as
binary include images, video, audio, and application-specific documents.

When working with data like this, data professionals often refer to the data files
as BLOBs (Binary Large Objects).

Optimized file formats

While human-readable formats for structured and semi-structured data can be


useful, they're typically not optimized for storage space or processing. Over time,
some specialized file formats that enable compression, indexing, and efficient
storage and processing have been developed.

Some common optimized file formats you might see include Avro, ORC, and Parquet:

 Avro is a row-based format. It was created by Apache. Each record


contains a header that describes the structure of the data in the record.
This header is stored as JSON. The data is stored as binary information.
An application uses the information in the header to parse the binary
data and extract the fields it contains. Avro is a good format for
compressing data and minimizing storage and network bandwidth
requirements.
 ORC (Optimized Row Columnar format) organizes data into columns
rather than rows. It was developed by HortonWorks for optimizing read
and write operations in Apache Hive (Hive is a data warehouse system
that supports fast data summarization and querying over large datasets).
An ORC file contains stripes of data. Each stripe holds the data for a
column or set of columns. A stripe contains an index into the rows in the
stripe, the data for each row, and a footer that holds statistical
information (count, sum, max, min, and so on) for each column.

 Parquet is another columnar data format. It was created by Cloudera and


Twitter. A Parquet file contains row groups. Data for each column is
stored together in the same row group. Each row group contains one or
more chunks of data. A Parquet file includes metadata that describes the
set of rows found in each chunk. An application can use this metadata to
quickly locate the correct chunk for a given set of rows, and retrieve the
data in the specified columns for these rows. Parquet specializes in
storing and processing nested data types efficiently. It supports very
efficient compression and encoding schemes.

You might also like