Microsoft Azure Data Fundamentals Explore Core Data Concepts
Microsoft Azure Data Fundamentals Explore Core Data Concepts
Structured data
Structured data is data that adheres to a fixed schema, so all of the data has the
same fields or properties. Most commonly, the schema for structured data entities
is tabular - in other words, the data is represented in one or more tables that consist
of rows to represent each instance of a data entity, and columns to represent
attributes of the entity. For example, the following image shows tabular data
representations for Customer and Product entities.
Structured data is often stored in a database in which multiple tables can reference
one another by using key values in a relational model; which we'll explore in more
depth later.
Semi-structured data
Semi-structured data is information that has some structure, but which allows for
some variation between entity instances. For example, while most customers may
have an email address, some might have multiple email addresses, and some might
have none at all.
JSONCopy
// Customer 1
{
"firstName": "Joe",
"lastName": "Jones",
"address":
{
"streetAddress": "1 Main St.",
"city": "New York",
"state": "NY",
"postalCode": "10099"
},
"contact":
[
{
"type": "home",
"number": "555 123-1234"
},
{
"type": "email",
"address": "joe@litware.com"
}
]
}
// Customer 2
{
"firstName": "Samir",
"lastName": "Nadoy",
"address":
{
"streetAddress": "123 Elm Pl.",
"unit": "500",
"city": "Seattle",
"state": "WA",
"postalCode": "98999"
},
"contact":
[
{
"type": "email",
"address": "samir@northwind.com"
}
]
}
Note
JSON is just one of many ways in which semi-structured data can be represented.
The point here is not to provide a detailed examination of JSON syntax, but rather to
illustrate the flexible nature of semi-structured data representations.
Unstructured data
Not all data is structured or even semi-structured. For example, documents, images,
audio and video data, and binary files might not have a specific structure. This kind of
data is referred to as unstructured data.
Data stores
File stores
Databases
The specific file format used to store data depends on a number of factors, including:
Data is often stored in plain text format with specific field delimiters and row
terminators. The most common format for delimited data is comma-separated values
(CSV) in which fields are separated by commas, and rows are terminated by a
carriage return / new line. Optionally, the first line may include the field names. Other
common formats include tab-separated values (TSV) and space-delimited (in which
tabs or spaces are used to separate fields), and fixed-width data in which each field is
allocated a fixed number of characters. Delimited text is a good choice for structured
data that needs to be accessed by a wide range of applications and services in a
human-readable format.
Copy
FirstName,LastName,Email
Joe,Jones,joe@litware.com
Samir,Nadoy,samir@northwind.com
JavaScript Object Notation (JSON)
JSONCopy
{
"customers":
[
{
"firstName": "Joe",
"lastName": "Jones",
"contact":
[
{
"type": "home",
"number": "555 123-1234"
},
{
"type": "email",
"address": "joe@litware.com"
}
]
},
{
"firstName": "Samir",
"lastName": "Nadoy",
"contact":
[
{
"type": "email",
"address": "samir@northwind.com"
}
]
}
]
}
XML is a human-readable data format that was popular in the 1990s and 2000s. It's
largely been superseded by the less verbose JSON format, but there are still some
systems that use XML to represent data. XML uses tags enclosed in angle-brackets
(<../>) to define elements and attributes, as shown in this example:
XMLCopy
<Customers>
<Customer name="Joe" lastName="Jones">
<ContactDetails>
<Contact type="home" number="555 123-1234"/>
<Contact type="email" address="joe@litware.com"/>
</ContactDetails>
</Customer>
<Customer name="Samir" lastName="Nadoy">
<ContactDetails>
<Contact type="email" address="samir@northwind.com"/>
</ContactDetails>
</Customer>
</Customers>
Ultimately, all files are stored as binary data (1's and 0's), but in the human-readable
formats discussed above, the bytes of binary data are mapped to printable characters
(typically through a character encoding scheme such as ASCII or Unicode). Some file
formats however, particularly for unstructured data, store the data as raw binary that
must be interpreted by applications and rendered. Common types of data stored as
binary include images, video, audio, and application-specific documents.
When working with data like this, data professionals often refer to the data files
as BLOBs (Binary Large Objects).
Some common optimized file formats you might see include Avro, ORC, and Parquet: