0% found this document useful (0 votes)
9 views115 pages

Mod 4

Uploaded by

sankalps.chintu
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
0% found this document useful (0 votes)
9 views115 pages

Mod 4

Uploaded by

sankalps.chintu
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 115

The Importance of

1
Data Visualization
and Data Exploration
Overview
This chapter introduces you to the basics of the statistical analysis of a dataset.
We will look at basic operations for calculating the mean, median, and variance
of different datasets and use NumPy and pandas to filter, sort, and shape the
datasets to our requirements. The concepts we will cover will serve as a base of
knowledge for the upcoming visualization chapters, in which we'll work with real-
world datasets.
By the end of this chapter, you will be able to explain the importance of data
visualization and calculate basic statistical values (such as median, mean, and
variance), and use NumPy and pandas for data wrangling.
2 | The Importance of Data Visualization and Data Exploration

Introduction
Unlike machines, people are usually not equipped for interpreting a large amount of
information from a random set of numbers and messages in each piece of data. Out
of all our logical capabilities, we understand things best through the visual processing
of information. When data is represented visually, the probability of understanding
complex builds and numbers increases.
Python has recently emerged as a programming language that performs well for data
analysis. It has applications across data science pipelines that convert data into a usable
format (such as pandas), analyzes it (such as NumPy), and extract useful conclusions
from the data to represent it in a visually appealing manner (such as Matplotlib
or Bokeh). Python provides data visualization libraries that can help you assemble
graphical representations efficiently.
In this book, you will learn how to use Python in combination with various libraries,
such as NumPy, pandas, Matplotlib, seaborn, and geoplotlib, to create impactful data
visualizations using real-world data. Besides that, you will also learn about the features
of different types of charts and compare their advantages and disadvantages. This will
help you choose the chart type that's suited to visualizing your data.
Once we understand the basics, we can cover more advanced concepts, such as
interactive visualizations and how Bokeh can be used to create animated visualizations
that tell a story. Upon completing this book, you will be able to perform data wrangling,
extract relevant information, and visualize your findings descriptively.

Introduction to Data Visualization


Computers and smartphones store data such as names and numbers in a digital
format. Data representation refers to the form in which you can store, process,
and transmit data.
Representations can narrate a story and convey fundamental discoveries to your
audience. Without appropriately modeling your information to use it to make
meaningful findings, its value is reduced. Creating representations helps us achieve
a more precise, more concise, and more direct perspective of information, making it
easier for anyone to understand the data.
Information isn't equivalent to data. Representations are a useful apparatus to derive
insights from the data. Thus, representations transform data into useful information.
Introduction | 3

The Importance of Data Visualization


Instead of just looking at data in the columns of an Excel spreadsheet, we get a better
idea of what our data contains by using visualization. For instance, it's easy to see
a pattern emerge from the numerical data that's given in the following scatter plot.
It shows the correlation between body mass and the maximum longevity of various
animals grouped by class. There is a positive correlation between body mass and
maximum longevity:

Figure 1.1: A simple example of data visualization

Visualizing data has many advantages, such as the following:


• Complex data can be easily understood.
• A simple visual representation of outliers, target audiences, and futures markets
can be created.
• Storytelling can be done using dashboards and animations.
• Data can be explored through interactive visualizations.
4 | The Importance of Data Visualization and Data Exploration

Data Wrangling
Data wrangling is the process of transforming raw data into a suitable representation
for various tasks. It is the discipline of augmenting, cleaning, filtering, standardizing,
and enriching data in a way that allows it to be used in a downstream task, which in our
case is data visualization.
Look at the following data wrangling process flow diagram to understand how accurate
and actionable data can be obtained for business analysts to work on. The following
steps explain the flow of the data wrangling process:
1. First, the Employee Engagement data is in its raw form.
2. Then, the data gets imported as a DataFrame and is later cleaned.
3. The cleaned data is then transformed into graphs, from which findings can
be derived.
4. Finally, we analyze this data to communicate the final results.
For example, employee engagement can be measured based on raw data gathered
from feedback surveys, employee tenure, exit interviews, one-on-one meetings, and so
on. This data is cleaned and made into graphs based on parameters such as referrals,
faith in leadership, and scope of promotions. The percentages, that is, information
derived from the graphs, help us reach our result, which is to determine the measure of
employee engagement:

Figure 1.2: Data wrangling process to measure employee engagement


Overview of Statistics | 5

Tools and Libraries for Visualization


There are several approaches to creating data visualizations. Depending on your
requirements, you might want to use a non-coding tool such as Tableau, which allows
you to get a good feel for your data. Besides Python, which will be used in this book,
MATLAB and R are widely used in data analytics.
However, Python is the most popular language in the industry. Its ease of use and the
speed at which you can manipulate and visualize data, combined with the availability of
a number of libraries, make Python the best choice for data visualization.

Note
MATLAB (https://www.mathworks.com/products/matlab.html),
R (https://www.r-project.org), and Tableau (https://www.tableau.com) are not
part of this book; we will only cover the relevant tools and libraries for Python.

Overview of Statistics
Statistics is a combination of the analysis, collection, interpretation, and representation
of numerical data. Probability is a measure of the likelihood that an event will occur and
is quantified as a number between 0 and 1.
A probability distribution is a function that provides the probability for every possible
event. A probability distribution is frequently used for statistical analysis. The higher the
probability, the more likely the event. There are two types of probability distributions,
namely discrete and continuous.
6 | The Importance of Data Visualization and Data Exploration

A discrete probability distribution shows all the values that a random variable can
take, together with their probability. The following diagram illustrates an example of
a discrete probability distribution. If we have a six-sided die, we can roll each number
between 1 and 6. We have six events that can occur based on the number that's
rolled. There is an equal probability of rolling any of the numbers, and the individual
probability of any of the six events occurring is 1/6:

Figure 1.3: Discrete probability distribution for die rolls


Overview of Statistics | 7

A continuous probability distribution defines the probabilities of each possible value


of a continuous random variable. The following diagram provides an example of a
continuous probability distribution. This example illustrates the distribution of the time
needed to drive home. In most cases, around 60 minutes is needed, but sometimes, less
time is needed because there is no traffic, and sometimes, much more time is needed if
there are traffic jams:

Figure 1.4: Continuous probability distribution for the time taken to reach home
8 | The Importance of Data Visualization and Data Exploration

Measures of Central Tendency


Measures of central tendency are often called averages and describe central or typical
values for a probability distribution. We are going to discuss three kinds of averages in
this chapter:
• Mean: The arithmetic average is computed by summing up all measurements and
dividing the sum by the number of observations. The mean is calculated as follows:

Figure 1.5: Formula for mean

• Median: This is the middle value of the ordered dataset. If there is an even number
of observations, the median will be the average of the two middle values. The
median is less prone to outliers compared to the mean, where outliers are distinct
values in data.
• Mode: Our last measure of central tendency, the mode is defined as the most
frequent value. There may be more than one mode in cases where multiple values
are equally frequent.
For example, a die was rolled 10 times, and we got the following numbers: 4, 5, 4, 3, 4, 2,
1, 1, 2, and 1.
The mean is calculated by summing all the events and dividing them by the number of
observations: (4+5+4+3+4+2+1+1+2+1)/10=2.7.
To calculate the median, the die rolls have to be ordered according to their values. The
ordered values are as follows: 1, 1, 1, 2, 2, 3, 4, 4, 4, 5. Since we have an even number of
die rolls, we need to take the average of the two middle values. The average of the two
middle values is (2+3)/2=2.5.
The modes are 1 and 4 since they are the two most frequent events.

Measures of Dispersion
Dispersion, also called variability, is the extent to which a probability distribution is
stretched or squeezed.
Overview of Statistics | 9

The different measures of dispersion are as follows:


• Variance: The variance is the expected value of the squared deviation from the
mean. It describes how far a set of numbers is spread out from their mean. Variance
is calculated as follows:

Figure 1.6: Formula for mean

• Standard deviation: This is the square root of the variance.


• Range: This is the difference between the largest and smallest values in a dataset.
• Interquartile range: Also called the midspread or middle 50%, this is the
difference between the 75th and 25th percentiles, or between the upper and
lower quartiles.

Correlation
The measures we have discussed so far only considered single variables. In contrast,
correlation describes the statistical relationship between two variables:
• In a positive correlation, both variables move in the same direction.
• In a negative correlation, the variables move in opposite directions.
• In zero correlation, the variables are not related.

Note
One thing you should be aware of is that correlation does not imply causation.
Correlation describes the relationship between two or more variables, while
causation describes how one event is caused by another. For example, ice cream
sales are correlated with the number of drowning deaths. But that doesn't mean
that ice cream consumption causes drowning. There is a third variable, namely
temperature, that's responsible for this correlation. Higher temperature causes
increasing ice cream sales and more people engaging in swimming, which
eventually results in drowning.
10 | The Importance of Data Visualization and Data Exploration

Example
You want to find a decent apartment to rent that is not too expensive compared to
other apartments you've found. The other apartments you found on a website are
priced as follows: $700, $850, $1,500, and $750 per month:
• The mean is ($700 + $850 + $1,500 + $750) / 4 = $950.
• The median is ($750 + $850) / 2 = $800.
• The standard deviation is .
• The range is $1,500 - $700 = $800.
• The median is a better statistical measure in this case since it is less prone to
outliers (the rent amount of $1,500).

Types of Data
It is important to understand what kind of data you are dealing with so that you can
select both the right statistical measure and the right visualization. We categorize
data as categorical/qualitative and numerical/quantitative. Categorical data describes
characteristics, for example, the color of an object or a person's gender. We can further
divide categorical data into nominal and ordinal data. In contrast to nominal data,
ordinal data has an order.
Numerical data can be divided into discrete and continuous data. We speak of discrete
data if the data can only have certain values, whereas continuous data can take any
value (sometimes limited to a range).
Another aspect to consider is whether the data has a temporal domain – in other words,
is it bound to time or does it change over time? If the data is bound to a location, it
might be interesting to show the spatial relationship, so you should keep that in mind
as well:
Overview of Statistics | 11

Figure 1.7: Classification of types of data

Summary Statistics
In real-world applications, we often encounter enormous datasets. Therefore, summary
statistics are used to summarize important aspects of data. They are necessary to
communicate large amounts of information in a compact and simple way.
We have already covered measures of central tendency and dispersion, which are both
summary statistics. It is important to know that measures of central tendency show a
center point in a set of data values, whereas measures of dispersion show how much
the data varies.
The following table gives an overview of which measure of central tendency is best
suited to a particular type of data:

Figure 1.8: Best suited measures of central tendency for different types of data

In the next section, we will learn about the NumPy library and implement a few
exercises using it.
12 | The Importance of Data Visualization and Data Exploration

NumPy
When handling data, we often need a way to work with multidimensional arrays. As we
discussed previously, we also have to apply some basic mathematical and statistical
operations on that data. This is exactly where NumPy positions itself. It provides
support for large n-dimensional arrays and has built-in support for many high-level
mathematical and statistical operations.

Note
Before NumPy, there was a library called Numeric. However, it's no longer used,
because NumPy's signature ndarray allows the performant handling of large and
high-dimensional matrices.

Ndarrays are the essence of NumPy. They are what makes it faster than using Python's
built-in lists. Other than the built-in list data type, ndarrays provide a stridden view of
memory (for example, int[] in Java). Since they are homogeneously typed, meaning
all the elements must be of the same type, the stride is consistent, which results in less
memory wastage and better access times.
The stride is the number of locations between the beginnings of two adjacent elements
in an array. They are normally measured in bytes or in units of the size of the array
elements. A stride can be larger or equal to the size of the element, but not smaller;
otherwise, it would intersect the memory location of the next element.

Note
Remember that NumPy arrays have a defined data type. This means you are not
able to insert strings into an integer type array. NumPy is mostly used with double-
precision data types.

The following are some of the built-in methods that we will use in the exercises and
activities of this chapter.
NumPy | 13

mean
NumPy provides implementations of all the mathematical operations we covered in the
Overview of Statistics section of this chapter. The mean, or average, is the one we will
look at in more detail in this exercise:
np.mean(dataset) # mean value for the whole dataset
np.mean(dataset[0]) # mean value of the first row
np.mean(dataset[:, 0] # mean value of the whole first column
np.mean(dataset[1, 0:10]) # mean value of the first 10 elements of the
second row

median
Several of the mathematical operations have the same interface. This makes them easy
to interchange if necessary. The median, var, and std methods will be used in the
upcoming exercises and activities:
np.median(dataset) # median value for the whole dataset
np.median(dataset[-1]) # median value of the last row using reverse
indexing
np.median(dataset[5:, 0]) # median value of values of rows >5 in the first
column

Note that we can index every element from the end of our dataset as we can from the
front by using reverse indexing. It's a simple way to get the last or several of the last
elements of a list. Instead of [0] for the first/last element, it starts with dataset[-
1] and then decreases until dataset[-len(dataset)], which is the first element in
the dataset.
var
As we mentioned in the Overview of Statistics section, the variance describes how far a
set of numbers is spread out from their mean. We can calculate the variance using the
var method of NumPy:
np.var(dataset) # variance value for the whole dataset
np.var(dataset, axis=0) # axis used to get variance per column
np.var(dataset, axis=1) # axis used to get variance per row
14 | The Importance of Data Visualization and Data Exploration

std
One of the advantages of the standard deviation is that it remains in the scalar system
of the data. This means that the unit of the deviation will have the same unit as the data
itself. The std method works just like the others:
np.std(dataset) # standard deviation for the whole dataset
np.std(dataset[:2, :2]) # std value of values from the 2 first rows
and columns
np.std(dataset, axis=1) # axis used to get standard deviation per row

Now we will do an exercise to load a dataset and calculate the mean using
these methods.

Note
All the exercises and activities in this chapter will be developed in Jupyter
Notebooks. Please download the GitHub repository with all the prepared
templates from https://packt.live/31USkof.

Exercise 1.01: Loading a Sample Dataset and Calculating the Mean using
NumPy
In this exercise, we will be loading the normal_distribution.csv dataset and
calculating the mean of each row and each column in it:
1. Open the Exercise1.01.ipynb Jupyter Notebook from the Chapter01 folder to
do this exercise. In the command-line Terminal, type jupyter-lab. You will now
see a browser window open, showing the content of the directory you called in the
previous command.
2. Click on Exercise1.01.ipynb. The notebook for Chapter01 should now be open
and ready for you to modify.
3. Import numpy with an alias:
import numpy as np

4. Use the genfromtxt method of NumPy to load the dataset:


dataset = np.genfromtxt('../../Datasets/normal_distribution.csv',
delimiter=',')
NumPy | 15

In order to load the dataset, we will use the genfromtxt method call in the
following cell. This method helps load the data from a given text or .csv file. If
everything works as expected, the generation should run through without any
error or output.

Note
The numpy.genfromtext method is less efficient than the
pandas.read_csv method.

5. Check the data you just imported by simply writing the name of the ndarray in the
next cell. Simply executing a cell that returns a value, such as an ndarray, will use
Jupyter formatting, which looks nice and, in most cases, displays more information
than using print:
# looking at the dataset
dataset

The output of the preceding code is as follows:

Figure 1.9: The first few rows of the normal_distribution.csv file


16 | The Importance of Data Visualization and Data Exploration

6. Print the shape using the dataset.shape command to get a quick overview of our
dataset. This will give us output in the form (rows, columns):
dataset.shape

We can also call the rows as instances and the columns as features. This means
that our dataset has 24 instances and 8 features. The output of the preceding code
is as follows:
(24,8)

7. Calculate the mean after loading and checking our dataset. The first row in
a NumPy array can be accessed by simply indexing it with zero; for example,
dataset[0]. As we mentioned previously, NumPy has some built-in functions for
calculations such as the mean. Call np.mean() and pass in the dataset's first row to
get the result:
# calculating the mean for the first row
np.mean(dataset[0])

The output of the preceding code is as follows:


100.177647525

8. Now, calculate the mean of the first column by using np.mean() in combination
with the column indexing dataset[:, 0]:
np.mean(dataset[:, 0])

The output of the preceding code is as follows:


99.76743510416668

Whenever we want to define a range to select from a dataset, we can use a colon,
:, to provide start and end values for the to be selected range. If we don't provide
start and end values, the default of 0 to n is used, where n is the length of the
current axis.
NumPy | 17

9. Calculate the mean for every single row, aggregated in a list, using the axis tools
of NumPy. Note that by simply passing the axis parameter in the np.mean() call,
we can define the dimension our data will be aggregated on. axis=0 is horizontal
and axis=1 is vertical. Get the result for each row by using axis=1:
np.mean(dataset, axis=1)

The output of the preceding code is as follows:

Figure 1.10: Mean of the elements of each row

Get the mean of each column by using axis=0:


np.mean(dataset, axis=0)

The output of the preceding code is as follows:

Figure 1.11: Mean of elements for each column

10. Calculate the mean of the whole matrix by summing all the values we retrieved in
the previous steps:
np.mean(dataset)

The output of the preceding code is as follows:


100.16536917390624

You are already one step closer to using NumPy in combination with plotting libraries
and creating impactful visualizations. Since we've now covered the very basics and
calculated the mean, it's now up to you to solve the upcoming activity.
18 | The Importance of Data Visualization and Data Exploration

Activity 1.01: Using NumPy to Compute the Mean, Median, Variance, and
Standard Deviation of a Dataset
In this activity, we will use the skills we've learned to import datasets and perform
some basic calculations (mean, median, variance, and standard deviation) to compute
our tasks.
Perform the following steps to implement this activity:
1. Open the Activity1.01.ipynb Jupyter Notebook from the Chapter01 folder to
do this activity. Import NumPy and give it the alias np.
2. Load the normal_distribution.csv dataset by using the genfromtxt method
from NumPy.
3. Print a subset of the first two rows of the dataset.
4. Load the dataset and calculate the mean of the third row. Access the third row by
using index 2, dataset[2].
5. Index the last element of an ndarray in the same way a regular Python list can be
accessed. dataset[:, -1] will give us the last column of every row.
6. Get a submatrix of the first three elements of every row of the first three columns
by using the double-indexing mechanism of NumPy.
7. Calculate the median for the last row of the dataset.
8. Use reverse indexing to define a range to get the last three columns. We can use
dataset[:, -3:] here.
9. Aggregate the values along an axis to calculate the rows. We can use axis=1 here.
10. Calculate the variance for each column using axis 0.
11. Calculate the variance of the intersection of the last two rows and the first
two columns.
12. Calculate the standard deviation for the dataset.

Note
The solution for this activity can be found on page 348.

You have now completed your first activity using NumPy. In the following activities, this
knowledge will be consolidated further.
NumPy | 19

Basic NumPy Operations


In this section, we will learn about basic NumPy operations such as indexing, slicing,
splitting, and iterating and implement them in an exercise.

Indexing
Indexing elements in a NumPy array, at a high level, works the same as with built-in
Python lists. Therefore, we can index elements in multi-dimensional matrices:
dataset[0] # index single element in outermost dimension
dataset[-1] # index in reversed order in outermost dimension
dataset[1, 1] # index single element in two-dimensional data
dataset[-1, -1] # index in reversed order in two-dimensional data

Slicing
Slicing has also been adapted from Python's lists. Being able to easily slice parts of lists
into new ndarrays is very helpful when handling large amounts of data:
dataset[1:3] # rows 1 and 2
dataset[:2, :2] # 2x2 subset of the data
dataset[-1, ::-1] # last row with elements reversed
dataset[-5:-1, :6:2] # last 4 rows, every other element up to index 6

Splitting
Splitting data can be helpful in many situations, from plotting only half of your time-
series data to separating test and training data for machine learning algorithms.
There are two ways of splitting your data, horizontally and vertically. Horizontal
splitting can be done with the hsplit method. Vertical splitting can be done with the
vsplit method:
np.hsplit(dataset, (3)) # split horizontally in 3 equal lists
np.vsplit(dataset, (2)) # split vertically in 2 equal lists
20 | The Importance of Data Visualization and Data Exploration

Iterating
Iterating the NumPy data structures, ndarrays, is also possible. It steps over the
whole list of data one after another, visiting every single element in the ndarray once.
Considering that they can have several dimensions, indexing gets very complex.
The nditer is a multi-dimensional iterator object that iterates over a given number
of arrays:
# iterating over whole dataset (each value in each row)
for x in np.nditer(dataset):
    print(x)

The ndenumerate will give us exactly this index, thus returning (0, 1) for the second
value in the first row:
# iterating over the whole dataset with indices matching the position in
the dataset
for index, value in np.ndenumerate(dataset):
    print(index, value)

Now, we will perform an exercise using these basic NumPy operations.

Exercise 1.02: Indexing, Slicing, Splitting, and Iterating


In this exercise, we will use the features of NumPy to index, slice, split, and iterate
ndarrays to consolidate what we've learned. Our client wants us to prove that our
dataset is nicely distributed around the mean value of 100.
Let's use the features of NumPy to index, slice, split, and iterate ndarrays.
Indexing
1. Import the necessary libraries:
import numpy as np

2. Load the normal_distribution.csv dataset using NumPy. Have a look at the


ndarray to verify that everything works:
dataset = np.genfromtxt('../../Datasets/normal_distribution_
splittable.csv', delimiter=',')
NumPy | 21

3. First, use simple indexing for the second row, as we did in our first exercise. For a
clearer understanding, all the elements are saved to a variable:
second_row = dataset[1]
np.mean(second_row)

The output of the preceding code is as follows:


96.90038836444445

4. Now, reverse index the last row and calculate the mean of that row. Always
remember that providing a negative number as the index value will index the list
from the end:
last_row = dataset[-1]
np.mean(last_row)

The output of the preceding code is as follows:


100.18096645222221

5. Index the first value of the first row using the Python standard syntax of [0][0]:
first_val_first_row = dataset[0][0]
np.mean(first_val_first_row)

The output of the preceding code is as follows:


99.14931546

6. Use reverse indexing to access the last value of the second last row (we want to use
the combined access syntax here). Remember that -1 means the last element:
last_val_second_last_row = dataset[-2, -1]
np.mean(last_val_second_last_row)

The output of the preceding code is as follows:


101.2226037
22 | The Importance of Data Visualization and Data Exploration

Slicing
7. Create a 2x2 matrix that starts at the second row and second column using
[1:3, 1:3]:
# slicing an intersection of 4 elements (2x2) of the first two rows and
first two columns
subsection_2x2 = dataset[1:3, 1:3]
np.mean(subsection_2x2)

The output of the preceding code is as follows:


95.63393608250001

8. In this task, we want to have every other element of the fifth row. Provide indexing
of ::2 as our second element to get every second element of the given row:
every_other_elem = dataset[4, ::2]
np.mean(every_other_elem)

The output of the preceding code is as follows:


98.35235805800001

Introducing the second column into the indexing allows us to add another layer of
complexity. The third value allows us to only select certain values (such as every
other element) by providing a value of 2. This means it skips the values between and
only takes each second element from the used list.
9. Reverse the elements in a slice using negative numbers:
reversed_last_row = dataset[-1, ::-1]
np.mean(reversed_last_row)

The output of the preceding code is as follows:


100.18096645222222

Splitting
10. Use the hsplit method to split our dataset into three equal parts:
hor_splits = np.hsplit(dataset,(3))

Note that if the dataset can't be split with the given number of slices, it will throw
an error.
11. Split the first third into two equal parts vertically. Use the vsplit method to
vertically split the dataset in half. It works like hsplit:
ver_splits = np.vsplit(hor_splits[0],(2))
NumPy | 23

12. Compare the shapes. We can see that the subset has the required half of the rows
and the third half of the columns:
print("Dataset", dataset.shape)
print("Subset", ver_splits[0].shape)

The output of the preceding code is as follows:


Dataset (24, 9)
Subset (12, 3)

Iterating
13. Iterate over the whole dataset (each value in each row):
curr_index = 0
for x in np.nditer(dataset):
    print(x, curr_index)
    curr_index += 1

The output of the preceding code is as follows:

Figure 1.12: Iterating the entire dataset

Looking at the given piece of code, we can see that the index is simply incremented
with each element. This only works with one-dimensional data. If we want to index
multi-dimensional data, this won't work.
24 | The Importance of Data Visualization and Data Exploration

14. Use the ndenumerate method to iterate over the whole dataset. It provides two
positional values, index and value:
for index, value in np.ndenumerate(dataset):
    print(index, value)

The output of the preceding code is as follows:

Figure 1.13: Enumerating the dataset with multi-dimensional data

We've already covered most of the basic data wrangling methods for NumPy. In the next
exercise, we'll take a look at more advanced features that will give you the tools you
need to get better at analyzing your data.

Advanced NumPy Operations


In this section, we will learn about advanced NumPy operations such as filtering,
sorting, combining, and reshaping and implement them in an exercise.

Filtering
Filtering is a very powerful tool that can be used to clean up your data if you want to
avoid outlier values.
In addition to the dataset[dataset > 10] shorthand notation, we can use the
built-in NumPy extract method, which does the same thing using a different notation,
but gives us greater control with more complex examples.
NumPy | 25

If we only want to extract the indices of the values that match our given condition, we
can use the built-in where method. For example, np.where(dataset > 5) will return
a list of indices of the values from the initial dataset that is bigger than 5:
dataset[dataset > 10] # values bigger than 10
np.extract((dataset < 3), dataset) # alternative – values smaller
than 3
dataset[(dataset > 5) & (dataset < 10)] # values bigger 5 and smaller 10
np.where(dataset > 5) # indices of values bigger than 5
(rows and cols)

Sorting
Sorting each row of a dataset can be really useful. Using NumPy, we are also able to sort
on other dimensions, such as columns.
In addition, argsort gives us the possibility to get a list of indices, which would result
in a sorted list:
np.sort(dataset) # values sorted on last axis
np.sort(dataset, axis=0) # values sorted on axis 0
np.argsort(dataset) # indices of values in sorted list

Combining
Stacking rows and columns onto an existing dataset can be helpful when you have two
datasets of the same dimension saved to different files.
Given two datasets, we use vstack to stack dataset_1 on top of dataset_2, which will
give us a combined dataset with all the rows from dataset_1, followed by all the rows
from dataset_2.
If we use hstack, we stack our datasets "next to each other," meaning that the elements
from the first row of dataset_1 will be followed by the elements of the first row of
dataset_2. This will be applied to each row:
np.vstack([dataset_1, dataset_2]) # combine datasets vertically
np.hstack([dataset_1, dataset_2]) # combine datasets horizontally
np.stack([dataset_1, dataset_2], axis=0) # combine datasets on axis 0
26 | The Importance of Data Visualization and Data Exploration

Reshaping
Reshaping can be crucial for some algorithms. Depending on the nature of your data, it
might help you to reduce dimensionality to make visualization easier:
dataset.reshape(-1, 2) # reshape dataset to two columns x rows
np.reshape(dataset, (1, -1)) # reshape dataset to one row x columns

Here, -1 is an unknown dimension that NumPy identifies automatically. NumPy will


figure out the length of any given array and the remaining dimensions and will thus
make sure that it satisfies the required standard.
Next, we will perform an exercise using advanced NumPy operations.

Exercise 1.03: Filtering, Sorting, Combining, and Reshaping


This final exercise for NumPy provides some more complex tasks to consolidate our
learning. It will also combine most of the previously learned methods as a recap.
Let's use the filtering features of NumPy for sorting, stacking, combining, and reshaping
our data:
1. Import the necessary libraries:
import numpy as np

2. Load the normal_distribution_splittable.csv dataset using NumPy. Make


sure that everything works by having a look at the ndarray:
dataset = np.genfromtxt('../../Datasets/normal_distribution_
splittable.csv', delimiter=',')

Filtering
3. Get values greater than 105 by supplying the condition > 105 in the brackets:
vals_greater_five = dataset[dataset > 105]

4. Extract the values of our dataset that are between the values 90 and 95. To use
more complex conditions, we might want to use the extract method of NumPy:
vals_between_90_95 = np.extract((dataset > 90) & (dataset < 95),
dataset)
NumPy | 27

5. Use the where method to get the indices of values that have a delta of less than 1 to
100. Use those indices (row, col) in a list comprehension and print them out:
rows, cols = np.where(abs(dataset - 100) < 1)
one_away_indices = [[rows[index], cols[index]] for (index, _) in
np.ndenumerate(rows)]

The where method from NumPy allows us to get indices (rows, cols) for each
of the matching values.

Note
List comprehensions are Python's way of mapping over data. They're a handy
notation for creating a new list with some operation applied to every element of
the old list.
For example, if we want to double the value of every element in this list, list
= [1, 2, 3, 4, 5], we would use list comprehensions like this: doubled_
list=[x*x for x in list]. This gives us the following list: [1, 4, 9,
16, 25]. To get a better understanding of list comprehensions, please visit
https://docs.python.org/3/tutorial/datastructures.html#list-comprehensions.

Sorting
6. Sort each row in our dataset by using the sort method:
row_sorted = np.sort(dataset)

As described before, by default, the last axis will be used. In a two-dimensional


dataset, this is axis 1 which represents the rows. So we can omit the axis=1
argument in the np.sort method call.
7. With multi-dimensional data, we can use the axis parameter to define which
dataset should be sorted. Use the 0 axes to sort the values by column:
col_sorted = np.sort(dataset, axis=0)
28 | The Importance of Data Visualization and Data Exploration

8. Create a sorted index list and use fancy indexing to get access to sorted elements
easily. To keep the order of our dataset and obtain only the values of a sorted
dataset, we will use argsort:
index_sorted = np.argsort(dataset[0])
dataset[0][index_sorted]

Figure 1.14: First row with sorted values from argsort

Combining
9. Use the combining features to add the second half of the first column back
together, add the second column to our combined dataset, and add the third
column to our combined dataset.
thirds = np.hsplit(dataset, (3))
halfed_first = np.vsplit(thirds[0], (2))

halfed_first[0]

The output of the preceding code is as follows:

Figure 1.15: Splitting the dataset


pandas | 29

10. Use vstack to vertically combine the halfed_first datasets:


first_col = np.vstack([halfed_first[0], halfed_first[1]])

After vstacking the second half of our split dataset, we have one-third of our initial
dataset stacked together again. Now, we want to add the other two remaining
datasets to our first_col dataset.
11. Use the hstack method to combine our already combined first_col with the
second of the three split datasets:
first_second_col = np.hstack([first_col, thirds[1]])

12. Use hstack to combine the last one-third column with our dataset. This is the
same thing we did with our second-third column in the previous step:
full_data = np.hstack([first_second_col, thirds[2]])

Reshaping
13. Reshape our dataset into a single list using the reshape method:
single_list = np.reshape(dataset, (1, -1))

14. Provide a -1 for the dimension. This tells NumPy to figure the dimension out itself:
# reshaping to a matrix with two columns
two_col_dataset = dataset.reshape(-1, 2)

You have now used many of the basic operations that are needed so that you can
analyze a dataset. Next, we will be learning about pandas, which will provide several
advantages when working with data that is more complex than simple multi-
dimensional numerical data. pandas also support different data types in datasets,
meaning that we can have columns that hold strings and others that have numbers.
NumPy, as you've seen, has some powerful tools. Some of them are even more powerful
when combined with pandas DataFrames.

pandas
The pandas Python library provides data structures and methods for manipulating
different types of data, such as numerical and temporal data. These operations are easy
to use and highly optimized for performance.
30 | The Importance of Data Visualization and Data Exploration

Data formats, such as CSV and JSON, and databases can be used to create DataFrames.
DataFrames are the internal representations of data and are very similar to tables
but are more powerful since they allow you to efficiently apply operations such as
multiplications, aggregations, and even joins. Importing and reading both files and
in-memory data is abstracted into a user-friendly interface. When it comes to handling
missing data, pandas provide built-in solutions to clean up and augment your data,
meaning it fills in missing values with reasonable values.
Integrated indexing and label-based slicing in combination with fancy indexing (what
we already saw with NumPy) make handling data simple. More complex techniques,
such as reshaping, pivoting, and melting data, together with the possibility of easily
joining and merging data, provide powerful tooling so that you can handle your
data correctly.
If you're working with time-series data, operations such as date range generation,
frequency conversion, and moving window statistics can provide an advanced
interface for data wrangling.

Note
The installation instructions for pandas can be found here:
https://pandas.pydata.org/. The latest version is v0.25.3 (used in this book);
however, every v0.25.x should be suitable.

Advantages of pandas over NumPy


The following are some of the advantages of pandas:
• High level of abstraction: pandas have a higher abstraction level than NumPy,
which gives it a simpler interface for users to interact with. It abstracts away some
of the more complex concepts, such as high-performance matrix multiplications
and joining tables, and makes it easier to use and understand.
• Less intuition: Many methods, such as joining, selecting, and loading files, are
used without much intuition and without taking away much of the powerful nature
of pandas.
• Faster processing: The internal representation of DataFrames allows faster
processing for some operations. Of course, this always depends on the data and
its structure.
• Easy DataFrame design: DataFrames are designed for operations with and on
large datasets.
pandas | 31

Disadvantages of pandas
The following are some of the disadvantages of pandas:
• Less applicable: Due to its higher abstraction, it's generally less applicable than
NumPy. Especially when used outside of its scope, operations can get complex.
• More disk space: Due to the internal representation of DataFrames and the way
pandas trades disk space for a more performant execution, the memory usage of
complex operations can spike.
• Performance problems: Especially when doing heavy joins, which is
not recommended, memory usage can get critical and might lead to
performance problems.
• Hidden complexity: Less experienced users often tend to overuse methods and
execute them several times instead of reusing what they've already calculated. This
hidden complexity makes users think that the operations themselves are simple,
which is not the case.

Note
Always try to think about how to design your workflows instead of excessively
using operations.

Now, we will do an exercise to load a dataset and calculate the mean using pandas.

Exercise 1.04 Loading a Sample Dataset and Calculating the Mean using
Pandas
In this exercise, we will be loading the world_population.csv dataset and calculating
the mean of some rows and columns. Our dataset holds the yearly population density
for every country. Let's use pandas to perform this exercise:
1. Open the Exercise1.04.ipynb Jupyter Notebook from the Chapter01 folder to
implement this exercise and import the pandas libraries:
import pandas as pd

2. Use the read_csv method to load the aforementioned dataset. We want to use
the first column, containing the country names, as our index. We will use the
index_col parameter for that:
dataset = pd.read_csv('../../Datasets/world_population.csv', index_
col=0)
32 | The Importance of Data Visualization and Data Exploration

3. Now, check the data you just imported by simply writing the name of the dataset
in the next cell. pandas uses a data structure called DataFrames. Print some of the
rows. To avoid filling the screen, use the pandas head() method:
dataset.head()

The output of the preceding code is as follows:

Figure 1.16: The first five rows of our dataset

Both head() and tail() let you provide a number, n, as a parameter, which
describes how many rows should be returned.

Note
Simply executing a cell that returns a value such as a DataFrame will use Jupyter
formatting, which looks nicer and, in most cases, displays more information than
using print.
pandas | 33

4. Print out the shape of the dataset to get a quick overview using the dataset.
shape command. This works the same as it does with NumPy ndarrays. It will give
us the output in the form (rows, columns):
dataset.shape

The output of the preceding code is as follows:


(264, 60)

5. Index the column with the year 1961. pandas DataFrames have built-in functions for
calculations, such as the mean. This means we can simply call dataset.mean() to
get the result.
The printed output should look as follows:
dataset["1961"].mean()

The output of the preceding code is as follows:


176.91514132840555

6. Check the difference in population density over the years by repeating the previous
step with the column for the year 2015 (the population more than doubled in the
given time range):
# calculating the mean for 2015 column
dataset["2015"].mean()

The output of the preceding code is as follows:


368.70660104001837

7. To get the mean for every single country (row), we can make use of pandas axis
tools. Use the mean() method on the dataset on axis=1, meaning all the rows, and
return the first 10 rows using the head() method:
dataset.mean(axis=1).head(10)
34 | The Importance of Data Visualization and Data Exploration

The output of the preceding code is as follows:

Figure 1.17: Mean of elements in the first 10 countries (rows)

8. Get the mean for each column and return the last 10 entries:
dataset.mean(axis=0).tail(10)

The output of the preceding code is as follows:

Figure 1.18: Mean of elements for the last 10 years (columns)

9. Calculate the mean of the whole DataFrame:


# calculating the mean for the whole matrix
dataset.mean()
pandas | 35

The output of the preceding code is as follows:

Figure 1.19: Mean of elements for each column

Since pandas DataFrames can have different data types in each column, aggregating
this value on the whole dataset out of the box makes no sense. By default, axis=0
will be used, which means that this will give us the same result as the cell prior
to this.
We've now seen that the interface of pandas has some similar methods to NumPy,
which makes it really easy to understand. We have now covered the very basics, which
will help you solve the first exercise using pandas. In the following exercise, you will
consolidate your basic knowledge of pandas and use the methods you just learned to
solve several computational tasks.
36 | The Importance of Data Visualization and Data Exploration

Exercise 1.05: Using pandas to Compute the Mean, Median, and Variance of a
Dataset
In this exercise, we will take the previously learned skills of importing datasets and
basic calculations and apply them to solve the tasks of our first exercise using pandas.
Let's use pandas features such as mean, median, and variance to make some
calculations on our data:
1. Import the necessary libraries:
import pandas as pd

2. Use the read_csv method to load the aforementioned dataset and use the index_
col parameter to define the first column as our index:
dataset = pd.read_csv('../../Datasets/world_population.csv', index_
col=0)

3. Print the first two rows of our dataset:


dataset[0:2]

The output of the preceding code is as follows:

Figure 1.20: The first two rows, printed

4. Now, index the third row by using dataset.iloc[[2]]. Use the axis parameter
to get the mean of the country rather than the yearly column:
dataset.iloc[[2]].mean(axis=1)
pandas | 37

The output of the preceding code is as follows:

Figure 1.21: Calculating the mean of the third row

5. Index the last element of the DataFrame using -1 as the index for the
iloc() method:
dataset.iloc[[-1]].mean(axis=1)

The output of the preceding code is as follows:

Figure 1.22: Calculating the mean of the last row

6. Calculate the mean value of the values labeled as Germany using loc, which works
based on the index column:
dataset.loc[["Germany"]].mean(axis=1)

The output of the preceding code is as follows:

Figure 1.23: Indexing a country and calculating the mean of Germany

7. Calculate the median value of the last row by using reverse indexing and axis=1 to
aggregate the values in the row:
dataset.iloc[[-1]].median(axis=1)

The output of the preceding code is as follows:

Figure 1.24: Usage of the median method on the last row


38 | The Importance of Data Visualization and Data Exploration

8. Use reverse indexing to get the last three columns with dataset[-3:] and
calculate the median for each of them:
dataset[-3:].median(axis=1)

The output of the preceding code is as follows:

Figure 1.25: Median of the last three columns

9. Calculate the median population density values for the first 10 countries of the list
using the head and median methods:
dataset.head(10).median(axis=1)

The output of the preceding code is as follows:

Figure 1.26: Usage of the axis to calculate the median of the first 10 rows

When handling larger datasets, the order in which methods are executed
matters. Think about what head(10) does for a moment. It simply takes your
dataset and returns the first 10 rows in it, cutting down your input to the mean()
method drastically.
The last method we'll cover here is the variance. pandas provide a consistent API,
which makes it easy to use.
pandas | 39

10. Calculate the variance of the dataset and return only the last five columns:
dataset.var().tail()

The output of the preceding code is as follows:

Figure 1.27: Variance of the last five columns

11. Calculate the mean for the year 2015 using both NumPy and pandas separately:
# NumPy pandas interoperability
import numpy as np
print("pandas", dataset["2015"].mean())
print("numpy", np.mean(dataset["2015"]))

The output of the preceding code is as follows:

Figure 1.28: Using NumPy's mean method with a pandas DataFrame

This example of how to use NumPy's mean method with a pandas DataFrame shows
that, in some cases, NumPy has better functionality. However, the DataFrame format of
pandas is more applicable, so we combine both libraries to get the best out of both.
You've completed your first exercise with pandas, which showed you some of the
similarities, and also differences when working with NumPy and pandas. In the
following exercise, this knowledge will be consolidated. You'll also be introduced to
more complex features and methods of pandas.
40 | The Importance of Data Visualization and Data Exploration

Basic Operations of pandas


In this section, we will learn about the basic pandas operations, such as indexing,
slicing, and iterating, and implement them with an exercise.

Indexing
Indexing with pandas is a bit more complex than with NumPy. We can only access
columns with a single bracket. To use the indices of the rows to access them, we need
the iloc method. If we want to access them with index_col (which was set in the
read_csv call), we need to use the loc method:
dataset["2000"] # index the 2000 col
dataset.iloc[-1] # index the last row
dataset.loc["Germany"] # index the row with index Germany
dataset[["2015"]].loc[["Germany"]] # index row Germany and column 2015

Slicing
Slicing with pandas is even more powerful. We can use the default slicing syntax we've
already seen with NumPy or use multi-selection. If we want to slice different rows or
columns by name, we can simply pass a list into the brackets:
dataset.iloc[0:10] # slice of the first 10 rows
dataset.loc[["Germany", "India"]] # slice of rows Germany and India
# subset of Germany and India with years 1970/90
dataset.loc[["Germany", "India"]][["1970", "1990"]]

Iterating
Iterating DataFrames is also possible. Considering that they can have several
dimensions and dtypes, the indexing is very high level and iterating over each row
has to be done separately:
# iterating the whole dataset
for index, row in dataset.iterrows():
    print(index, row)
pandas | 41

Series
A pandas Series is a one-dimensional labeled array that is capable of holding any type of
data. We can create a Series by loading datasets from a .csv file, Excel spreadsheet, or
SQL database. There are many different ways to create them, such as the following:
• NumPy arrays:
# import pandas
import pandas as pd
# import numpy
import numpy as np
# creating a numpy array
numarr = np.array(['p','y','t','h','o','n'])
ser = pd.Series(numarr)
print(ser)

• pandas lists:
# import pandas
import pandas as pd
# creating a pandas list
plist = ['p','y','t','h','o','n']
ser = pd.Series(plist)
print(ser)

Now, we will use basic pandas operations in an exercise.

Exercise 1.06: Indexing, Slicing, and Iterating Using pandas


In this exercise, we will use the previously discussed pandas features to index, slice,
and iterate DataFrames using pandas Series. To derive some insights from our dataset,
we need to be able to explicitly index, slice, and iterate our data. For example, we can
compare several countries in terms of population density growth.
Let's use the indexing, slicing, and iterating operations to display the population density
of Germany, Singapore, United States, and India for years 1970, 1990, and 2010.
42 | The Importance of Data Visualization and Data Exploration

Indexing
1. Import the necessary libraries:
import pandas as pd

2. Use the read_csv method to load the world_population.csv dataset and


use the first column, (containing the country names) as our index using the
index_col parameter:
dataset = pd.read_csv('../../Datasets/world_population.csv', index_
col=0)

3. Index the row with the index_col "United States" using the loc method:
dataset.loc[["United States"]].head()

The output of the preceding code is as follows:

Figure 1.29: A few columns from the output showing indexing United States with the loc method

4. Use reverse indexing in pandas to index the second to last row using the
iloc method:
dataset.iloc[[-2]]
pandas | 43

The output of the preceding code is as follows:

Figure 1.30: Indexing the second to last row

5. Columns are indexed using their header. This is the first line of the CSV file. Index
the column with the header of 2000 as a Series:
dataset["2000"].head()

The output of the preceding code is as follows:

Figure 1.31: Indexing all 2000 columns

Remember, the head() method simply returns the first five rows.
6. First, get the data for the year 2000 as a DataFrame and then select India using the
loc() method using chaining:
dataset[["2000"]].loc[["India"]]

The output of the preceding code is as follows:

Figure 1.32: Getting the population density of India in 2000


44 | The Importance of Data Visualization and Data Exploration

Since the double brackets notation returns a DataFrame once again, we can chain
method calls to get distinct elements.
7. Use the single brackets notation to get the distinct value for the population density
of India in 2000:
dataset["2000"].loc["India"]

If we want to only retrieve a Series object, we must replace the double brackets
with single ones. The output of the preceding code is as follows:
354.326858357522

Slicing
8. Create a slice with the rows 2 to 5 using the iloc() method again:
# slicing countries of rows 2 to 5
dataset.iloc[1:5]

The output of the preceding code is as follows:

Figure 1.33: The countries in rows 2 to 5


pandas | 45

9. Use the loc() method to access several rows in the DataFrame and use the
nested brackets to provide a list of elements. Slice the dataset to get the rows for
Germany, Singapore, United States, and India:
dataset.loc[["Germany", "Singapore", "United States", "India"]]

The output of the preceding code is as follows:

Figure 1.34: Slicing Germany, Singapore, United States, and India


46 | The Importance of Data Visualization and Data Exploration

10. Use chaining to get the rows for Germany, Singapore, United States, and India and
return only the values for the years 1970, 1990, and 2010. Since the double bracket
queries return new DataFrames, we can chain methods and therefore access
distinct subframes of our data:
country_list = ["Germany", "Singapore", "United States", "India"]

dataset.loc[country_list][["1970", "1990", "2010"]]

The output of the preceding code is as follows:

Figure 1.35: Slices some of the countries and their population density for 1970, 1990, and 2010

Iterating
11. Iterate our dataset and print out the countries up until Angola using the
iterrows() method. The index will be the name of our row, and the row will
hold all the columns:
for index, row in dataset.iterrows():
    # only printing the rows until Angola
    if index == 'Angola':
        break
    print(index, '\n', row[["Country Code", "1970", "1990", "2010"]],
'\n')
pandas | 47

The output of the preceding code is as follows:

Figure 1.36: Iterating all countries until Angola

We've already covered most of the underlying data wrangling methods using pandas. In
the next exercise, we'll take a look at more advanced features such as filtering, sorting,
and reshaping to prepare you for the next chapter.

Advanced pandas Operations


In this section, we will learn about some advanced pandas operations such as filtering,
sorting, and reshaping and implement them in an exercise.

Filtering
Filtering in pandas has a higher-level interface than NumPy. You can still use the
simple brackets-based conditional filtering. However, you're also able to use more
complex queries, for example, filter rows based on labels using likeness, which allows
us to search for a substring using the like argument and even full regular expressions
using regex:
dataset.filter(items=["1990"]) # only column 1994
dataset[(dataset["1990"] < 10)] # countries population density < 10 in
1999
dataset.filter(like="8", axis=1) # years containing an 8
dataset.filter(regex="a$", axis=0) # countries ending with a
48 | The Importance of Data Visualization and Data Exploration

Sorting
Sorting each row or column based on a given row or column will help you analyze
your data better and find the ranking of a given dataset. With pandas, we are able to
do this pretty easily. Sorting in ascending and descending order can be done using the
parameter known as ascending. The default sorting order is ascending. Of course, you
can do more complex sorting by providing more than one value in the by = [ ] list.
Those will then be used to sort values for which the first value is the same:
dataset.sort_values(by=["1999"]) # values sorted by 1999
# values sorted by 1999 descending
dataset.sort_values(by=["1994"], ascending=False)

Reshaping
Reshaping can be crucial for easier visualization and algorithms. However, depending
on your data, this can get really complex:
dataset.pivot(index=["1999"] * len(dataset), columns="Country Code",
values="1999")

Now, we will use advanced pandas operations to perform an exercise.

Exercise 1.07: Filtering, Sorting, and Reshaping


This exercise provides some more complex tasks and also combines most of the
methods we learned about previously as a recap. After this exercise, you should be able
to read the most basic pandas code and understand its logic.
Let's use pandas to filter, sort, and reshape our data.
Filtering
1. Import the necessary libraries:
# importing the necessary dependencies
import pandas as pd
pandas | 49

2. Use the read_csv method to load the dataset, again defining our first column as an
index column:
# loading the dataset
dataset = pd.read_csv('../../Datasets/world_population.csv', index_
col=0)

3. Use filter instead of using the bracket syntax to filter for specific items. Filter the
dataset for columns 1961, 2000, and 2015 using the items parameter:
# filtering columns 1961, 2000, and 2015
dataset.filter(items=["1961", "2000", "2015"]).head()

The output of the preceding code is as follows:

Figure 1.37: Filtering data for 1961, 2000, and 2015

4. Use conditions to get all the countries that had a higher population density than
500 in 2000. Simply pass this condition in brackets:
# filtering countries that had a greater population density than 500 in
2000
dataset[(dataset["2000"] > 500)][["2000"]]
50 | The Importance of Data Visualization and Data Exploration

The output of the preceding code is as follows:

Figure 1.38: Filtering out values that are greater than 500 in the 2000 column

5. Search for arbitrary columns or rows (depending on the index given) that match a
certain regex. Get all the columns that start with 2 by passing ^2 (meaning that it
starts at 2):
dataset.filter(regex="^2", axis=1).head()

The output of the preceding code is as follows:

Figure 1.39: Retrieving all columns starting with 2


pandas | 51

6. Filter the rows instead of the columns by passing axis=0. This will be helpful for
situations when we want to filter all the rows that start with A:
dataset.filter(regex="^A", axis=0).head()

The output of the preceding code is as follows:

Figure 1.40: Retrieving the rows that start with A


52 | The Importance of Data Visualization and Data Exploration

7. Use the like query to find only the countries that contain the word land, such
as Switzerland:
dataset.filter(like="land", axis=0).head()

The output of the preceding code is as follows:

Figure 1.41: Retrieving all countries containing the word "land"


pandas | 53

Sorting
8. Use the sort_values or sort_index method to get the countries with the lowest
population density for the year 1961:
dataset.sort_values(by=["1961"])[["1961"]].head(10)

The output of the preceding code is as follows:

Figure 1.42: Sorting by the values for the year 1961

9. Just for comparison, carry out sorting for 2015:


dataset.sort_values(by=["2015"])[["2015"]].head(10)
54 | The Importance of Data Visualization and Data Exploration

The output of the preceding code is as follows:

Figure 1.43: Sorting based on the values of 2015

We can see that the order of the countries with the lowest population density has
changed a bit, but that the first three entries remain unchanged.
10. Sort column 2015 in descending order to show the biggest values first:
dataset.sort_values(by=["2015"], ascending=False)[["2015"]].head(10)
pandas | 55

The output of the preceding code is as follows:

Figure 1.44: Sorting in descending order

Reshaping
11. Get a DataFrame where the columns are country codes and the only row is the
year 2015. Since we only have one 2015 label, we need to duplicate it as many
times as our dataset's length:
# reshaping to 2015 as row and country codes as columns
dataset_2015 = dataset[["Country Code", "2015"]]
dataset_2015.pivot(index=["2015"] * len(dataset_2015),
columns="Country Code", values="2015")

The output of the preceding code is as follows:

Figure 1.45: Reshaping the dataset into a single row for the values of 2015
56 | The Importance of Data Visualization and Data Exploration

You now know the basic functionality of pandas and have already applied it to a real-
world dataset. In the final activity for this chapter, we will try to analyze a forest fire
dataset to get a feeling for mean forest fire sizes and whether the temperature of each
month is proportional to the number of fires.

Activity 1.02: Forest Fire Size and Temperature Analysis


In this activity, we will use pandas features to derive some insights from a forest fire
dataset. We will get the mean size of forest fires, what the largest recorded fire in
our dataset is, and whether the amount of forest fires grows proportionally to the
temperature in each month.
Our forest fires dataset has the following structure:
• X: X-axis spatial coordinate within the Montesinho park map: 1 to 9
• Y: Y-axis spatial coordinate within the Montesinho park map: 2 to 9
• month: Month of the year: 'jan' to 'dec'
• day: Day of the week: 'mon' to 'sun'
• FFMC: FFMC index from the FWI system: 18.7 to 96.20
• DMC: DMC index from the FWI system: 1.1 to 291.3
• DC: DC index from the FWI system: 7.9 to 860.6
• ISI: ISI index from the FWI system: 0.0 to 56.10
• temp: Temperature in degrees Celsius: 2.2 to 33.30
• RH: Relative humidity in %: 15.0 to 100
• wind: Wind speed in km/h: 0.40 to 9.40
• rain: Outside rain in mm/m2: 0.0 to 6.4
• area: The burned area of the forest (in ha): 0.00 to 1090.84

Note
We will only be using the month, temp, and area columns in this activity.
pandas | 57

The following are the steps for this activity:


1. Open the Activity1.02.ipynb Jupyter Notebook from the Chapter01 folder to
complete this activity. Import pandas using the pd alias.
2. Load the forestfires.csv dataset using pandas.
3. Print the first two rows of the dataset to get a feeling for its structure.
Derive insights from the sizes of forest fires
1. Filter the dataset so that it only contains entries that have an area larger than 0.
2. Get the mean, min, max, and std of the area column and see what information this
gives you.
3. Sort the filtered dataset using the area column and print the last 20 entries using
the tail method to see how many huge values it holds.
4. Then, get the median of the area column and visually compare it to the mean value.
Finding the month with the most forest fires
1. Get a list of unique values from the month column of the dataset.
2. Get the number of entries for the month of March using the shape member of
our DataFrame.
3. Now, iterate over all the months, filter our dataset for the rows containing the
given month, and calculate the mean temperature. Print a statement with the
number of fires, the mean temperature, and the month.

Note
The solution for this activity can be found on page 351.

You have now completed this topic all about pandas, which concludes this chapter. We
have learned about the essential tools that help you wrangle and work with data. pandas
is an incredibly powerful and widely used tool for wrangling and understanding data.
58 | The Importance of Data Visualization and Data Exploration

Summary
NumPy and pandas are essential tools for data wrangling. Their user-friendly
interfaces and performant implementation make data handling easy. Even though
they only provide a little insight into our datasets, they are valuable for wrangling,
augmenting, and cleaning our datasets. Mastering these skills will improve the quality
of your visualizations.
In this chapter, we learned about the basics of NumPy, pandas, and statistics. Even
though the statistical concepts we covered are basic, they are necessary to enrich
our visualizations with information that, in most cases, is not directly provided in our
datasets. This hands-on experience will help you implement the exercises and activities
in the following chapters.
In the next chapter, we will focus on the different types of visualizations and how to
decide which visualization would be best for our use case. This will give you theoretical
knowledge so that you know when to use a specific chart type and why. It will also lay
down the fundamentals of the remaining chapters in this book, which will focus on
teaching you how to use Matplotlib and seaborn to create the plots we have discussed
here. After we have covered basic visualization techniques with Matplotlib and seaborn,
we will dive more in-depth and explore the possibilities of interactive and animated
charts, which will introduce an element of storytelling into our visualizations.
All You Need to Know
2
about Plots

Overview
In this chapter, we will learn the basics of different types of plots. You will design
attractive, tangible visualizations, and learn how to identify the best plot type for a
given dataset and scenario.
62 | All You Need to Know about Plots

Introduction
In the previous chapter, we learned how to work with new datasets and get familiar
with their data and structure. We also got hands-on experience of how to analyze and
transform them using different data wrangling techniques such as filtering, sorting, and
reshaping. All of these techniques will come in handy when working with further real-
world datasets in the coming activities.
In this chapter, we will focus on various visualizations and identify which visualization
is best for showing certain information for a given dataset. We will describe every
visualization in detail and give practical examples, such as comparing different stocks
over time or comparing the ratings for different movies. Starting with comparison plots,
which are great for comparing multiple variables over time, we will look at their types
(such as line charts, bar charts, and radar charts).
We will then move onto relation plots, which are handy for showing relationships
among variables. We will cover scatter plots for showing the relationship between two
variables, bubble plots for three variables, correlograms for variable pairs, and finally,
heatmaps for visualizing multivariate data.
The chapter will further explain composition plots (used to visualize variables that are
part of a whole), as well as pie charts, stacked bar charts, stacked area charts, and Venn
diagrams. To give you a deeper insight into the distribution of variables, we will discuss
distribution plots, describing histograms, density plots, box plots, and violin plots.
Finally, we will talk about dot maps, connection maps, and choropleth maps, which
can be categorized into geoplots. Geoplots are useful for visualizing geospatial data.
Let's start with the family of comparison plots, including line charts, bar charts, and
radar charts.

Note
The data used in this chapter has been provided to demonstrate the different
types of plots available to you. In each case, the data itself will be revisited and
explained more fully in a later chapter.
Comparison Plots | 63

Comparison Plots
Comparison plots include charts that are ideal for comparing multiple variables
or variables over time. Line charts are great for visualizing variables over time. For
comparison among items, bar charts (also called column charts) are the best way to
go. For a certain time period (say, fewer than 10-time points), vertical bar charts can be
used as well. Radar charts or spider plots are great for visualizing multiple variables for
multiple groups.

Line Chart
Line charts are used to display quantitative values over a continuous time period and
show information as a series. A line chart is ideal for a time series that is connected by
straight-line segments.
The value being measured is placed on the y-axis, while the x-axis is the timescale.

Uses
• Line charts are great for comparing multiple variables and visualizing trends for
both single as well as multiple variables, especially if your dataset has many time
periods (more than 10).
• For smaller time periods, vertical bar charts might be the better choice.
64 | All You Need to Know about Plots

The following diagram shows a trend of real estate prices (per million US dollars) across
two decades. Line charts are ideal for showing data trends:

Figure 2.1: Line chart for a single variable


Comparison Plots | 65

Example
The following figure is a multiple-variable line chart that compares the stock-closing
prices for Google, Facebook, Apple, Amazon, and Microsoft. A line chart is great for
comparing values and visualizing the trend of the stock. As we can see, Amazon shows
the highest growth:

Figure 2.2: Line chart showing stock trends for five companies

Design Practices
• Avoid too many lines per chart.
• Adjust your scale so that the trend is clearly visible.

Note
For plots with multiple variables, a legend should be given to describe
each variable.
66 | All You Need to Know about Plots

Bar Chart
In a bar chart, the bar length encodes the value. There are two variants of bar charts:
vertical bar charts and horizontal bar charts.

Use
While they are both used to compare numerical values across categories, vertical bar
charts are sometimes used to show a single variable over time.

Don'ts of Bar Charts


• Don't confuse vertical bar charts with histograms. Bar charts compare different
variables or categories, while histograms show the distribution for a single variable.
Histograms will be discussed later in this chapter.
• Another common mistake is to use bar charts to show central tendencies among
groups or categories. Use box plots or violin plots to show statistical measures or
distributions in these cases.

Examples
The following diagram shows a vertical bar chart. Each bar shows the marks out of 100
that 5 students obtained in a test:

Figure 2.3: Vertical bar chart using student test data


Comparison Plots | 67

The following diagram shows a horizontal bar chart. Each bar shows the marks out of
100 that 5 students obtained in a test:

Figure 2.4: Horizontal bar chart using student test data


68 | All You Need to Know about Plots

The following diagram compares movie ratings, giving two different scores. The
Tomatometer is the percentage of approved critics who have given a positive review
for the movie. The Audience Score is the percentage of users who have given a score
of 3.5 or higher out of 5. As we can see, The Martian is the only movie with both a high
Tomatometer and Audience Score. The Hobbit: An Unexpected Journey has a relatively
high Audience Score compared to the Tomatometer score, which might be due to a
huge fan base:

Figure 2.5: Comparative bar chart

Design Practices
• The axis corresponding to the numerical variable should start at zero. Starting with
another value might be misleading, as it makes a small value difference look like a
big one.
• Use horizontal labels—that is, as long as the number of bars is small, and the chart
doesn't look too cluttered.
• The labels can be rotated to different angles if there isn't enough space to
present them horizontally. You can see this on the labels of the x-axis of the
preceding diagram.
Comparison Plots | 69

Radar Chart
Radar charts (also known as spider or web charts) visualize multiple variables with
each variable plotted on its own axis, resulting in a polygon. All axes are arranged
radially, starting at the center with equal distances between one another, and have
the same scale.

Uses
• Radar charts are great for comparing multiple quantitative variables for a single
group or multiple groups.
• They are also useful for showing which variables score high or low within a dataset,
making them ideal for visualizing performance.

Examples
The following diagram shows a radar chart for a single variable. This chart displays data
about a student scoring marks in different subjects:

Figure 2.6: Radar chart for one variable (student)


70 | All You Need to Know about Plots

The following diagram shows a radar chart for two variables/groups. Here, the chart
explains the marks that were scored by two students in different subjects:

Figure 2.7: Radar chart for two variables (two students)


Comparison Plots | 71

The following diagram shows a radar chart for multiple variables/groups. Each chart
displays data about a student's performance in different subjects:

Figure 2.8: Radar chart with faceting for multiple variables (multiple students)
72 | All You Need to Know about Plots

Design Practices
• Try to display 10 factors or fewer on a single radar chart to make it easier to read.
• Use faceting (displaying each variable in a separate plot) for multiple variables/
groups, as shown in the preceding diagram, in order to maintain clarity.
In the first section, we learned which plots are suitable for comparing items. Line charts
are great for comparing something over time, whereas bar charts are for comparing
different items. Last but not least, radar charts are best suited for visualizing multiple
variables for multiple groups. In the following activity, you can check whether you
understood which plot is best for which scenario.

Activity 2.01: Employee Skill Comparison


You are given scores of four employees (Alex, Alice, Chris, and Jennifer) for five
attributes: efficiency, quality, commitment, responsible conduct, and cooperation. Your
task is to compare the employees and their skills. This activity will foster your skills in
choosing the best visualization when it comes to comparing items.
1. Which charts are suitable for this task?
2. You are given the following bar and radar charts. List the advantages and
disadvantages of both charts. Which is the better chart for this task in your
opinion, and why?
The following diagram shows a bar chart for the employee skills:

Figure 2.9: Employee skills comparison with a bar chart


Comparison Plots | 73

The following diagram shows a radar chart for the employee skills:

Figure 2.10: Employee skills comparison with a radar chart

3. What could be improved in the respective visualizations?

Note
The solution to this activity can be found on page 356.
74 | All You Need to Know about Plots

Concluding the activity, you hopefully have a good understanding of deciding which
comparison plots are best for the situation. In the next section, we will discuss different
relation plots.

Relation Plots
Relation plots are perfectly suited to showing relationships among variables. A scatter
plot visualizes the correlation between two variables for one or multiple groups. Bubble
plots can be used to show relationships between three variables. The additional third
variable is represented by the dot size. Heatmaps are great for revealing patterns or
correlations between two qualitative variables. A correlogram is a perfect visualization
for showing the correlation among multiple variables.

Scatter Plot
Scatter plots show data points for two numerical variables, displaying a variable on
both axes.

Uses
• You can detect whether a correlation (relationship) exists between two variables.
• They allow you to plot the relationship between multiple groups or categories
using different colors.
• A bubble plot, which is a variation of the scatter plot, is an excellent tool for
visualizing the correlation of a third variable.
Relation Plots | 75

Examples
The following diagram shows a scatter plot of height and weight of persons belonging
to a single group:

Figure 2.11: Scatter plot with a single group


76 | All You Need to Know about Plots

The following diagram shows the same data as in the previous plot but differentiates
between groups. In this case, we have different groups: A, B, and C:

Figure 2.12: Scatter plot with multiple groups


Relation Plots | 77

The following diagram shows the correlation between body mass and the maximum
longevity for various animals grouped by their classes. There is a positive correlation
between body mass and maximum longevity:

Figure 2.13: Correlation between body mass and maximum longevity for animals

Design Practices
• Start both axes at zero to represent data accurately.
• Use contrasting colors for data points and avoid using symbols for scatter plots
with multiple groups or categories.

Variants: Scatter Plots with Marginal Histograms


In addition to the scatter plot, which visualizes the correlation between two numerical
variables, you can plot the marginal distribution for each variable in the form of
histograms to give better insight into how each variable is distributed.
78 | All You Need to Know about Plots

Examples
The following diagram shows the correlation between body mass and the maximum
longevity for animals in the Aves class. The marginal histograms are also shown, which
helps to get a better insight into both variables:

Figure 2.14: Correlation between body mass and maximum longevity


of the Aves class with marginal histograms
Relation Plots | 79

Bubble Plot
A bubble plot extends a scatter plot by introducing a third numerical variable. The
value of the variable is represented by the size of the dots. The area of the dots is
proportional to the value. A legend is used to link the size of the dot to an actual
numerical value.

Use
Bubble plots help to show a correlation between three variables.

Example
The following diagram shows a bubble plot that highlights the relationship between
heights and age of humans to get the weight of each person, which is represented by
the size of the bubble:

Figure 2.15: Bubble plot showing the relation between height and age of humans
80 | All You Need to Know about Plots

Design Practices
• The design practices for the scatter plot are also applicable to the bubble plot.
• Don't use bubble plots for very large amounts of data, since too many bubbles make
the chart difficult to read.

Correlogram
A correlogram is a combination of scatter plots and histograms. Histograms will be
discussed in detail later in this chapter. A correlogram or correlation matrix visualizes
the relationship between each pair of numerical variables using a scatter plot.
The diagonals of the correlation matrix represent the distribution of each variable in
the form of a histogram. You can also plot the relationship between multiple groups or
categories using different colors. A correlogram is a great chart for exploratory data
analysis to get a feel for your data, especially the correlation between variable pairs.

Examples
The following diagram shows a correlogram for the height, weight, and age of humans.
The diagonal plots show a histogram for each variable. The off-diagonal elements show
scatter plots between variable pairs:
Relation Plots | 81

Figure 2.16: Correlogram with a single category


82 | All You Need to Know about Plots

The following diagram shows the correlogram with data samples separated by color
into different groups:

Figure 2.17: Correlogram with multiple categories

Design Practices
• Start both axes at zero to represent data accurately.
• Use contrasting colors for data points and avoid using symbols for scatter plots
with multiple groups or categories.
Relation Plots | 83

Heatmap
A heatmap is a visualization where values contained in a matrix are represented as
colors or color saturation. Heatmaps are great for visualizing multivariate data (data in
which analysis is based on more than two variables per observation), where categorical
variables are placed in the rows and columns and a numerical or categorical variable is
represented as colors or color saturation.

Use
The visualization of multivariate data can be done using heatmaps as they are great for
finding patterns in your data.

Examples
The following diagram shows a heatmap for the most popular products on the
electronics category page across various e-commerce websites, where the color shows
the number of units sold. In the following diagram, we can analyze that the darker
colors represent more units sold, as shown in the key:

Figure 2.18: Heatmap for popular products in the electronics category


84 | All You Need to Know about Plots

Variants: Annotated Heatmaps


Let's see the same example we saw previously in an annotated heatmap, where the
color shows the number of units sold:

Figure 2.19: Annotated heatmap for popular products in the electronics category

Design Practice
• Select colors and contrasts that will be easily visible to individuals with vision
problems so that your plots are more inclusive.
In this section, we introduced various plots for relating a variable to other variables and
looked at their uses, and multiple examples for the different relation plots were given.
The following activity will give you some practice in working with heatmaps.
Relation Plots | 85

Activity 2.02: Road Accidents Occurring over Two Decades


You are given a diagram that provides information about the road accidents that
have occurred over the past two decades during the months of January, April, July,
and October. The aim of this activity is to understand how you can use heatmaps to
visualize multivariate data.
1. Identify the two years during which the number of road accidents occurring was
the least.
2. For the past two decades, identify the month for which accidents showed a
marked decrease:

Figure 2.20: Total accidents over 20 years

Note
The solution to this activity can be found on page 356.
86 | All You Need to Know about Plots

Composition Plots
Composition plots are ideal if you think about something as a part of a whole. For static
data, you can use pie charts, stacked bar charts, or Venn diagrams. Pie charts or donut
charts help show proportions and percentages for groups. If you need an additional
dimension, stacked bar charts are great. Venn diagrams are the best way to visualize
overlapping groups, where each group is represented by a circle. For data that changes
over time, you can use either stacked bar charts or stacked area charts.

Pie Chart
Pie charts illustrate numerical proportions by dividing a circle into slices. Each arc
length represents a proportion of a category. The full circle equates to 100%. For
humans, it is easier to compare bars than arc lengths; therefore, it is recommended to
use bar charts or stacked bar charts the majority of the time.

Use
To compare items that are part of a whole.
Composition Plots | 87

Examples
The following diagram shows household water usage around the world:

Figure 2.21: Pie chart for global household water usage


88 | All You Need to Know about Plots

Design Practices
• Arrange the slices according to their size in increasing/decreasing order, either in
a clockwise or counterclockwise manner.
• Make sure that every slice has a different color.

Variants: Donut Chart


An alternative to a pie chart is a donut chart. In contrast to pie charts, it is easier to
compare the size of slices, since the reader focuses more on reading the length of
the arcs instead of the area. Donut charts are also more space-efficient because the
center is cut out, so it can be used to display information or further divide groups
into subgroups.
The following diagram shows a basic donut chart:

Figure 2.22: Donut chart


Composition Plots | 89

The following diagram shows a donut chart with subgroups:

Figure 2.23: Donut chart with subgroups

Design Practice
• Use the same color that's used for the category for the subcategories. Use varying
brightness levels for the different subcategories.

Stacked Bar Chart


Stacked bar charts are used to show how a category is divided into subcategories
and the proportion of the subcategory in comparison to the overall category. You can
either compare total amounts across each bar or show a percentage of each group. The
latter is also referred to as a 100% stacked bar chart and makes it easier to see relative
differences between quantities in each group.
90 | All You Need to Know about Plots

Use
• To compare variables that can be divided into sub-variables

Examples
The following diagram shows a generic stacked bar chart with five groups:

Figure 2.24: Stacked bar chart to show sales of laptops and mobiles
Composition Plots | 91

The following diagram shows a 100% stacked bar chart with the same data that was
used in the preceding diagram:

Figure 2.25: 100% stacked bar chart to show sales of laptops, PCs, and mobiles
92 | All You Need to Know about Plots

The following diagram illustrates the daily total sales of a restaurant over several
days. The daily total sales of non-smokers are stacked on top of the daily total sales
of smokers:

Figure 2.26: Daily total restaurant sales categorized by smokers and non-smokers

Design Practices
• Use contrasting colors for stacked bars.
• Ensure that the bars are adequately spaced to eliminate visual clutter. The ideal
space guideline between each bar is half the width of a bar.
• Categorize data alphabetically, sequentially, or by value, to uniformly order it and
make things easier for your audience.

Stacked Area Chart


Stacked area charts show trends for part-of-a-whole relations. The values of several
groups are illustrated by stacking individual area charts on top of one another. It helps
to analyze both individual and overall trend information.
Composition Plots | 93

Use
To show trends for time series that are part of a whole.

Examples
The following diagram shows a stacked area chart with the net profits of Google,
Facebook, Twitter, and Snapchat over a decade:

Figure 2.27: Stacked area chart to show net profits of four companies

Design Practice
• Use transparent colors to improve information visibility. This will help you to
analyze the overlapping data and you will also be able to see the grid lines.
In this section, we covered various composition plots and we will conclude this section
with the following activity.
94 | All You Need to Know about Plots

Activity 2.03: Smartphone Sales Units


You want to compare smartphone sales units for the five biggest smartphone
manufacturers over time and see whether there is any trend. In this activity, we also
want to look at the advantages and disadvantages of stacked area charts compared to
line charts:
1. Looking at the following line chart, analyze the sales of each manufacturer and
identify the one whose fourth-quarter performance is exceptional when compared
to the third quarter.
2. Analyze the performance of all manufacturers and make a prediction about two
companies whose sales units will show a downward and an upward trend:

Figure 2.28: Line chart of smartphone sales units

3. What would be the advantages and disadvantages of using a stacked area chart
instead of a line chart?

Note
The solution to this activity can be found on page 357.
Composition Plots | 95

Venn Diagram
Venn diagrams, also known as set diagrams, show all possible logical relations between
a finite collection of different sets. Each set is represented by a circle. The circle size
illustrates the importance of a group. The size of overlap represents the intersection
between multiple groups.

Use
To show overlaps for different sets.

Example
Visualizing the intersection of the following diagram shows a Venn diagram for students
in two groups taking the same class in a semester:

Figure 2.29: Venn diagram showing students taking the same class

From the preceding diagram, we can note that there are eight students in just group A,
four students in just group B, and one student in both groups.

Design Practice
• It is not recommended to use Venn diagrams if you have more than three groups.
It would become difficult to understand.
Moving on from composition plots, we will cover distribution plots in the
following section.
96 | All You Need to Know about Plots

Distribution Plots
Distribution plots give a deep insight into how your data is distributed. For a single
variable, a histogram is effective. For multiple variables, you can either use a box plot or
a violin plot. The violin plot visualizes the densities of your variables, whereas the box
plot just visualizes the median, the interquartile range, and the range for each variable.

Histogram
A histogram visualizes the distribution of a single numerical variable. Each bar
represents the frequency for a certain interval. Histograms help get an estimate
of statistical measures. You see where values are concentrated, and you can easily
detect outliers. You can either plot a histogram with absolute frequency values or,
alternatively, normalize your histogram. If you want to compare distributions of
multiple variables, you can use different colors for the bars.

Use
Get insights into the underlying distribution for a dataset.

Example
The following diagram shows the distribution of the Intelligence Quotient (IQ) for a
test group. The dashed lines represent the standard deviation each side of the mean
(the solid line):

Figure 2.30: Distribution of IQ for a test group of a hundred adults


Distribution Plots | 97

Design Practice
• Try different numbers of bins (data intervals), since the shape of the histogram can
vary significantly.

Density Plot
A density plot shows the distribution of a numerical variable. It is a variation of a
histogram that uses kernel smoothing, allowing for smoother distributions. One
advantage these have over histograms is that density plots are better at determining the
distribution shape since the distribution shape for histograms heavily depends on the
number of bins (data intervals).

Use
To compare the distribution of several variables by plotting the density on the same axis
and using different colors.

Example
The following diagram shows a basic density plot:

Figure 2.31: Density plot


98 | All You Need to Know about Plots

The following diagram shows a basic multi-density plot:

Figure 2.32: Multi-density plot

Design Practice
• Use contrasting colors to plot the density of multiple variables.

Box Plot
The box plot shows multiple statistical measurements. The box extends from the lower
to the upper quartile values of the data, thus allowing us to visualize the interquartile
range (IQR). The horizontal line within the box denotes the median. The parallel
extending lines from the boxes are called whiskers; they indicate the variability outside
the lower and upper quartiles. There is also an option to show data outliers, usually as
circles or diamonds, past the end of the whiskers.
Distribution Plots | 99

Use
Compare statistical measures for multiple variables or groups.

Examples
The following diagram shows a basic box plot that shows the height of a group
of people:

Figure 2.33: Box plot showing a single variable


100 | All You Need to Know about Plots

The following diagram shows a basic box plot for multiple variables. In this case, it
shows heights for two different groups – adults and non-adults:

Figure 2.34: Box plot for multiple variables

In the next section, we will learn what the features, uses, and best practices are of the
violin plot.

Violin Plot
Violin plots are a combination of box plots and density plots. Both the statistical
measures and the distribution are visualized. The thick black bar in the center
represents the interquartile range, while the thin black line corresponds to the whiskers
in a box plot. The white dot indicates the median. On both sides of the centerline, the
density is visualized.
Distribution Plots | 101

Use
Compare statistical measures and density for multiple variables or groups.

Examples
The following diagram shows a violin plot for a single variable and shows how students
have performed in Math:

Figure 2.35: Violin plot for a single variable (Math)

From the preceding diagram, we can analyze that most of the students have scored
around 40-60 in the Math test.
102 | All You Need to Know about Plots

The following diagram shows a violin plot for two variables and shows the performance
of students in English and Math:

Figure 2.36: Violin plot for multiple variables (English and Math)

From the preceding diagram, we can say that on average, the students have scored
more in English than in Math, but the highest score was secured in Math.
Distribution Plots | 103

The following diagram shows a violin plot for a single variable divided into three
groups, and shows the performance of three divisions of students in English based
on their score:

Figure 2.37: Violin plot with multiple categories (three groups of students)

From the preceding diagram, we can note that on average, division C has scored the
highest, division B has scored the lowest, and division A is, on average, in between
divisions B and C.

Design Practice
• Scale the axes accordingly so that the distribution is clearly visible and not flat.
In this section, distribution plots were introduced. In the following activity, we will have
a closer look at histograms.
104 | All You Need to Know about Plots

Activity 2.04: Frequency of Trains during Different Time Intervals


You are provided with a histogram that states the number of trains arriving at
different time intervals in the afternoon to determine the maximum number of trains
arriving in 2-hour time intervals. The goal of this activity is to gain a deeper insight
into histograms:
1. Looking at the following histogram, can you identify the interval during which a
maximum number of trains arrive?
2. How would the histogram change if in the morning, the same total number of
trains arrive as in the afternoon, and if you have the same frequencies for all
time intervals?

Figure 2.38: Frequency of trains during different time intervals

Note
The solution to this activity can be found on page 358.
Geoplots | 105

With that activity, we conclude the section about distribution plots and we will
introduce geoplots in the next section.

Geoplots
Geological plots are a great way to visualize geospatial data. Choropleth maps can be
used to compare quantitative values for different countries, states, and so on. If you
want to show connections between different locations, connection maps are the way
to go.

Dot Map
In a dot map, each dot represents a certain number of observations. Each dot has the
same size and value (the number of observations each dot represents). The dots are
not meant to be counted; they are only intended to give an impression of magnitude.
The size and value are important factors for the effectiveness and impression of the
visualization. You can use different colors or symbols for the dots to show multiple
categories or groups.

Use
To visualize geospatial data.
106 | All You Need to Know about Plots

Example
The following diagram shows a dot map where each dot represents a certain amount of
bus stops throughout the world:

Figure 2.39: Dot map showing bus stops worldwide

Design Practices
• Do not show too many locations. You should still be able to see the map to get a
feel for the actual location.
• Choose a dot size and value so that in dense areas, the dots start to blend. The dot
map should give a good impression of the underlying spatial distribution.

Choropleth Map
In a choropleth map, each tile is colored to encode a variable. For example, a tile
represents a geographic region for counties and countries. Choropleth maps provide a
good way to show how a variable varies across a geographic area. One thing to keep in
mind for choropleth maps is that the human eye naturally gives more attention to larger
areas, so you might want to normalize your data by dividing the map area-wise.
Geoplots | 107

Use
To visualize geospatial data grouped into geological regions—for example, states
or countries.

Example
The following diagram shows a choropleth map of a weather forecast in the USA:

Figure 2.40: Choropleth map showing a weather forecast for the USA

Design Practices
• Use darker colors for higher values, as they are perceived as being higher
in magnitude.
• Limit the color gradation, since the human eye is limited in how many colors it can
easily distinguish between. Seven color gradations should be enough.

Connection Map
In a connection map, each line represents a certain number of connections between
two locations. The link between the locations can be drawn with a straight or rounded
line, representing the shortest distance between them.
108 | All You Need to Know about Plots

Each line has the same thickness and value (the number of connections each line
represents). The lines are not meant to be counted; they are only intended to give an
impression of magnitude. The size and value of a connection line are important factors
for the effectiveness and impression of the visualization.
You can use different colors for the lines to show multiple categories or groups, or you
can use a colormap to encode the length of the connection.

Use
To visualize connections.

Examples
The following diagram shows a connection map of flight connections around the world:

Figure 2.41: Connection map showing flight connections around the world
What Makes a Good Visualization? | 109

Design Practices
• Do not show too many connections as it will be difficult for you to analyze the data.
You should still see the map to get a feel for the actual locations of the start and
end points.
• Choose a line thickness and value so that the lines start to blend in dense
areas. The connection map should give a good impression of the underlying
spatial distribution.
Geoplots are special plots that are great for visualizing geospatial data. In the following
section, we want to briefly talk about what's generally important when it comes to
creating good visualizations.

What Makes a Good Visualization?


There are multiple aspects to what makes a good visualization:
• Most importantly, the visualization should be self-explanatory and visually
appealing. To make it self-explanatory, use a legend, descriptive labels for your
x-axis and y-axis, and titles.
• A visualization should tell a story and be designed for your audience. Before
creating your visualization, think about your target audience; create simple
visualizations for a non-specialist audience and more technical detailed
visualizations for a specialist audience. Think about a story to tell with your
visualization so that your visualization leaves an impression on the audience.

Common Design Practices


• Use colors to differentiate variables/subjects rather than symbols, as colors are
more perceptible.
• To show additional variables on a 2D plot, use color, shape, and size.
• Keep it simple and don't overload the visualization with too much information.
110 | All You Need to Know about Plots

Activity 2.05: Analyzing Visualizations


The following visualizations are not ideal as they do not represent data well. Answer the
following questions for each visualization. The aim of this activity is to sharpen your
skills with regard to choosing the best suitable plot for a scenario.
1. What are the bad aspects of these visualizations?
2. How could we improve the visualizations? Sketch the right visualization for
both scenarios.
The first visualization is supposed to illustrate the top 30 YouTube music channels
according to their number of subscribers:

Figure 2.42: Pie chart showing the top 30 YouTube music channels
What Makes a Good Visualization? | 111

The second visualization is supposed to illustrate the number of people playing a


certain game in a casino over 2 days:

Figure 2.43: Line chart displaying casino data for 2 days

Note
The solution to this activity can be found on page 359.

Activity 2.06: Choosing a Suitable Visualization


In this activity, we are using a dataset to visualize the median, the interquartile ranges,
and the underlying density of populations from different income groups. Following is
the link to the dataset that we have used: https://packt.live/2HgHxeK. Select the best
suitable plot from the following plots.
112 | All You Need to Know about Plots

The following diagram shows the population by different income groups using a
density plot:

Figure 2.44: Density plot


What Makes a Good Visualization? | 113

The following diagram shows the population by different income groups using a
box plot:

Figure 2.45: Box plot


114 | All You Need to Know about Plots

The following diagram shows the population by different income groups using a
violin plot:

Figure 2.46: Violin plot

Note
The solution to this activity can be found on page 360.

Summary
This chapter covered the most important visualizations, categorized into comparison,
relation, composition, distribution, and geological plots. For each plot, a description,
practical examples, and design practices were given. Comparison plots, such as line
charts, bar charts, and radar charts, are well suited to comparing multiple variables
or variables over time. Relation plots are perfectly suited to show relationships
between variables. Scatter plots, bubble plots, which are an extension of scatter plots,
correlograms, and heatmaps were considered.
Summary | 115

Composition plots are ideal if you need to think about something as part of a whole.
We first covered pie charts and continued with stacked bar charts, stacked area charts,
and Venn diagrams. For distribution plots that give a deep insight into how your data
is distributed, histograms, density plots, box plots, and violin plots were considered.
Regarding geospatial data, we discussed dot maps, connection maps, and choropleth
maps. Finally, some remarks were provided on what makes a good visualization.
In the next chapter, we will dive into Matplotlib and create our own visualizations. We
will start by introducing the basics, followed by talking about how you can add text
and annotations to make your visualizations more comprehensible. We will continue
creating simple plots and using layouts to include multiple plots within a visualization.
At the end of the next chapter, we will explain how you can use Matplotlib to
visualize images.

You might also like