Data Visualization With Python PDF
Data Visualization With Python PDF
1. Preface
2. Chapter 1
3. The Importance of Data Visualization and Data Exploration
1. Introduction
2. Overview of Statistics
3. NumPy
4. pandas
5. Summary
4. Chapter 2
5. All You Need to Know About Plots
1. Introduction
2. Comparison Plots
1. Line Chart
2. Bar Chart
3. Radar Chart
4. Activity 7: Employee Skill Comparison
3. Relation Plots
1. Scatter Plot
2. Bubble Plot
3. Correlogram
4. Heatmap
5. Activity 8: Road Accidents Occurring over Two
Decades
4. Composition Plots
1. Pie Chart
2. Stacked Bar Chart
3. Stacked Area Chart
4. Activity 9: Smartphone Sales Units
5. Venn Diagram
5. Distribution Plots
1. Histogram
2. Density Plot
3. Box Plot
4. Violin Plot
5. Activity 10: Frequency of Trains during
Different Time Intervals
6. Geo Plots
1. Dot Map
2. Choropleth Map
3. Connection Map
8. Summary
6. Chapter 3
7. A Deep Dive into Matplotlib
1. Introduction
2. Overview of Plots in Matplotlib
3. Pyplot Basics
1. Creating Figures
2. Closing Figures
3. Format Strings
4. Plotting
5. Plotting Using pandas DataFrames
6. Displaying Figures
7. Saving Figures
8. Exercise 3: Creating a Simple Visualization
1. Labels
2. Titles
3. Text
4. Annotations
5. Legends
6. Activity 12: Visualizing Stock Trends by Using a
Line Plot
5. Basic Plots
1. Bar Chart
2. Activity 13: Creating a Bar Plot for Movie
Comparison
3. Pie Chart
4. Exercise 4: Creating a Pie Chart for Water Usage
5. Stacked Bar Chart
6. Activity 14: Creating a Stacked Bar Plot to
Visualize Restaurant Performance
7. Stacked Area Chart
8. Activity 15: Comparing Smartphone Sales Units
Using a Stacked Area Chart
9. Histogram
10. Box Plot
11. Activity 16: Using a Histogram and a Box Plot
to Visualize the Intelligence Quotient
12. Scatter Plot
13. Activity 17: Using a Scatter Plot to Visualize
Correlation Between Various Animals
14. Bubble Plot
6. Layouts
1. Subplots
2. Tight Layout
3. Radar Charts
4. Exercise 5: Working on Radar Charts
5. GridSpec
6. Activity 18: Creating Scatter Plot with Marginal
Histograms
7. Images
8. Chapter 4
9. Simplifying Visualizations Using Seaborn
1. Introduction
1. Advantages of Seaborn
2. Controlling Figure Aesthetics
3. Color Palettes
1. Bar Plots
2. Activity 22: Movie Comparison Revisited
3. Kernel Density Estimation
4. Plotting Bivariate Distributions
5. Visualizing Pairwise Relationships
6. Violin Plots
7. Activity 23: Comparing IQ Scores for Different
Test Groups by Using a Violin Plot
5. Multi-Plots in Seaborn
1. FacetGrid
2. Activity 24: Top 30 YouTube Channels
6. Regression Plots
7. Squarify
8. Summary
10. Chapter 5
11. Plotting Geospatial Data
1. Introduction
2. Tile Providers
3. Custom Layers
4. Summary
12. Chapter 6
13. Making Things Interactive with Bokeh
1. Introduction
1. Concepts of Bokeh
2. Interfaces in Bokeh
3. Output
4. Bokeh Server
5. Presentation
6. Integrating
7. Exercise 9: Plotting with Bokeh
8. Exercise 10: Comparing the Plotting and Models
Interfaces
2. Adding Widgets
3. Summary
14. Chapter 7
15. Combining What We Have Learned
1. Introduction
2. Summary
16. Appendix
Landmarks
1. Cover
2. Table of Contents
DATA VISUALIZATION
WITH PYTHON
Copyright © 2019 Packt Publishing
ISBN: 978-1-78995-646-7
Table of Contents
Preface
The Importance of
Data Visualization and
Data Exploration
INTRODUCTION
INTRODUCTION TO DATA
VISUALIZATION
THE IMPORTANCE OF
DATA VISUALIZATION
DATA WRANGLING
TOOLS AND LIBRARIES
FOR VISUALIZATION
OVERVIEW OF
STATISTICS
MEASURES OF CENTRAL
TENDENCY
MEASURES OF
DISPERSION
CORRELATION
TYPES OF DATA
SUMMARY STATISTICS
NUMPY
EXERCISE 1: LOADING A
SAMPLE DATASET AND
CALCULATING THE MEAN
ACTIVITY 1: USING
NUMPY TO COMPUTE
THE MEAN, MEDIAN,
VARIANCE, AND
STANDARD DEVIATION
FOR THE GIVEN
NUMBERS
BASIC NUMPY
OPERATIONS
ACTIVITY 2: INDEXING,
SLICING, SPLITTING, AND
ITERATING
ADVANCED NUMPY
OPERATIONS
ACTIVITY 3: FILTERING,
SORTING, COMBINING,
AND RESHAPING
PANDAS
ADVANTAGES OF
PANDAS OVER NUMPY
DISADVANTAGES OF
PANDAS
EXERCISE 2: LOADING A
SAMPLE DATASET AND
CALCULATING THE MEAN
ACTIVITY 4: USING
PANDAS TO COMPUTE
THE MEAN, MEDIAN, AND
VARIANCE FOR THE
GIVEN NUMBERS
BASIC OPERATIONS OF
PANDAS
SERIES
ACTIVITY 5: INDEXING,
SLICING, AND ITERATING
USING PANDAS
ADVANCED PANDAS
OPERATIONS
ACTIVITY 6: FILTERING,
SORTING, AND
RESHAPING
SUMMARY
ACTIVITY 8: ROAD
ACCIDENTS OCCURRING
OVER TWO DECADES
COMPOSITION PLOTS
PIE CHART
ACTIVITY 9:
SMARTPHONE SALES
UNITS
VENN DIAGRAM
DISTRIBUTION PLOTS
HISTOGRAM
DENSITY PLOT
BOX PLOT
VIOLIN PLOT
ACTIVITY 10: FREQUENCY
OF TRAINS DURING
DIFFERENT TIME
INTERVALS
GEO PLOTS
DOT MAP
CHOROPLETH MAP
CONNECTION MAP
SUMMARY
PYPLOT BASICS
CREATING FIGURES
CLOSING FIGURES
FORMAT STRINGS
PLOTTING
PLOTTING USING
PANDAS DATAFRAMES
DISPLAYING FIGURES
SAVING FIGURES
EXERCISE 3: CREATING A
SIMPLE VISUALIZATION
BASIC TEXT AND LEGEND
FUNCTIONS
LABELS
TITLES
TEXT
ANNOTATIONS
LEGENDS
PIE CHART
EXERCISE 4: CREATING A
PIE CHART FOR WATER
USAGE
STACKED BAR CHART
HISTOGRAM
BOX PLOT
ACTIVITY 16: USING A
HISTOGRAM AND A BOX
PLOT TO VISUALIZE THE
INTELLIGENCE QUOTIENT
SCATTER PLOT
LAYOUTS
SUBPLOTS
TIGHT LAYOUT
RADAR CHARTS
EXERCISE 5: WORKING
ON RADAR CHARTS
GRIDSPEC
IMAGES
BASIC IMAGE
OPERATIONS
WRITING MATHEMATICAL
EXPRESSIONS
SUMMARY
Simplifying
Visualizations Using
Seaborn
INTRODUCTION
ADVANTAGES OF
SEABORN
CONTROLLING FIGURE
AESTHETICS
SEABORN FIGURE
STYLES
COLOR PALETTES
CATEGORICAL COLOR
PALETTES
SEQUENTIAL COLOR
PALETTES
DIVERGING COLOR
PALETTES
INTERESTING PLOTS IN
SEABORN
BAR PLOTS
KERNEL DENSITY
ESTIMATION
PLOTTING BIVARIATE
DISTRIBUTIONS
VISUALIZING PAIRWISE
RELATIONSHIPS
VIOLIN PLOTS
REGRESSION PLOTS
SUMMARY
Plotting Geospatial
Data
INTRODUCTION
THE DESIGN PRINCIPLES
OF GEOPLOTLIB
GEOSPATIAL
VISUALIZATIONS
EXERCISE 6: VISUALIZING
SIMPLE GEOSPATIAL
DATA
EXERCISE 7:
CHOROPLETH PLOT WITH
GEOJSON DATA
TILE PROVIDERS
EXERCISE 8: VISUALLY
COMPARING DIFFERENT
TILE PROVIDERS
CUSTOM LAYERS
SUMMARY
Making Things
Interactive with Bokeh
INTRODUCTION
CONCEPTS OF BOKEH
INTERFACES IN BOKEH
OUTPUT
BOKEH SERVER
PRESENTATION
INTEGRATING
EXERCISE 9: PLOTTING
WITH BOKEH
EXERCISE 10:
COMPARING THE
PLOTTING AND MODELS
INTERFACES
ADDING WIDGETS
SUMMARY
Combining What We
Have Learned
INTRODUCTION
ACTIVITY 30:
IMPLEMENTING
MATPLOTLIB AND
SEABORN ON NEW YORK
CITY DATABASE
BOKEH
GEOPLOTLIB
SUMMARY
Appendix
Preface
About
This section briefly introduces the author, the coverage of this
book, the technical skills you'll need to get started, and the
hardware and software requirements required to complete all of
the included activities and exercises.
OBJECTIVES
Get an overview of various plots
and their best use cases
AUDIENCE
This book is aimed at developers or scientists who want to get
into data science or want to use data visualizations to enrich
their personal and professional projects. Prior experience in
data analytics and visualization is not needed; however, some
knowledge of Python and high-school level math is
recommended. Even though this is a beginner-level book on
data visualization, more experienced students will benefit from
improving their Python skills by working with real-world data.
APPROACH
This book thoroughly explains the technology in easy-to-
understand language while perfectly balancing theory and
exercises. Each chapter is designed to build on the learning
from the previous chapter. The book also contains multiple
activities that use real-life business scenarios for you to
practice and apply your new skills in a highly relevant context.
MINIMUM HARDWARE
REQUIREMENTS
For the optimal student experience, we recommend the
following hardware configuration:
OS: Windows 7 SP1 32/64-bit,
Windows 8.1 32/64-bit or Windows
10 32/64-bit, Ubuntu 14.04 or later,
or macOS Sierra or later
SOFTWARE
REQUIREMENTS
You'll also need the following software installed in advance:
Conda
Python 3
first_val_first_row = dataset[0][0]
np.mean(first_val_first_row)
INSTALLATION AND
SETUP
Before you start this book, we'll install Python 3.6, pip, and the
other libraries used throughout this book. You will find the
steps to install them here.
Installing Python
Installing pip
python get-pip.py
Installing libraries
WORKING WITH
JUPYTERLAB AND
JUPYTER NOTEBOOK
You'll be working on different exercises and activities in
JupyterLab. These exercises and activities can be downloaded
from the associated GitHub repository.
cd Data-Visualization-with-Python/<your
current chapter>.
For example:
cd Data-Visualization-with-
Python/chapter01/
cd Activity01
IMPORTING PYTHON
LIBRARIES
Every exercise and activity in this book will make use of
various libraries. Importing libraries into Python is very simple
and here's how we do it:
1. To import libraries, such as NumPy
and pandas, we have to run the
following code. This will import
the whole numpy library into our
current file:
import numpy as np #
import numpy and
assign alias np
In this chapter, you will also learn about the basic operations of
NumPy and pandas.
Introduction
Unlike machines, people are not usually equipped for
interpreting a lot of information from a random set of numbers
and messages in a given piece of data. While they may know
what the data is basically comprised of, they might need help
to understand it completely. Out of all of our logical
capabilities, we understand things best through the processing
of visual information. When data is represented visually, the
probability of understanding complex builds and numbers
increases.
INTRODUCTION TO DATA
VISUALIZATION
Computers and smartphones store data such as names and
numbers in a digital format. Data representation refers to the
form in which you can store, process, and transmit data.
THE IMPORTANCE OF
DATA VISUALIZATION
Visual data is very easy to understand compared to data in any
other form. Instead of just looking at data in the columns of an
Excel spreadsheet, we get a better idea of what our data
contains by using a visualization. For instance, it's easy to see a
pattern emerge from the numerical data that's given in the
following graph:
DATA WRANGLING
To draw conclusions from visualized data, we need to handle
our data and transform it into the best possible representation.
This is where data wrangling is used. It is the discipline of
augmenting, transforming, and enriching data in a way that
allows it to be displayed and understood by machine learning
algorithms.
Note
MATLAB (https://www.mathworks.com/products/matlab.html),
R (https://www.r-project.org), and Tableau
(https://www.tableau.com) are not part of this book, so we will
only cover the highlighted tools and libraries for Python.
Overview of Statistics
Statistics is a combination of the analysis, collection,
interpretation, and representation of numerical data.
Probability is a measure of the likelihood that an event will
occur and is quantified as a number between 0 and 1.
MEASURES OF CENTRAL
TENDENCY
Measures of central tendency are often called averages and
describe central or typical values for a probability distribution.
There are three kinds of averages that we are going to discuss
in this chapter:
Example:
A die was rolled ten times and we got the following numbers:
4, 5, 4, 3, 4, 2, 1, 1, 2, and 1.
The modes are 1 and 4 since they are the two most frequent
events.
MEASURES OF
DISPERSION
Dispersion, also called variability, is the extent to which a
probability distribution is stretched or squeezed.
CORRELATION
The measures we have discussed so far only considered single
variables. In contrast, correlation describes the statistical
relationship between two variables:
Note
One thing you should be aware of is that correlation does not
imply causation. Correlation describes the relationship
between two or more variables, while causation describes how
one event is caused by another. For example: sleeping with
your shoes on is correlated with waking up with a headache.
This does not mean that sleeping with your shoes on causes a
headache in the morning. There might be a third, hidden
variable, for example, someone was up working late the
previous night, which caused both them falling asleep with
their shoes on and waking up with a headache.
Example:
The mean is
The median is
The range is
TYPES OF DATA
It is important to understand what kind of data you are dealing
with so that you can select both the right statistical measure
and the right visualization. We categorize data as
categorical/qualitative and numerical/quantitative. Categorical
data describes characteristics, for example, the color of an
object or a person's gender. We can further divide categorical
data into nominal and ordinal data. In contrast to nominal data,
ordinal data has an order.
NumPy
When handling data, we often need a way to work with
multidimensional arrays. As we discussed previously, we also
have to apply some basic mathematical and statistical
operations on that data. This is exactly where NumPy positions
itself. It provides support for large n-dimensional arrays and is
the built-in support for many high-level mathematical and
statistical operations.
Note
Before NumPy, there was a library called Numeric. However,
it's no longer used, as NumPy's signature ndarray allows for
the performant handling of large and high-dimensional
matrices.
Note
Remember that NumPy arrays have a "defined" datatype. This
means you are not able to insert strings into an integer type
array. NumPy is mostly used with double-precision datatypes.
EXERCISE 1: LOADING A
SAMPLE DATASET AND
CALCULATING THE MEAN
Note
All exercises and activities will be developed in the Jupyter
Notebook. Please download the GitHub repository with all the
prepared templates from
https://github.com/TrainingByPackt/Data-Visualization-with-
Python
In this exercise, we will be loading the
normal_distribution.csv dataset and calculating the
mean of each row and each column in it:
# importing the
necessary dependencies
import numpy as np
4. Look for the cell that has a
comment saying, "loading the
dataset." This is the place you want
to insert the genfromtxt method
call. This method helps in loading
the data from a given text or .csv
file.
dataset =
np.genfromtxt('./data/
normal_distribution.cs
v', delimiter=',')
dataset
dataset.shape
np.mean(dataset[0])
np.mean(dataset[:, 0])
np.mean(dataset,
axis=1)
np.mean(dataset,
axis=0)
np.mean(dataset)
ACTIVITY 1: USING
NUMPY TO COMPUTE
THE MEAN, MEDIAN,
VARIANCE, AND
STANDARD DEVIATION
FOR THE GIVEN
NUMBERS
In this activity, we will use the skills we've learned to import
datasets and perform some basic calculations (mean, median,
variance, and standard deviation) to compute our tasks.
3. Load the
normal_distribution.csv
dataset by using the genfromtxt
method of numpy.
Note:
BASIC NUMPY
OPERATIONS
In this section, we will learn basic NumPy operations such as
indexing, slicing, splitting, and iterating and implement them
in an activity.
Indexing
Slicing
Slicing has also been adapted from Python's Lists. Being able
to easily slice parts of lists into new ndarrays is very helpful
when handling large amounts of data:
Splitting
Iterating
for x in np.nditer(dataset):
print(x)
print(index, value)
ACTIVITY 2: INDEXING,
SLICING, SPLITTING, AND
ITERATING
In this activity, we will use the features of NumPy to index,
slice, split, and iterate ndarrays to consolidate what we've
learned. Our client wants us to prove that our dataset is nicely
distributed around the mean value of 100:
3. Load the
normal_distribution.csv
dataset using NumPy. Make sure
that everything works by having a
look at the ndarray, as in the
previous activity. Follow the task
description in the notebook.
Note:
ADVANCED NUMPY
OPERATIONS
In this section, we will learn GI advanced NumPy operations
such as filtering, sorting, combining, and reshaping and
implement them in an Activity.
Filtering
Sorting
Combining
np.vstack([dataset_1, dataset_2]) #
combine datasets vertically
np.hstack([dataset_1, dataset_2]) #
combine datasets horizontally
Note
Combining with stacking can take some getting used to. Please
look at the examples in the NumPy documentation for further
information:https://docs.scipy.org/doc/numpy-
1.15.0/reference/generated/numpy.hstack.html.
Reshaping
ACTIVITY 3: FILTERING,
SORTING, COMBINING,
AND RESHAPING
This last activity for NumPy provides some more complex
tasks to consolidate our learning. It will also combine most of
the previously learned methods as a recap. Perform the
following steps:
Note:
pandas
The pandas Python library offers data structures and methods
to manipulate different types of data, such as numerical and
temporal. These operations are easy to use and highly
optimized for performance.
Note
Installation instructions for pandas can be found here:
https://pandas.pydata.org/.
ADVANTAGES OF
PANDAS OVER NUMPY
The following are some of the advantages of pandas:
DISADVANTAGES OF
PANDAS
The following are some of the disadvantages of pandas:
Note
# importing the
necessary dependencies
import pandas as pd
Note
# looking at the
dataset
dataset.head()
The output of the preceding code is
as follows:
dataset.shape
dataset["1961"].mean()
dataset["2015"].mean()
dataset.mean(axis=1).h
ead(10)
dataset.mean(axis=0).t
ail(10)
dataset.mean()
ACTIVITY 4: USING
PANDAS TO COMPUTE
THE MEAN, MEDIAN, AND
VARIANCE FOR THE
GIVEN NUMBERS
In this activity, we will take the previously learned skills of
importing datasets and doing some basic calculations and apply
them to solve the tasks of our first activity using pandas.
3. Load the
world_population.csv
dataset using the read_csv
method of pandas.
Note:
BASIC OPERATIONS OF
PANDAS
In this section, we will learn the basic pandas operations such
as Indexing, Slicing, and Iterating and implement it with an
Activity.
Indexing
dataset[["2015"]].loc[["Germany"]] #
index row Germany and column 2015
Slicing
dataset.loc[["Germany", "India"]] #
slice of rows Germany and India
dataset.loc[["Germany", "India"]]
[["1970", "1990"]]
Iterating
print(index, row)
SERIES
A pandas Series is a one-dimensional labelled array that is
capable of holding any type of data. We can create a Series by
loading datasets from a .csv file, Excel spreadsheet, or SQL
database. There are many different ways to create them. For
example:
NumPy arrays:
# import pandas
import pandas as pd
# import numpy
import numpy as np
# creating a numpy
array
numarr =
np.array(['p','y','t',
'h','o','n'])
ser =
pd.Series(numarr)
print(ser)
pandas lists:
# import pandas
import pandas as pd
# creating a pandas
list
plist =
['p','y','t','h','o','
n']
ser = pd.Series(plist)
print(ser)
ACTIVITY 5: INDEXING,
SLICING, AND ITERATING
USING PANDAS
In this activity, we will use previously discussed pandas
features to index, slice, and iterate DataFrames using pandas
Series. To get some understandable insights into our dataset,
we need to be able to explicitly index, slice, and iterate our
data. For example, we can compare several countries in terms
of population density growth.
3. Load the
world_population.csv
dataset using pandas. Make sure
everything works by having a look
at the DataFrames.
Note:
ADVANCED PANDAS
OPERATIONS
In this section, we will learn advanced pandas operations such
as filtering, sorting, and reshaping and implement them in an
activity.
Filtering
dataset.filter(items=["1990"]) # only
column 1994
dataset.filter(regex="a$", axis=0) #
countries ending with a
Sorting
dataset.sort_values(by=["1999"]) #
values sorted by 1999
dataset.sort_values(by=["1994"],
ascending=False)
Reshaping
dataset.pivot(index=["1999"] *
len(dataset), columns="Country Code",
values="1999")
Note
Reshaping is a very complex topic. If you want to dive deeper
into it, this is a good resource to get started:
https://bit.ly/2SjWzaB.
ACTIVITY 6: FILTERING,
SORTING, AND
RESHAPING
This last activity for pandas provides some more complex tasks
and also combines most of the methods learned previously as a
recap. After this activity, students should be able to read the
most basic pandas code and understand its logic:
Note:
Summary
NumPy and pandas are essential tools for data wrangling. Their
user-friendly interfaces and performant implementation make
data handling easy. Even though they only provide a little
insight into our datasets, they are absolutely valuable for
wrangling, augmenting, and cleaning our datasets. Mastering
these skills will improve the quality of your visualizations.
Introduction
In this chapter, we will focus on various visualizations and
identify which visualization is best to show certain information
for a given dataset. We will describe every visualization in
detail and give practical examples, such as comparing different
stocks over time or comparing the ratings for different movies.
Starting with comparison plots, which are great for comparing
multiple variables over time, we will look at their types, such
as line charts, bar charts, and radar charts. Relation plots are
handy to show relationships among variables. We will cover
scatter plots for showing the relationship between two
variables, bubble plots for three variables, correlograms for
variable pairs, and, finally, heatmaps.
Comparison Plots
Comparison plots include charts that are well-suited for
comparing multiple variables or variables over time. For a
comparison among items, bar charts (also called column
charts) are the best way to go. Line charts are great for
visualizing variables over time. For a certain time period (say,
less than ten time points), vertical bar charts can be used as
well. Radar charts or spider plots are great for visualizing
multiple variables for multiple groups.
LINE CHART
Line charts are used to display quantitative values over a
continuous time period and show information as a series. A
line chart is ideal for a time series, which is connected by
straight-line segments.
Uses:
Line charts are great for comparing
multiple variables and visualizing
trends for both single as well as
multiple variables, especially if
your dataset has many time periods
(roughly more than ten).
Design practices:
Note
BAR CHART
The bar length encodes the value. There are two variants of bar
charts: vertical bar charts and horizontal bar charts.
Uses:
Design practices:
RADAR CHART
Radar charts, also known as spider or web charts, visualize
multiple variables with each variable plotted on its own axis,
resulting in a polygon. All axes are arranged radially, starting
at the center with equal distances between one another and
have the same scale.
Uses:
Examples:
Design practices:
ACTIVITY 7: EMPLOYEE
SKILL COMPARISON
You are given scores of four employees (A, B, C, and D) for
five attributes: Efficiency, Quality, Commitment, Responsible
Conduct, and Cooperation. Your task is to compare the
employees and their skills:
Note:
Relation Plots
Relation plots are perfectly suited to show relationships
among variables. A scatter plot visualizes the correlation
between two variables for one or multiple groups. Bubble plots
can be used to show relationships between three variables. The
additional third variable is represented by the dot size.
Heatmaps are great for revealing patterns or correlating
between two qualitative variables. A correlogram is a perfect
visualization to show the correlation among multiple variables.
SCATTER PLOT
Scatter plots show data points for two numerical variables,
displaying a variable on both axes.
Uses:
Examples:
Design practices:
Examples:
BUBBLE PLOT
A bubble plot extends a scatter plot by introducing a third
numerical variable. The value of the variable is represented by
the size of the dots. The area of the dots is proportional to the
value. A legend is used to link the size of the dot to an actual
numerical value.
Uses:
To show a correlation between
three variables.
Example:
Design practices:
CORRELOGRAM
A correlogram is a combination of scatter plots and
histograms. Histograms will be discussed in detail later in this
chapter. A correlogram or correlation matrix visualizes the
relationship between each pair of numerical variables using a
scatter plot.
Examples:
Design practices:
HEATMAP
A heatmap is a visualization where values contained in a
matrix are represented as colors or color saturation. Heatmaps
are great for visualizing multivariate data, where categorical
variables are placed in the rows and columns and a numerical
or categorical variable is represented as colors or color
saturation.
Uses:
Examples:
ACTIVITY 8: ROAD
ACCIDENTS OCCURRING
OVER TWO DECADES
You are given a diagram that gives information about the road
accidents that have occurred over the past two decades during
the months of January, April, July, and October:
Note:
The solution for this activity can be found on page 241.
Composition Plots
Composition plots are ideal if you think about something as a
part of a whole. For static data, you can use pie charts, stacked
bar charts, or Venn diagrams. Pie charts or donut charts help
show proportions and percentages for groups. If you need an
additional dimension, stacked bar charts are great. Venn
diagrams are the best way to visualize overlapping groups,
where each group is represented by a circle. For data that
changes over time, you can use either stacked bar charts or
stacked area charts.
PIE CHART
Pie charts illustrate numerical proportion by dividing a circle
into slices. Each arc length represents a proportion of a
category. The full circle equals to 100%. For humans, it is
easier to compare bars than arc lengths; therefore, it is
recommended to use bar charts or stacked bar charts most of
the time.
Uses:
Examples:
Design practices:
Uses:
Examples:
Design practices:
Uses:
Examples:
The following diagram shows a stacked area chart with the net
profits of companies like Google, Facebook, Twitter, and
Snapchat over a decade:
Design practices:
Using transparent colors might
improve information visibility.
ACTIVITY 9:
SMARTPHONE SALES
UNITS
You want to compare smartphone sales units for the five
biggest smartphone manufacturers over time and see whether
there is any trend:
VENN DIAGRAM
Venn diagrams, also known as set diagrams, show all
possible logical relations between a finite collections of
different sets. Each set is represented by a circle. The circle
size illustrates the importance of a group. The size of an
overlap represents the intersection between multiple groups.
Uses:
Example:
Design practices:
Distribution Plots
Distribution plots give a deep insight into how your data is
distributed. For a single variable, a histogram is well-suited.
For multiple variables, you can either use a box plot or a violin
plot. The violin plot visualizes the densities of your variables,
whereas the box plot just visualizes the median, the
interquartile range, and the range for each variable.
HISTOGRAM
A histogram visualizes the distribution of a single numerical
variable. Each bar represents the frequency for a certain
interval. Histograms help get an estimate of statistical
measures. You see where values are concentrated and you can
easily detect outliers. You can either plot a histogram with
absolute frequency values or alternatively normalize your
histogram. If you want to compare distributions of multiple
variables, you can use different colors for the bars.
Uses:
Design practices:
Try different numbers of bins, since
the shape of the histogram can vary
significantly.
DENSITY PLOT
A density plot shows the distribution of a numerical variable.
It is a variation of a histogram that uses kernel smoothing,
allowing for smoother distributions. An advantage they have
over histograms is that density plots are better at determining
the distribution shape, since the distribution shape for
histograms heavily depends on the number of bins (data
intervals).
Uses:
Example:
Design practices:
BOX PLOT
The box plot shows multiple statistical measurements. The box
extends from the lower to the upper quartile values of the data,
thus allowing us to visualize the interquartile range. The
horizontal line within the box denotes the median. The
whiskers extending from the box show the range of the data. It
is also an option to show data outliers, usually as circles or
diamonds, past the end of the whiskers.
Uses:
Examples:
VIOLIN PLOT
Violin plots are a combination of box plots and density plots.
Both the statistical measures and the distribution are
visualized. The thick black bar in the center represents the
interquartile range, the thin black line shows the 95%
confidence interval, and the white dot shows the median. On
both sides of the center-line, the density is visualized.
Uses:
Examples:
Design practices:
Note:
The solution for this activity can be found on page 242.
Geo Plots
Geological plots are a great way to visualize geospatial data.
Choropleth maps can be used to compare quantitative values
for different countries, states, and so on. If you want to show
connections between different locations, connection maps are
the way to go.
DOT MAP
In a dot map, each dot represents a certain number of
observations. Each dot has the same size and value (the
number of observations each dot represents). The dots are not
meant to be counted—they are only intended to give an
impression of magnitude. The size and value are important
factors for the effectiveness and impression of the
visualization. You can use different colors or symbols for the
dots to show multiple categories or groups.
Uses:
Example:
Design practices:
Uses:
Example:
Design practices:
You can use different colors for the lines to show multiple
categories or groups, or you can use a colormap to encode the
length of the connection.
Uses:
Examples:
Design practices:
Note:
The solution for this activity can be found on page 243.
Summary
In this chapter, the most important visualizations were
discussed. The visualizations were categorized into
comparison, relation, composition, distribution, and geological
plots. For each plot, a description, practical examples, and
design practices were given. Comparison plots, such as line
charts, bar charts, and radar charts, are well-suited for
comparing multiple variables or variables over time. Relation
plots are perfectly suited to show relationships between
variables. Scatter plots, bubble plots, which are an extension of
scatter plots, correlograms, and heatmaps were considered.
Composition plots are ideal if you think about something as a
part of a whole. We first covered pie charts and continued with
stacked bar charts, stacked area charts, and Venn diagrams. For
distribution plots that give a deep insight into how your data is
distributed, histograms, density plots, box plots, and violin
plots were considered. Regarding geospatial data, we discussed
dot maps, connection maps, and choropleth maps. Finally,
some remarks were given on what makes a good visualization.
In the next chapter, we will dive into Matplotlib and create our
own visualizations. We will cover all the plots that we have
discussed in this chapter.
Chapter 3
A Deep Dive into
Matplotlib
Learning Objectives
By the end of this chapter, you will be able to:
Introduction
Matplotlib is probably the most popular plotting library for
Python. It is used for data science and machine learning
visualizations all around the world. John Hunter began
developing Matplotlib in 2003. It aimed to emulate the
commands of the MATLAB software, which was the scientific
standard back then. Several features such as the global style of
MATLAB were introduced into Matplotlib to make the
transition to Matplotlib easier for MATLAB users.
Overview of Plots in
Matplotlib
Plots in Matplotlib have a hierarchical structure that nests
Python objects to create a tree-like structure. Each plot is
encapsulated in a Figure object. This Figure is the top-
level container of the visualization. It can have multiple axes,
which are basically individual plots inside this top-level
container.
Figure
Axes
The Axes is an actual plot, or
subplot, depending on whether you
want to plot single or multiple
visualizations. Its sub-objects
include the x and y axis, spines, and
legends.
Pyplot Basics
pyplot contains a simpler interface for creating visualizations,
which allows the users to plot the data without explicitly
configuring the Figure and Axes themselves. They are
implicitly and automatically configured to achieve the desired
output. It is handy to use the alias plt to reference the
imported submodule, as follows:
CREATING FIGURES
We use plt.figure() to create a new Figure. This
function returns a Figure instance, but it is also passed to the
backend. Every Figure-related command that follows is applied
to the current Figure and does not need to know the Figure
instance.
CLOSING FIGURES
Figures that are not used anymore should be closed by
explicitly calling plt.close(), which also cleans up
memory efficiently.
FORMAT STRINGS
Before we actually plot something, let's quickly discuss format
strings. They are a neat way to specify colors, marker types,
and line styles. A format string is specified as "[color]
[marker][line]", where each item is optional. If the
color is the only argument of the format string, you can use
any matplotlib.colors. Matplotlib recognizes the
following formats, among others:
PLOTTING
With plt.plot([x], y, [fmt]), you can plot data
points as lines and/or markers. The function returns a list of
Line2D objects representing the plotted data. By default, if you
do not provide a format string, the data points will be
connected with straight, solid lines. plt.plot([0, 1, 2,
3], [2, 4, 6, 8]) produces a plot, as shown in the
following diagram. Since x is optional and default values are
[0, …, N-1], plt.plot([2, 4, 6, 8]) results in
the same plot:
PLOTTING USING
PANDAS DATAFRAMES
It is pretty straightforward to use pandas.DataFrame as a
data source. Instead of providing x and y values, you can
provide the pandas.DataFrame in the data parameter and
give keys for x and y, as follows:
DISPLAYING FIGURES
plt.show() is used to display a Figure or multiple Figures.
To display Figures within a Jupyter Notebook, simply set the
%matplotlib inline command in the beginning of the
code.
SAVING FIGURES
plt.savefig(fname) saves the current Figure. There are
some useful optional parameters you can specify, such as dpi,
format, or transparent. The following code snippet
gives an example of how you can save a Figure:
plt.figure()
plt.savefig('lineplot.png', dpi=300,
bbox_inches='tight')
#bbox_inches='tight' removes the outer
white margins
Note
All exercises and activities will be developed in the Jupyter
Notebook. Please download the GitHub repository with all the
prepared templates from
https://github.com/TrainingByPackt/Data-Visualization-with-
Python.
EXERCISE 3: CREATING A
SIMPLE VISUALIZATION
In this exercise, we will create our first simple plot using
Matplotlib:
import numpy as np
import
matplotlib.pyplot as
plt
%matplotlib inline
plt.figure(dpi=200)
plt.plot([1, 2, 4, 5],
[1, 3, 4, 3], '-o')
plt.show()
plt.savefig(exercise03
.png);
LABELS
Matplotlib provides a few label functions that we can use for
setting labels to the x and y axes. The plt.xlabel() and
plt.ylabel() functions are used to set the label for the
current axes. The set_xlabel() and set_ylabel()
functions are used to set the label for specified axes.
Example:
ax.set_xlabel('X Label')
ax.set_ylabel('Y Label')
TITLES
A title describes a particular chart/graph. The titles are
placed above the axes in the center, left edge, or right edge.
There are two options for titles – you can either set the Figure
title or the title of an Axes. The suptitle() function sets
the title for current and specified Figure. The title()
function helps in setting the title for the current and specified
axes.
Example:
fig = plt.figure()
fig.suptitle('Suptitle', fontsize=10,
fontweight='bold')
This creates a bold figure title with a text suptitle and a font
size of 10.
TEXT
There are two options for text – you can either add text to a
Figure or text to an Axes. The figtext(x, y, text) and
text(x, y, text) functions add a text at location x, or y
for a figure.
Example:
ANNOTATIONS
Compared to text that is placed at an arbitrary position on the
Axes, annotations are used to annotate some features of the
plot. In annotation, there are two locations to consider: the
annotated location xy and the location of the annotation, text
xytext. It is useful to specify the parameter arrowprops,
which results in an arrow pointing to the annotated location.
Example:
Example:
plt.legend()
Note:
The solution for this activity can be found on page 244.
Basic Plots
In this section, we are going to go through the different types
of basic plots.
BAR CHART
plt.bar(x, height, [width]) creates a vertical bar
plot. For horizontal bars, use the plt.barh() function.
Important parameters:
Example:
Example:
x = np.arange(len(labels))
width = 0.4
plt.ticks(x)
ax = plt.gca()
ax.set_xticklabels(labels)
Note:
The solution for this activity can be found on page 245.
PIE CHART
The plt.pie(x, [explode], [labels],
[autopct]) function creates a pie chart.
Important parameters:
Example:
EXERCISE 4: CREATING A
PIE CHART FOR WATER
USAGE
In this exercise, we will use a pie chart to visualize water
usage:
# Import statements
import pandas as pd
import
matplotlib.pyplot as
plt
%matplotlib inline
# Load dataset
data =
pd.read_csv('./data/wa
ter_usage.csv')
# Create figure
plt.figure(figsize=(8,
8), dpi=300)
plt.pie('Percentage',
explode=(0, 0, 0.1, 0,
0, 0), labels='Usage',
data=data,
autopct='%.0f%%')
# Add title
plt.title('Water
usage')
# Show plot
plt.show()
plt.bar(x, bars1)
plt.bar(x, bars2, bottom=bars1)
Note:
The solution for this activity can be found on page 247.
Example:
Note:
The solution for this activity can be found on page 248.
HISTOGRAM
plt.hist(x) creates a histogram.
Important parameters:
Example:
BOX PLOT
plt.boxplot(x) creates a box plot.
Important parameters:
showfliers: Optional: By
default, it is true, and outliers are
plotted beyond the caps.
Example:
Note
plt.axvline(x, [color=…], [linestyle=…])
draws a vertical line at position x.
Note:
The solution for this activity can be found on page 249.
SCATTER PLOT
plt.scatter(x, y) creates a scatter plot of y versus x
with optionally varying marker size and/or color.
Important parameters:
Example:
plt.scatter(x, y)
Note
Axes.set_xscale('log') and
Axes.set_yscale('log') change the scale of the x-axis
and y-axis to a logarithmic scale, respectively.
Note:
The solution for this activity can be found on page 252.
BUBBLE PLOT
The plt.scatter function is used to create a bubble plot.
To visualize a third or a fourth variable, the parameters s
(scale) and c (color) can be used.
Example:
plt.colorbar()
Layouts
There are multiple ways to define a visualization layout in
Matplotlib. We will start with subplots and how to use the
tight layout to create visually appealing plots and then cover
GridSpec, which offers a more flexible way to create multi-
plots.
SUBPLOTS
It is often useful to display several plots next to each other.
Matplotlib offers the concept of subplots, which are multiple
Axes within a Figure. These plots can be grids of plots, nested
plots, and so forth.
plt.subplots(nrows,
ncols) creates a Figure and a set
of subplots.
plt.subplot(nrows,
ncols, index) or equivalently
plt.subplot(pos) adds a
subplot to the current Figure. The
index starts at 1.
plt.subplot(2, 2, 1) is
equivalent to
plt.subplot(221).
Figure.subplots(nrows,
ncols) adds a set of subplots to
the specified Figure.
Figure.add_subplot(nrows
, ncols, index) or
equivalently
Figure.add_subplot(pos)
adds a subplot to the specified
Figure.
Example 1:
axes = axes.ravel()
for i, ax in enumerate(axes):
ax.plot(series[i])
for i in range(4):
plt.subplot(2, 2, i+1)
plt.plot(series[i])
Example 2:
axes = axes.ravel()
for i, ax in enumerate(axes):
ax.plot(series[i])
TIGHT LAYOUT
plt.tight_layout() adjusts subplot parameters so that
the subplots fit well in the Figure.
Examples:
axes = axes.ravel()
for i, ax in enumerate(axes):
ax.plot(series[i])
axes = axes.ravel()
for i, ax in enumerate(axes):
ax.plot(series[i])
plt.tight_layout()
…
RADAR CHARTS
Radar charts, also known as spider or web charts, visualize
multiple variables, with each variable plotted on its own axis,
resulting in a polygon. All axes are arranged radially, starting
at the center with equal distance between each other and have
the same scale.
EXERCISE 5: WORKING
ON RADAR CHARTS
In this exercise, it is shown step-by-step how to create a radar
chart:
# Import settings
import numpy as np
import pandas as pd
import
matplotlib.pyplot as
plt
%matplotlib inline
# Sample data
# Attributes:
Efficiency, Quality,
Commitment,
Responsible Conduct,
Cooperation
data = pd.DataFrame({
'Efficiency': [5, 4,
4, 3,],
'Quality': [5, 5, 3,
3],
'Commitment': [5, 4,
4, 4],
'Responsible Conduct':
[4, 4, 4, 3],
'Cooperation': [4, 3,
4, 5]
})
attributes =
list(data.columns[1:])
values =
list(data.values[:,
1:])
employees =
list(data.values[:,
0])
angles = [n /
float(len(attributes))
* 2 * np.pi for n in
range(len(attributes))
]
angles += angles[:1]
values =
np.asarray(values)
values =
np.concatenate([values
, values[:, 0:1]],
axis=1)
# Create figure
plt.figure(figsize=(8,
8), dpi=150)
# Create subplots
for i in range(4):
ax = plt.subplot(2, 2,
i + 1, polar=True)
ax.plot(angles,
values[i])
ax.set_yticks([1, 2,
3, 4, 5])
ax.set_xticks(angles)
ax.set_xticklabels(att
ributes)
ax.set_title(employees
[i], fontsize=14,
color='r')
plt.tight_layout()
# Show plot
plt.show()
GRIDSPEC
matplotlib.gridspec.GridSpec(nrows, ncols)
specifies the geometry of the grid in which a subplot will be
placed.
Example:
…
gs = matplotlib.gridspec.GridSpec(3, 4)
ax1.plot(series[0])
ax2.plot(series[1])
ax3.plot(series[2])
ax4.plot(series[3])
plt.tight_layout()
Note:
The solution for this activity can be found on page 253.
Images
In case you want to include images in your visualizations or in
case you are working with image data, Matplotlib offers
several functions to deal with images. In this section, we will
show you how to load, save, and plot images with Matplotlib.
Note
The images that are used in this topic are from
https://unsplash.com/.
BASIC IMAGE
OPERATIONS
Following are the basic operations that are used for designing
an image:
Loading Images
Saving images
plt.imshow(img, cmap='jet')
plt.colorbar()
for i in range(2):
axes[i].imshow(imgs[i])
for i in range(2):
axes[i].imshow(imgs[i])
axes[i].set_xticks([])
axes[i].set_yticks([])
axes[i].set_xlabel(labels[i])
…
Note:
The solution for this activity can be found on page 254.
Writing Mathematical
Expressions
In case you need to write mathematical expressions within the
code, Matplotlib supports TeX. You can use it in any text by
placing your mathematical expression in a pair of dollar signs.
There is no need to have TeX installed since Matplotlib comes
with its own parser.
plt.xlabel('$x$')
plt.ylabel('$\cos(x)$')
TeX examples:
'$\alpha_i>\beta_i$'
produces
'$\sum_{i=0}^\infty
x_i$' produces
'$\sqrt[3]{8}$' produces
'$\frac{3 - \frac{x}{2}}
{5}$' produces
Summary
In this chapter, we provided a detailed introduction to
Matplotlib, one of the most popular visualization libraries for
Python. We started off with the basics of pyplot and its
operations, and then followed up with a deep insight into the
numerous possibilities that helps to enrich visualizations with
text. Using practical examples, this chapter covered the most
popular plotting functions that Matplotlib offers out of the box,
including comparison charts, composition, and distribution
plots. This chapter is rounded off with how to visualize images
and write mathematical expressions.
Introduction
Unlike Matplotlib, Seaborn is not a standalone Python library.
It is built on top of Matplotlib and provides a higher-level
abstraction to make visually appealing statistical visualizations.
A neat feature of Seaborn is the ability to integrate with
DataFrames from the pandas library.
With Seaborn, we attempt to make visualization a central part
of data exploration and understanding. Internally, Seaborn
operates on DataFrames and arrays that contain the complete
dataset. This enables it to perform semantic mappings and
statistical aggregations that are essential for displaying
informative visualizations. Seaborn can also be solely used to
change the style and appearance of Matplotlib visualizations.
Dataset-oriented interface
ADVANTAGES OF
SEABORN
Seaborn is built on top of Matplotlib, and also addresses some
of the main pain points of working with Matplotlib.
import pandas as pd
sns.set(style="ticks")
data = pd.read_csv("data/salary.csv")
sns.relplot(x="Salary", y="Age",
hue="Education", style="Education",
col="Gender", data=data)
Controlling Figure
Aesthetics
As we mentioned previously, Matplotlib is highly
customizable. But this also has the effect that it is difficult to
know what settings to tweak to achieve a visually appealing
plot. In contrast, Seaborn provides several customized themes
and a high-level interface for controlling the appearance of
Matplotlib figures.
%matplotlib inline
plt.figure()
plt.show()
%matplotlib inline
sns.set()
plt.figure()
plt.legend()
plt.show()
SEABORN FIGURE
STYLES
To control the style, Seaborn provides two methods:
set_style(style, [rc]) and axes_style(style,
[rc]).
Parameters:
Here is an example:
%matplotlib inline
sns.set_style("whitegrid")
plt.figure()
plt.legend()
plt.show()
Here is an example:
%matplotlib inline
sns.set()
plt.figure()
with sns.axes_style('dark'):
plt.legend()
plt.show()
seaborn.despine(fig=None, ax=None,
top=True, right=True, left=False,
bottom=False, offset=None, trim=False)
removes the top and right spines from the plot.
%matplotlib inline
sns.set_style("white")
plt.figure()
sns.despine()
plt.legend()
plt.show()
CONTEXTS
A separate set of parameters controls the scale of plot elements.
This is a handy way to use the same code to create plots that
are suited for use in contexts where larger or smaller plots are
necessary. To control the context, two functions can be used.
seaborn.set_context(context, [font_scale],
[rc]) sets the plotting context parameters. This does not
change the overall style of the plot, but affects things such as
the size of the labels, lines, and so on. The base context is
notebook, and the other contexts are paper, talk, and
poster, which are versions of the notebook parameters
scaled by 0.8, 1.3, and 1.6, respectively.
context: A dictionary of
parameters or the name of one of
the following preconfigured sets:
paper, notebook, talk, or poster
%matplotlib inline
sns.set_context("poster")
plt.figure()
plt.legend()
plt.show()
The preceding code generates the following output:
seaborn.plotting_context(context,
[font_scale], [rc]) returns a parameter dictionary to
scale elements of the Figure. This function can be used with a
statement to temporarily change the context parameters.
context: A dictionary of
parameters or the name of one of
the following preconfigured sets:
paper, notebook, talk, or poster
Note:
The solution for this activity can be found on page 255.
Color Palettes
Color is a very important factor for your visualization. Color
can reveal patterns in the data if used effectively or hide
patterns if used poorly. Seaborn makes it easy to select and use
color palettes that are suited to your task. The
color_palette() function provides an interface for many
of the possible ways to generate colors.
seaborn.color_palette([palette],
[n_colors], [desat]) returns a list of colors, thus
defining a color palette.
You can set the palette for all plots with set_palette().
This function accepts the same arguments as
color_palette(). In the following sections, we will
explain how color palettes are divided into different groups.
CATEGORICAL COLOR
PALETTES
Categorical palettes are best for distinguishing discrete data
that does not have an inherent ordering. There are six default
themes in Seaborn: deep, muted, bright, pastel, dark,
and colorblind. The code and output for each theme is
provided in the following code:
import seaborn as sns
palette1 = sns.color_palette("deep")
sns.palplot(palette1)
palette2 = sns.color_palette("muted")
sns.palplot(palette2)
palette3 = sns.color_palette("bright")
sns.palplot(palette3)
palette4 = sns.color_palette("pastel")
sns.palplot(palette4)
palette5 = sns.color_palette("dark")
sns.palplot(palette5)
palette6 =
sns.color_palette("colorblind")
sns.palplot(palette6)
SEQUENTIAL COLOR
PALETTES
Sequential color palettes are appropriate when the data ranges
from relatively low or uninteresting values to relatively high or
interesting values. The following code snippets, along with
their respective outputs, give us a better insight into sequential
color palettes:
custom_palette2 =
sns.light_palette("brown")
sns.palplot(custom_palette2)
custom_palette3 =
sns.light_palette("brown", reverse=True)
sns.palplot(custom_palette3)
DIVERGING COLOR
PALETTES
Diverging color palettes are used for data that
consists of a well-defined midpoint. Emphasis is being laid on
both high and low values. For example: if you are plotting any
population changes for a particular region from some baseline
population, it is best to use diverging colormaps to show the
relative increase and decrease in the population. The following
code snippet and output provides a better understanding about
diverging plot, wherein we use the coolwarm template,
which is built into Matplotlib:
custom_palette4 =
sns.color_palette("coolwarm", 7)
sns.palplot(custom_palette4)
custom_palette5 =
sns.diverging_palette(440, 40, n=7)
sns.palplot(custom_palette5)
Note:
The solution for this activity can be found on page 257.
Interesting Plots in
Seaborn
In the previous chapter, we discussed various plots in
Matplotlib, but there are still a few visualizations left that we
want to discuss.
BAR PLOTS
In the last chapter, we already explained how to create bar plots
with Matplotlib. Creating bar plots with subgroups was quite
tedious, but Seaborn offers a very convenient way to create
various bar plots. They can be also used in Seaborn to
represent estimates of central tendency with the height of each
rectangle and indicates the uncertainty around that estimate
using error bars.
import pandas as pd
data = pd.read_csv("data/salary.csv")
sns.set(style="whitegrid")
sns.barplot(x="Education", y="Salary",
hue="District", data=data)
Note:
The solution for this activity can be found on page 258.
KERNEL DENSITY
ESTIMATION
It is often useful to visualize how variables of a dataset are
distributed. Seaborn offers handy functions to examine
univariate and bivariate distributions. One possible way to look
at a univariate distribution in Seaborn is by using the
distplot() function. This will draw a histogram and fit a
kernel density estimate (KDE), as illustrated in the following
example:
%matplotlib inline
import numpy as np
import pandas as pd
x = np.random.normal(size=50)
sns.distplot(x)
sns.kdeplot(x, shade=True)
PLOTTING BIVARIATE
DISTRIBUTIONS
For visualizing bivariate distributions, we will introduce
three different plots. The first two plots use the
jointplot() function, which creates a multi-panel figure
that shows both the joint relationship between both variables
and the corresponding marginal distributions.
import pandas as pd
data = pd.read_csv("data/salary.csv")
sns.set(style="white")
sns.jointplot(x="Salary", y="Age",
data=data)
sns.jointplot('Salary', 'Age',
data=subdata, kind='kde', xlim=(0,
500000), ylim=(0, 100))
VISUALIZING PAIRWISE
RELATIONSHIPS
For visualizing multiple pairwise bivariate distributions in a
dataset, Seaborn offers the pairplot() function. This
function creates a matrix where off-diagonal elements visualize
the relationship between each pair of variables and the
diagonal elements show the marginal distributions.
%matplotlib inline
import numpy as np
import pandas as pd
mydata =
pd.read_csv("data/basic_details.csv")
sns.set(style="ticks", color_codes=True)
g = sns.pairplot(mydata, hue="Groups")
VIOLIN PLOTS
A different approach to visualizing statistical measures is by
using violin plots. They combine box plots with the kernel
density estimation procedure that we described previously. It
provides a richer description of the variable's distribution.
Additionally, the quartile and whisker values from the box plot
are shown inside the violin.
import pandas as pd
sns.set(style="whitegrid")
sns.violinplot('Education', 'Salary',
hue='Gender', data=data, split=True,
cut=0)
Note:
The solution for this activity can be found on page 259.
Multi-Plots in Seaborn
In the previous topic, we introduced a multi-plot, namely the
pair plot. In this topic, we want to talk about a different way to
create flexible multi-plots.
FACETGRID
The FacetGrid is useful for visualizing a certain plot for
multiple variables separately. A FacetGrid can be drawn with
up to three dimensions: row, col, and hue. The first two
have the obvious correspondence to the rows and columns of
an array. The hue is the third dimension and shown with
different colors. The FacetGrid class has to be initialized
with a DataFrame and the names of the variables that will form
the row, column, or hue dimensions of the grid. These variables
should be categorical or discrete.
Initializing the grid does not draw anything on them yet. For
visualizing data on this grid, the FacetGrid.map() method
has to be used. You can provide any plotting function and the
name(s) of the variable(s) in the data frame to plot.
import pandas as pd
data = pd.read_csv("data/salary.csv")
g = sns.FacetGrid(subdata,
col='District')
Note:
The solution for this activity can be found on page 261.
Regression Plots
Many datasets contain multiple quantitative variables, and the
goal is to find a relationship among those variables. We
previously mentioned a few functions that show the joint
distribution of two variables. It can be helpful to estimate
relationships between two variables. We will only cover linear
regression in this topic; however, Seaborn provides a wider
range of regression functionality if needed.
To visualize linear relationships, determined through linear
regression, the regplot() function is offered by Seaborn.
The following code snippet gives a simple example:
import numpy as np
x = np.arange(100)
y = x + np.random.normal(0, 5, size=100)
sns.regplot(x, y)
Note:
The solution for this activity can be found on page 263.
Squarify
At this point, we will briefly talk about tree maps. Tree maps
display hierarchical data as a set of nested rectangles. Each
group is represented by a rectangle, of which its area is
proportional to its value. Using color schemes, it is possible to
represent hierarchies: groups, subgroups, and so on. Compared
to pie charts, tree maps efficiently use space. Matplotlib and
Seaborn do not offer tree maps, and so the Squarify library
that is built on top of Matplotlib is used. Seaborn is a great
addition for creating color palettes.
The following code snippet is a basic tree map example. It
requires the Squarify library:
%matplotlib inline
import squarify
colors = sns.light_palette("brown", 4)
plt.axis("off")
plt.show()
Note:
The solution for this activity can be found on page 264.
Summary
In this chapter, we demonstrated how Seaborn helps create
visually appealing figures. We discussed various options for
controlling figure aesthetics, such as figure style, controlling
spines, and setting the context of visualizations. We talked
about color palettes in detail. Further visualizations were
introduced for visualizing univariate and bivariate
distributions. Moreover, we discussed FacetGrids, which can
be used for creating multi-plots, and regression plots as a way
to analyze the relationships between two variables. Finally, we
discussed the Squarify library, which is used to create tree
maps. In the next chapter, we will show you how to visualize
geospatial data in various ways by using the Geoplotlib library.
Chapter 5
Plotting Geospatial
Data
Learning Objectives
By the end of this chapter, you will be able to:
Introduction
Geoplotlib is an open source Python library for geospatial data
visualizations. It has a wide range of geographical
visualizations and supports hardware acceleration. It also
provides performance rendering for large datasets with millions
of data points. As discussed in the earlier chapters, Matplotlib
provides the ways to visualize geographical data. However,
Matplotlib is not designed for this task, as its interfaces are
complex and inconvenient to use. Matplotlib also restricts the
ways in which geographical data can be displayed. The
Basemap and Cartopy libraries enable it so that you can plot
on a world map. However, these packages do not support
drawing on map tiles.
Note
For a better understanding of the available features of
Geoplotlib, you can visit the following link:
https://github.com/andrea-cuttone/geoplotlib/wiki/User-Guide.
import geoplotlib
from geoplotlib.utils
import read_csv
dataset =
read_csv('./data/poach
ing_points_cleaned.csv
')
geoplotlib.dot(dataset
)
geoplotlib.show()
Integration: Geoplotlib
visualizations are purely Python-
based. This means that the generic
Python code can be executed and
other libraries such as pandas can
be used for data wrangling
purposes. We can manipulate and
enrich our datasets using pandas
DataFrames and later simply
convert them into a Geoplotlib
DataAccessObject, which we
need for optimum compatibility,
like this:
import pandas as pd
from geoplotlib.utils
import
DataAccessObject
pd_dataset =
pd.read_csv('./data/po
aching_points_cleaned.
csv')
dataset =
DataAccessObject(pd_da
taset)
Geoplotlib fully integrates into the
Python ecosystem. This enables us
to even plot geographical data
inline inside our Jupyter
Notebooks. This possibility allows
us to design our visualizations
quickly and iteratively.
Performance: As we mentioned
before, Geoplotlib is able to handle
large amounts of data due to the
usage of NumPy for accelerated
numerical operations and OpenGL
for accelerated graphical rendering.
GEOSPATIAL
VISUALIZATIONS
Choropleth plot, Voronoi tessellation, and Delaunay
triangulation are a few of the geospatial visualizations that
will be used in this chapter. The explanation for each of them is
provided here:
Choropleth Plot
Delaunay triangulation
EXERCISE 6: VISUALIZING
SIMPLE GEOSPATIAL
DATA
In this exercise, we'll be looking at the basic usage of
Geoplotlib's plot methods for DotDensity, Histograms, and
Voronoi diagrams. For this, we will make use of the data of
various poaching incidents that have taken place all over the
world:
# importing the
necessary dependencies
import geoplotlib
from geoplotlib.utils
import read_csv
dataset =
read_csv('./data/poach
ing_points_cleaned.csv
')
Note
# looking at the
dataset structure
Dataset
import pandas as pd
pd_dataset =
pd.read_csv('./data/po
aching_points_cleaned.
csv')
pd_dataset.head()
The following figure shows the
output of the preceding code:
Note
Note
geoplotlib.dot(dataset
)
geoplotlib.show()
geoplotlib.hist(datase
t, binsize=20)
geoplotlib.show()
# plotting a voronoi
map
geoplotlib.voronoi(dat
aset, cmap='Blues_r',
max_area=1e5,
alpha=255)
geoplotlib.show()
Note:
The solution for this activity can be found on page 264.
EXERCISE 7:
CHOROPLETH PLOT WITH
GEOJSON DATA
In this exercise, we not only want to work with GeoJSON
data, but also want to see how we can create a choropleth
visualization. They are especially useful for displaying
statistical variables in shaded areas. In our case, the areas will
be the outlines of the states of the USA. Let's create a
choropleth visualization with the given GeoJSON data:
1. Open the Jupyter Notebook
exercise07.ipynb from the
Lesson05 folder to implement
this exercise.
# importing the
necessary dependencies
import json
import geoplotlib
from geoplotlib.colors
import ColorMap
from geoplotlib.utils
import BoundingBox
# displaying one of
the entries for the
states
with
open('data/National_Ob
esity_By_State.geojson
') as data:
dataset =
json.load(data)
first_state =
dataset.get('features'
)[0]
first_state['geometry'
]['coordinates'] =
first_state['geometry'
]['coordinates'][0][0]
print(json.dumps(first
_state, indent=4))
Note
with
open('data/National_Ob
esity_By_State.geojson
') as data:
dataset =
json.load(data)
states =
[feature['properties']
['NAME'] for feature
in
dataset.get('features'
)]
print(states)
# plotting the
information from the
geojson file
geoplotlib.geojson('da
ta/National_Obesity_By
_State.geojson')
geoplotlib.show()
# converting the
obesity into a color
cmap =
ColorMap('Reds',
alpha=255, levels=40)
def
get_color(properties):
return
cmap.to_color(properti
es['Obesity'],
maxvalue=40,scale='lin
')
# our BoundingBox
should focus the USA
geoplotlib.geojson('da
ta/National_Obesity_By
_State.geojson',
fill=True,
color=get_color)
geoplotlib.geojson('da
ta/National_Obesity_By
_State.geojson',
fill=False, color=
[255, 255, 255, 255])
geoplotlib.set_bbox(Bo
undingBox.USA)
geoplotlib.show()
A new window will open, displaying the country USA with the
areas of its states filled with different shades of red. The darker
areas represent higher obesity percentages.
Note
To give the user some more information for this plot, we could
also use the f_tooltip parameter to provide a tooltip for
each state, thus displaying the name and the percentage of
obese people.
Tile Providers
Geoplotlib supports the usage of different tile providers. This
means that any OpenStreetMap tile server can be used as a
backdrop to our visualization. Some of the popular free tile
providers are Stamen Watercolor, Stamen Toner, Stamen
Toner Lite, and DarkMatter.
'http://a.tile.stamen.
com/watercolor/%d/%d/%
d.png' % (zoom, xtile,
ytile),
'tiles_dir':
'tiles_dir',
'attribution': 'Python
Data Visualization |
Packt'
})
EXERCISE 8: VISUALLY
COMPARING DIFFERENT
TILE PROVIDERS
This quick exercise will teach you how to switch the map tile
provider for your visualizations. Geoplotlib provides mappings
for some of the available and most popular map tiles. However,
we can also provide a custom object that contains the url of
some tile providers:
# importing the
necessary dependencies
import geoplotlib
geoplotlib.show()
geoplotlib.show()
Note
geoplotlib.tiles_provi
der({
'tiles_dir':
'custom_tiles',
'attribution': 'Custom
Tiles Provider -
Humanitarian map style
| Packt Courseware'
})
geoplotlib.show()
The next topic will cover how to create custom layers that can
go far beyond the ones we have described in this book. We'll
look at the basic structure of the BaseLayer class and what it
takes to create a custom layer.
Custom Layers
Now that we have covered the basics of visualizing geospatial
data with the built-in layers, and the methods to change the tile
provider, we will now focus on defining our own custom
layers. Custom layers allow you to create more complex data
visualizations. They also help with adding more interactivity
and animation to them. Creating a custom layer starts by
defining a new class that extends the BaseLayer class that's
provided by Geoplotlib. Besides the __init__ method that
initializes the class level variables, we also have to at least
extend the draw method of the already provided BaseLayer
class.
Note
Since Geoplotlib operates on OpenGL, this process is highly
performant and can even draw complex visualizations quickly.
For more examples on how to create custom layers, visit the
following GitHub repository of Geoplotlib:
https://github.com/andrea-
cuttone/geoplotlib/tree/master/examples.
3. Load the
flight_tracking.csv dataset
using pandas.
Summary
In this chapter, we covered the basic and advanced concepts
and methods of Geoplotlib. It gave us a quick insight into the
internal processes and how to practically apply the library to
our own problem statements. Most of the time, the built-in
plots should suit your needs pretty well. Once you're interested
in having animated or even interactive visualizations, you will
have to create custom layers that enable those features.
Introduction
Bokeh has been around since 2013, with version 1.0.4 being
released in 2018. It targets modern web browsers to present
interactive visualizations to users rather than static images. The
following are some of the features of Bokeh:
Simple visualizations: Through its
different interfaces, it targets users
of many skill levels, thus providing
an API for quick and simple
visualizations but also more
complex and extremely
customizable ones.
Excellent animated
visualizations: It provides high
performance and therefore can
work on large or even streaming
datasets, which makes it the go-to
choice for animated visualizations
and data analysis.
Inter-visualization interactivity:
This is a web-based approach,
where it's easy to combine several
plots and create unique and
impactful dashboards with
visualizations that can be
interconnected to create inter-
visualization interactivity.
CONCEPTS OF BOKEH
The basic concept of Bokeh, in some ways, is comparable to
that of Matplotlib. In Bokeh, we have a figure as our root
element, which has sub-elements such as a title, an axis, and
glyphs. Glyphs have to be added to a figure, which can take on
different shapes such as circles, bars, and triangles to represent
a figure. The following hierarchy shows the different concepts
of Bokeh:
Figure 6.1: Concepts of Bokeh
INTERFACES IN BOKEH
The interface-based approach provides different levels of
complexity for users that either simply want to create some
basic plots with very few customizable parameters or the users
who want full control over their visualizations and want to
customize every single element of their plots. This layered
approach is divided into two levels:
Plotting: This layer is
customizable.
Note
bokeh.plotting
bokeh.models
OUTPUT
Outputting Bokeh charts is straightforward. There are three
ways this can be done, depending on your needs:
BOKEH SERVER
As we mentioned before, Bokeh creates scene graph JSON
objects that will be interpreted by the BokehJS library to create
the visualization output. This process allows you to have a
unified format for other languages to create the same Bokeh
plots and visualizations, independent of the language used.
PRESENTATION
In Bokeh, presentations help make the visualization more
interactive by using different features such as interactions,
styling, tools, and layouts.
Interactions
Passive interactions are actions that the users can take that
neither change the data nor the displayed data. In Bokeh, this is
called the Inspector. As we mentioned before, the inspector
contains attributes such as zooming, panning, and hovering
over data. This tooling allows the user to inspect its data
further and might get better insights by only looking at a
zoomed-in subset of the visualized data points.
INTEGRATING
Embedding Bokeh visualizations can take two forms, as
follows:
Note
One interesting feature is the to_bokeh method, which
allows you to plot Matplotlib figures with Bokeh without
configuration overhead. Further information about this method
is available at the following link:
https://bokeh.pydata.org/en/0.12.3/docs/user_guide/compat.ht
ml.
Note
All the exercises and activities in this chapter are developed
using Jupyter Notebook and Jupyter Lab. The files can be
downloaded from the following link: https://bit.ly/2T3Afn1.
EXERCISE 9: PLOTTING
WITH BOKEH
In this exercise, we want to use the higher-level interface that
is focused around providing a simple interface for quick
visualization creation. Refer to the introduction to check back
with the different interfaces of Bokeh. In this exercise, we will
be using the world_population dataset. This dataset
shows the population of different countries over the years. We
will use the plotting interface to get some insights into the
population densities of Germany and Switzerland:
1. Open the
exercise09_solution.ipyn
b Jupyter Notebook from the
Lesson06 folder to implement
this exercise. To do that, you need
to navigate to the path of this file in
the command-line terminal and
type in jupyter-lab.
# importing the
necessary dependencies
import pandas as pd
from bokeh.plotting
import figure, show
output_notebook()
4. Use pandas to load our
world_population dataset:
dataset =
pd.read_csv('./data/wo
rld_population.csv',
index_col=0)
# looking at the
dataset
dataset.head()
# plotting the
population density
change in Germany in
the given years
plot =
figure(title='Populati
on Density of
Germany',
x_axis_label='Year',
y_axis_label='Populati
on Density')
plot.line(years,
de_vals, line_width=2,
legend='Germany')
show(plot)
ch_vals =
[dataset.loc[['Switzer
land']][year] for year
in years]
plot =
figure(title='Populati
on Density of Germany
and Switzerland',
x_axis_label='Year',
y_axis_label='Populati
on Density')
plot.line(years,
de_vals, line_width=2,
legend='Germany')
plot.line(years,
ch_vals, line_width=2,
color='orange',
legend='Switzerland')
plot.circle(years,
ch_vals, size=4,
line_color='orange',
fill_color='white',
legend='Switzerland')
show(plot)
The following figure shows the
output of the preceding code:
Figure 6.5: Adding Switzerland to the
plot
# that are
interconnected in
terms of view port
from bokeh.layouts
import gridplot
plot_de =
figure(title='Populati
on Density of
Germany',
x_axis_label='Year',
y_axis_label='Populati
on Density',
plot_height=300)
plot_ch =
figure(title='Populati
on Density of
Switzerland',
x_axis_label='Year',
y_axis_label='Populati
on Density',
plot_height=300,
x_range=plot_de.x_rang
e,
y_range=plot_de.y_rang
e)
plot_de.line(years,
de_vals, line_width=2)
plot_ch.line(years,
ch_vals, line_width=2)
plot =
gridplot([[plot_de,
plot_ch]])
show(plot)
plot_v =
gridplot([[plot_de],
[plot_ch]])
show(plot_v)
The following screenshot shows the
output of the preceding code:
Figure 6.7: Using the gridplot method to arrange the
visualizations vertically
EXERCISE 10:
COMPARING THE
PLOTTING AND MODELS
INTERFACES
In this exercise, we want to compare the two interfaces:
plotting and models. We will compare them by creating a
basic plot with the high-level plotting interface and then
recreate this plot by using the lower-level models interface.
This will show us the differences between these two interfaces
and give us a good direction for the later exercises to
understand how to use the models interface:
# importing the
necessary dependencies
import numpy as np
import pandas as pd
output_notebook()
dataset =
pd.read_csv('./data/wo
rld_population.csv',
index_col=0)
# looking at the
dataset
dataset.head()
# importing the
plotting dependencies
from bokeh.plotting
import figure, show
mean_pop_vals =
[np.mean(dataset[year]
) for year in years]
jp_vals =
[dataset.loc[['Japan']
][year] for year in
years]
plot =
figure(title='Global
Mean Population
Density compared to
Japan',
x_axis_label='Year',
y_axis_label='Populati
on Density')
plot.line(years,
mean_pop_vals,
line_width=2,
legend='Global Mean')
plot.cross(years,
jp_vals,
legend='Japan',
line_color='red')
show(plot)
from
bokeh.models.grids
import Grid
from
bokeh.models.plots
import Plot
from bokeh.models.axes
import LinearAxis
from
bokeh.models.ranges
import Range1d
from
bokeh.models.glyphs
import Line, Cross
from
bokeh.models.sources
import
ColumnDataSource
from
bokeh.models.tickers
import
SingleIntervalTicker,
YearsTicker
from
bokeh.models.renderers
import GlyphRenderer
from
bokeh.models.annotatio
ns import Title,
Legend, LegendItem
extracted_mean_pop_val
s = [val for i, val in
enumerate(mean_pop_val
s) if i not in [0,
len(mean_pop_vals) -
1]]
extracted_jp_vals =
[jp_val['Japan'] for
i, jp_val in
enumerate(jp_vals) if
i not in [0,
len(jp_vals) - 1]]
min_pop_density =
min(extracted_mean_pop
_vals)
min_jp_densitiy =
min(extracted_jp_vals)
min_y =
int(min(min_pop_densit
y, min_jp_densitiy))
max_pop_density =
max(extracted_mean_pop
_vals)
max_jp_densitiy =
max(extracted_jp_vals)
max_y =
int(max(max_jp_densiti
y, max_pop_density))
xdr =
Range1d(int(years[0]),
int(years[-1]))
ydr = Range1d(min_y,
max_y)
axis_def =
dict(axis_line_color='
#222222',
axis_line_width=1,
major_tick_line_color=
'#222222',
major_label_text_color
='#222222',major_tick_
line_width=1)
x_axis =
LinearAxis(ticker =
SingleIntervalTicker(i
nterval=10),
axis_label = 'Year',
**axis_def)
y_axis =
LinearAxis(ticker =
SingleIntervalTicker(i
nterval=50),
axis_label =
'Population Density',
**axis_def)
title = Title(align =
'left', text = 'Global
Mean Population
Density compared to
Japan')
plot =
Plot(x_range=xdr,
y_range=ydr,
plot_width=650,
plot_height=600,
title=title)
show(plot)
line_source =
ColumnDataSource(dict(
x=years,
y=mean_pop_vals))
line_glyph =
Line(x='x', y='y',
line_color='#2678b2',
line_width=2)
cross_source =
ColumnDataSource(dict(
x=years, y=jp_vals))
cross_glyph =
Cross(x='x', y='y',
line_color='#fc1d26')
plot.add_layout(x_axis
, 'below')
plot.add_layout(y_axis
, 'left')
line_renderer =
plot.add_glyph(line_so
urce, line_glyph)
cross_renderer =
plot.add_glyph(cross_s
ource, cross_glyph)
show(plot)
legend_items=
[LegendItem(label='Glo
bal Mean', renderers=
[line_renderer]),
LegendItem(label='Japa
n', renderers=
[cross_renderer])]
legend =
Legend(items=legend_it
ems,
location='top_right')
x_grid =
Grid(dimension=0,
ticker=x_axis.ticker)
y_grid =
Grid(dimension=1,
ticker=y_axis.ticker)
plot.add_layout(legend
)
plot.add_layout(x_grid
)
plot.add_layout(y_grid
)
show(plot)
Adding Widgets
One of the most powerful features of Bokeh is its ability to use
widgets to interactively change the data that's displayed in the
visualization. To understand the importance of interactivity in
your visualizations, imagine seeing a static visualization about
stock prices that only show data for the last year. If this is what
you specifically searched for, it's suitable enough, but if you're
interested to see the current year or even visually compare it to
recent years, those plots won't work and will add additional
work, since you have to create them for every year. Comparing
this to a simple plot that lets the user select the wanted date
range, we can already see the advantages. There are endless
options to combine widgets and tell your story. You can guide
the user by restricting values and only displaying what you
want them to see. Developing a story behind your visualization
is very important, and doing this is much easier if the user has
ways of interacting with the data.
1. Open the
exercise11_solution.ipyn
b Jupyter Notebook from the
Lesson06 folder to implement
this exercise. Since we need to user
Jupyter Notebook in this example,
we will type in the following at the
command line: jupyter
notebook.
# importing the
necessary dependencies
import pandas as pd
4. Again, we want to display our plots
inside a Jupyter Notebook, so we
have to import and call the
output_notebook method from
the io interface of Bokeh:
output_notebook()
dataset =
pd.read_csv('./data/st
ock_prices.csv')
# looking at the
dataset
dataset.head()
The following screenshot shows the
output of the preceding code:
shortened =
timestamp[0]
if len(shortened) >
10:
parsed_date=datetime.s
trptime(shortened,
'%Y-%m-%d %H:%M:%S')
shortened=datetime.str
ftime(parsed_date,
'%Y-%m-%d')
return shortened
dataset['short_date']
= dataset.apply(lambda
x:
shorten_time_stamp(x),
axis=1)
# looking at the
dataset with shortened
date
dataset.head()
# importing the
widgets
# creating a checkbox
@interact(Value=False)
def
checkbox(Value=False):
print(Value)
Note
@interact() is called a
decorator, which wraps the
annotated method into the interact
component. This allows us to
display and react to the change of
the drop-down menu. The method
will be executed every time the
value of the dropdown changes.
# creating a dropdown
options=['Option1',
'Option2', 'Option3',
'Option4']
@interact(Value=option
s)
def
slider(Value=options[0
]):
print(Value)
@interact(Value='Input
Text')
def slider(Value):
print(Value)
# multiple widgets
with default layout
options=['Option1',
'Option2', 'Option3',
'Option4']
@interact(Select=optio
ns, Display=False)
def uif(Select,
Display):
print(Select, Display)
# creating an int
slider with dynamic
updates
@interact(Value=(0,
100))
def slider(Value=0):
print(Value)
# creating an int
slider that only
triggers on mouse
release
slider=IntSlider(min=0
, max=100,
continuous_update=Fals
e)
@interact(Value=slider
)
def slider(Value=0.0):
print(Value)
Note:
# creating a float
slider 0.5 steps with
manual update trigger
@interact_manual(Value
=(0.0, 100.0, 0.5))
def slider(Value=0.0):
print(Value)
Note
Compared to the previous cells, this one contains the
interact_manual decorator instead of interact. This will
add an execution button, which will trigger the update of the
value instead of triggering with every change. This can be
really useful when working with larger datasets, where the
recalculation time would be large. Because of this, you don't
want to trigger the execution for every small step but only once
you have selected the right value.
# importing the
necessary dependencies
from
bokeh.models.widgets
import Panel, Tabs
from bokeh.plotting
import figure, show
def get_plot(stock):
stock_name=stock['symb
ol'].unique()[0]
line_plot=figure(title
='Stock prices',
x_axis_label='Date',
x_range=stock['short_d
ate'],
y_axis_label='Price in
$USD')
line_plot.line(stock['
short_date'],
stock['high'],
legend=stock_name)
line_plot.xaxis.major_
label_orientation = 1
circle_plot=figure(tit
le='Stock prices',
x_axis_label='Date',
x_range=stock['short_d
ate'],
y_axis_label='Price in
$USD')
circle_plot.circle(sto
ck['short_date'],
stock['high'],
legend=stock_name)
circle_plot.xaxis.majo
r_label_orientation =
1
line_tab=Panel(child=l
ine_plot,
title='Line')
circle_tab=Panel(child
=circle_plot,
title='Circles')
tabs = Tabs(tabs=[
line_tab, circle_tab
])
return tabs
stock_names=dataset['s
ymbol'].unique()
# creating the
dropdown interaction
and building the plot
# based on selection
@interact(Stock=stock_
names)
def
get_stock_for(Stock='A
APL'):
stock =
dataset[dataset['symbo
l'] == Stock][:25]
show(get_plot(stock))
Note
We can already see that each date is displayed on the x-axis. If
we want to display a bigger time range, we have to customize
the ticks on our x-axis. This can be done using ticker objects.
Note
If you want to learn more about using widgets and which
widgets can be used in Jupyter, you can refer to these links:
https://bit.ly/2Sx9txZ and https://bit.ly/2T4FcM1.
Note:
Summary
In this chapter, we have looked at another option for creating
visualizations with a whole new focus: web-based Bokeh plots.
We also discovered ways in which we can make our
visualizations more interactive and really give the user the
chance to explore data in a whole different way. As we
mentioned in the first part of this chapter, Bokeh is a
comparably new tool that empowers developers to use their
favorite language to create easily portable visualizations for the
web. After working with Matplotlib, Seaborn, geoplotlib, and
Bokeh, we can see some common interfaces and similar ways
to work with those libraries. After understanding the tools that
are covered in this book, it will be simple to understand new
plotting tools.
Introduction
To consolidate what we have learned, we will provide you with
three sophisticated activities. Each activity uses one of the
libraries that we have covered in this book. Every activity has a
bigger dataset than we have used before in this book, which
will prepare you for larger datasets.
Note
All activities will be developed in the Jupyter Notebook or
Jupyter Lab. Please download the GitHub repository with all
the prepared templates from https://bit.ly/2SswjqE.
ACTIVITY 30:
IMPLEMENTING
MATPLOTLIB AND
SEABORN ON NEW YORK
CITY DATABASE
In this activity, we will visualize data about New York City
(NYC) and compare it to the state of New York and the United
States (US). The American Community Survey (ACS) Public-
Use Microdata Samples (PUMS) dataset (one-year estimate
from 2017) from https://www.census.gov/programs-
surveys/acs/technical-
documentation/pums/documentation.2017.html is used. For
this activity, you can either use Matplotlib, Seaborn, or a
combination of both.
# PUMA ranges
manhatten = [3801,
3810]
staten_island = [3901,
3903]
brooklyn = [4001,
4018]
nyc = [bronx[0],
queens[1]]
# Function for a
'weighted' median
def
weighted_frequency(val
ues, weights):
weighted_values = []
weighted_values.ex
tend(np.repeat(value,
weight))
return
weighted_values
def
weighted_median(values
, weights):
return
np.median(weighted_fre
quency(values,
weights))
occ_categories =
['Management,\nBusines
s,\nScience,\nand
Arts\nOccupations',
'Service\nOccupations'
,
'Sales
and\nOffice\nOccupatio
ns', 'Natural
Resources,\nConstructi
on,\nand
Maintenance\nOccupatio
ns',
'Production,\nTranspor
tation,\nand Material
Moving\nOccupations']
occ_ranges =
{'Management,
Business, Science, and
Arts Occupations':
[10, 3540], 'Service
Occupations': [3600,
4650],
'Production,
Transportation, and
Material Moving
Occupations': [7700,
9750]}
occ_subcategories =
{'Management,\nBusines
s,\nand Financial':
[10, 950],
'Computer,
Engineering,\nand
Science': [1000,
1965],
'Education,\nLegal,\nC
ommunity
Service,\nArts,\nand
Media': [2000, 2960],
'Healthcare\nPractitio
ners\nand\nTechnical':
[3000, 3540],
'Service': [3600,
4650],
'Sales\nand Related':
[4700, 4965],
'Office\nand
Administrative\nSuppor
t': [5000, 5940],
'Construction\nand
Extraction': [6200,
6940],
'Installation,\nMainte
nance,\nand Repair':
[7000, 7630],
'Production': [7700,
8965],
'Transportation\nand
Material\nMoving':
[9000, 9750]}
Note:
BOKEH
Stock price data is one of the most interesting types of data for
many people. When thinking about its nature, we can see that it
is highly dynamic and constantly changing. To understand it,
we need high levels of interactivity to not only look at the
stocks of interest, but also to compare different stocks, see their
traded volume, and the high/lows of the given dates and
whether it rose or sunk the day before that.
Note:
GEOPLOTLIB
The dataset that's used in this activity is of Airbnb, which is
publicly available online. Accommodation listings have two
predominant feature: latitude and longitude. Those two features
allow us to create geo-spatial visualizations that gives us a
better understanding of attributes such as the distribution of
accommodations across each city.
3. Load the
airbnb_new_york.csv dataset
using pandas. If your system is a
little bit slower, just use the
airbnb_new_york_smaller.
csv dataset with less data points.
8. Create a new
DataAccessObject with the
newly created subsection of the
dataset. Use it to plot out a dot map.
Note:
Summary
This chapter gave us a short overview and recap of everything
that was covered in this bookware on the basis of three
extensive practical activities. In Chapter 1, The Importance of
Data Visualization and Data Exploration, we started with a
Python library journey that we used as a guide throughout the
whole bookware. We first talked about the importance of data
and visualizing this data to get meaningful insights into it and
gave a quick recap on different statistics concepts. In several
activities, we learned how to import and handle datasets with
Numpy and pandas. In Chapter 2, All You Need to Know about
Plots, we discussed various visualizations plots/charts and
which visualizations are best to display certain information. We
mentioned the use case, design practices, and practical
examples for each plot type.
# importing the
necessary dependencies
import numpy as np
2. Load the
normal_distribution.csv
dataset by using the genfromtxt
method of NumPy:
dataset =
np.genfromtxt('./data/
normal_distribution.cs
v', delimiter=',')
dataset[0:2]
np.mean(dataset[2])
np.mean(dataset[:,-1])
np.mean(dataset[0:3,
0:3])
np.median(dataset[-1])
np.median(dataset[:,
-3:])
np.median(dataset,
axis=1)
# calculate the
variance of each
column
np.var(dataset,
axis=0)
Note
np.var(dataset[-2:,
:2])
np.std(dataset)
ACTIVITY 2: INDEXING,
SLICING, SPLITTING, AND
ITERATING
Solution:
Indexing
# importing the
necessary dependencies
import numpy as np
2. Load the
normal_distribution.csv
dataset using NumPy. Make sure
that everything works by having a
look at the ndarray, like in the
previous activity:
dataset =
np.genfromtxt('./data/
normal_distribution.cs
v', delimiter=',')
second_row =
dataset[1]
np.mean(second_row)
last_row = dataset[-1]
np.mean(last_row)
first_val_first_row =
dataset[0][0]
np.mean(first_val_firs
t_row)
last_val_second_last_r
ow = dataset[-2, -1]
np.mean(last_val_secon
d_last_row)
The output of the preceding code is
as follows:
Slicing
# slicing an
intersection of 4
elements (2x2) of the
first two rows and
first two columns
subsection_2x2 =
dataset[1:3, 1:3]
np.mean(subsection_2x2
)
# selecting every
second element of the
fifth row
every_other_elem =
dataset[6, ::2]
np.mean(every_other_el
em)
reversed_last_row =
dataset[-1, ::-1]
np.mean(reversed_last_
row)
Splitting
# splitting up our
dataset horizontally
on indices one third
and two thirds
hor_splits =
np.hsplit(dataset,(3))
# splitting up our
dataset vertically on
index 2
ver_splits =
np.vsplit(hor_splits[0
],(2))
# requested subsection
of our dataset which
has only half the
amount of rows and
only a third of the
columns
print("Dataset",
dataset.shape)
print("Subset",
ver_splits[0].shape)
Iterating
curr_index = 0
for x in
np.nditer(dataset):
print(x, curr_index)
curr_index += 1
print(index, value)
ACTIVITY 3: FILTERING,
SORTING, COMBINING,
AND RESHAPING
Solution:
# importing the
necessary dependencies
import numpy as np
2. Load the
normal_distribution.csv
dataset using NumPy. Make sure
that everything works by having a
look at the ndarray, like in the
previous activity:
dataset =
np.genfromtxt('./data/
normal_distribution.cs
v', delimiter=',')
Filtering
vals_greater_five =
dataset[dataset > 105]
vals_between_90_95 =
np.extract((dataset >
90) & (dataset < 95),
dataset)
# indices of values
that have a delta of
less than 1 to 100
rows, cols =
np.where(abs(dataset -
100) < 1)
one_away_indices =
[[rows[index],
cols[index]] for
(index, _) in
np.ndenumerate(rows)]
Note
Sorting
row_sorted =
np.sort(dataset)
col_sorted =
np.sort(dataset,
axis=0)
# indices of positions
for each row
index_sorted =
np.argsort(dataset)
Combining
9. Use combining features to add the
second half of the first column back
together, add the second column to
our combined dataset, and add the
third column to our combined
dataset.
# split up dataset
from activity03
thirds =
np.hsplit(dataset,
(3))
halfed_first =
np.vsplit(thirds[0],
(2))
halfed_first[0]
first_col =
np.vstack([halfed_firs
t[0],
halfed_first[1]])
first_second_col =
np.hstack([first_col,
thirds[1]])
full_data =
np.hstack([first_secon
d_col, thirds[2]])
Reshaping
# reshaping to a list
of values
single_list =
np.reshape(dataset,
(1, -1))
# reshaping to a
matrix with two
columns
two_col_dataset =
dataset.reshape(-1, 2)
ACTIVITY 4: USING
PANDAS TO COMPUTE
THE MEAN, MEDIAN, AND
VARIANCE FOR THE
GIVEN NUMBERS
Solution:
import pandas as pd
dataset =
pd.read_csv('./data/wo
rld_population.csv',
index_col=0)
dataset[0:2]
dataset.iloc[[2]].mean
(axis=1)
dataset.iloc[[-1]].mea
n(axis=1)
dataset.loc[["Germany"
]].mean(axis=1)
dataset.iloc[[-1]].med
ian(axis=1)
dataset[-3:].median(ax
is=1)
dataset.head(10).media
n(axis=1)
# calculate the
variance of the last 5
columns
dataset.var().tail()
import numpy as np
print("pandas",
dataset["2015"].mean()
)
print("numpy",
np.mean(dataset["2015"
]))
ACTIVITY 5: INDEXING,
SLICING, AND ITERATING
USING PANDAS
Solution:
Indexing
# importing the
necessary dependencies
import pandas as pd
dataset =
pd.read_csv('./data/wo
rld_population.csv',
index_col=0)
dataset.loc[["United
States"]].head()
dataset.iloc[[-2]]
dataset["2000"].head()
# indexing the
population density of
India in 2000
(Dataframe)
dataset[["2000"]].loc[
["India"]]
# indexing the
population density of
India in 2000 (Series)
dataset["2000"].loc["I
ndia"]
Slicing
# slicing countries of
rows 2 to 5
dataset.iloc[1:5]
The output of the preceding code is
as follows:
Figure 1.57: The countries in rows 2 to 5
# slicing rows
Germany, Singapore,
United States, and
India
dataset.loc[["Germany"
, "Singapore", "United
States", "India"]]
# slicing a subset of
Germany, Singapore,
United States, and
India
country_list =
["Germany",
"Singapore", "United
States", "India"]
dataset.loc[country_li
st][["1970", "1990",
"2010"]]
Iterating
if index == 'Angola':
break
print(index, '\n',
row[["Country Code",
"1970", "1990",
"2010"]], '\n')
Filtering
# importing the
necessary dependencies
import pandas as pd
dataset =
pd.read_csv('./data/wo
rld_population.csv',
index_col=0)
# filtering columns
1961, 2000, and 2015
dataset.filter(items=
["1961", "2000",
"2015"]).head()
# filtering countries
that had a greater
population density
than 500 in 2000
dataset[(dataset["2000
"] > 500)][["2000"]]
dataset.filter(regex="
^2", axis=1).head()
# filtering countries
that start with A
dataset.filter(regex="
^A", axis=0).head()
# filtering countries
that contain the word
land
dataset.filter(like="l
and", axis=0).head()
Sorting
# values sorted by
column 1961
dataset.sort_values(by
=["1961"])
[["1961"]].head(10)
# values sorted by
column 2015
dataset.sort_values(by
=["2015"])
[["2015"]].head(10)
# values sorted by
column 2015 in
descending order
dataset.sort_values(by
=["2015"],
ascending=False)
[["2015"]].head(10)
Reshaping
# reshaping to 2015 as
row and country codes
as columns
dataset_2015 =
dataset[["Country
Code", "2015"]]
dataset_2015.pivot(ind
ex=["2015"] *
len(dataset_2015),
columns="Country
Code", values="2015")
3. Suggested response:
ACTIVITY 8: ROAD
ACCIDENTS OCCURRING
OVER TWO DECADES
Solution:
Design practices:
ACTIVITY 9:
SMARTPHONE SALES
UNITS
Solution:
1. Suggested response: If we
compare the performance of each
manufacturer in the third and fourth
quarters, we come to the conclusion
that Apple has performed
exceptionally well. Their sales units
have risen at a higher rate from the
third quarter to the fourth quarter
for both 2016 and 2017, when
compared with that of other
manufacturers.
1. Open the
activity12_solution.ipyn
b Jupyter Notebook from the
Lesson03 folder to implement
this activity.
# Import statements
import
matplotlib.pyplot as
plt
import numpy as np
import pandas as pd
%matplotlib inline
# load datasets
google =
pd.read_csv('./data/GO
OGL_data.csv')
facebook =
pd.read_csv('./data/FB
_data.csv')
apple =
pd.read_csv('./data/AA
PL_data.csv')
amazon =
pd.read_csv('./data/AM
ZN_data.csv')
microsoft =
pd.read_csv('./data/MS
FT_data.csv')
4. Use Matplotlib to create a line
chart that visualizes the closing
prices for the past five years (whole
data sequence) for all five
companies. Add labels, titles, and a
legend to make the visualization
self-explanatory. Use the
plt.grid() function to add a
grid to your plot:
# Create figure
plt.figure(figsize=
(16, 8), dpi=300)
# Plot data
plt.plot('date',
'close', data=google,
label='Google')
plt.plot('date',
'close',
data=facebook,
label='Facebook')
plt.plot('date',
'close', data=apple,
label='Apple')
plt.plot('date',
'close', data=amazon,
label='Amazon')
plt.plot('date',
'close',
data=microsoft,
label='Microsoft')
plt.xticks(np.arange(0
, 1260, 40),
rotation=70)
plt.yticks(np.arange(0
, 1450, 100))
plt.title('Stock
trend', fontsize=16)
plt.ylabel('Closing
price in $',
fontsize=14)
# Add grid
plt.grid()
# Add legend
plt.legend()
# Show plot
plt.show()
1. Open the
activity13_solution.ipyn
b Jupyter Notebook from the
Lesson03 folder to implement
this activity.
# Import statements
import numpy as np
import pandas as pd
import
matplotlib.pyplot as
plt
%matplotlib inline
3. Use pandas to read the data located
in the data folder:
# Load dataset
movie_scores =
pd.read_csv('./data/mo
vie_scores.csv')
# Create figure
plt.figure(figsize=
(10, 5), dpi=300)
pos =
np.arange(len(movie_sc
ores['MovieTitle']))
width = 0.3
plt.bar(pos - width /
2,
movie_scores['Tomatome
ter'], width,
label='Tomatometer')
plt.bar(pos + width /
2,
movie_scores['Audience
Score'], width,
label='Audience
Score')
# Specify ticks
plt.xticks(pos,
rotation=10)
plt.yticks(np.arange(0
, 101, 20))
ax = plt.gca()
ax.set_xticklabels(mov
ie_scores['MovieTitle'
])
ax.set_yticklabels(['0
%', '20%', '40%',
'60%', '80%', '100%'])
ax.yaxis.grid(which='m
ajor')
ax.yaxis.grid(which='m
inor', linestyle='--')
# Add title
plt.title('Movie
comparison')
# Add legend
plt.legend()
# Show plot
plt.show()
1. Open the
activity14_solution.ipyn
b Jupyter Notebook from the
Lesson03 folder to implement
this activity.
# Import statements
import pandas as sb
import numpy as np
import
matplotlib.pyplot as
plt
%matplotlib inline
# Load dataset
bills =
sns.load_dataset('tips
')
days_range =
np.arange(len(days))
bills_by_days =
[bills[bills['day'] ==
day] for day in days]
bills_by_days_smoker =
[[bills_by_days[day]
[bills_by_days[day]
['smoker'] == s] for s
in smoker] for day in
days_range]
total_by_days_smoker =
[[bills_by_days_smoker
[day][s]
['total_bill'].sum()
for s in
range(len(smoker))]
for day in days_range]
totals =
np.asarray(total_by_da
ys_smoker)
# Create figure
plt.figure(figsize=
(10, 5), dpi=300)
plt.bar(days_range,
totals[:, 0],
label='Smoker')
plt.bar(days_range,
totals[:, 1],
bottom=totals[:, 0],
label='Non-smoker')
# Add legend
plt.legend()
plt.xticks(days_range)
ax = plt.gca()
ax.set_xticklabels(day
s)
ax.yaxis.grid()
plt.ylabel('Daily
total sales in $')
plt.title('Restaurant
performance')
# Show plot
plt.show()
1. Open the
activity15_solution.ipyn
b Jupyter Notebook from the
Lesson03 folder to implement
this activity.
# Import statements
import pandas as pd
import numpy as np
import
matplotlib.pyplot as
plt
%matplotlib inline
# Load dataset
sales =
pd.read_csv('./data/sm
artphone_sales.csv')
# Create figure
plt.figure(figsize=
(10, 6), dpi=300)
plt.stackplot('Quarter
', 'Apple', 'Samsung',
'Huawei', 'Xiaomi',
'OPPO', data=sales,
labels=labels)
# Add legend
plt.legend()
plt.xlabel('Quarters')
plt.ylabel('Sales
units in thousands')
plt.title('Smartphone
sales units')
# Show plot
plt.show()
1. Open the
activity16_solution.ipyn
b Jupyter Notebook from the
Lesson03 folder to implement
this activity.
# Import statements
import numpy as np
import
matplotlib.pyplot as
plt
%matplotlib inline
# IQ samples
# Create figure
plt.figure(figsize=(6,
4), dpi=150)
# Create histogram
plt.hist(iq_scores,
bins=10)
plt.axvline(x=100,
color='r')
plt.axvline(x=115,
color='r', linestyle=
'--')
plt.axvline(x=85,
color='r', linestyle=
'--')
plt.xlabel('IQ score')
plt.ylabel('Frequency'
)
plt.title('IQ scores
for a test group of a
hundred adults')
# Show plot
plt.show()
# Create figure
plt.figure(figsize=(6,
4), dpi=150)
# Create histogram
plt.boxplot(iq_scores)
ax = plt.gca()
ax.set_xticklabels(['T
est group'])
plt.ylabel('IQ score')
plt.title('IQ scores
for a test group of a
hundred adults')
# Show plot
plt.show()
# Create figure
plt.figure(figsize=(6,
4), dpi=150)
# Create histogram
plt.boxplot([group_a,
group_b, group_c,
group_d])
ax = plt.gca()
ax.set_xticklabels(['G
roup A', 'Group B',
'Group C', 'Group D'])
plt.ylabel('IQ score')
plt.title('IQ scores
for different test
groups')
# Show plot
plt.show()
1. Open the
activity17_solution.ipyn
b Jupyter Notebook from the
Lesson03 folder to implement
this activity.
# Import statements
import pandas as pd
import numpy as np
import
matplotlib.pyplot as
plt
%matplotlib inline
# Load dataset
data =
pd.read_csv('./data/an
age_data.csv')
# Preprocessing
longevity = 'Maximum
longevity (yrs)'
data =
data[np.isfinite(data[
longevity]) &
np.isfinite(data[mass]
)]
# Sort according to
class
amphibia =
data[data['Class'] ==
'Amphibia']
aves =
data[data['Class'] ==
'Aves']
mammalia =
data[data['Class'] ==
'Mammalia']
reptilia =
data[data['Class'] ==
'Reptilia']
# Create figure
plt.figure(figsize=
(10, 6), dpi=300)
plt.scatter(amphibia[m
ass],
amphibia[longevity],
label='Amphibia')
plt.scatter(aves[mass]
, aves[longevity],
label='Aves')
plt.scatter(mammalia[m
ass],
mammalia[longevity],
label='Mammalia')
plt.scatter(reptilia[m
ass],
reptilia[longevity],
label='Reptilia')
# Add legend
plt.legend()
# Log scale
ax = plt.gca()
ax.set_xscale('log')
ax.set_yscale('log')
# Add labels
plt.xlabel('Body mass
in grams')
plt.ylabel('Maximum
longevity in years')
# Show plot
plt.show()
1. Open the
activity18_solution.ipyn
b Jupyter Notebook from the
Lesson03 folder to implement
this activity.
# Import statements
import pandas as pd
import numpy as np
import
matplotlib.pyplot as
plt
%matplotlib inline
# Load dataset
data =
pd.read_csv('./data/an
age_data.csv')
# Preprocessing
longevity = 'Maximum
longevity (yrs)'
data =
data[np.isfinite(data[
longevity]) &
np.isfinite(data[mass]
)]
# Sort according to
class
aves =
data[data['Class'] ==
'Aves']
aves = data[data[mass]
< 20000]
# Create figure
fig =
plt.figure(figsize=(8,
8), dpi=150,
constrained_layout=Tru
e)
# Create gridspec
gs =
fig.add_gridspec(4, 4)
# Specify subplots
histx_ax =
fig.add_subplot(gs[0,
:-1])
histy_ax =
fig.add_subplot(gs[1:,
-1])
scatter_ax =
fig.add_subplot(gs[1:,
:-1])
# Create plots
scatter_ax.scatter(ave
s[mass],
aves[longevity])
histx_ax.hist(aves[mas
s], bins=20,
density=True)
histx_ax.set_xticks([]
)
histy_ax.hist(aves[lon
gevity], bins=20,
density=True,
orientation='horizonta
l')
histy_ax.set_yticks([]
)
plt.ylabel('Maximum
longevity in years')
fig.suptitle('Scatter
plot with marginal
histograms')
# Show plot
plt.show()
1. Open the
activity19_solution.ipyn
b Jupyter Notebook from the
Lesson03 folder to implement
this activity.
# Import statements
import os
import numpy as np
import
matplotlib.pyplot as
plt
import
matplotlib.image as
mpimg
%matplotlib inline
# Load images
img_filenames =
os.listdir('data')
imgs =
[mpimg.imread(os.path.
join('data',
img_filename)) for
img_filename in
img_filenames]
# Create subplot
fig, axes =
plt.subplots(2, 2)
fig.figsize = (6, 6)
fig.dpi = 150
axes = axes.ravel()
# Specify labels
labels = ['coast',
'beach', 'building',
'city at night']
# Plot images
for i in
range(len(imgs)):
axes[i].imshow(imgs[i]
)
axes[i].set_xticks([])
axes[i].set_yticks([])
axes[i].set_xlabel(lab
els[i])
Chapter 4: Simplifying
Visualizations Using
Seaborn
ACTIVITY 20: COMPARING
IQ SCORES FOR
DIFFERENT TEST
GROUPS BY USING A BOX
PLOT
Solution:
1. Open the
activity20_solution.ipyn
b Jupyter Notebook from the
Lesson04 folder to implement
this activity. Navigate to the path of
this file and type in the following at
the command-line terminal:
jupyter-lab.
%matplotlib inline
import numpy as np
import pandas as pd
import
matplotlib.pyplot as
plt
mydata =
pd.read_csv("./data/sc
ores.csv")
group_a =
mydata[mydata.columns[
0]].tolist()
group_b =
mydata[mydata.columns[
1]].tolist()
group_c =
mydata[mydata.columns[
2]].tolist()
group_d =
mydata[mydata.columns[
3]].tolist()
print(group_a)
print(group_b)
print(group_c)
Data values of Group C are shown
in the following figure:
print(group_d)
data =
pd.DataFrame({'Groups'
: ['Group A'] *
len(group_a) + ['Group
B'] * len(group_b) +
['Group C'] *
len(group_c) + ['Group
D'] * len(group_d),
plt.figure(dpi=150)
# Set style
sns.set_style('whitegr
id')
# Create boxplot
sns.boxplot('Groups',
'IQ score', data=data)
# Despine
sns.despine(left=True,
right=True, top=True)
# Add title
plt.title('IQ scores
for different test
groups')
# Show plot
plt.show()
From Figure 4.8, we can conclude that by using a box plot, the
IQ scores of Group A are better than the other groups.
Let's find the patterns in the flight passengers' data with the
help of a heatmap:
1. Open the
activity21_solution.ipyn
b Jupyter Notebook from the
Lesson04 folder to implement
this activity. Navigate to the path of
this file and type in the following at
the command-line terminal:
jupyter-lab.
%matplotlib inline
import numpy as np
import pandas as pd
import
matplotlib.pyplot as
plt
mydata =
pd.read_csv("./data/fl
ight_details.csv")
sns.set()
plt.figure(dpi=150)
sns.heatmap(data,
cmap=sns.light_palette
("orange",
as_cmap=True,
reverse=True))
plt.title("Flight
Passengers from 2001
to 2012")
plt.show()
1. Open the
activity22_solution.ipyn
b Jupyter Notebook from the
Lesson04 folder to implement
this activity. Navigate to the path of
this file and type in the following at
the command-line terminal:
jupyter-lab.
%matplotlib inline
import numpy as np
import pandas as pd
import
matplotlib.pyplot as
plt
mydata =
pd.read_csv("./data/mo
vie_scores.csv")
movie_scores =
pd.DataFrame({"Movie
Title":
list(mydata["MovieTitl
e"]) * 2,
"Score":
list(mydata["AudienceS
core"]) +
list(mydata["Tomatomet
er"]),
"Type": ["Audience
Score"] *
len(mydata["AudienceSc
ore"]) +
["Tomatometer"] *
len(mydata["Tomatomete
r"])})
plt.figure(figsize=
(10, 5), dpi=300)
ax =
sns.barplot("Movie
Title", "Score",
hue="Type",
data=movie_scores)
plt.xticks(rotation=10
)
# Add title
plt.title("Movies
Scores comparison")
plt.xlabel("Movies")
plt.ylabel("Scores")
# Show plot
plt.show()
1. Open the
activity23_solution.ipyn
b Jupyter Notebook from the
Lesson04 folder to implement
this activity. Navigate to the path of
this file and type in the following at
the command-line terminal:
jupyter-lab.
%matplotlib inline
import numpy as np
import pandas as pd
import
matplotlib.pyplot as
plt
mydata =
pd.read_csv("./data/sc
ores.csv")
group_a =
mydata[mydata.columns[
0]].tolist()
group_b =
mydata[mydata.columns[
1]].tolist()
group_c =
mydata[mydata.columns[
2]].tolist()
group_d =
mydata[mydata.columns[
3]].tolist()
print(group_a)
print(group_b)
print(group_c)
Figure 4.41: Values of Group C
print(group_d)
data =
pd.DataFrame({'Groups'
: ['Group A'] *
len(group_a) + ['Group
B'] * len(group_b) +
['Group C'] *
len(group_c) + ['Group
D'] * len(group_d),
plt.figure(dpi=150)
# Set style
sns.set_style('whitegr
id')
# Create boxplot
sns.violinplot('Groups
', 'IQ score',
data=data)
# Despine
sns.despine(left=True,
right=True, top=True)
# Add title
plt.title('IQ scores
for different test
groups')
# Show plot
plt.show()
1. Open the
activity24_solution.ipyn
b Jupyter Notebook from the
Lesson04 folder to implement
this activity. Navigate to the path of
this file and type in the following at
the command-line terminal:
jupyter-lab.
%matplotlib inline
import numpy as np
import pandas as pd
import
matplotlib.pyplot as
plt
mydata =
pd.read_csv("./data/yo
utube.csv")
channels =
mydata[mydata.columns[
0]].tolist()
subs =
mydata[mydata.columns[
1]].tolist()
views =
mydata[mydata.columns[
2]].tolist()
print(channels)
print(subs)
print(views)
data =
pd.DataFrame({'YouTube
Channels': channels +
channels, 'Subscribers
in millions': subs +
views, 'Type':
['Subscribers'] *
len(subs) + ['Views']
* len(views)})
sns.set()
g =
sns.FacetGrid(data,
col='Type',
hue='Type',
sharex=False,
height=8)
g.map(sns.barplot,
'Subscribers in
millions', 'YouTube
Channels')
plt.show()
1. Open the
activity25_solution.ipyn
b Jupyter Notebook from the
Lesson04 folder to implement
this activity. Navigate to the path of
this file and type in the following at
the command-line terminal:
jupyter-lab.
%matplotlib inline
import numpy as np
import pandas as pd
import
matplotlib.pyplot as
plt
mydata =
pd.read_csv("./data/an
age_data.csv")
longevity = 'Maximum
longevity (yrs)'
data =
mydata[mydata['Class']
== 'Mammalia']
data =
data[np.isfinite(data[
longevity]) &
np.isfinite(data[mass]
) & (data[mass] <
200000)]
# Create figure
sns.set()
plt.figure(figsize=
(10, 6), dpi=300)
# Show plot
plt.show()
Let's visualize the water usage by using a tree map, which can
be created with the help of the Squarify library:
1. Open the
activity26_solution.ipyn
b Jupyter Notebook from the
Lesson 04 folder to implement
this activity. Navigate to the path of
this file and type in the following at
the command-line terminal:
jupyter-lab.
%matplotlib inline
import numpy as np
import pandas as pd
import
matplotlib.pyplot as
plt
import squarify
mydata =
pd.read_csv("./data/wa
ter_usage.csv")
# Create figure
plt.figure(dpi=200)
labels =
mydata['Usage'] + ' ('
+
mydata['Percentage'].a
stype('str') + '%)'
squarify.plot(sizes=my
data['Percentage'],
label=labels,
color=sns.light_palett
e('green',
mydata.shape[0]))
plt.axis('off')
# Add title
plt.title('Water
usage')
# Show plot
plt.show()
Chapter 5: Plotting
Geospatial Data
ACTIVITY 27: PLOTTING
GEOSPATIAL DATA ON A
MAP
Solution:
Let's plot the geospatial data on a map and find the densely
populated areas for cities in Europe that have a population of
more than 100k:
# importing the
necessary dependencies
import numpy as np
import pandas as pd
import geoplotlib
3. Load the dataset using pandas:
Dataset =
pd.read_csv('./data/wo
rld_cities_pop.csv',
dtype = {'Region':
np.str})
Note
Dataset.dtypes
Note
dataset.head()
# mapping Latitude to
lat and Longitude to
lon
dataset['lat'] =
dataset['Latitude']
dataset['lon'] =
dataset['Longitude']
geoplotlib.dot(dataset
)
geoplotlib.show()
# amount of countries
and cities
print(len(dataset.grou
pby(['Country'])),
'Countries')
print(len(dataset),
'Cities')
dataset.groupby(['Coun
try']).size().head(20)
dataset.groupby(['Coun
try']).size().agg('mea
n')
Note
dataset_with_pop =
dataset[(dataset['Popu
lation'] > 0)]
print('Full dataset:',
len(dataset))
print('Cities with
population
information:',
len(dataset_with_pop))
dataset_with_pop.head(
)
geoplotlib.dot(dataset
_with_pop)
geoplotlib.show()
dataset_100k =
dataset_with_pop[(data
set_with_pop['Populati
on'] >= 100_000)]
print('Cities with a
population of 100k or
more:',
len(dataset_100k))
# displaying all
cities >= 100k
population with a
fixed bounding box
(WORLD) in a dot
density plot
from geoplotlib.utils
import BoundingBox
geoplotlib.dot(dataset
_100k)
geoplotlib.set_bbox(Bo
undingBox.WORLD)
geoplotlib.show()
geoplotlib.voronoi(dat
aset_100k,
cmap='hot_r',
max_area=1e3,
alpha=255)
geoplotlib.show()
Note
dataset_europe =
dataset_100k[(dataset_
100k['Country'] ==
'de') |
(dataset_100k['Country
'] == 'gb')]
print('Cities in
Germany or GB with
population >= 100k:',
len(dataset_europe))
geoplotlib.delaunay(da
taset_europe,
cmap='hot_r')
geoplotlib.show()
# importing the
necessary dependencies
import pandas as pd
dataset =
pd.read_csv('./data/fl
ight_tracking.csv')
Note
dataset.head()
Figure 5.29: First five elements of the
dataset
# renaming columns
latitude to lat and
longitude to lon
dataset =
dataset.rename(index=s
tr, columns=
{"latitude": "lat",
"longitude": "lon"})
dataset.head()
def to_epoch(date,
time):
try:
timestamp =
round(datetime.strptim
e('{} {}'.format(date,
time), '%Y/%m/%d
%H:%M:%S.%f').timestam
p())
return timestamp
except ValueError:
return
round(datetime.strptim
e('2017/09/11
17:02:06.418',
'%Y/%m/%d
%H:%M:%S.%f').timestam
p())
# creating a new
column called
timestamp with the
to_epoch method
applied
dataset['timestamp'] =
dataset.apply(lambda
x: to_epoch(x['date'],
x['time']), axis=1)
dataset.head()
import geoplotlib
from geoplotlib.layers
import BaseLayer
from geoplotlib.core
import BatchPainter
from geoplotlib.colors
import colorbrewer
from geoplotlib.utils
import epoch_to_str,
BoundingBox
class
TrackLayer(BaseLayer):
def __init__(self,
dataset,
bbox=BoundingBox.WORLD
):
self.data = dataset
self.cmap =
colorbrewer(self.data[
'hex_ident'],
alpha=200)
self.time =
self.data['timestamp']
.min()
self.painter =
BatchPainter()
self.view = bbox
self.painter =
BatchPainter()
df =
self.data.where((self.
data['timestamp'] >
self.time) &
(self.data['timestamp'
] <= self.time + 180))
for element in
set(df['hex_ident']):
grp =
df.where(df['hex_ident
'] == element)
self.painter.set_color
(self.cmap[element])
x, y =
proj.lonlat_to_screen(
grp['lon'],
grp['lat'])
self.painter.points(x,
y, 15, rounded=True)
self.time += 1
if self.time >
self.data['timestamp']
.max():
self.time =
self.data['timestamp']
.min()
self.painter.batch_dra
w()
ui_manager.info('Curre
nt timestamp:
{}'.format(epoch_to_st
r(self.time)))
def bbox(self):
return self.view
from geoplotlib.utils
import BoundingBox
leeds_bbox =
BoundingBox(north=53.8
074, west=-3,
south=53.7074 ,
east=0)
# displaying our
custom layer using
add_layer
from geoplotlib.utils
import
DataAccessObject
data =
DataAccessObject(datas
et)
geoplotlib.add_layer(T
rackLayer(data,
bbox=leeds_bbox))
geoplotlib.show()
Congratulations! You've completed the custom layer activity
using Geoplotlib. We've applied several pre-processing steps to
shape the dataset as we want to have it. We've also written a
custom layer to display spatial data in the temporal space. Our
custom layer even has a level of animation. This is something
we'll look into more in the following chapter about Bokeh.
Chapter 6: Making
Things Interactive
with Bokeh
ACTIVITY 29: EXTENDING
PLOTS WITH WIDGETS
Solution:
1. Open the
activity29_solution.ipyn
b Jupyter Notebook from the
Lesson06 folder to implement
this activity.
# importing the
necessary dependencies
import pandas as pd
output_notebook()
dataset =
pd.read_csv('./data/ol
ympia2016_athletes.csv
')
# looking at the
dataset
dataset.head()
# importing the
necessary dependencies
from bokeh.plotting
import figure, show
# extract countries
and group Olympians by
country and their sex
countries =
dataset['nationality']
.unique()
athletes_per_country =
dataset.groupby('natio
nality').size()
medals_per_country =
dataset.groupby('natio
nality')['gold',
'silver','bronze'].sum
()
max_medals =
medals_per_country.sum
(axis=1).max()
max_athletes =
athletes_per_country.m
ax()
# setting up the
interaction elements
max_athletes_slider=wi
dgets.IntSlider(value=
max_athletes, min=0,
max=max_athletes,
step=1,
description='Max.
Athletes:',
continuous_update=Fals
e,
orientation='vertical'
, layout={'width':
'100px'})
max_medals_slider=widg
ets.IntSlider(value=ma
x_medals, min=0,
max=max_medals,
step=1,
description='Max.
Medals:',
continuous_update=Fals
e,
orientation='horizonta
l')
# creating the
interact method
@interact(max_athletes
=max_athletes_slider,
max_medals=max_medals_
slider)
def
get_olympia_stats(max_
athletes, max_medals):
show(get_plot(max_athl
etes, max_medals))
Note
def
get_plot(max_athletes,
max_medals):
filtered_countries=[]
for country in
countries:
if
(athletes_per_country[
country] <=
max_athletes and
medals_per_country.loc
[country].sum() <=
max_medals):
filtered_countries.app
end(country)
data_source=get_dataso
urce(filtered_countrie
s)
TOOLTIPS=[ ('Country',
'@countries'),('Num of
Athletes', '@y'),
('Gold', '@gold'),
('Silver', '@silver'),
('Bronze', '@bronze')]
plot=figure(title='Rio
Olympics 2016 - Medal
comparison',
x_axis_label='Number
of Medals',
y_axis_label='Num of
Athletes',
plot_width=800,
plot_height=500,
tooltips=TOOLTIPS)
plot.circle('x', 'y',
source=data_source,
size=20,
color='color',
alpha=0.5)
return plot
import random
def
get_random_color():
return '%06x' %
random.randint(0,
0xFFFFFF)
def
get_datasource(filtere
d_countries):
return
ColumnDataSource(data=
dict(
color=
[get_random_color()
for _ in
filtered_countries],
countries=filtered_cou
ntries,
gold=
[medals_per_country.lo
c[country]['gold'] for
country in
filtered_countries],
silver=
[medals_per_country.lo
c[country]['silver']
for country in
filtered_countries],
bronze=
[medals_per_country.lo
c[country]['bronze']
for country in
x=
[medals_per_country.lo
c[country].sum() for
country in
filtered_countries],
y=
[athletes_per_country.
loc[country].sum() for
country in
filtered_countries]
))
# Import statements
import pandas as pd
import numpy as np
import matplotlib
import
matplotlib.pyplot as
plt
import squarify
sns.set()
p_ny =
pd.read_csv('./data/pn
y.csv')
h_ny =
pd.read_csv('./data/hn
y.csv')
# PUMA ranges
manhatten = [3801,
3810]
staten_island = [3901,
3903]
brooklyn = [4001,
4017]
nyc = [bronx[0],
queens[1]]
def puma_filter(data,
puma_ranges):
return
data.loc[(data['PUMA']
>= puma_ranges[0]) &
(data['PUMA'] <=
puma_ranges[1])]
h_bronx =
puma_filter(h_ny,
bronx)
h_manhatten =
puma_filter(h_ny,
manhatten)
h_staten_island =
puma_filter(h_ny,
staten_island)
h_brooklyn =
puma_filter(h_ny,
brooklyn)
h_queens =
puma_filter(h_ny,
queens)
p_nyc =
puma_filter(p_ny, nyc)
h_nyc =
puma_filter(h_ny, nyc)
# Function for a
'weighted' median
def
weighted_frequency(val
ues, weights):
weighted_values = []
weighted_values.ex
tend(np.repeat(value,
weight))
return
weighted_values
def
weighted_median(values
, weights):
return
np.median(weighted_fre
quency(values,
weights))
Lesson07/Activity30/activity30_solution.
ipynb
def
median_housing_income(
data):
//[…]
h_queens_income_median
=
median_housing_income(
h_queens)
occ_categories =
['Management,\nBusines
s,\nScience,\nand
Arts\nOccupations',
'Service\nOccupations'
,
'Sales
and\nOffice\nOccupatio
ns', 'Natural
Resources,\nConstructi
on,\nand
Maintenance\nOccupatio
ns',
'Production,\nTranspor
tation,\nand Material
Moving\nOccupations']
//[…]
wages_female =
wage_by_gender_and_occ
upation(p_nyc, 2)
wage_bins = {'<$10k':
[0, 10000], '$10-20k':
[10000, 20000], '$20-
30k': [20000, 30000],
'$30-40k': [30000,
40000], '$10-20k':
[40000, 50000],
'$50-60k': [50000,
60000], '$60-70k':
[60000, 70000], '$70-
80k': [70000, 80000],
'$80-90k': [80000,
90000], '$90-100k':
[90000, 100000],
'$100-150k': [100000,
150000], '$150-200k':
[150000, 200000],
'>$200k': [200000,
np.infty]}
//[…]
wages_ny =
wage_frequency(p_ny)
# Median household
income in the US
us_income_median =
60336
# Median household
income
ax1.set_title('Median
Household Income',
fontsize=14)
//[…]
ax1.set_xlabel('Yearly
household income in
$')
# Wage by gender in
common jobs
ax2.set_title('Wage by
Gender for different
Job Categories',
fontsize=14)
x = np.arange(5) + 1
//[…]
ax2.set_ylabel('Averag
e Salary in $')
# Wage distribution
ax3.set_title('Wage
Distribution',
fontsize=14)
x =
np.arange(len(wages_ny
c)) + 1
width = 0.4
//[…]
ax3.vlines(x=9.5,
ymin=0, ymax=15,
linestyle='--')
# Overall figure
fig.tight_layout()
plt.show()
https://bit.ly/2StchfL
Lesson07/Activity30/activity30_solution.
ipynb
occ_subcategories =
{'Management,\nBusines
s,\nand Financial':
[10, 950],
//[..]
def
occupation_percentage(
data):
percentages = []
overall_sum =
np.sum(data.loc[(data[
'OCCP'] >= 10) &
(data['OCCP'] <=
9750),
['PWGTP']].values)
for occ in
occ_subcategories.valu
es():
query =
data.loc[(data['OCCP']
>= occ[0]) &
(data['OCCP'] <=
occ[1]),
['PWGTP']].values
percentages.append(np.
sum(query) /
overall_sum)
return percentages
occ_percentages =
occupation_percentage(
p_nyc)
# Visualization of
tree map
plt.figure(figsize=
(16, 6), dpi=300)
//[..]
plt.axis('off')
plt.title('Occupations
in New York City',
fontsize=24)
plt.show()
https://bit.ly/2StchfL
difficulties = {'Self-
care difficulty':
'DDRS', 'Hearing
difficulty': 'DEAR',
'Vision difficulty':
'DEYE', 'Independent
living difficulty':
'DOUT',
'Ambulatory
difficulty': 'DPHY',
'Veteran service
connected disability':
'DRATX',
'Cognitive
difficulty': 'DREM'}
age_groups = {'<5':
[0, 4], '5-11': [5,
11], '12-14': [12,
14], '15-17': [15,
17], '18-24': [18,
24], '25-34': [25,
34],
def
difficulty_age_array(d
ata):
array =
np.zeros((len(difficul
ties.values()),
len(age_groups.values(
))))
for d, diff in
enumerate(difficulties
.values()):
for a, age in
enumerate(age_groups.v
alues()):
age_sum =
np.sum(data.loc[(data[
'AGEP'] >= age[0]) &
(data['AGEP'] <=
age[1]),
['PWGTP']].values)
query =
data.loc[(data['AGEP']
>= age[0]) &
(data['AGEP'] <=
age[1]) & (data[diff]
== 1),
['PWGTP']].values
array[d, a] =
np.sum(query) /
age_sum
return array
array =
difficulty_age_array(p
_nyc)
# Heatmap
plt.figure(dpi=300)
ax = sns.heatmap(array
* 100)
ax.set_yticklabels(dif
ficulties.keys(),
rotation=0)
ax.set_xticklabels(age
_groups.keys(),
rotation=90)
ax.set_xlabel('Age
Groups')
ax.set_title('Percenta
ge of NYC population
with difficulties',
fontsize=14)
plt.show()
# importing the
necessary dependencies
import pandas as pd
output_notebook()
# looking at the
dataset
dataset.head()
def
shorten_time_stamp(tim
estamp):
shortened =
timestamp[0]
if len(shortened) >
10:
parsed_date=dateti
me.strptime(shortened,
'%Y-%m-%d %H:%M:%S')
shortened=datetime
.strftime(parsed_date,
'%Y-%m-%d')
return shortened
dataset['short_date']
= dataset.apply(lambda
x:
shorten_time_stamp(x),
axis=1)
# looking at the
dataset with shortened
date
dataset.head()
# importing the
necessary dependencies
from bokeh.plotting
import figure, show
# extracing the
necessary data
stock_names=dataset['s
ymbol'].unique()
dates_2016=dataset[dat
aset['short_date'] >=
'2016-01-01']
['short_date']
unique_dates_2016=sort
ed(dates_2016.unique()
)
value_options=['open-
close', 'volume']
# setting up the
interaction elements
drp_1=widgets.Dropdown
(options=stock_names,
value='AAPL',
description='Compare:'
)
drp_2=widgets.Dropdown
(options=stock_names,
value='AON',
description='to:')
5. Then, we need
SelectionRange, which will
allow us to select a range of dates
from the extracted list of unique
2016 dates. By default, the first 25
dates should be selected, named
From-To. Make sure to disable the
continuous_update
parameter. Adjust the layout width
to 500px to make sure that the
dates are displayed correctly:
range_slider=widgets.S
electionRangeSlider(op
tions=unique_dates_201
6,
index=(0,25),
continuous_update=Fals
e,
description='From-To',
layout={'width':
'500px'})
range_slider=widgets.S
electionRangeSlider(op
tions=unique_dates_201
6,
index=(0,25),
continuous_update=Fals
e,
description='From-To',
layout={'width':
'500px'})
value_radio=widgets.Ra
dioButtons(options=val
ue_options,
value='open-close',
description='Metric')
Note
# creating the
interact method
@interact(stock_1=drp_
1, stock_2=drp_2,
date=range_slider,
value=value_radio)
def
get_stock_for_2016(sto
ck_1, stock_2, date,
value):
show(get_plot(stock_1,
stock_2, date, value))
def
add_candle_plot(plot,
stock_name,
stock_range, color):
inc_1 =
stock_range.close >
stock_range.open
dec_1 =
stock_range.open >
stock_range.close
w = 0.5
plot.segment(stock_ran
ge['short_date'],
stock_range['high'],
stock_range['short_dat
e'],
stock_range['low'],
color="grey")
plot.vbar(stock_range[
'short_date'][inc_1],
w,
stock_range['high']
[inc_1],
stock_range['close']
[inc_1],
fill_color="green",
line_color="black",
legend=('Mean price of
' + stock_name),
muted_alpha=0.2)
plot.vbar(stock_range[
'short_date'][dec_1],
w,
stock_range['high']
[dec_1],
stock_range['close']
[dec_1],
fill_color="red",
line_color="black",
legend=('Mean price of
' + stock_name),
muted_alpha=0.2)
stock_mean_val=stock_r
ange[['high',
'low']].mean(axis=1)
plot.line(stock_range[
'short_date'],
stock_mean_val,
legend=('Mean price of
' + stock_name),
muted_alpha=0.2,
line_color=color,
alpha=0.5)
Note
Lesson07/Activity31/activity31_solution.ipynb
//[..]
plot.xaxis.major_label_orientation = 1
plot.grid.grid_line_alpha=0.3
if value == 'open-close':
add_candle_plot(plot, stock_1_name,
stock_1_range, 'blue')
add_candle_plot(plot, stock_2_name,
stock_2_range, 'orange')
if value == 'volume':
plot.line(stock_1_range['short_date'],
stock_1_range['volume'],
legend=stock_1_name,
muted_alpha=0.2)
plot.line(stock_2_range['short_date'],
stock_2_range['volume'],
legend=stock_2_name,
muted_alpha=0.2,
line_color='orange')
plot.legend.click_policy="mute"
return plot
https://bit.ly/2GRneWR
Note
To make our legend interactive please take a look at the
documentation for the legend feature:
https://bokeh.pydata.org/en/latest/docs/user_guide/interaction/
legends.html.
Congratulations!
1. Open the
activity032_solution.ipy
nb Jupyter Notebook from the
Lesson07 folder to implement
this activity. Import NumPy,
pandas, and Geoplotlib first:
# importing the
necessary dependencies
import numpy as np
import pandas as pd
import geoplotlib
dataset =
pd.read_csv('./data/ai
rbnb_new_york.csv')
# dataset =
pd.read_csv('./data/ai
rbnb_new_york_smaller.
csv')
dataset.head()
# mapping Latitude to
lat and Longitude to
lon
dataset['lat'] =
dataset['latitude']
dataset['lon'] =
dataset['longitude']
# convert string of
type $<numbers> to
<nubmers> of type
float
def
convert_to_float(x):
try:
value=str.replace(x[1:
], ',', '')
return float(value)
except:
return 0.0
# create new
dollar_price column
with the price as a
number
dataset['price'] =
dataset['price'].filln
a('$0.0')
dataset['review_scores
_rating'] =
dataset['review_scores
_rating'].fillna(0.0)
dataset['dollar_price'
] =
dataset['price'].apply
(lambda x:
convert_to_float(x))
for col in
dataset.columns:
print('{}\t{}'.format(
col, dataset[col][0]))
columns=['id', 'lat',
'lon', 'dollar_price',
'review_scores_rating'
]
sub_data=dataset[colum
ns]
sub_data.head()
# import
DataAccessObject and
create a data object
as an instance of that
class
from geoplotlib.utils
import
DataAccessObject
data =
DataAccessObject(sub_d
ata)
geoplotlib.dot(data)
geoplotlib.show()
Lesson07/Activity32/activity32_solution.
ipynb
# custom layer
creation
import pyglet
import geoplotlib
//[..]
class
ValueLayer(BaseLayer):
def __init__(self,
dataset,
bbox=BoundingBox.WORLD
):
//[..]
def invalidate(self,
proj):
self.painter =
BatchPainter()
max_val =
max(self.data[self.dis
play])
scale = 'log' if
self.display ==
'dollar_price' else
'lin'
for index, id in
enumerate(self.data['i
d']):
//[..]
# display the ui
manager info
ui_manager.info('Use
left and right to
switch between the
displaying of price
and ratings. Currently
displaying:
{}'.format(self.displa
y))
self.painter.batch_dra
w()
def
on_key_release(self,
key, modifiers):
//[..]
def bbox(self):
return self.view
https://bit.ly/2VoQveT
from geoplotlib.utils
import BoundingBox
ny_bbox =
BoundingBox(north=40.8
97994,
west=-73.999040,
south=40.595581,
east=-73.95040)
# displaying our
custom layer using
add_layer
geoplotlib.tiles_provi
der('darkmatter')
geoplotlib.add_layer(V
alueLayer(data,
bbox=ny_bbox))
geoplotlib.show()
Congratulations!