Machine Learning Projects in Python
Machine Learning Projects in Python
International License.
ISBN 978-0-9997730-2-4
Python Machine Learning Projects
Written by Lisa Tagliaferri, Michelle Morales, Ellie Birbeck, and
Alvin Wan, with editing by Brian Hogan and Mark Drake
1. Foreword
2. Setting Up a Python Programming Environment
3. An Introduction to Machine Learning
4. How To Build a Machine Learning Classifier in Python with Scikit-
learn
5. How To Build a Neural Network to Recognize Handwritten Digits with
TensorFlow
6. Bias-Variance for Deep Reinforcement Learning: How To Build a Bot
for Atari with OpenAI Gym
Foreword
Prerequisites
This tutorial will be based on working with a Linux or Unix-like (*nix)
system and use of a command line or terminal environment. Both macOS
and specifically the PowerShell program of Windows should be able to
achieve similar results.
python3 -V
You’ll receive output in the terminal window that will let you know
the version number. While this number may vary, the output will be
similar to this:
Output
Python 3.7.2
Now that you have pip installed, you can download Python packages
with the following command:
mkdir environments
cd environments
Once you are in the directory where you would like the environments
t o live, you can create an environment. You should use the version of
Python that is installed on your machine as the first part of the command
(the output you received when typing python -V). If that version was
Python 3.6.3, you can type the following:
If, instead, your computer has Python 3.7.3 installed, use the
following command:
Once you run the appropriate command, you can verify that the
environment is set up be continuing.
Essentially, pyvenv sets up a new directory that contains a few items
which we can view with the ls command:
ls my_env
Output
Together, these files work to make sure that your projects are isolated
from the broader context of your local machine, so that system files and
project files don’t mix. This is good practice for version control and to
ensure that each of your projects has access to the particular packages
that it needs. Python Wheels, a built-package format for Python that can
speed up your software production by reducing the number of times you
need to compile, will be in the Ubuntu 18.04 share directory.
To use this environment, you need to activate it, which you can achieve
by typing the following command that calls the activate script:
source my_env/bin/activate
Your command prompt will now be prefixed with the name of your
environment, in this case it is called my_env. Depending on what version
o f Debian Linux you are running, your prefix may appear somewhat
differently, but the name of your environment in parentheses should be
the first thing you see on your line:
(my_env) sammy@sammy:~/environments$
Once the text file opens up in the terminal window we’ll type out our
program:
print("Hello, World!")
Exit nano by typing the CTRL and X keys, and when prompted to save
the file press y.
Once you exit out of nano and return to your shell, let’s run the
program:
Output
Hello, World!
Conclusion
At this point you have a Python 3 programming environment set up on
your machine and you can now begin a coding project!
If you would like to learn more about Python, you can download our
free How To Code in Python 3 eBook via do.co/python-book.
Here, package_name can refer to any Python package or library, such
as Django for web development or NumPy for scientific computing. So if
you would like to install NumPy, you can do so with the command pip3
install numpy.
There are a few more packages and development tools to install to
ensure that we have a robust set-up for our programming environment:
Once Python is set up, and pip and other tools are installed, we can set
up a virtual environment for our development projects.
mkdir environments
cd environments
Once you are in the directory where you would like the environments
t o live, you can create an environment. You should use the version of
Python that is installed on your machine as the first part of the command
(the output you received when typing python -V). If that version was
Python 3.6.3, you can type the following:
If, instead, your computer has Python 3.7.3 installed, use the
following command:
Once you run the appropriate command, you can verify that the
environment is set up be continuing.
Essentially, pyvenv sets up a new directory that contains a few items
which we can view with the ls command:
ls my_env
Output
Together, these files work to make sure that your projects are isolated
from the broader context of your local machine, so that system files and
project files don’t mix. This is good practice for version control and to
ensure that each of your projects has access to the particular packages
that it needs. Python Wheels, a built-package format for Python that can
speed up your software production by reducing the number of times you
need to compile, will be in the Ubuntu 18.04 share directory.
To use this environment, you need to activate it, which you can achieve
by typing the following command that calls the activate script:
source my_env/bin/activate
Your command prompt will now be prefixed with the name of your
environment, in this case it is called my_env. Depending on what version
o f Debian Linux you are running, your prefix may appear somewhat
differently, but the name of your environment in parentheses should be
the first thing you see on your line:
(my_env) sammy@sammy:~/environments$
Once the text file opens up in the terminal window we’ll type out our
program:
print("Hello, World!")
Exit nano by typing the CTRL and X keys, and when prompted to save
the file press y.
Once you exit out of nano and return to your shell, let’s run the
program:
Output
Hello, World!
Conclusion
At this point you have a Python 3 programming environment set up on
your machine and you can now begin a coding project!
If you would like to learn more about Python, you can download our
free How To Code in Python 3 eBook via do.co/python-book.
k-nearest neighbor initial data set
When a new object is added to the space — in this case a green heart —
we will want the machine learning algorithm to classify the heart to a
certain class.
In this tutorial, we’ll look into the common machine learning methods
o f supervised and unsupervised learning, and common algorithmic
approaches in machine learning, including the k-nearest neighbor
algorithm, decision tree learning, and deep learning. We’ll explore which
programming languages are most used in machine learning, providing
y o u with some of the positive and negative attributes of each.
Additionally, we’ll discuss biases that are perpetuated by machine
learning algorithms, and consider what can be kept in mind to prevent
these biases when building algorithms.
Supervised Learning
Approaches
As a field, machine learning is closely related to computational statistics,
so having a background knowledge in statistics is useful for
understanding and leveraging machine learning algorithms.
For those who may not have studied statistics, it can be helpful to first
define correlation and regression, as they are commonly used techniques
for investigating the relationship among quantitative variables.
Correlation is a measure of association between two variables that are not
designated as either dependent or independent. Regression at a basic
level is used to examine the relationship between one dependent and one
independent variable. Because regression statistics can be used to
anticipate the dependent variable when the independent variable is
known, regression enables prediction capabilities.
Approaches to machine learning are continuously being developed.
For our purposes, we’ll go through a few of the popular approaches that
are being used in machine learning at the time of writing.
k-nearest neighbor initial data set
When a new object is added to the space — in this case a green heart —
we will want the machine learning algorithm to classify the heart to a
certain class.
k-nearest neighbor data set with new object to classify
Conclusion
This tutorial reviewed some of the use cases of machine learning,
common methods and popular approaches used in the field, suitable
machine learning programming languages, and also covered some things
to keep in mind in terms of unconscious biases being replicated in
algorithms.
Because machine learning is a field that is continuously being
innovated, it is important to keep in mind that algorithms, methods, and
approaches will continue to change.
Currently, Python is one of the most popular programming languages
t o use with machine learning applications in professional fields. Other
languages you may wish to investigate include Java, R, and C++.
the value of a target based on input variables.
In the predictive model, the data’s attributes that are determined
through observation are represented by the branches, while the
conclusions about the data’s target value are represented in the leaves.
When “learning” a tree, the source data is divided into subsets based
on an attribute value test, which is repeated on each of the derived
subsets recursively. Once the subset at a node has the equivalent value as
its target value has, the recursion process will be complete.
Let’s look at an example of various conditions that can determine
whether or not someone should go fishing. This includes weather
conditions as well as barometric pressure conditions.
. my_env/bin/activate
Output
ML Tutorial
...
# Load dataset
data = load_breast_cancer()
ML Tutorial
...
labels = data['target']
feature_names = data['feature_names']
features = data['data']
ML Tutorial
...
print(label_names)
print(labels[0])
print(feature_names[0])
print(features[0])
Alt Jupyter Notebook with three Python cells, which prints the first instance in our dataset
How To Build a Machine Learning Classifier in
Python with Scikit-learn
Written by Michelle Morales
Edited by Brian Hogan
Prerequisites
To complete this tutorial, we’ll use Jupyter Notebooks, which are a useful
and interactive way to run machine learning experiments. With Jupyter
Notebooks, you can run short blocks of code and see the results quickly,
making it easy to test and debug your code.
To get up and running quickly, you can open up a web browser and
navigate to the Try Jupyter website: jupyter.org/try. From there, click on
Try Jupyter with Python, and you will be taken to an interactive Jupyter
Notebook where you can start to write Python code.
If you would like to learn more about Jupyter Notebooks and how to
set up your own Python programming environment to use with Jupyter,
y o u can read our tutorial on How To Set Up Jupyter Notebook for
Python 3.
test_size=0.33,
random_state=42)
ML Tutorial
...
gnb = GaussianNB()
After we train the model, we can then use the trained model to make
predictions on our test set, which we do using the predict() function.
The predict() function returns an array of predictions for each data
instance in the test set. We can then print our predictions to get a sense of
what the model determined.
Use the predict() function with the test set and print the results:
ML Tutorial
...
# Make predictions
preds = gnb.predict(test)
print(preds)
Jupyter Notebook with Python cell that prints the predicted values of the Naive Bayes classifier
. my_env/bin/activate
Output
ML Tutorial
# Load dataset
data = load_breast_cancer()
label_names = data['target_names']
labels = data['target']
feature_names = data['feature_names']
features = data['data']
print(label_names)
print(feature_names)
print(features[0])
# Split our data
labels,
test_size=0.33,
random_state=42)
gnb = GaussianNB()
# Make predictions
preds = gnb.predict(test)
print(preds)
# Evaluate accuracy
print(accuracy_score(test_labels, preds))
Now you can continue to work with your code to see if you can make
your classifier perform even better. You could experiment with different
subsets of features or even try completely different algorithms. Check out
Scikit-learn’s website at scikit-learn.org/stable for more machine learning
ideas.
Conclusion
In this tutorial, you learned how to build a machine learning classifier in
Python. Now you can load data, organize data, train, predict, and
evaluate machine learning classifiers in Python using Scikit-learn. The
steps in this tutorial should help you facilitate the process of working
with your own data in Python.
benign.
Scikit-learn comes installed with various datasets which we can load
into Python, and the dataset we want is included. Import and load the
dataset:
ML Tutorial
...
# Load dataset
data = load_breast_cancer()
ML Tutorial
...
Prerequisites
To complete this tutorial, you’ll need a local or remote Python 3
development environment that includes pip for installing Python
packages, and venv for creating virtual environments.
mkdir tensorflow-demo
cd tensorflow-demo
source tensorflow-demo/bin/activate
Next, install the libraries you’ll use in this tutorial. We’ll use specific
versions of these libraries by creating a requirements.txt file in the
project directory which specifies the requirement and the version we
need. Create the requirements.txt file:
Open the file in your text editor and add the following lines to specify
the Image, NumPy, and TensorFlow libraries and their versions:
requirements.txt
image==1.5.20
numpy==1.14.3
tensorflow==1.4.0
Save the file and exit the editor. Then install these libraries with the
following command:
Let’s create a Python program to work with this dataset. We will use
one file for all of our work in this tutorial. Create a new file called
main.py:
Now open this file in your text editor of choice and add this line of
code to the file to import the TensorFlow library:
main.py
import tensorflow as tf
Add the following lines of code to your file to import the MNIST
dataset and store the image data in the variable mnist:
main.py
...
ML Tutorial
...
labels,
test_size=0.33,
random_state=42)
ML Tutorial
...
gnb = GaussianNB()
...
main.py
...
learning_rate = 1e-4
n_iterations = 1000
batch_size = 128
dropout = 0.5
The learning rate represents how much the parameters will adjust at
each step of the learning process. These adjustments are a key component
of training: after each pass through the network we tune the weights
slightly to try and reduce the loss. Larger learning rates can converge
faster, but also have the potential to overshoot the optimal values as they
are updated. The number of iterations refers to how many times we go
the tumor class (malignant vs. benign).
Now that we have our predictions, let’s evaluate how well our
classifier is performing.
ML Tutorial
...
# Evaluate accuracy
print(accuracy_score(test_labels, preds))
Alt Jupyter Notebook with Python cell that prints the accuracy of our NB classifier
main.py
...
weights = {
stddev=0.1)),
stddev=0.1)),
stddev=0.1)),
For the bias, we use a small constant value to ensure that the tensors
activate in the intial stages and therefore contribute to the propagation.
The weights and bias tensors are stored in dictionary objects for ease of
access. Add this code to your file to define the biases:
main.py
...
biases = {
Next, set up the layers of the network by defining the operations that
will manipulate the tensors. Add these lines to your file:
main.py
...
main.py
...
cross_entropy = tf.reduce_mean(
tf.nn.softmax_cross_entropy_with_logits(
labels=Y, logits=output_layer
))
In this tutorial, you learned how to build a machine learning classifier in
Python. Now you can load data, organize data, train, predict, and
evaluate machine learning classifiers in Python using Scikit-learn. The
steps in this tutorial should help you facilitate the process of working
with your own data in Python.
How To Build a Neural Network to Recognize
Handwritten Digits with TensorFlow
Written by Ellie Birbeck
Edited by Brian Hogan
Neural networks are used as a method of deep learning, one of the many
subfields of artificial intelligence. They were first proposed around 70
years ago as an attempt at simulating the way the human brain works,
though in a much more simplified form. Individual ‘neurons’ are
connected in layers, with weights assigned to determine how the neuron
responds when signals are propagated through the network. Previously,
neural networks were limited in the number of neurons they were able to
simulate, and therefore the complexity of learning they could achieve.
But in recent years, due to advancements in hardware development, we
have been able to build very deep networks, and train them on enormous
datasets to achieve breakthroughs in machine intelligence.
These breakthroughs have allowed machines to match and exceed the
capabilities of humans at performing certain tasks. One such task is
object recognition. Though machines have historically been unable to
match human vision, recent advances in deep learning have made it
possible to build neural networks which can recognize objects, faces, text,
and even emotions.
In this tutorial, you will implement a small subsection of object
recognition—digit recognition. Using TensorFlow
(https://www.tensorflow.org/), an open-source Python library
developed by the Google Brain labs for deep learning research, you will
see a reduction in loss, and eventually we can stop training and use the
network as a model for testing our new data.
Add this code to the file:
main.py
...
for i in range(n_iterations):
sess.run(train_step, feed_dict={
})
if i % 100 == 0:
[cross_entropy, accuracy],
print(
"Iteration",
str(i),
str(minibatch_loss),
str(minibatch_accuracy)
)
After 100 iterations of each training step in which we feed a mini-batch
of images through the network, we print out the loss and accuracy of that
batch. Note that we should not be expecting a decreasing loss and
increasing accuracy here, as the values are per batch, not for the entire
model. We use mini-batches of images rather than feeding them through
individually to speed up the training process and allow the network to
see a number of different examples before updating the parameters.
Once the training is complete, we can run the session on the test
images. This time we are using a keep_prob dropout rate o f 1.0 to
ensure all units are active in the testing process.
Add this code to the file:
main.py
...
It’s now time to run our program and see how accurately our neural
network can recognize these handwritten digits. Save the main.py file
and execute the following command in the terminal to run the script:
Output
Iteration 0 | Loss = 3.67079 | Accuracy = 0.140625
To try and improve the accuracy of our model, or to learn more about
the impact of tuning hyperparameters, we can test the effect of changing
the learning rate, the dropout threshold, the batch size, and the number
of iterations. We can also change the number of units in our hidden
layers, and change the amount of hidden layers themselves, to see how
different architectures increase or decrease the model accuracy.
To demonstrate that the network is actually recognizing the hand-
drawn images, let’s test it on a single image of our own.
If you are on a local machine and you would like to use your own
hand-drawn number, you can use a graphics editor to create your own
28x28 pixel image of a digit. Otherwise, you can use curl to download
the following sample test image to your server or computer:
main.py
import numpy as np
...
Then at the end of the file, add the following line of code to load the
test image of the handwritten digit:
main.py
...
img = np.invert(Image.open("test_img.png").convert('L')).ravel()
The open function of the Image library loads the test image as a 4D
array containing the three RGB color channels and the Alpha
transparency. This is not the same representation we used previously
when reading in the dataset with TensorFlow, so we’ll need to do some
extra work to match the format.
First, we use the convert function with the L parameter to reduce the
4D RGBA representation to one grayscale color channel. We store this as
a numpy array and invert it using np.invert, because the current
matrix represents black as 0 and white as 255, whereas we need the
opposite. Finally, we call ravel to flatten the array.
Now that the image data is structured correctly, we can run a session in
the same way as previously, but this time only feeding in the single
mnist = input_data.read_data_sets("MNIST_data/", one_hot=True) # y
main.py
...
Now that we have our data imported, it’s time to think about the
neural network.
main.py
...
learning_rate = 1e-4
n_iterations = 1000
batch_size = 128
dropout = 0.5
The learning rate represents how much the parameters will adjust at
each step of the learning process. These adjustments are a key component
of training: after each pass through the network we tune the weights
slightly to try and reduce the loss. Larger learning rates can converge
faster, but also have the potential to overshoot the optimal values as they
are updated. The number of iterations refers to how many times we go
cd ~/AtariBot
Then create a new virtual environment for the project. You can name
this virtual environment anything you’d like; here, we will name it
ataribot:
source ataribot/bin/activate
For the bias, we use a small constant value to ensure that the tensors
activate in the intial stages and therefore contribute to the propagation.
The weights and bias tensors are stored in dictionary objects for ease of
access. Add this code to your file to define the biases:
main.py
...
biases = {
Next, set up the layers of the network by defining the operations that
will manipulate the tensors. Add these lines to your file:
main.py
...
Output
nano bot_2_random.py
Note: Throughout this guide, the bots’ names are aligned with the Step
number in which they appear, rather than the order in which they
appear. Hence, this bot is named bot\_2\_random.py rather than
bot\_1\_random.py.
Start this script by adding the following highlighted lines. These lines
include a comment block that explains what this script will do and two
import statements that will import the packages this script will
ultimately need in order to function:
/AtariBot/bot_2_random.py
"""
"""
import gym
import random
/AtariBot/bot_2_random.py
. . .
import gym
import random
def main():
env = gym.make('SpaceInvaders-v0')
env.reset()
state: The new state of the game, after applying the provided
action.
reward: The increase in score that the state incurs. By way of
example, this could be when a bullet has destroyed an alien, and the
score increases by 50 points. Then, reward = 50. In playing any
score-based game, the player’s goal is to maximize the score. This is
synonymous with maximizing the total reward.
done: Whether or not the episode has ended, which usually occurs
when a player has lost all lives.
info: Extraneous information that you’ll put aside for now.
You will use reward to count your total reward. You’ll also use done
to determine when the player dies, which will be when done returns
True.
Add the following game loop, which instructs the game to loop until
the player dies:
/AtariBot/bot_2_random.py
. . .
def main():
env = gym.make('SpaceInvaders-v0')
env.reset()
episode_reward = 0
while True:
action = env.action_space.sample()
episode_reward += reward
if done:
break
/AtariBot/bot_2_random.py
. . .
def main():
. . .
if done:
if **name** == '**main**':
main()
Save the file and exit the editor. If you’re using nano, do so by pressing
CTRL+X, Y, then ENTER. Then, run your script by typing:
python bot_2_random.py
Your program will output a number, akin to the following. Note that
each time you run the file you will get a different result:
Output
Reward: 210.0
nano bot_2_random.py
"""
"""
import gym
import random
random.seed(0)
def main():
env = gym.make('SpaceInvaders-v0')
env.seed(0)
env.reset()
episode_reward = 0
while True:
action = env.action_space.sample()
episode_reward += reward
if done:
break
if **name** == '**main**':
main()
Save the file and close your editor, then run the script by typing the
following in your terminal:
python bot_2_random.py
Output
Reward: 555.0
This is your very first bot, although it’s rather unintelligent since it
doesn’t account for the surrounding environment when it makes
decisions. For a more reliable estimate of your bot’s performance, you
could have the agent run for multiple episodes at a time, reporting
rewards averaged across multiple episodes. To configure this, first reopen
the file:
nano bot_2_random.py
/AtariBot/bot_2_random.py
. . .
random.seed(0)
num_episodes = 10
. . .
/AtariBot/bot_2_random.py
. . .
env.seed(0)
rewards = []
. . .
Nest all code from env.reset() to the end of main() in a for loop,
iterating num_episodes times. Make sure to indent each line from
env.reset() to break by four spaces:
/AtariBot/bot_2_random.py
. . .
def main():
env = gym.make('SpaceInvaders-v0')
env.seed(0)
rewards = []
for _ in range(num_episodes):
env.reset()
episode_reward = 0
while True:
...
Right before break, currently the last line of the main game loop, add
the current episode’s reward to the list of all rewards:
/AtariBot/bot_2_random.py
. . .
if done:
rewards.append(episode_reward)
break
. . .
/AtariBot/bot_2_random.py
. . .
def main():
...
break
. . .
Your file will now align with the following. Please note that the
following code block includes a few comments to clarify key parts of the
script:
/AtariBot/bot_2_random.py
"""
Bot 2 -- Make a random, baseline agent for the SpaceInvaders game.
"""
import gym
import random
num_episodes = 10
def main():
rewards = []
for _ in range(num_episodes):
env.reset()
episode_reward = 0
while True:
action = env.action_space.sample()
episode_reward += reward
if done:
rewards.append(episode_reward)
break
main()
Save the file, exit the editor, and run the script:
python bot_2_random.py
Output
. . .
state0 shoot 10
state0 right 3
state0 left 3
policy: state -> look at Q-table, pick action with greatest reward
However, most games have too many states to list in a table. In such
cases, the Q-learning agent learns a Q-function instead of a Q-table. We
use this Q-function similarly to how we used the Q-table previously.
Rewriting the table entries as functions gives us the following:
Q(state0, shoot) = 10
Q(state0, right) = 3
Q(state0, left) = 3
learning_rate * Q_target
The player starts at the top left, denoted by S, and works its way to the
goal at the bottom right, denoted by G. The available actions are right,
left, up, and down, and reaching the goal results in a score of 1. There are
a number of holes, denoted H, and falling into one immediately results in
a score of 0.
In this section, you will implement a simple Q-learning agent. Using
what you’ve learned previously, you will create an agent that trades off
between exploration and exploitation. In this context, exploration means
the agent acts randomly, and exploitation means it uses its Q-values to
choose what it believes to be the optimal action. You will also create a
table to hold the Q-values, updating it incrementally as the agent acts
and learns.
Make a copy of your script from Step 2:
cp bot_2_random.py bot_3_q_table.py
Begin by updating the comment at the top of the file that describes the
script’s purpose. Because this is only a comment, this change isn’t
necessary for the script to function properly, but it can be helpful for
keeping track of what the script does:
/AtariBot/bot_3_q_table.py
"""
"""
. . .
Before you make functional modifications to the script, you will need
to import numpy for its linear algebra utilities. Right underneath import
gym, add the highlighted line:
/AtariBot/bot_3_q_table.py
"""
"""
import gym
import numpy as np
import random
. . .
Underneath random.seed(0), add a seed for numpy:
/AtariBot/bot_3_q_table.py
. . .
import random
np.random.seed(0)
. . .
Next, make the game states accessible. Update the env.reset() line
to say the following, which stores the initial state of the game in the
variable state:
/AtariBot/bot_3_q_table.py
. . .
for \_ in range(num_episodes):
state = env.reset()
. . .
/AtariBot/bot_3_q_table.py
. . .
while True:
action = env.action_space.sample()
state2, reward, done, _ = env.step(action)
. . .
/AtariBot/bot_3_q_table.py
. . .
while True:
. . .
episode_reward += reward
state = state2
if done:
. . .
In the if done block, delete the print statement which prints the
reward for each episode. Instead, you’ll output the average reward over
many episodes. The if done block will then look like this:
/AtariBot/bot_3_q_table.py
. . .
if done:
rewards.append(episode_reward)
break
. . .
After these modifications your game loop will match the following:
/AtariBot/bot_3_q_table.py
. . .
for _ in range(num_episodes):
state = env.reset()
episode_reward = 0
while True:
action = env.action_space.sample()
episode_reward += reward
state = state2
if done:
rewards.append(episode_reward))
break
. . .
Next, add the ability for the agent to trade off between exploration and
exploitation. Right before your main game loop (which starts with
for...), create the Q-value table:
/AtariBot/bot_3_q_table.py
. . .
Q = np.zeros((env.observation_space.n, env.action_space.n))
for _ in range(num_episodes):
. . .
/AtariBot/bot_3_q_table.py
. . .
Q = np.zeros((env.observation_space.n, env.action_space.n))
. . .
Inside the while True: inner game loop, create noise. Noise, or
meaningless, random data, is sometimes introduced when training deep
neural networks because it can improve both the performance and the
accuracy of the model. Note that the higher the noise, the less the values
in Q[state, :] matter. As a result, the higher the noise, the more likely
that the agent acts independently of its knowledge of the game. In other
words, higher noise encourages the agent to explore random actions:
/AtariBot/bot_3_q_table.py
. . .
while True:
(episode**2.)
action = env.action_space.sample()
. . .
. . .
(episode**2.)
. . .
/AtariBot/bot_3_q_table.py
. . .
Q = np.zeros((env.observation_space.n, env.action_space.n))
state = env.reset()
episode_reward = 0
while True:
(episode**2.)
episode_reward += reward
state = state2
if done:
rewards.append(episode_reward)
break
. . .
Next, you will update your Q-value table using the Bellman update
equation, an equation widely used in machine learning to find the
optimal policy within a given environment.
The Bellman equation incorporates two ideas that are highly relevant
to this project. First, taking a particular action from a particular state
many times will result in a good estimate for the Q-value associated with
that state and action. To this end, you will increase the number of
episodes this bot must play through in order to return a stronger Q-value
estimate. Second, rewards must propagate through time, so that the
original action is assigned a non-zero reward. This idea is clearest in
games with delayed rewards; for example, in Space Invaders, the player
is rewarded when the alien is blown up and not when the player shoots.
However, the player shooting is the true impetus for a reward. Likewise,
the Q-function must assign (state0, shoot) a positive reward.
First, update num_episodes to equal 4000:
/AtariBot/bot_3_q_table.py
. . .
np.random.seed(0)
num_episodes = 4000
. . .
Then, add the necessary hyperparameters to the top of the file in the
form of two more variables:
/AtariBot/bot_3_q_table.py
. . .
num_episodes = 4000
discount_factor = 0.8
learning_rate = 0.9
. . .
Compute the new target Q-value, right after the line containing
env.step(...):
/AtariBot/bot_3_q_table.py
. . .
episode_reward += reward
. . .
On the line directly after Qtarget, update the Q-value table using a
weighted average of the old and new Q-values:
/AtariBot/bot_3_q_table.py
. . .
Q[state, action] = (
1-learning_rate
episode_reward += reward
. . .
Check that your main game loop now matches the following:
/AtariBot/bot_3_q_table.py
. . .
Q = np.zeros((env.observation_space.n, env.action_space.n))
state = env.reset()
episode_reward = 0
while True:
(episode**2.)
Q[state, action] = (
1-learning_rate
episode_reward += reward
state = state2
if done:
rewards.append(episode_reward)
break
. . .
Our logic for training the agent is now complete. All that’s left is to add
reporting mechanisms.
Even though Python does not enforce strict type checking, add types to
your function declarations for cleanliness. At the top of the file, before the
first line reading import gym, import the List type:
/AtariBot/bot_3_q_table.py
. . .
import gym
. . .
/AtariBot/bot_3_q_table.py
. . .
learning_rate = 0.9
report_interval = 500
%.2f ' \
'(Episode %d)'
def main():
. . .
Before the main function, add a new function that will populate this
report string, using the list of all rewards:
/AtariBot/bot_3_q_table.py
. . .
%.2f ' \
'(Episode %d)'
def print_report(rewards: List, episode: int):
"""
print(report % (
np.mean(rewards[-100:]),
np.mean(rewards),
episode))
def main():
. . .
/AtariBot/bot_3_q_table.py
. . .
def main():
. . .
/AtariBot/bot_3_q_table.py
. . .
if done:
rewards.append(episode_reward)
if episode % report_interval == 0:
print_report(rewards, episode)
. . .
At the end of the main() function, report both averages once more.
Do this by replacing the line that reads print('Average reward:
%.2f' % (sum(rewards) / len(rewards))) with the following
highlighted line:
/AtariBot/bot_3_q_table.py
. . .
def main():
...
break
print_report(rewards, -1)
. . .
Finally, you have completed your Q-learning agent. Check that your
script aligns with the following:
/AtariBot/bot_3_q_table.py
"""
"""
from typing import List
import gym
import numpy as np
import random
num_episodes = 4000
discount_factor = 0.8
learning_rate = 0.9
report_interval = 500
%.2f ' \
'(Episode %d)'
"""
print(report % (
np.mean(rewards[-100:]),
100)]),
np.mean(rewards),
episode))
def main():
rewards = []
Q = np.zeros((env.observation_space.n, env.action_space.n))
state = env.reset()
episode_reward = 0
while True:
(episode**2.)
Q[state, action] = (
1-learning_rate
episode_reward += reward
state = state2
if done:
rewards.append(episode_reward)
if episode % report_interval == 0:
print_report(rewards, episode)
break
print_report(rewards, -1)
if __name__ == '__main__':
main()
Save the file, exit your editor, and run the script:
python bot_3_q_table.py
Output
(Episode 500)
(Episode 1000)
(Episode 1500)
(Episode 2000)
(Episode 2500)
(Episode 3000)
(Episode 3500)
(Episode -1)
You now have your first non-trivial bot for games, but let’s put this
average reward of 0.78 into perspective. According to the Gym
FrozenLake page, “solving” the game means attaining a 100-episode
average of 0.78. Informally, “solving” means “plays the game very
well”. While not in record time, the Q-table agent is able to solve
FrozenLake in 4000 episodes.
However, the game may be more complex. Here, you used a table to
store all of the 144 possible states, but consider tic tac toe in which there
are 19,683 possible states. Likewise, consider Space Invaders where there
are too many possible states to count. A Q-table is not sustainable as
games grow increasingly complex. For this reason, you need some way to
approximate the Q-table. As you continue experimenting in the next step,
you will design a function that can accept states and actions as inputs
and output a Q-value.
"""
import gym
import random
num_episodes = 10
def main():
rewards = []
for _ in range(num_episodes):
env.reset()
episode_reward = 0
while True:
action = env.action_space.sample()
episode_reward += reward
if done:
rewards.append(episode_reward)
break
main()
Save the file, exit the editor, and run the script:
python bot_2_random.py
Output
. . .
state0 shoot 10
state0 right 3
state0 left 3
policy: state -> look at Q-table, pick action with greatest reward
However, most games have too many states to list in a table. In such
cases, the Q-learning agent learns a Q-function instead of a Q-table. We
use this Q-function similarly to how we used the Q-table previously.
Rewriting the table entries as functions gives us the following:
Q(state0, shoot) = 10
Q(state0, right) = 3
Next, you will rewrite your algorithm logic using Tensorflow’s
abstractions. Before doing that, though, you’ll need to first create
placeholders for your data.
In your main function, directly beneath rewards=[], insert the
following highlighted content. Here, you define placeholders for your
observation at time t (as obs_t_ph) and time t+1 (as obs_tp1_ph), as
well as placeholders for your action, reward, and Q target:
/AtariBot/bot_4_q_network.py
. . .
def main():
rewards = []
# 1. Setup placeholders
Q = np.zeros((env.observation_space.n, env.action_space.n))
. . .
Directly beneath the line beginning with q_target_ph =, insert the
following highlighted lines. This code starts your computation by
computing Q(s, a) for all a to make q_current and Q(s’, a’) for all a’ to
make q_target:
/AtariBot/bot_4_q_network.py
. . .
q_current = tf.matmul(obs_t_ph, W)
q_target = tf.matmul(obs_tp1_ph, W)
Q = np.zeros((env.observation_space.n, env.action_space.n))
. . .
Again directly beneath the last line you added, insert the following
higlighted code. The first two lines are equivalent to the line added in
Step 3 that computes Qtarget, where Qtarget = reward +
discount_factor * np.max(Q[state2, :]). The next two lines
set up your loss, while the last line computes the action that maximizes
your Q-value:
/AtariBot/bot_4_q_network.py
. . .
q_current = tf.matmul(obs_t_ph, W)
q_target = tf.matmul(obs_tp1_ph, W)
pred_act_ph = tf.argmax(q_current, 1)
Q = np.zeros((env.observation_space.n, env.action_space.n))
. . .
After setting up your algorithm and the loss function, define your
optimizer:
/AtariBot/bot_4_q_network.py
. . .
pred_act_ph = tf.argmax(q_current, 1)
# 3. Setup optimization
trainer =
tf.train.GradientDescentOptimizer(learning_rate=learning_rate)
update_model = trainer.minimize(error)
Q = np.zeros((env.observation_space.n, env.action_space.n))
Next, set up the body of the game loop. To do this, pass data to the
Tensorflow placeholders and Tensorflow’s abstractions will handle the
computation on the GPU, returning the result of the algorithm.
Start by deleting the old Q-table and logic. Specifically, delete the lines
that define Q (right before the for loop), noise (in the while loop),
action, Qtarget, and Q[state, action]. Rename state to obs_t
and state2 to obs_tp1 to align with the Tensorflow placeholders you
set previously. When finished, your for loop will match the following:
/AtariBot/bot_4_q_network.py
. . .
# 3. Setup optimization
trainer =
tf.train.GradientDescentOptimizer(learning_rate=learning_rate)
update_model = trainer.minimize(error)
obs_t = env.reset()
episode_reward = 0
while True:
episode_reward += reward
obs_t = obs_tp1
if done:
...
Directly above the for loop, add the following two highlighted lines.
These lines initialize a Tensorflow session which in turn manages the
resources needed to run operations on the GPU. The second line
initializes all the variables in your computation graph; for example,
initializing weights to 0 before updating them. Additionally, you will
nest the for loop within the with statement, so indent the entire for
loop by four spaces:
/AtariBot/bot_4_q_network.py
. . .
trainer =
tf.train.GradientDescentOptimizer(learning_rate=learning_rate)
update_model = trainer.minimize(error)
session.run(tf.global_variables_initializer())
obs_t = env.reset()
...
. . .
while True:
obs_t_oh})[0]
action = env.action_space.sample()
. . .
/AtariBot/bot_4_q_network.py
. . .
# 5. Train model
obs_tp1_ph: obs_tp1_oh
})
session.run(update_model, feed_dict={
obs_t_ph: obs_t_oh,
rew_ph: reward,
q_target_ph: q_target_val,
act_ph: action
})
episode_reward += reward
. . .
/AtariBot/bot_4_q_network.py
"""
"""
import gym
import numpy as np
import random
import tensorflow as tf
random.seed(0)
np.random.seed(0)
tf.set_random_seed(0)
num_episodes = 4000
discount_factor = 0.99
learning_rate = 0.15
report_interval = 500
%.2f ' \
'(Episode %d)'
vector"""
"""
print(report % (
np.mean(rewards[-100:]),
100)]),
np.mean(rewards),
episode))
def main():
rewards = []
# 1. Setup placeholders
q_current = tf.matmul(obs_t_ph, W)
q_target = tf.matmul(obs_tp1_ph, W)
pred_act_ph = tf.argmax(q_current, 1)
# 3. Setup optimization
trainer =
tf.train.GradientDescentOptimizer(learning_rate=learning_rate)
update_model = trainer.minimize(error)
session.run(tf.global_variables_initializer())
episode_reward = 0
while True:
obs_t_oh})[0]
action = env.action_space.sample()
# 5. Train model
obs_tp1_ph: obs_tp1_oh
})
session.run(update_model, feed_dict={
obs_t_ph: obs_t_oh,
rew_ph: reward,
q_target_ph: q_target_val,
act_ph: action
})
episode_reward += reward
obs_t = obs_tp1
if done:
rewards.append(episode_reward)
if episode % report_interval == 0:
print_report(rewards, episode)
break
print_report(rewards, -1)
if __name__ == '__main__':
main()
Save the file, exit your editor, and run the script:
python bot_4_q_network.py
Output
(Episode 500)
(Episode 1000)
(Episode 1500)
(Episode 2000)
(Episode 2500)
(Episode 3000)
(Episode 4000)
(Episode -1)
You’ve now trained your very first deep Q-learning agent. For a game
as simple as FrozenLake, your deep Q-learning agent required 4000
episodes to train. Imagine if the game were far more complex. How
many training samples would that require to train? As it turns out, the
agent could require millions of samples. The number of samples required
is referred to as sample complexity, a concept explored further in the next
section.
Say we have two models, one simple and one extremely complex. For
both models to attain the same performance, bias-variance tells us that
the extremely complex model will need exponentially more samples to
train. Case in point: your neural network-based Q-learning agent
required 4000 episodes to solve FrozenLake. Adding a second layer to the
neural network agent quadruples the number of necessary training
episodes. With increasingly complex neural networks, this divide only
grows. To maintain the same error rate, increasing model complexity
increases the sample complexity exponentially. Likewise, decreasing
sample complexity decreases model complexity. Thus, we cannot
maximize model complexity and minimize sample complexity to our
heart’s desire.
We can, however, leverage our knowledge of this tradeoff. For a visual
interpretation of the mathematics behind the bias-variance
decomposition, see Understanding the Bias-Variance Tradeoff. At a high
level, the bias-variance decomposition is a breakdown of “true error”
into two components: bias and variance. We refer to “true error” as mean
squared error (MSE), which is the expected difference between our
predicted labels and the true labels. The following is a plot showing the
change of “true error” as model complexity increases:
/AtariBot/bot_3_q_table.py
. . .
Q = np.zeros((env.observation_space.n, env.action_space.n))
state = env.reset()
episode_reward = 0
while True:
(episode**2.)
Q[state, action] = (
1-learning_rate
episode_reward += reward
state = state2
if done:
rewards.append(episode_reward)
break
. . .
Our logic for training the agent is now complete. All that’s left is to add
reporting mechanisms.
Even though Python does not enforce strict type checking, add types to
your function declarations for cleanliness. At the top of the file, before the
first line reading import gym, import the List type:
Open the new file:
nano bot_5_ls.py
Again, update the comment at the top of the file describing what this
script will do:
/AtariBot/bot_4_q_network.py
"""
"""
. . .
Before the block of imports near the top of your file, add two more
imports for type checking:
/AtariBot/bot_5_ls.py
. . .
import gym
. . .
/AtariBot/bot_5_ls.py
. . .
num_episodes = 5000
discount_factor = 0.85
learning_rate = 0.9
w_lr = 0.5
report_interval = 500
. . .
/AtariBot/bot_5_ls.py
. . .
report_interval = 500
%.2f ' \
'(Episode %d)'
actions"""
. . .
/AtariBot/bot_5_ls.py
. . .
actions"""
"""Initialize model"""
Q = makeQ(W)
return W, Q
. . .
After the initialize block, add a train method that computes the
ridge regression closed-form solution, then weights the old model with
the new one. It returns both the model and the abstracted Q-function:
/AtariBot/bot_5_ls.py
. . .
def initialize(shape: Tuple):
...
return W, Q
Callable]:
I = np.eye(X.shape[1])
Q = makeQ(W)
return W, Q
. . .
/AtariBot/bot_5_ls.py
. . .
Callable]:
...
return W, Q
return np.identity(n)[i]
. . .
Following this, you will need to modify the training logic. In the
previous script you wrote, the Q-table was updated every iteration. This
script, however, will collect samples and labels every time step and train
a new model every 10 steps. Additionally, instead of holding a Q-table or
a neural network, it will use a least squares model to predict Q-values.
Go to the main function and replace the definition of the Q-table (Q =
np.zeros(...)) with the following:
/AtariBot/bot_5_ls.py
. . .
def main():
...
rewards = []
W, Q = initialize((n_obs, n_actions))
. . .
Scroll down before the for loop. Directly below this, add the following
lines which reset the states and labels lists if there is too much
information stored:
/AtariBot/bot_5_ls.py
. . .
def main():
...
. . .
Modify the line directly after this one, which defines state =
env.reset(), so that it becomes the following. This will one-hot
encode the state immediately, as all of its usages will require a one-hot
vector:
/AtariBot/bot_5_ls.py
. . .
. . .
Before the first line in your while main game loop, amend the list of
states:
/AtariBot/bot_5_ls.py
. . .
...
episode_reward = 0
while True:
states.append(state)
. . .
/AtariBot/bot_5_ls.py
. . .
while True:
states.append(state)
. . .
/AtariBot/bot_5_ls.py
. . .
while True:
...
state2, reward, done, \_ = env.step(action)
. . .
/AtariBot/bot_5_ls.py
. . .
label = Q(state)
Qtarget
labels.append(label)
episode_reward += reward
. . .
/AtariBot/bot_5_ls.py
. . .
state = state2
if len(states) % 10 == 0:
W, Q = train(np.array(states), np.array(labels), W)
if done:
. . .
/AtariBot_5_ls.py
"""
"""
import gym
import numpy as np
import random
num_episodes = 5000
discount_factor = 0.85
learning_rate = 0.9
w_lr = 0.5
report_interval = 500
report = '100-ep Average: %.2f . Best 100-ep Average: %.2f . Average:
%.2f ' \
'(Episode %d)'
actions"""
"""Initialize model"""
Q = makeQ(W)
return W, Q
Callable]:
I = np.eye(X.shape[1])
Q = makeQ(W)
return W, Q
def one_hot(i: int, n: int) -> np.array:
vector"""
return np.identity(n)[i]
"""
print(report % (
np.mean(rewards[-100:]),
100)]),
np.mean(rewards),
episode))
def main():
rewards = []
W, Q = initialize((n_obs, n_actions))
episode_reward = 0
while True:
states.append(state)
label = Q(state)
learning_rate * Qtarget
labels.append(label)
episode_reward += reward
state = state2
if len(states) % 10 == 0:
W, Q = train(np.array(states), np.array(labels), W)
if done:
rewards.append(episode_reward)
if episode % report_interval == 0:
print_report(rewards, episode)
break
print_report(rewards, -1)
if __name__ == '__main__':
main()
Then, save the file, exit the editor, and run the script:
python bot_5_ls.py
Output
(Episode 500)
(Episode 1000)
(Episode 1500)
(Episode 2000)
(Episode 2500)
(Episode 3000)
(Episode 3500)
(Episode 4000)
(Episode 5000)
(Episode -1)
q_target = tf.matmul(obs_tp1_ph, W)
pred_act_ph = tf.argmax(q_current, 1)
Q = np.zeros((env.observation_space.n, env.action_space.n))
. . .
After setting up your algorithm and the loss function, define your
optimizer:
/AtariBot/bot_4_q_network.py
. . .
pred_act_ph = tf.argmax(q_current, 1)
# 3. Setup optimization
trainer =
tf.train.GradientDescentOptimizer(learning_rate=learning_rate)
update_model = trainer.minimize(error)
Q = np.zeros((env.observation_space.n, env.action_space.n))
You won’t implement these yourself, but you will load pretrained
models that trained with these solutions. To do this, create a new
directory where you will store these models’ parameters:
mkdir models
wget http://models.tensorpack.com/OpenAIGym/SpaceInvaders-v0.tfmodel -P
models
We will address these constraints in more detail later on. For now,
download the script by typing:
wget https://github.com/alvinwan/bots-for-atari-
games/raw/master/src/bot_6_a3c.py
You will now run this pretrained Space Invaders agent to see how it
performs. Unlike the past few bots we’ve used, you will write this script
from scratch.
Create a new script file:
nano bot_6_dqn.py
/AtariBot/bot_6_dqn.py
"""
"""
import cv2
import gym
import numpy as np
import random
import tensorflow as tf
def main():
if **name** == '**main**':
main()
Directly after your imports, set random seeds to make your results
reproducible. Also, define a hyperparameter num_episodes which will
tell the script how many episodes to run the agent for:
/AtariBot/bot_6_dqn.py
. . .
import tensorflow as tf
tf.set_random_seed(0)
num_episodes = 10
def main():
. . .
. . .
num_episodes = 10
def downsample(state):
def main():
. . .
Create the game environment at the start of your main function and
seed the environment so that the results are reproducible:
/AtariBot/bot_6_dqn.py
. . .
def main():
. . .
Directly after the environment seed, initialize an empty list to hold the
rewards:
/AtariBot/bot_6_dqn.py
. . .
def main():
. . .
/AtariBot/bot_6_dqn.py
. . .
def main():
rewards = []
model = a3c_model(load='models/SpaceInvaders-v0.tfmodel')
. . .
Next, add some lines telling the script to iterate for num_episodes
times to compute average performance and initialize each episode’s
reward to 0. Additionally, add a line to reset the environment
(env.reset()), collecting the new initial state in the process,
downsample this initial state with downsample(), and start the game
loop using a while loop:
/AtariBot/bot_6_dqn.py
. . .
def main():
rewards = []
model = a3c*model(load='models/SpaceInvaders-v0.tfmodel')
for * in range(num_episodes):
episode_reward = 0
states = [downsample(env.reset())]
while True:
. . .
/AtariBot/bot_6_dqn.py
. . .
while True:
if len(states) < 4:
action = env.action_space.sample()
else:
action = np.argmax(model([frames]))
. . .
Then take an action and update the relevant data. Add a downsampled
version of the observed state, and update the reward for this episode:
/AtariBot/bot_6_dqn.py
. . .
while True:
...
action = np.argmax(model([frames]))
states.append(downsample(state))
episode_reward += reward
. . .
Next, add the following lines which check whether the episode is done
and, if it is, print the episode’s total reward and amend the list of all
results and break the while loop early:
/AtariBot/bot_6_dqn.py
. . .
while True:
...
episode_reward += reward
if done:
rewards.append(episode_reward)
break
. . .
Outside of the while and for loops, print the average reward. Place
this at the end of your main function:
/AtariBot/bot_6_dqn.py
def main():
...
break
/AtariBot/bot_6_dqn.py
"""
"""
import cv2
import gym
import numpy as np
import random
import tensorflow as tf
tf.set_random_seed(0)
num_episodes = 10
def downsample(state):
def main():
rewards = []
model = a3c_model(load='models/SpaceInvaders-v0.tfmodel')
for _ in range(num_episodes):
episode_reward = 0
states = [downsample(env.reset())]
while True:
if len(states) < 4:
action = env.action_space.sample()
else:
action = np.argmax(model([frames]))
states.append(downsample(state))
episode_reward += reward
if done:
rewards.append(episode_reward)
break
main()
Save the file and exit your editor. Then, run the script:
python bot_6_dqn.py
Output
. . .
Reward: 1230
Reward: 4510
Reward: 1860
Reward: 2555
Reward: 515
Reward: 1830
Reward: 4100
Reward: 4350
Reward: 1705
Reward: 4905
Compare this to the result from the first script, where you ran a
random agent for Space Invaders. The average reward in that case was
only about 150, meaning this result is over twenty times better. However,
you only ran your code for three episodes, as it’s fairly slow, and the
average of three episodes is not a reliable metric. Running this over 10
episodes, the average is 2756; over 100 episodes, the average is around
2500. Only with these averages can you comfortably conclude that your
agent is indeed performing an order of magnitude better, and that you
now have an agent that plays Space Invaders reasonably well.
However, recall the issue that was raised in the previous section
regarding sample complexity. As it turns out, this Space Invaders agent
takes millions of samples to train. In fact, this agent required 24 hours on
four Titan X GPUs to train up to this current level; in other words, it took
a significant amount of compute to train it adequately. Can you train a
similarly high-performing agent with far fewer samples? The previous
steps should arm you with enough knowledge to begin exploring this
question. Using far simpler models and per bias-variance tradeoffs, it
may be possible.
Conclusion
In this tutorial, you built several bots for games and explored a
fundamental concept in machine learning called bias-variance. A natural
next question is: Can you build bots for more complex games, such as
StarCraft 2? As it turns out, this is a pending research question,
supplemented with open-source tools from collaborators across Google,
DeepMind, and Blizzard. If these are problems that interest you, see open
calls for research at OpenAI, for current problems.
The main takeaway from this tutorial is the bias-variance tradeoff. It is
up to the machine learning practitioner to consider the effects of model
complexity. Whereas it is possible to leverage highly complex models and
layer on excessive amounts of compute, samples, and time, reduced
model complexity could significantly reduce the resources required.