Martin Jones - Advanced Python For Biologists (2016) PDF
Martin Jones - Advanced Python For Biologists (2016) PDF
Martin Jones - Advanced Python For Biologists (2016) PDF
Martin Jones
All rights reserved. This book or any portion thereof may not be
reproduced or used in any manner whatsoever without the express
written permission of the publisher except for the use of brief
quotations in a book review.
ISBN-13: 978-1495244377
ISBN-10: 1495244377
http://pythonforbiologists.com
Set in PT Serif and Source Code Pro
About the author
Martin started his programming career by learning Perl during the
course of his PhD in evolutionary biology, and started teaching other
people to program soon after. Since then he has taught introductory
programming to hundreds of biologists, from undergraduates to PIs,
and has maintained a philosophy that programming courses must be
friendly, approachable, and practical.
In his academic career, Martin mixed research and teaching at the
University of Edinburgh, culminating in a two year stint as Lecturer
in Bioinformatics. He now runs programming courses for biological
researchers as a full time freelancer.
You can get in touch with Martin at
1: Introduction
1
Chapter 1: Introduction
A second, more persuasive reason is that all the features of Python that
we are going to discuss in this book have been added to the language for
good reasons – because they make code easier to write, easier to
maintain, easier to test, faster, or more efficient. You don't have to use
objects when modelling biological systems – but it will make
development much easier. You don't have to use comprehensions when
transforming data – but doing so will allow you to express your ideas
much more concisely. You don't have to use recursive functions when
processing tree-like data – but your code will be much more readable if
you do.
Yet another reason is that knowing about features of Python opens up
new approaches to programming, which will allow you to think about
problems in a new light. For example, two large chapters in this book are
devoted to object oriented programming and functional programming.
The aim of these chapters is to introduce you not only to object oriented
and functional features, but also to object oriented and functional
approaches to tackling real life problems.
Hopefully, as you encounter new tools and techniques in this book the
biological examples will convince you that they're useful things to know
about. I have tried, for each new concept introduced, to point out why and
in what circumstances it is a better way of doing things than the way that
you might already know.
2
Chapter 1: Introduction
there are inevitably some cases where material from one chapter relies on
material from a later chapter. I've tried to minimize such cases, and have
added footnotes to point out connections between chapters whenever
possible. If there's a particular chapter that sounds interesting then it's
fine to jump in and start reading there; just be aware that you'll probably
have to skip around in the book a bit to fill in any gaps in your current
knowledge.
Chapters tend to follow a predictable structure. They generally start with
a few paragraphs outlining the motivation behind the features that it will
cover – why do they exist, what problems do they allow us to solve, and
why are they useful in biology specifically? These are followed by the
main body of the chapter in which we discuss the relevant features and
how to use them. The length of the chapters varies quite a lot –
sometimes we want to cover a topic briefly, other times we need more
depth. This section ends with a brief recap outlining what we have
learned, followed by exercises and solutions (more on that topic below).
The book assumes that you're familiar with all the material in Python for
Biologists. If you have some Python experience, but haven't read Python
for Biologists, then it's probably worth downloading a free copy and at
least looking over the chapter contents to make sure you're comfortable
with them. I will sometimes refer in the text or in footnotes to sections of
Python for Biologists – rather than repeating the URL where you can get a
copy1, I'll simply give it here:
http://pythonforbiologists.com
1 If you're reading this book as an ebook (as opposed to a physical book) then you should have
received a copy of Python for Biologists in your download.
3
Chapter 1: Introduction
Formatting
A book like this has lots of special types of text – we'll need to look at
examples of Python code and output, the contents of files, and technical
terms. Take a minute to note the typographic conventions we'll be using.
In the main text of this book, bold type is used to emphasize important
points and italics for technical terms and filenames. Where code is mixed
in with normal text it's written in a monospaced font like this
with a grey background. Occasionally there are footnotes1 to provide
additional information that is interesting to know but not crucial to
understanding, or to give links to web pages.
Example Python code is highlighted with a solid border and the name of
the matching example file is written just underneath the example to the
right:
example.py
Not every bit of code has a matching example file – much of the time we'll
be building up a Python program bit by bit, in which case there will be a
single example file containing the finished version of the program. The
example files are in separate folders, one for each chapter, to make them
easy to find.
Sometimes it's useful to refer to a specific line of code inside an example.
For this, we'll use numbered circles like this❶:
1 Like this.
4
Chapter 1: Introduction
Example output (i.e. what we see on the screen when we run the code) is
highlighted with a dotted border:
Often we want to look at the code and the output it produces together. In
these situations, you'll see a solid-bordered code block followed
immediately by a dotted-bordered output block.
Other blocks of text (usually file contents or typed command lines) don't
have any kind of border and look like this:
contents of a file
http://pythonforbiologists.com/index.php/exercise-files/
5
Chapter 1: Introduction
6
Chapter 1: Introduction
on the command line and editing them in a text editor. For notes on
setting up your environment, see the introductory chapter in Python for
Biologists. I've tried to ensure that all code samples and exercise solutions
will work in both Python 2 and Python 3 – where there are differences
between versions, I have noted it in the text.
There are two chief differences that are relevant to multiple examples and
exercises. Firstly, to carry out floating point division in Python 2 we need
to include the line
at the start of our programs. Secondly, the way that we get input from the
user is slightly different: in Python 3 we use the input function and in
Python 2 we use the raw_input function.
Joined-up programming
Finally, a quick note about synergy between chapters. When explaining
new concepts, I've made the examples as simple as possible in the
interests of clarity and avoided using multiple "advanced" techniques in a
single example. For instance, in the chapter on object oriented
programming I've given examples of class methods that don't use any of
the techniques from other chapters.
This makes it easier for the reader to concentrate on the new material
while being able to easily understand the context. But it can lead to the
misconception that these programming techniques are mutually
exclusive. In fact, nothing could be further from the truth: the real power
of the tools that we're going to be discussing comes when they are used
together. So it's certainly possible, for example, to write a class (chapter 4)
that stores some data in a list of dicts (chapter 3) and has methods that
7
Chapter 1: Introduction
use recursion (chapter 2) and raise custom exceptions (chapter 7). You
just won't see such code in the examples, because it wouldn't be a very
good way of introducing the new material!
Getting in touch
Learning to program is a difficult task, and my one goal in writing this
book is to make it as easy and accessible as possible to get started. So, if
you find anything that is hard to understand, or you think may contain an
error, please get in touch – just drop me an email at
8
Chapter 1: Introduction
and I promise to get back to you. If you find the book useful, then please
also consider leaving an Amazon review to help other people find it.
9
Chapter 2: Recursion and trees
10
Chapter 2: Recursion and trees
def generate_trimers():
bases = ['A', 'T', 'G', 'C']
result = []
for base1 in bases:
for base2 in bases:
for base3 in bases:
result.append(base1 + base2 + base3)
return result
generate_3mers.py
In the above example, we have three nested for loops which iterate over
the same list of four possible bases. The outer for loop defines the first
base of the 3mer, the middle for loop defines the second base, and the
inner for loop defines the third base. Because each for loop repeats
four times (one for each base) the append() line gets called 4x4x4 = 64
times, so we end up with a list of all possible 3mers.
Suppose we wanted to modify this code to generate 4mers instead of
3mers. We would simply add another for loop inside the original three.
But what if we wanted to write a function that would do the same job, but
for any value of k – in other words, for sequences of any length? This is a
very tricky problem. We know that we need to use a for loop to add bases
onto the end of a growing sequence, but the difficult part is making sure
that we keep all the possible sequences between iterations. Here's one
possible solution:
def generate_kmers(length):
result = [''] ❶
for i in range(length): ❷
new_result = [] ❹
for kmer in result:
for base in ['A', 'T', 'G', 'C']:
new_result.append(kmer + base) ❸
result = new_result ❺
return result
generate_kmers.py
11
Chapter 2: Recursion and trees
The function works in the following way: we start off with a list
containing a single empty string – this is the starting point for each of our
final sequences❶ . Then, we extend each element in that list (initially just
one element, but as we'll see, the size of the list will grow) in an iterative
process (controlled using a for loop and a range❷). To extend a
sequence, we add each of the four possible bases onto the end❸. Because
we want to end up with the final sequences (i.e. we don't want the
intermediate steps) we have to create a new temporary list ❹ to hold the
extended sequences for each round of extension, and then use that list as
the list of sequences for the next round❺. After each iteration of the for
loop, two thing happen: the results list contains four times as many
elements as before (because we've added each of the four possible bases)
and those elements are one base longer than before.
Why does this function look so different to the generate_trimers()
function? In the generate_trimers() function we use nested loops to
generate the sequences, because we know in advance how long we want
the sequences to be – three bases means that we need three nested loops.
But in the function above, we can't use nested loops, because we don't
know in advance how many loops we'll need. The number of loops
depends on the length of the kmers. What we require is a way to
express the idea of nesting code an arbitrary number of times.
Recursion is the way to express this, but to understand it, we need to
think about the problem in a slightly different way. Here's an English
translation of the generate_kmers() function above:
"Start with a list containing a single empty string. Next, extend
each element in the list by adding each of the four possible bases
onto the end. Repeat this extension process as many times as
necessary until the sequences are the length you require"
12
Chapter 2: Recursion and trees
At first glance, this doesn't seem like a very helpful solution. Rather than
telling us how to figure out the answer, it just describes what the
answer looks like. And it assumes that we have a way to magically
calculate the list of all possible sequences that are one base shorter than
the length we want. But – remarkably – when we write a function using
this description, it actually works:
def generate_kmers_rec(length):
if length == 1: ❶
return ['A', 'T', 'G', 'C'] ❷
else:
result = []
for seq in generate_kmers_rec(length - 1): ❸
for base in ['A', 'T', 'G', 'C']: ❹
result.append(seq + base)
return result
generate_kmers_recursive.py
13
Chapter 2: Recursion and trees
"If the length is one❶ then the result is simply a list of the four
bases❷. To get the result when the length is more than one, take
the list of all possible sequences whose length is one less that the
length you're looking for❸, and add each of the four possible
bases to it❹."
14
Chapter 2: Recursion and trees
the function immediately returns the list ['A', 'T', 'G', 'C']
and the function call is over. Now the call to generate_kmers_rec(2)
can carry on running. It takes that list, adds each of the four bases to each
of the four sequences to generate a list of the 16 dinucleotides, and
returns that list. The list of dinucleotides is received by
generate_kmers_rec(3), which finishes the job by adding each of the
four bases to each of the 16 dinucleotides to create a list of 64
trinucletoides, and returning it.
When looking at a recursive function like the one above, an obvious
question is: why doesn't the function run forever? If every time the function
is called, it calls itself again, then why don't we end up with an infinite
number of function calls which never return1? In our example above, the
answer lies in the special case where the length is one. We can see by
looking at the function definition that each call to the function will
trigger another call to the function unless the length is one. Couple that
fact with the fact that we decrease the length by one whenever the
function calls itself, and we can see that whatever length is supplied as
the argument to the initial function call, eventually the function will be
called with a length of one, at which point the functions will start to
return.
In general, these two criteria are necessary for any recursive function to
work properly: there must be a special case that causes the function to
return without calling itself, and there must be some guarantee that this
special case will eventually be reached.
In the introduction to this chapter, we said that recursion is good for
solving problems that have a tree-like structure. The problem of
generating all kmers of a given length is an example, although the tree-
1 It is actually quite easy to write functions that never return when getting started with
recursive programming!
15
Chapter 2: Recursion and trees
like nature of the problem isn't obvious. To make it clearer, imagine the
process of choosing a single 3mer by selecting one base at a time. At the
start of the process, we have four options – four different branches – one
for each base:
We choose one of these paths – for instance, the one labelled T – and then
are faced with four more branches:
This time we pick the branch labelled TG. Finally, we have a choice of four
different branches to end up at one of four different 3mers:
16
Chapter 2: Recursion and trees
Viewed like this, it's clear why we can consider this a tree-like problem:
generating all possible kmers of a given length is equivalent to visiting all
the leaves of the tree.
In the above example, there's not a clear winner between the iterative
solution and the recursive solution – both are roughly equally easy to
read. In the next section, we'll consider some data that are more explicitly
tree-like and see some examples of problems where recursion is clearly
the best solution.
Child-to-parent trees
Suppose we want to store some information about taxonomic
relationships among primates. If we were describing the taxonomy of
primates using natural language, here are two statements that we might
make: "Homo sapiens is a member of the group Homo, which is a member of
the group Homininae" and "Primates contains two groups – Haplorrhini and
Strepsirrhini".
These two statements are similar – they are both talking about group
membership – but they approach the problem of description in different
ways. The first expresses the relationship in child to parent terms, and the
second in parent to child. How might we store these relationships in a
Python program? The first can be stored quite simply: we can create a
dictionary which holds child → parent relationships:
tax_dict = {
'Homo sapiens' : 'Homo',
'Homo' : 'Homininae'
}
17
Chapter 2: Recursion and trees
In the above code we have a dictionary with two elements, each one
describing a single child-to-parent relationship.
Storing parent-to-child relationships looks a bit different. To store the
second set of relationships we can create a list and store it in a variable:
Note that this looks a little less satisfying – the name of the parent
(Primates) is a variable name, while the names of the child taxa are
strings.
The outlook for parent-to-child relationships looks even worse when we
start to consider how we would store additional relationships. To add the
relationships from the first statement requires us to create two new
variables:
tax_dict = {
'Homo sapiens' : 'Homo',
'Homo' : 'Homininae',
'Haplorrhini' : 'Primates',
'Strepsirrhini' : 'Primates'
}
18
Chapter 2: Recursion and trees
Primates
Haplorrhini
Simiiformes
Hominoidea
Pan troglodytes
Pongo abelii
Tarsiiformes
Tarsius tarsier
Strepsirrhini
Lorisidae
Loris tardigradus
Lemuriformes
Allocebus trichotis
Lorisiformes
Galago alleni
Galago moholi
tax_dict = {
'Pan troglodytes' : 'Hominoidea', 'Pongo abelii' : 'Hominoidea',
'Hominoidea' : 'Simiiformes', 'Simiiformes' : 'Haplorrhini',
'Tarsius tarsier' : 'Tarsiiformes', 'Haplorrhini' : 'Primates',
'Tarsiiformes' : 'Haplorrhini', 'Loris tardigradus' :
'Lorisidae',
'Lorisidae' : 'Strepsirrhini', 'Strepsirrhini' : 'Primates',
'Allocebus trichotis' : 'Lemuriformes', 'Lemuriformes' :
'Strepsirrhini',
'Galago alleni' : 'Lorisiformes', 'Lorisiformes' :
'Strepsirrhini',
'Galago moholi' : ' Lorisiformes'
}
19
Chapter 2: Recursion and trees
Even with some white space added to make the items in the dict line up,
it's not particularly easy to read! Nevertheless, every relationship in the
tree is also present in the dict, so we can be confident that it represents
the whole taxonomy.
What can we do with this data structure once it's been created? Let's try
writing a function that will list all the parents1 of a given taxon. When
given the name of a taxon as input, it will return a list of all the taxa of
which that taxon is a member. For example, given the input 'Galago
alleni' it should return the list ['Lorisiformes',
'Strepsirrhini', 'Primates']. If we know the number of
ancestors of the node in advance, we can write a function that will do the
job in a slightly clunky way:
def get_ancestors(taxon):
first_parent = tax_dict.get(taxon)
second_parent = tax_dict.get(first_parent)
third_parent = tax_dict.get(second_parent)
return[first_parent, second_parent, third_parent]
get_three_parents.py
We use the dictionary to look up the parent for the node, then use it again
to look up the parent of the parent, etc. etc. Obviously this will fail when
we try to use it on a node that doesn't have exactly three ancestors.
Here's an alternative way to write the function that doesn't rely on us
knowing the number of ancestors in advance. We'll use a while loop to
keep going up the tree until we reach Primates:
20
Chapter 2: Recursion and trees
def get_ancestors(taxon):
result = [taxon]
while taxon != 'Primates':
result.append(tax_dict.get(taxon))
taxon = tax_dict.get(taxon)❶
return result
get_parents_while.py
Notice how each time round the loop❶ we set the value of the taxon
variable to be the name of the parent taxon, which then becomes the
child in the next iteration of the while loop1.
Here's another function that does the same job using recursion:
def get_ancestors(taxon):
if taxon == 'Primates': ❶
return [taxon] ❷
else:
parent = tax_dict.get(taxon) ❸
parent_ancestors = get_ancestors(parent)
return [parent] + parent_ancestors ❹
get_parents_recursive.py
1 This is very similar to the way that the long sequences of a given iteration become the short
sequences of the next iteration in our kmer example at the start of the chapter.
21
Chapter 2: Recursion and trees
Just as in our kmers example, the recursive function works because there
is a special case for which the function doesn't call itself (the case where
the input taxon is 'Primates'). Let's see how this function works in detail
by adding a few print() statements and calling it:
def get_ancestors(taxon):
print('calculating ancestors for ' + taxon)
if taxon == 'Primates':
print('taxon is Primates, returning an empty list')
return []
else:
print('taxon is not Primates, looking up the parent')
parent = tax_dict.get(taxon)
print('the parent is ' + parent + ' ')
print('looking up ancestors for ' + parent)
parent_ancestors = get_ancestors(parent)
print('parent ancestors are ' + str(parent_ancestors))
result = [parent] + parent_ancestors
print('about to return the result: ' + str(result))
return result
get_ancestors('Galago alleni')
get_parents_verbose.py
This looks like a lot of extra code, but all we have done is added a
temporary result variable to hold the result, and added a lot of print()
statements. If we look at the output from this code we can see exactly
what is going on:
22
Chapter 2: Recursion and trees
23
Chapter 2: Recursion and trees
get_ancestors('Galago alleni', 0)
get_parents_indented.py
In this version of the function, at the start of each function call we create
a variable called spacer which is just a string of space characters, then
print the spacer at the start of each line of output. Now the output clearly
shows how the levels of function calls build up, then collapse:
1 The technical term for this is the depth of the call stack.
24
Chapter 2: Recursion and trees
Parent-to-child trees
In the previous section, we started off by comparing two different ways of
storing tree-like data, and concluded that storing child-to-parent
relationships was easier than storing parent-to-child relationships. That's
certainly the case for the example code that we looked at, but here I want
to introduce an approach that makes parent-to-child relationships look a
lot better.
Let's start with a simple question: why, in the examples in the previous
section, could we use a dictionary to store child → parent relationships,
but not parent → child relationships? Answer: because keys in a
dictionary have to be unique, so we can't store multiple key value pairs for
25
Chapter 2: Recursion and trees
parents that have more than one child. In other words, we can't store the
three children of Strepsirrhini like this:
tax_dict = {
'Strepsirrhini' : 'Lorisidae',
'Strepsirrhini' : 'Lemuriformes',
'Strepsirrhini' : 'Lorisiformes'
}
tax_dict = {
'Strepsirrhini' : ['Lorisidae', 'Lemuriformes','Lorisiformes']
}
Using this approach, we can store the exact same set of relationships as
before in a parent → child manner:
new_tax_dict = {
'Primates': ['Haplorrhini', 'Strepsirrhini'],
'Tarsiiformes': ['Tarsius tarsier'],
'Haplorrhini': ['Tarsiiformes', 'Simiiformes'],
'Simiiformes': ['Hominoidea'],
'Lorisidae': ['Loris tardigradus'],
'Lemuriformes': ['Allocebus trichotis'],
'Lorisiformes': ['Galago alleni','Galago moholi'],
'Hominoidea': ['Pongo abelii', 'Pan troglodytes'],
'Strepsirrhini': ['Lorisidae', 'Lemuriformes', 'Lorisiformes']
}
Now we can start to address a problem that is a mirror of the one in the
previous section: given a taxon, how do we find all its children? Just like
before, we'll look at both iterative and recursive solutions. Here's an
1 This idea – of storing a list in a dict – is explored in much more depth in the chapter on
complex data structures.
26
Chapter 2: Recursion and trees
iterative function that returns a list of all the children of the taxon which
is given as its argument:
def get_children(taxon):
result = []
stack = [taxon] ❶
while len(stack) != 0:
current_taxon = stack.pop() ❷
current_taxon_children = new_tax_dict.get(current_taxon, [])
stack.extend(current_taxon_children) ❸
result.append(current_taxon) ❹
return result
get_children.py
print(get_children('Strepsirrhini'))
print(get_children('Lorisiformes'))
And here's the output. The result of the first call is wrapped over two lines
as it's too long to fit on the page, but it's still just a single list. Notice that
1 A stack is the traditional computer science name for a list where elements are added and
removed from the top – picture a stack of dinner plates.
27
Chapter 2: Recursion and trees
the input taxon is included in the list – we could modify the function to
remove it, but it's not important for the purposes of this discussion:
The best way to understand how the function works is to picture the fate
of a single taxon as we encounter it. Each taxon is first added onto the
stack variable at the point where we are processing its parent. It is then
(at some point in the future) transferred from the stack variable to the
result variable, having its own children added to the stack variable in
the process. In this manner, taxa are repeatedly added onto and removed
from the stack, until the stack is empty, the while loop ends and the
function can return.
Contrast the code above with the recursive function:
def get_children_rec(taxon):
result = [taxon] ❶
children = new_tax_dict.get(taxon, []) ❷
for child in children:
result.extend(get_children_rec(child)) ❸
return result
get_children_recursive.py
Here, we create a single result list❶, which at the start of the function
contains just the taxon that was given as the argument. The, we look up
the children of that taxon❷ , and for each child, we add its children to the
result using a recursive function call❸. Then we simply return the list. Or
in other words:
"The list of all children of a taxon is the taxon itself plus, for each
child, a list of their children"
28
Chapter 2: Recursion and trees
Why is the recursive solution so much simpler and clearer than the iterate
solution for this case? One way to look at it to consider the nature of a
tree-like data structure such as the one we're using to store the taxonomy.
We can describe a taxonomic tree is like this: a node in a tree has a name
and some children. The children of a node are also nodes themselves. The
description of a tree is itself recursive – we cannot describe the children
of a node without referring to the definition of a node. In other words, a
node can have children, and those children are also nodes that can have
children, and those children are also nodes, etc. etc. When we are dealing
with a data structure that is fundamentally recursive, like a taxonomic
tree, we shouldn't be surprised to find that the best ways to manipulate it
also turn out to be recursive.
Recap
We started this chapter by comparing two different ways to generate lists
of all possible DNA sequences of a given length. In doing so, we came
across the idea of a recursive function – one that calls itself on a simpler
version of the input. We saw that recursive functions have two important
properties: a special condition under which they don't call themselves,
and an assurance that the special condition will eventually be reached –
usually when the simplification of the input reaches its limit. Using
recursion to solve the kmer-generating problem revealed its tree-like
structure.
We then looked at a couple of different ways of storing tree-like data in
Python and discovered that, under the right circumstances, both
parent → child and child → parent schemes can be useful1. We examined
two common tree operations – finding lists of parents, and finding lists of
1 See the section on nested lists in the chapter on complex data structures for another way of
storing tree data.
29
Chapter 2: Recursion and trees
children – and showed how they can both be carried out using either
iterative or recursive functions. Although we were investigating these
operations in the context of taxonomic relationships, they are actually
applicable to many different tree-like data types.
30
Chapter 2: Recursion and trees
Exercise
31
Chapter 2: Recursion and trees
Solution
Looking through both lists, we can see that the first taxon that occurs in
both is 'Haplorrhini'.
Implementing this as a function is quite straightforward. We already have
a function that returns a list of parents from earlier in the chapter, so
we'll use that to generate the two lists of parents1. The easiest way to
32
Chapter 2: Recursion and trees
identify the first taxon that appears in both lists is to go through the
second list one element at a time and return as soon as we find an
element that's also in the first list:
Haplorrhini
Hominoidea
Primates
33
Chapter 2: Recursion and trees
Now for the interesting part – writing a function that uses the above
function to find the last common ancestor for a list of taxa. As suggested
in the problem description, we'll try both an iterative and a recursive
solution. For the iterative solution, we'll remove taxa from the list one by
one and repeatedly find the last common ancestor of the current taxon
and the current last common ancestor. Here's the code for a function that
implements this approach, along with a test function call:
def get_lca_list(taxa):
taxon1 = taxa.pop()
while len(taxa) > 0:
taxon2 = taxa.pop()
lca = get_lca(taxon1, taxon2)
print('LCA of ' + taxon1 + ' and ' + taxon2 + ' is ' + lca)
taxon1 = lca
return taxon1
get_lca_iterative.py
34
Chapter 2: Recursion and trees
def get_lca_list_rec(taxa):
print("getting lca for " + str(taxa))
if len(taxa) == 2:
return get_lca(taxa[0], taxa[1])
else:
taxon1 = taxa.pop()
taxon2 = get_lca_list_rec(taxa)
return get_lca(taxon1, taxon2)
get_lca_recursive.py
Let's analyse how this function works in terms of the requirements for
recursive functions that we saw earlier in the chapter. First, the special
case: for this function, the special case (for which the function doesn't
call itself) occurs when the list of taxa is has only two elements. For this
1 This is the reason why our get_lca function has to work correctly when one of the
arguments is also the result.
35
Chapter 2: Recursion and trees
special case, we can find the last common ancestor of the list by using the
existing get_lca() function. Second, the simplification process: for
this function, the simplification takes place because one taxon is removed
from the list at each function call. This guarantees that the special case
will eventually be reached.
When looking at the solution code in the exercises folder, pay attention to
the structure. There are three separate functions: one which calculates a
list of ancestors for a single taxon, one which calculates the last common
ancestor for a pair of taxa, and one which calculates the last common
ancestor for a list of many taxa.
The two solutions we came up with above above show quite nicely the
relative strengths and weaknesses of the iterative and recursive
approaches. The recursive approach is easier to read and provides less
scope for bugs, but the execution logic can be hard to follow if you're not
used to recursion. The iterative approach, on the other hand, requires the
programmer to carefully manage the taxon list. In the end, the choice of
recursion or iteration to solve a given problem is in the hands of the
programmer, and often comes down to how you feel more comfortable
expressing the solution to a problem. Arguably, iterative programming
involves telling the computer how to find the solution to a given
problem, whereas recursive programming involves telling the computer
what the solution looks like.
36
Chapter 3: Complex data structures
Tuples
Tuples are a built in type of data structure that are part of the core Python
distribution. At first glance, tuples appear very similar to lists – they have
multiple elements, we can retrieve single element using square brackets,
and we can iterate over the elements of a tuple. The only apparent
difference is that we define them using parentheses rather than square
brackets:
t = (4, 5, 6)
print t[1]
for e in t:
print(e+1)
tuple.py
5
5
6
7
Our first clue about the role of tuples comes when we try to change one of
the elements:
37
Chapter 3: Complex data structures
t[1] = 9
The reason for this error is that a tuple cannot be changed once it has
been created. Not only are we not allowed to change one of the elements,
but we also can't append or remove elements, reverse or sort the
elements, or carry out any other operation that changes the tuple. Data
structures that have this property are said to be immutable1.
It's not clear at first why this is a useful property to have – surely the
point of variables is that they should be able to vary? But there are
advantages to using tuples in some situations. Knowing that the value of
a particular variable can't change after assignment can make it easier to
reason about the behaviour of your code2, and can allow Python to make
certain optimizations which can sometimes result in faster or more
memory-efficient code. Also, being immutable allows tuples to be used as
the keys to a dict – something which is not possible with lists.
As a rule, tuples work well for heterogeneous data: sequences of elements
which represent different bits of information, and where the position of
each element tells you something about what the element stores. For
example, here's a bit of code that creates a number of 3-element tuples to
represent DNA sequence records. Each tuple stores a sequence, an
accession number, and a genetic code, and they are always in the same
order:
1 Strings are also immutable in Python, which seems odd but doesn't generally cause problems.
2 This is a similar idea to that of pure functions – see the chapter on functional programming
for more discussion.
38
Chapter 3: Complex data structures
t1 = ('actgctagt', 'ABC123', 1)
t2 = ('ttaggttta', 'XYZ456', 1)
t3 = ('cgcgatcgt', 'HIJ789', 5)
Sets
A fairly frequent task in programming is to keep a list of items that share
some common property. For example, imagine we are writing a program
that processes a long list of accession numbers:
processed = []
for acc in accessions:
if not acc in processed:
# do some processing
processed.append(acc)
The problem with this approach is that the operation of testing whether a
particular value is in a list takes a long time if the list is large. The above
1 Or, if you're familiar with object oriented programming, lightweight immutable objects.
39
Chapter 3: Complex data structures
code will start out fast, but as the size of the processed list grows, so
will the time required to check it on each iteration.
A faster alternative is to use a dict. We know that it's very quick to look up
a value in a dict, so this approach will be much faster1:
processed = {}
for acc in accessions:
if not acc in processed:
# do some processing
processed[acc] = 1
However, it's still not quite satisfactory – we are wasting a lot of memory
storing all those values, all of which are 1, which we never look up.
Python's set type is like a dict that doesn't store any values – it simply
stores a collection of keys and allows us to very rapidly check whether or
not a particular key is in the set. Using a set is straightforward; we create
an empty one using the set() function, add elements to it using add(),
and check if a given element is in the set using in():
processed = set()
for acc in accessions:
if not acc in processed:
# do some processing
processed.add(acc)
set = {4,7,6,12}
Although this looks very similar to the way that we create a dict, it's easy
to spot the difference – inside the brackets are individual elements, not
key-value pairs.
1 On my computer it's around one thousand times faster when processed holds a million
elements.
40
Chapter 3: Complex data structures
Sets also have various useful methods for carrying out set operations like
intersections, differences and unions, and can very quickly answer
questions like "are all elements in my first set also in my second set?".
See the chapter on performance in Effective Python development for
biologists for an in-depth look at the relative speed of lists and sets for
different jobs.
Lists of lists
Let's look at a couple of examples of lists:
[1,2,3,4]
['a', 'b', 'c']
You've almost certainly encountered lists of numbers and strings like this
in your programming so far. However, we're certainly not restricted to
numbers and strings when constructing lists: we can make a list of file
objects:
[open('one.txt'), open('two.txt')]
import re
[re.search(r'[^ATGC]', 'ACTRGGT'), re.search(r'[^ATGC]', 'ACTYGGT')]
[[1,2,3],[4,5,6],[7,8,9]]
# more readably
[[1,2,3],
[4,5,6],
[7,8,9]]
41
Chapter 3: Complex data structures
The data structure in the code above is a list of lists, sometimes known as
a two-dimensional list. It may help to think of it like a table or a
spreadsheet, where the elements of the outer list are rows, and the
elements of the inner lists are cells. Although it looks weird, a list of lists
behaves just like any other list. We can retrieve a single element using the
normal syntax:
lol = [[1,2,3],[4,5,6],[7,8,9]]
print(lol[1])
# prints [4,5,6]
and then manipulate the returned list in exactly the same way:
lol = [[1,2,3],[4,5,6],[7,8,9]]
l = lol[1]
print(l[2])
# prints 6
lol = [[1,2,3],[4,5,6],[7,8,9]]
print(lol[1][2])
# prints 6
42
Chapter 3: Complex data structures
seq = aln[2]
char = aln[2][3]
records = [
{'name' : 'actgctagt', 'accession' : 'ABC123', 'genetic_code' : 1},
{'name' : 'ttaggttta', 'accession' : 'XYZ456', 'genetic_code' : 1},
{'name' : 'cgcgatcgt', 'accession' : 'HIJ789', 'genetic_code' : 5}
]
list_of_dicts.py
The dicts that make up the elements of this list are different from most of
the ones we've seen before in two important ways. Firstly, they don't have
names – in other words, they are not assigned to a variable (we call values
like this anonymous, so each element of the list is an anonymous dict). This
looks strange because we're used to storing dicts in variables, but in fact
1 Take a look at the chapter on generators for an explanation of this particular bit of syntax.
43
Chapter 3: Complex data structures
it's no stranger than the fact that the elements of the list [1,2,3] are
anonymous integers.
Secondly, each value in one of the dicts above represents a different type
of information – a DNA sequence, an accession number, and a genetic
code – and they are a mixture of strings and numbers. In previous
examples where we've used dicts, they've been storing pairs of data where
each pair stores the same kind of thing – for example, restriction enzyme
names and their cut motifs:
enzymes = {
'EcoRI' : r'GAATTC',
'AvaII' : r'GG(A|T)CC',
'BisI' : r'GC[ATGC]GC'
}
In the dicts that comprise the elements of our list, the data are stored
very differently: the keys are simply labels which describe their values.
Just as with lists of lists, we can refer to an entire dict just by giving its
index:
one_record = records[2]
but we're usually more interested in iterating over the dicts. The use of
label-type keys leads to a very readable way of processing the dicts – for
example, to print out the accession number and genetic code for each
record:
list_of_dicts.py
44
Chapter 3: Complex data structures
Recall from the earlier in the chapter that tuples are good for storing this
kind of heterogeneous data. We could also store our collection of records
using a list of tuples:
records = [
('actgctagt', 'ABC123', 1),
('ttaggttta', 'XYZ456', 1),
('cgcgatcgt', 'HIJ789', 5)
]
list_of_tuples.py
which avoids the need to store strings like 'accession' multiple times,
and instead relies on the ordering of the elements in each tuple to
identify them. We can refer to individual elements of each tuple using the
index:
list_of_tuples.py
45
Chapter 3: Complex data structures
This is known as unpacking the tuple, and leads to very readable code
when the number of elements in the tuple is small.
Dicts of sets
When we are dealing with multiple sets in a program, storing them all in a
dict offers a convenient way to label them without creating a bunch of
extra variables. Imagine we have collected lists of genes (identified by
accession numbers) that are over-expressed by some organism of interest
when exposed to various types of heavy metal contaminants. We'll store
the gene lists as a dict of sets, where the keys of the dict are the names of
the heavy metals, and the values are the sets of genes1:
1 Obviously, in a real life analysis these would be created by reading the gene lists from a file
rather than hard-coding them.
46
Chapter 3: Complex data structures
gene_sets = {
'arsenic' : {1,2,3,4,5,6,8,12},
'cadmium' : {2,12,6,4},
'copper' : {7,6,10,4,8},
'mercury' : {3,2,4,5,1}
}
dict_of_sets.py
3 in gene_sets['arsenic']
# True
mercury
arsenic
set_one.issubset(set_two)
47
Chapter 3: Complex data structures
We can use two loops to iterate over our dict and carry out a pairwise
comparison of our gene sets to identify conditions where all the over-
expressed genes are also over-expressed in some other condition:
dict_of_sets.py
Dicts of tuples
Think back to the examples we looked at earlier in the chapter for storing
a collection of DNA sequence records. We saw how these data could be
stored using a list of tuples:
records = [
('actgctagt', 'ABC123', 1),
('ttaggttta', 'XYZ456', 1),
('cgcgatcgt', 'HIJ789', 5)
]
This approach worked well when we wanted to iterate over the records,
but is not very good if we want to retrieve a specific record. Finding a
record for which we know the accession number, for example, requires us
to look at each record in turn until we find the one we want:
48
Chapter 3: Complex data structures
This is not only an extra chunk of typing, but the time required to carry
out the search will grow linearly with the number of records.
If we know that we're frequently going to want to look up a record using a
particular element – and, importantly, we are confident that element is
unique to each record – we can store the data as a dict of tuples instead.
To do this, we take the tuple element that uniquely identifies each record,
and turn it into the key in a dict. The remaining tuple elements are the
value:
records = {
'ABC123' : ('actgctagt', 1),
'XYZ456' : ('ttaggttta', 1),
'HIJ789' : ('cgcgatcgt', 5)
}
dict_of_tuples.py
Now looking up the record for an accession is simply a matter of using the
usual dict get method:
my_record = records.get('XYZ456')
and we can even combine this with tuple unpacking to achieve a very
clear, readable syntax:
Dicts of lists
One of the frustrations that beginners tend to run into when using dicts is
the restriction that keys must be unique. At first glance, this makes dicts
seem a lot less useful than they ought to be, as often in programming we
want rapid lookup of multiple values associated with a single key. Using
49
Chapter 3: Complex data structures
lists as the values in a dict offers a way round this restriction, and exposes
the full usefulness of dicts.
Let's look at an example we've seen before in Python for Biologists: kmer
counting. Previously, our goal has always been to count the number of
times each kmer appears in a given DNA sequence, and we've usually
ended up with a dict where the keys are the kmers and the values are the
counts. But what if we're interested not just in the number of times a
particular kmer occurs, but in the positions where it occurs? Let's remind
ourselves of the standard way we're used to tackling this problem:
dna = 'aattggaattggaattg'
k = 4
kmer2count = {}
for start in range(len(dna) - k + 1): ❶
kmer = dna[start:start + k] ❷
current_count = kmer2count.get(kmer, 0) ❸
kmer2count[kmer] = current_count + 1 ❹
print(kmer2count)
In the above code we iterate over each possible start position using a
range❶ and extract the kmer❷. We then look up the current count for
that kmer from the dict❸, using a default value of zero if the kmer isn't
already in the dict. Finally, we update the value in the dict for the kmer to
be the current count plus one❹. We can see from the output that the
result of running the code is a dict where the keys are kmers and the
values are counts:
50
Chapter 3: Complex data structures
for a kmer that isn't currently in the dict will be an empty list, and rather
than adding one to the count in each iteration, we'll append a position to
the list:
dna = 'aattggaattggaattg'
k = 4
kmer2list = {}
for start in range(len(dna) - k + 1):
kmer = dna[start:start + k]
list_of_positions = kmer2list.get(kmer, [])
list_of_positions.append(start)
kmer2list[kmer] = list_of_positions
print(kmer2list)
dict_of_lists.py
As we can see from the output, what we end up with is a dict of lists:
{'ggaa': [4, 10], 'aatt': [0, 6, 12], 'gaat': [5, 11], 'tgga':
[3, 9], 'attg': [1, 7, 13], 'ttgg': [2, 8]}
Notice how the order that the elements are stored in the dict bears no
relation to the order of the start positions – remember, dicts have no
inherent ordering. We can manipulate the items in the dict using
standard tools, as long as we remember that every value is itself a list. For
example, we can reconstruct our dict of kmer counts from the dict of start
positions using a dict comprehension1 which asks for the length of each
start position list:
51
Chapter 3: Complex data structures
(dog,(raccoon,bear),((sea_lion,seal),((monkey,cat),weasel)));
If we replace the parentheses with square brackets and put quotes around
the names of the taxa, we get a valid bit of Python code that describes a
nested set of lists:
Splitting the list definition over multiple lines doesn't change it, but
makes it easier to read, and shows the tree-like structure a bit better:
1 http://evolution.genetics.washington.edu/phylip/newicktree.html
52
Chapter 3: Complex data structures
[
'dog',
[
'raccoon','bear'
],
[
[
'sea_lion','seal'
],
[
'monkey','cat'
],
'weasel'
]
]
1 An alternative way to do this would be simply to flatten the list (i.e. turn it into a one-
dimensional list) and look for the element.
53
Chapter 3: Complex data structures
leaf_in_subtree.py
We start off by defining a variable to hold the result, which will be False
by default❶. The function then looks at each element in the input list and
asks whether or not that element is itself a list using the isinstance()
function❷. If it is a list, then it recursively calls itself on the element to
check if it contains the target. If not then it determines whether the
element is the target that we're looking for. If either of these possibilities
is true, then we know that the list does contain the target, so the result is
set to True❸. If it gets to the end of the input list without finding the
target, then the result remains False. Either way, the value of result is
returned.
Because the function is recursive we can use it on nested lists of any
depth and elements of any type:
assert contains([1,2,3], 2)
assert contains([1,[2,3],[4,5], 5])
assert contains([['sea_lion','seal'],['monkey','cat'], 'weasel'], cat)
54
Chapter 3: Complex data structures
In programming terms, what we have done here is found a list of all the
sublists that contain the strings 'monkey' and 'cat'. In phylogenetic
terms, we have found a list of of the clades that contain these two
55
Chapter 3: Complex data structures
def count_leaves(subtree):
total = 0
for element in subtree:
if isinstance(element, list):
total = total + count_leaves(element)
else:
total = total + 1
return total
The first element of the sorted list of subtrees – the one with the fewest
leaf nodes – is the smallest possible clade which contains our two taxa,
and hence represents finding the last common ancestor of the two:
['monkey', 'cat']
Recap
This chapter has been all about ways of storing data: an introduction to
two new data types and a discussion of complex data structures.
The new data types are intended for fairly specific uses, and you can get
away without using for most programs – we can use lists instead of tuples,
and dicts instead of sets – but having them in your toolbox leads to code
that's more readable and robust. If you're used to programming with just
lists and dicts, it might not be obvious when you encounter good
56
Chapter 3: Complex data structures
opportunities to use tuples and sets, so keep them at the back of your
mind when writing code in future (hopefully the examples in the rest of
this book will provide some inspiration).
Complex data structures are conceptually quite simple, but thinking and
reasoning about them takes a bit of getting used to. We've seen in this
chapter examples of the most common and useful types of complex data
structures, involving lists, dicts, sets and tuples in various combinations.
The examples that we've discussed illustrate an important point in
programming: that picking the right representation for your data can
make a big difference in how difficult it is to process them. Experience,
along with exposure to real life examples, will make it clear to you which
data structures are suitable for tackling which types of problems.
57
Chapter 3: Complex data structures
Exercises
Distance matrix
One of the questions we might want to ask of our heavy metal gene
response data from earlier in the chapter is: which types of
contamination provoke similar responses in expression levels? To answer
it, we need to come up with a way of measuring how similar the sets of
over-expressed genes are for any two give conditions. A straightforward
metric is to divide the number of genes shared between the two lists (the
intersection) by the total number of genes in both lists (the union). Write a
program that will start with a list of sets (use the heavy metal data as an
example) and produce a pairwise similarity matrix stored in a dict of dicts
– i.e. we should be able to get the similarity score for two conditions by
writing something like:
score = similarity_matrix['arsenic']['cadmium']
58
Chapter 3: Complex data structures
Solutions
Distance matrix
There are two parts to this problem: calculating the similarity scores and
creating the dict to hold the results. It's quite easy to calculate the
similarity score for a given pair of sets – the names of the methods we
need to use (union and intersection) are helpfully mentioned in the
problem description, so we just have to divide the number of elements in
the intersection by the number of elements in the union1:
gene_sets = {
'arsenic' : {1,2,3,4,5,6,8,12},
'cadmium' : {2,12,6,4},
'copper' : {7,6,10,4,8},
'mercury' : {3,2,4,5,1}
}
set1 = gene_sets['arsenic']
set2 = gene_sets['mercury']
similarity = len(set1.intersection(set2)) / len(set1.union(set2))
# similarity == 0.625
59
Chapter 3: Complex data structures
Notice from the output that we are considering each pair in both
direction (e.g. mercury vs. copper and copper vs. mercury). We could avoid
this, but it doesn't matter for the purpose of this exercise and it might be
handy to be able to look up similarity scores in either direction, so we'll
leave it.
Next, we come to the problem of storing the results. It's tempting to think
that we can just create an empty dict before we start our loops and add
elements at each iteration:
similarity_scores = {}
for condition1, set1 in gene_sets.items():
for condition2, set2 in gene_sets.items():
if condition1 != condition2:
similarity = len(set1.intersection(set2)) /
len(set1.union(set2))
similarity_scores[condition1][condition2] = similarity
but this gives us a KeyError. The problem here is that the final line
requires there to be a value associated with the key condition1 in the
similarity_scores dict, but we have never created one. Some
programming languages (notably Perl) will automatically create keys in a
60
Chapter 3: Complex data structures
similarity_scores = {}
for condition1, set1 in gene_sets.items():
similarity_scores[condition1] = {} ❶
for condition2, set2 in gene_sets.items():
if condition1 != condition2:
similarity = len(set1.intersection(set2))/len(set1.union(set2))
similarity_scores[condition1][condition2] = similarity
or we create a temporary dict at the start of the outer loop to hold the
scores for the current iteration❶, and assign it to condition1 at the end
of the outer loop after it's been populated with scores❷:
similarity_scores = {}
for condition1, set1 in gene_sets.items():
single_similarity = {} ❶
for condition2, set2 in gene_sets.items():
if condition1 != condition2:
similarity = len(set1.intersection(set2))/len(set1.union(set2))
single_similarity[condition2] = similarity
similarity_scores[condition1] = single_similarity❷
distance_matrix.py
Either way will work, and both will result in the same complex data
structure being stored in similarity_scores:
61
Chapter 3: Complex data structures
{'mercury':
{'copper': 0.1111111111111111,
'arsenic': 0.625,
'cadmium': 0.2857142857142857},
'copper':
{'mercury': 0.1111111111111111,
'arsenic': 0.3,
'cadmium': 0.2857142857142857},
'arsenic':
{'mercury': 0.625,
'copper': 0.3,
'cadmium': 0.5},
'cadmium':
{'mercury': 0.2857142857142857,
'copper': 0.2857142857142857,
'arsenic': 0.5}
}
In the above code, the dict comprehension has been split over multiple
lines to make it easier to read, but it's a single Python statement. Whether
you consider the procedural version or the comprehension version easier
to read is a matter of personal preference and background.
62
Chapter 3: Complex data structures
We can even take things a step further and replace the outer for loop
with another dict comprehension. Using two nested comprehensions in
this way looks odd, but the key to reading it is to start with the innermost
set of brackets. Using this approach we can transform our collection of
gene sets into a collection of pairwise similarity scores in a single
statement:
similarity_scores = {
c1: {
c2 : len(s1.intersection(s2)) / len(s1.union(s2))
for c2,s2 in gene_sets.items()
if c1 != c2
}
for c1,s1 in gene_sets.items()
}
63
Chapter 3: Complex data structures
taxon2 but not taxon3. We can reuse most of the code for the function
we saw earlier in the chapter to iterate over subtrees:
monophyly.py
64
Chapter 4: object oriented Python
Introduction
If you've worked your way through Python for Biologists1 you'll be vaguely
aware that there are different "types" of things in Python – strings,
numbers, files, regular expressions, etc. You may also have heard
references to something called object oriented programming, which is
often presented as a scary and esoteric technique that involves a lot of
complicated-sounding concepts like inheritance, composition, and
polymorphism. While it's true that there are many corners of the object
oriented world which are daunting for the novice programmer, at its heart
object oriented programming is simply the practice of creating new
types of things.
In many ways, learning the tools of object oriented programming is much
like learning to write functions. When we first learn to write small
programs as complete beginners, we are content to use the built in
functions and methods of Python as building blocks. Later, we learn to
write our own functions, and find that this allows us to write larger and
more complex programs more easily. It's the same with types: it's
perfectly possible to write good and useful programs using only the types
provided by the Python language, but learning how to create our own
makes it much easier to solve a wide range of problems.
Indeed, the advantages of writing our own classes and writing our own
functions are very similar. When we write our own functions, the basic
building blocks that we have to use are the built in Python functions and
methods. Writing our own functions doesn't allow us to do anything that
we couldn't do before, it just allows us to structure our code in a way
1 Or pretty much any other introductory Python book
65
Chapter 4: object oriented Python
that's much easier to read and write, and hence to write much larger and
more sophisticated programs (think back to the concept of encapsulation).
Similarly, learning to create our own types of object doesn't allow us to do
anything fundamentally different – but it does let us structure code in a
way that allows for much greater flexibility.
Before we dive in, a few sentences about nomenclature. So far, in both
Python for Biologists and in this book, we've tended to use the word type to
refer to the fact that values in Python come in different flavours, and the
word thing to refer to a bit of data like a string or a file. For the purposes
of this chapter, where our goal is to explicitly talk about objects, we are
going to have to be a bit more precise. From now on, we will refer to an
individual thing as an object, and the type of thing as its class. Instead of
saying that a particular thing is of a particular type, we will say that a
given object is an instance of a class. For example, in this line of code:
input = open("somedata.txt")
we will say that the input variable refers to an object that is an instance
of the File class.
One common pitfall when learning about objects is getting confused
about the difference between objects and classes, so I'll spell it out
explicitly here. A class is like a blueprint for building objects. Defining a
class doesn't cause any objects to be created directly, it simply describes
what an instance of that class will look like when we do create it. Once
we've defined a class, we can create as many instances – i.e. as many
objects that use that class as a blueprint – as we like.
Some of the nomenclature for talking about classes we already know – for
example, we know from Python for Biologists that a piece of code which
belongs to objects of a particular class is a method.
66
Chapter 4: object oriented Python
def get_AT(my_dna):
length = len(my_dna)
a_count = my_dna.count('A')
t_count = my_dna.count('T')
at_content = (a_count + t_count) / length
return at_content
def complement(my_dna):
replacement1 = my_dna.replace('A', 't')
replacement2 = replacement1.replace('T', 'a')
replacement3 = replacement2.replace('C', 'g')
replacement4 = replacement3.replace('G', 'c')
return replacement4.upper()
two_functions.py
67
Chapter 4: object oriented Python
dna_sequence = "ACTGATCGTTACGTACGAGTCAT"
print(get_AT(dna_sequence))
print(complement(dna_sequence))
0.5652173913043478
TGACTAGCAATGCATGCTCAGTA
dna_sequence = "ACTGATCGTTACGTACGAGTCAT"
species = "Drosophila melanogaster"
gene_name = "ABC1"
print("Looking at the " + species " " + gene_name + " gene")
print("AT content is " + str(get_AT(dna_sequence)))
print("complement is " + complement(dna_sequence))
two_functions.py
But this obviously will not scale. To store a large collection of sequences
along with their gene names and species names requires a different
approach. We could create two dictionaries – one to store sequences and
gene names, and one to store sequences and species names. But what if
we want to store two sequences that are the same, but belong to different
species? The dictionary approach won't work, since keys have to be
unique. And it seems unlikely that we're going to want to look up the gene
68
Chapter 4: object oriented Python
name for a given sequence anyway – if anything, it's likely to be the other
way round.
What we need in order to solve this problem elegantly is a way of
wrapping up all three bits of information – the sequence, gene name, and
species name – into one big ball of data which can be treated as a unit.
One way to do this is with a complex data structure – for example, a list of
dictionaries1, where each dictionary represents a single sequence record
and has three items corresponding to the three bits of information we
need to store. A much better way is to create a class that represents a
DNA sequence, instances of which can be created and passed around in
our programs as discrete objects.
Defining a class is straightforward, but first we have to decide what
instance variables and methods it will have. Instance variables are
variables that belong to a particular object (we'll see how to use them
soon). We already know what methods are – we've been using them on
many of the built in Python classes. We want our class to have three
instance variables (a DNA sequence, a gene name, and a species name)
and two methods (the ones we saw previously: getAT() and
complement()). For this example, our three instance variables are
going to be strings, but they could also be File objects, dicts, lists, etc.
Here's a bit of code that defines our new class, creates an instance, and
calls some methods on it:
1 Take a look at the chapter on complex data structures for some examples of this approach.
69
Chapter 4: object oriented Python
class DNARecord(object):
sequence = 'ACGTAGCTGACGATC'❶
gene_name = 'ABC1'
species_name = 'Drosophila melanogaster'
def complement(self): ❷
replacement1 = self.sequence.replace('A', 't')
replacement2 = replacement1.replace('T', 'a')
replacement3 = replacement2.replace('C', 'g')
replacement4 = replacement3.replace('G', 'c')
return replacement4.upper()
def get_AT(self): ❸
length = len(self.sequence)
a_count = self.sequence.count('A')
t_count = self.sequence.count('T')
at_content = (a_count + t_count) / length
return at_content
d = DNARecord() ❹
print('Created a record for ' + d.gene_name + ' from ' + d.species_name)
print('AT is ' + str(d.get_AT()))
print('complement is ' + d.complement())
dna_record.py
70
Chapter 4: object oriented Python
variable is how we refer to the object inside the method – so, to refer to
the DNA sequence of the record, we use the variable name
self.sequence. We don't have to worry about how the self variable
gets created – Python automatically takes care of setting the value of the
self variable whenever we make a method call on our object. We make
use of the self variable again in the get_AT() method❸.
The next few lines of code are where we start to use our new class. We
create a new instance of our DNARecord class by writing the name of the
class followed by a pair of parentheses (we'll learn more about the reason
for these shortly)❹. Once the new object has been created, and assigned
to the variable d, we can access its attributes using the pattern
variablename.attributename. So to get the gene name of the
DNARecord referred to by the variable d, we simply write d.gene_name.
To call a method on our new object, we use the same pattern.
Now we've seen what a class definition looks like, let's see what can be
done to improve it.
Constructors
An obvious limitation of the class as we've written it above is that the
three members – sequence, gene_name and species_name – are set
as part of the class definition. This means that every instance of this class
we create will have the same values set for these variables, which is
unlikely to be useful. Of course, once we've created an object we can
change its member variables, so if we want two different DNA records
with different properties then we can simply set them after the objects
have been created:
71
Chapter 4: object oriented Python
d1 = DNARecord()
d1.sequence = 'ATATATTATTATATTATA'
d1.gene_name = 'COX1'
d1.species_name = 'Homo sapiens'
d2 = DNARecord()
d2.sequence = 'CGGCGGCGCGGCGCGGCG'
d2.gene_name = 'ATP6'
d2.species_name = 'Gorilla gorilla'
We're using the exact same class definition as above, but this time after
creating each DNARecord object we set its properties, before using a loop
to iterate over the two records and print their information. We can see
from the output how the updated values for the member variables are
used when we ask for the AT content or for the complement:
72
Chapter 4: object oriented Python
creating another method for our object whose job is to set its variables.
Here's what it might look like:
class DNARecord(object):
sequence = 'ACGTAGCTGACGATC'
gene_name = 'ABC1'
species_name = 'Drosophila melanogaster'
def complement(self):
...
def get_AT(self):
...
d1 = DNARecord()
d1.set_variables('ATATATTATTATATTATA','COX1','Homo sapiens') ❷
d2 = DNARecord()
d2.set_variables('CGGCGGCGCGGCGCGGCG','ATP6','Gorilla gorilla') ❷
set_variables.py
The new method❶ follows the normal rule for object methods – the first
argument is self – and sets the three variables using the remaining
arguments. Later we can see how this method allows us to set all the
variables in one statement❷.
Now that we've made it so easy to set the variables, there's no need to
have them as part of the class definition, so we can tidy up our class by
removing them. Everything will still work fine as long as we remember to
set the variables for an object after we create it:
73
Chapter 4: object oriented Python
class DNARecord(object):
def complement(self):
...
def get_AT(self):
...
d1 = DNARecord()
d1.set_variables('ATATATTATTATATTATA','COX1','Homo sapiens')
d2 = DNARecord()
d2.set_variables('CGGCGGCGCGGCGCGGCG','ATP6','Gorilla gorilla')
d1 = DNARecord()
print(d1.complement())
The above code will give us an error letting us know that Python can't find
the sequence in order to calculate the complement:
74
Chapter 4: object oriented Python
To avoid running into this problem, Python1 has a special kind of method
called a constructor. The job of a constructor is to create a new object and
set its variables all in one statement, and it uses a bit of special syntax:
class DNARecord(object):
def complement(self):
...
def get_AT(self):
...
constructor.py
d2 = DNARecord()
75
Chapter 4: object oriented Python
It's worth pausing at this point and comparing the object oriented code
we get when using the class definition above with the imperative style
code we saw at the start of the chapter:
# imperative code
dna_sequence = "ACTGATCGTTACGTACGAGT"
species = "Drosophila melanogaster"
gene_name = "ABC1"
print("Looking at the " + species + " " + gene_name + " gene")
print("AT content is " + get_AT(dna_sequence))
print("complement is " + complement(dna_sequence))
Notice the difference in how the data are stored, and how they are
processed. In the imperative code, we create three variables to hold the
three bits of data, and then pass them to the functions to get the answers
we want. In the object oriented style, we package up the three bits of data
into one object, then ask the object for the answers we want. The object is
responsible for both storing its own variables, and calculating the AT and
complement. In other words, when we want to know the AT content of a
DNARecord object, we don't ask for the sequence and then pass it to a
function, we simply ask for the AT content directly, and it's the object's
job to tell us.
Once we've defined a new class, it behaves just the same as a built in class
– there's no difference in how we use a DNARecord object compared to a
File object or a String object or an Integer object. We can store a
76
Chapter 4: object oriented Python
def translate_dna(dna_record):
gencode = {
'ATA':'I', 'ATC':'I', 'ATT':'I', 'ATG':'M',
'ACA':'T', 'ACC':'T', 'ACG':'T', 'ACT':'T',
'AAC':'N', 'AAT':'N', 'AAA':'K', 'AAG':'K',
'AGC':'S', 'AGT':'S', 'AGA':'R', 'AGG':'R',
'CTA':'L', 'CTC':'L', 'CTG':'L', 'CTT':'L',
'CCA':'P', 'CCC':'P', 'CCG':'P', 'CCT':'P',
'CAC':'H', 'CAT':'H', 'CAA':'Q', 'CAG':'Q',
'CGA':'R', 'CGC':'R', 'CGG':'R', 'CGT':'R',
'GTA':'V', 'GTC':'V', 'GTG':'V', 'GTT':'V',
'GCA':'A', 'GCC':'A', 'GCG':'A', 'GCT':'A',
'GAC':'D', 'GAT':'D', 'GAA':'E', 'GAG':'E',
'GGA':'G', 'GGC':'G', 'GGG':'G', 'GGT':'G',
'TCA':'S', 'TCC':'S', 'TCG':'S', 'TCT':'S',
'TTC':'F', 'TTT':'F', 'TTA':'L', 'TTG':'L',
'TAC':'Y', 'TAT':'Y', 'TAA':'_', 'TAG':'_',
'TGC':'C', 'TGT':'C', 'TGA':'_', 'TGG':'W'}
last_codon_start = len(dna_record.sequence) - 2
protein = ""
for start in range(0,last_codon_start,3):
codon = dna_record.sequence[start:start+3]
aa = gencode.get(codon.upper(), 'X')
protein = protein + aa
return protein
translate_record.py
Inheritance
What other useful methods could we add to our DNARecord class? How
about a method which returns the record in FASTA format. We'll combine
the gene_name and species_name member variables to construct the
1 See the dictionaries chapter in Python for Biologists for a reminder of how this function works.
77
Chapter 4: object oriented Python
header, replacing any spaces in the species name with underscores1, add
a greater-than symbol at the start, and separate the header and sequence
with a newline character2:
class DNARecord(object):
def get_fasta(self):
safe_species_name = self.species_name.replace(' ','_')
header = '>' + self.gene_name + '_' + safe_species_name
return header + '\n' + self.sequence + '\n'
fasta_method.py
A quick check will allow us to make sure that the method's working as
expected:
>COX1_Homo_sapiens
ATATATTATTATATTATA
1 Some sequence analysis tools are fussy about not allowing spaces in FASTA headers.
2 We also add another newline character at the end so that we can create a multi-sequence
FASTA file simply by writing the result of several get_fasta() method calls consecutively.
78
Chapter 4: object oriented Python
objects, we can generate a FASTA file containing just the sequences with a
high AT content:
Now that we've seen how useful objects can be, we might want to create a
similar class to represent a protein record – let's call it ProteinRecord
for consistency. Just like the DNARecord class, it will have a gene_name,
a species_name, and a sequence. What methods should our
ProteinRecord class have? Obviously it doesn't make any sense to ask
for the complement of a protein sequence, or to ask for its AT content.
Instead, we'll give it a method that calculates the proportion of the amino
acid residues that are hydrophobic1. We'll also include the method that
generates the FASTA sequence – since DNA and protein FASTA formats
look the same, we can just reuse our get_fasta() method.
Here's a first attempt at the code for our ProteinRecord class:
1 Take a look at the exercise in the chapter of Python for Biologists on writing our own functions
for a discussion of how this method works.
79
Chapter 4: object oriented Python
class ProteinRecord(object):
def get_fasta(self):
safe_species_name = self.species_name.replace(' ','_')
header = '>' + self.gene_name + '_' + safe_species_name
return header + '\n' + self.sequence + '\n'
def get_hydrophobic(self):
aa_list=['A','I','L','M','F','W','Y','V']
protein_length = len(self.sequence)
total = 0
for aa in aa_list:
aa = aa.upper()
aa_count = self.sequence.count(aa)
total = total + aa_count
percentage = total * 100 / protein_length
return percentage
protein_record.py
80
Chapter 4: object oriented Python
>COX1_Homo_sapiens
MSRSLLLRFLLFLLLLPPLP
65
81
Chapter 4: object oriented Python
class SequenceRecord(object):
def get_fasta(self):
safe_species_name = self.species_name.replace(' ','_')
header = '>' + self.gene_name + '_' + safe_species_name
return header + '\n' + self.sequence + '\n'
inheritance.py
class DNARecord(SequenceRecord):
def complement(self):
...
def get_AT(self):
...
inheritance.py
82
Chapter 4: object oriented Python
class ProteinRecord(SequenceRecord):
def get_hydrophobic(self):
...
inheritance.py
Let's look at where this leaves us. We have one base class –
SequenceRecord – which holds the methods (the __init__()
constructor and get_fasta()) which are common to both sequence
types. Then we have two subclasses – DNARecord and ProteinRecord
– that inherit these methods, and add their own. Let's look at the object
oriented code in full:
83
Chapter 4: object oriented Python
class SequenceRecord(object):
def get_fasta(self):
safe_species_name = self.species_name.replace(' ','_')
header = '>' + self.gene_name + '_' + safe_species_name
return header + '\n' + self.sequence + '\n'
class ProteinRecord(SequenceRecord):
def get_hydrophobic(self):
aa_list=['A','I','L','M','F','W','Y','V']
protein_length = len(self.sequence)
total = 0
for aa in aa_list:
aa = aa.upper()
aa_count = self.sequence.count(aa)
total = total + aa_count
return total * 100 / protein_length
class DNARecord(SequenceRecord):
def complement(self):
replacement1 = self.sequence.replace('A', 't')
replacement2 = replacement1.replace('T', 'a')
replacement3 = replacement2.replace('C', 'g')
replacement4 = replacement3.replace('G', 'c')
return replacement4.upper()
def get_AT(self):
length = len(self.sequence)
a_count = self.sequence.count('A')
t_count = self.sequence.count('T')
return (a_count + t_count) / length
inheritance.py
The benefit of structuring things in this way is that all our methods are
only defined once, but can be used by all the appropriate classes, allowing
us to easily mix and match different sequence types in a script:
84
Chapter 4: object oriented Python
85
Chapter 4: object oriented Python
def translate_dna(dna_record):
gencode = {
'ATA':'I', 'ATC':'I', 'ATT':'I', 'ATG':'M',
'ACA':'T', 'ACC':'T', 'ACG':'T', 'ACT':'T',
'AAC':'N', 'AAT':'N', 'AAA':'K', 'AAG':'K',
'AGC':'S', 'AGT':'S', 'AGA':'R', 'AGG':'R',
'CTA':'L', 'CTC':'L', 'CTG':'L', 'CTT':'L',
'CCA':'P', 'CCC':'P', 'CCG':'P', 'CCT':'P',
'CAC':'H', 'CAT':'H', 'CAA':'Q', 'CAG':'Q',
'CGA':'R', 'CGC':'R', 'CGG':'R', 'CGT':'R',
'GTA':'V', 'GTC':'V', 'GTG':'V', 'GTT':'V',
'GCA':'A', 'GCC':'A', 'GCG':'A', 'GCT':'A',
'GAC':'D', 'GAT':'D', 'GAA':'E', 'GAG':'E',
'GGA':'G', 'GGC':'G', 'GGG':'G', 'GGT':'G',
'TCA':'S', 'TCC':'S', 'TCG':'S', 'TCT':'S',
'TTC':'F', 'TTT':'F', 'TTA':'L', 'TTG':'L',
'TAC':'Y', 'TAT':'Y', 'TAA':'_', 'TAG':'_',
'TGC':'C', 'TGT':'C', 'TGA':'_', 'TGG':'W'}
last_codon_start = len(dna_record.sequence) - 2
protein = ""
for start in range(0,last_codon_start,3):
codon = dna_record.sequence[start:start+3]
aa = gencode.get(codon.upper(), 'X')
protein = protein + aa
return protein_record
Overriding
Occasionally we'll want a subclass to behave in a slightly different way to
its superclass – the mechanism that allows us to do this is called
overriding. Suppose that we want our DNARecord objects to have a
genetic_code variable, which stores the number of the genetic code for
86
Chapter 4: object oriented Python
the sequence using the NCBI numbering scheme1. We cannot simply add
this variable to the constructor for the SequenceRecord class, as it
doesn't make sense to have a genetic code for a protein sequence. Instead,
what we need to do is supply the DNARecord class with its very own,
specialized constructor, which will take a genetic code as one of its
arguments. That way, when we create a new DNARecord object the
__init__() method defined in DNARecord will be used, but when we
call get_fasta() on the object, it will still use the method defined in
SequenceRecord. Let's look at the code:
class DNARecord(SequenceRecord):
self.sequence = sequence
self.gene_name = gene_name
self.species_name = species_name
self.genetic_code = genetic_code
def complement(self):
...
def get_AT(self):
...
overriding.py
1 http://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi
87
Chapter 4: object oriented Python
import re
class SequenceRecord(object):
def get_fasta(self):
...
This works fine, but we run into a problem – we would like this
functionality to be shared by all subclasses of SequenceRecord (i.e.
1 See the regular expressions chapter in Python for Biologists if you need a refresher.
2 There is a much better way to handle this type of situation, which we will learn about in the
chapter on exceptions.
88
Chapter 4: object oriented Python
class DNARecord(SequenceRecord):
def complement(self):
...
def get_AT(self):
...
calling_superclass.py
Now we have the best of both worlds. Our DNARecord class is able to take
advantage of the improvements to the SequenceRecord constructor,
and still implement its own specialized behaviour.
89
Chapter 4: object oriented Python
Polymorphism
Polymorphism is a complicated name for a simple concept: code that does
different things depending on the type of data on which it's operating.
Here's a somewhat contrived example: imagine that we want to add a
method to our sequence objects that will return the length of the protein
sequence that they represent. Obviously we are going to need a different
method for DNA and protein sequences – for protein sequences, we just
need to return the length of the sequence variable, but for DNA
sequences we need to return the length of the sequence variable divided
by three. Because we need different methods for each type of sequence,
we can't add the method to the SequenceRecord class definition, but
must instead add it separately to both the DNARecord and
ProteinRecord class definitions:
class ProteinRecord(SequenceRecord):
def get_protein_length(self):
return len(self.sequence)
...
class DNARecord(SequenceRecord):
def get_protein_length(self):
return len(self.sequence) / 3
...
Now suppose that we have a list of sequence records that are a mixture of
DNA and protein sequences, and we want to do something to just the
ones whose protein length is greater than one hundred amino acid
residues. Rather than having to examine each record and check whether it
is a DNARecord or a ProteinRecord, we can simply call the
90
Chapter 4: object oriented Python
Recap
object oriented programming is a big topic, and whole libraries of books
have been written about its ramifications. In this chapter we've seen a
brief overview of the ways that basic object oriented features work in
Python. We've looked at how we can use simple class definitions to
package data and code together in logical units which we can create and
pass around in our programs. We've also seen how we can allow our
classes to share functionality using inheritance, and how we can give them
specialized behaviour by overriding methods in their base class. Finally,
we've looked at one important benefit of object oriented thinking – the
ability of functions to handle different types of data transparently.
Because object oriented programming is such a complex topic, there are
many aspects worth reading up on that are beyond the scope of this book.
Composition is an alternative to inheritance which also allows classes to
share functionality, but in a different way. Unlike many languages, Python
allows for multiple inheritance, an easily abused technique that allows
classes to inherit from multiple parents. Much thought has been given to
solving common abstract problems in an object oriented style and the
result is design patterns – a set of best practise techniques that are
91
Chapter 4: object oriented Python
1 Though design patterns tend to be used less often in Python than in other languages as its
dynamic nature makes many of them unnecessary.
92
Chapter 4: object oriented Python
Exercise
Write an object oriented program that simulates evolution at three loci in
a population of one hundred haploid individuals1. Each locus has two
alleles which differ slightly in fitness and the overall fitness for an
individual can be calculated from the fitness of its three loci using a
multiplicative model (i.e. if the fitness scores for the alleles of a given
individual are 1, 0.9 and 0.8 then the individual's fitness is 1 * 0.9 * 0.8 =
0.72).
In every generation, the simulation proceeds in two stages. Firstly, to
represent selection, each individual is potentially killed with a probability
inversely proportional to the fitness – in other words, for each individual,
pick a random number between 0 and 1 and if that number is greater than
the individual's fitness, it dies and is removed from the population.
Secondly, to represent reproduction, new individuals are added to the
population to make the numbers back up to one hundred. Rather than
simulating recombination etc. we will simply say that the alleles for each
new individual are chosen by randomly selecting alleles from the current
population – in other words, the chances of selecting a given allele is
proportional to its frequency in the population as a whole.
At each generation, your program should calculate the frequency of all
alleles and write them to a text file. At the end of the simulation, we'll be
able to plot the frequencies on a chart to show the how they change over
time.
1 Readers with a background in population genetics will, I hope, forgive the many shortcomings
of this simulation!
93
Chapter 4: object oriented Python
Solution
This is a big exercise with a lot of parts, and there are many different ways
to structure it. The solution we'll look at here is just one way; you may
come up with something completely different.
The goal of this exercise is to write a program that combines object
oriented and procedural code. We will start by tackling the object oriented
part and defining some classes. We'll use three different classes – one to
represent an individual, one to represent a locus, and one to represent an
allele. Let's begin with the simplest class, the one that represents a single
allele. It has a name, and a fitness score:
class Allele(object):
def __init__(self, name, fitness):
self.name = name
self.fitness = fitness
class Locus(object):
def __init__(self, name):
self.name = name
self.alleles = []
94
Chapter 4: object oriented Python
class Individual(object):
Having defined our classes, we can start experimenting with them. Let's
start off with something simple – here's how we define a locus (which
we'll imaginatively call locus one) with two alleles. As is customary, we'll
use a capital letter A as the name for the most-fit allele (with a fitness of
1), and a lower-case a as the name for the less-fit allele (with a fitness
slightly less than 1):
allele_A = Allele('A', 1)
allele_a = Allele('a', 0.94)
locus1 = Locus('locus one')
locus1.add_allele(allele_A)
locus1.add_allele(allele_a)
The first thing that we notice about this bit of code is that the variable
names of the two alleles don't really serve any purpose – we create the
Allele objects and then immediately add them to the Locus object. We
can simplify the code a bit by calling the constructor for the alleles and
then passing the returned value immediately to the add_allele()
method all in one statement:
This has the exact same effect but is a little easier to read. Let's go ahead
and create the other two loci in the same way, which we'll use for the rest
of the exercise. We'll also create a list to hold all three Locus objects:
95
Chapter 4: object oriented Python
Now we have our loci and alleles, we can create some individuals. Our
Individual constructor requires that we pass in a list of alleles as the
argument, so we need some way to get hold of the allele objects.
Remember that we can't refer to the allele objects using their variable
names, because we created them in such a way that they don't have
variable names! Here's one way to do it – we could just grab the first
element of the alleles list from each locus:
first_allele = locus1.alleles[0]
second_allele = locus2.alleles[0]
third_allele = locus3.alleles[0]
ind = Individual([first_allele, second_allele, third_allele])
alleles_for_individual = []
for locus in all_loci:
alleles_for_individual.append(locus.alleles[0])
ind = Individual(alleles_for_individual)
This works fine, but if give all our one hundred individuals exactly the
same set of alleles, then our simulation is going to be a bit boring! What
we really need is a way of randomly picking an allele for each locus. A
useful tool for this is the random.choice() method, which takes a list
96
Chapter 4: object oriented Python
def get_random_allele(my_locus):
return random.choice(my_locus.alleles)
class Locus(object):
def __init__(self, name):
...
def get_random_allele(self):
return random.choice(self.alleles)
Notice the difference between the two approaches: in the first approach,
we get the information (the list of alleles) from the locus and then process
it (pick a random allele) whereas in the second, we let the object use the
information that it has (its list of alleles) to generate the answer for us.
The distinction is subtle, but important.
Now we have a way of randomly picking alleles, we can write a function
that creates and returns Individuals with randomly-picked alleles,
given a set of loci:
97
Chapter 4: object oriented Python
def create_individual(loci):
alleles_for_individual = []
for locus in loci:
alleles_for_individual.append(locus.get_random_allele())
i = Individual(alleles_for_individual)
return i
We can now create a starting population of any size we like just by calling
this function inside a loop:
def get_genotype_for_individual(ind):
result = ''
for allele in ind.alleles:
result = result + allele.name
return result
98
Chapter 4: object oriented Python
class Individual(object):
def get_genotype(self):
result = ''
for a in self.alleles:
result = result + a.name
return result
We can use this method to, for instance, print out the genotypes of each
individual in the population in quite a natural way:
ABC
abC
aBc
aBC
AbC
abc
ABc
AbC
...etc...
99
Chapter 4: object oriented Python
class Individual(object):
def get_genotype(self):
...
def get_fitness(self):
final_fitness = 1
for a in self.alleles:
final_fitness = final_fitness * a.fitness
return final_fitness
The output looks good – we can see that, as expected, individuals with
more capital letters in their genotype tend to have higher fitness than
those with more lower case letters:
('Abc', 0.6156)
('aBc', 0.7614)
('ABC', 1)
('AbC', 0.76)
('aBc', 0.7614)
('aBC', 0.94)
('abc', 0.578664)
...
Now we've seen how to look at the data for individuals, let's tackle the
problem of summarizing the population as a whole, starting with
something easy – calculating the frequency of a given allele in the
population. We simply iterate over the list of all individuals and ask, for
100
Chapter 4: object oriented Python
To use this function we first have to get a reference to one of our alleles.
Remember that we don't have variables that point to the alleles, but we
do have variables that point to the loci, so we can just grab the first allele
in a loci's list of alleles and calculate its frequency:
summarize_population_alleles(my_population, all_loci)
Th output shows pretty much what we'd expect – in the initial population,
all alleles are hovering at a frequency of around 0.5, with some variation
to due chance:
101
Chapter 4: object oriented Python
('A', 0.53)
('a', 0.47)
('B', 0.48)
('b', 0.52)
('C', 0.45)
('c', 0.55)
def single_generation(population):
for individual in population:
if random.random() > individual.get_fitness():
population.remove(individual)
summarize_population_alleles(my_population, all_loci)
for i in range(10):
print('at generation ' + str(i))
print('population size is ' + str(len(my_population)))
single_generation(my_population)
summarize_population_alleles(my_population, all_loci)
simulation1.py
102
Chapter 4: object oriented Python
1 Remember that the make up of the starting population and the removal of individuals are
both partly controlled by random numbers, so if you try running this code you'll get different
results.
103
Chapter 4: object oriented Python
('A', 0.49)
('a', 0.51)
('B', 0.47)
('b', 0.53)
('C', 0.58)
('c', 0.42)
at generation 0
population size is 100
at generation 1
population size is 77
at generation 2
population size is 64
at generation 3
population size is 50
at generation 4
population size is 40
at generation 5
population size is 36
at generation 6
population size is 30
at generation 7
population size is 28
at generation 8
population size is 25
at generation 9
population size is 23
('A', 0.6086956521739131)
('a', 0.391304347826087)
('B', 0.8695652173913043)
('b', 0.13043478260869565)
('C', 0.9565217391304348)
('c', 0.043478260869565216)
Now all we need is to fill in the last bit of the simulation – adding new
individuals to the population. As specified in the exercise description, we
create a new individual by picking alleles randomly from the current
population. There are a few different ways to do this, but the simplest one
is probably to make a list, for each locus, of all the current alleles in the
104
Chapter 4: object oriented Python
population belonging to that locus, then pick a random element from that
list:
def single_generation(pop):
for individual in pop:
if random.random() > individual.get_fitness():
pop.remove(individual)
for i in range(100 - len(population)):
pop.append(individual_from_population(population, all_loci))
simulation2.py
105
Chapter 4: object oriented Python
('A', 0.49)
('a', 0.51)
('B', 0.47)
('b', 0.53)
('C', 0.57)
('c', 0.43)
at generation 0
population size is 100
...
at generation 9
population size is 100
('A', 0.62)
('a', 0.38)
('B', 0.87)
('b', 0.13)
('C', 0.92)
('c', 0.08)
Having a snapshot of the allele frequencies at the start and end of the
simulation is useful for testing, but it doesn't make for a very interesting
result – what we would really like to be able to is look at the change in
allele frequencies as the simulation progresses. To do that we'll have to
switch from printing the frequency information on screen to writing it to
a file. The simplest way to do this is just to write a line containing six
comma-separated fields – one per allele – to a file after each generation.
To make sense of the result, we'll need an extra bit of code to write a
header line which will let us keep track of which field corresponds to
which allele. Here's a function that will print a header line to an output
file:
106
Chapter 4: object oriented Python
And here's a modified version of our earlier function that writes a single
line summarizing the allele frequencies at a given moment:
# open the alleles frequency output file and write the header line
alleles_output = open('alleles.csv', 'w')
summarize_alleles_header( all_loci, alleles_output)
simulation2.py
107
Chapter 4: object oriented Python
And here's what the output file alleles.csv looks like – the first line tells us
the order of the allele frequencies and subsequent lines each represent a
single generation1:
A , a , B , b , C , c ,
0.54, 0.46, 0.41, 0.59, 0.4, 0.6,
0.52, 0.48, 0.49, 0.51, 0.46, 0.54,
0.6, 0.4, 0.59, 0.41, 0.48, 0.52,
0.62, 0.38, 0.67, 0.33, 0.46, 0.54,
0.64, 0.36, 0.73, 0.27, 0.47, 0.53,
0.65, 0.35, 0.83, 0.17, 0.47, 0.53,
0.66, 0.34, 0.83, 0.17, 0.53, 0.47,
0.7, 0.3, 0.81, 0.19, 0.58, 0.42,
0.72, 0.28, 0.88, 0.12, 0.62, 0.38,
0.74, 0.26, 0.92, 0.08, 0.66, 0.34,
1 Ignore the trailing comma at the end of each line – we could remove it, but it would require
more code and most spreadsheet programs will not care about it.
108
Chapter 4: object oriented Python
I've added the relative fitness for each allele (as set in the simulation code
above) to the legend so we can see how the less-fit alleles with the lowest
fitness (b and c) disappear from the population relatively early on,
whereas the less-fit allele which has fairly high relative fitness (a) takes
much longer to disappear (and even, due to chance, is more frequent than
it's fitter partner A for a time).
109
Chapter 5: Functional Python
5: Functional Python
Introduction
If you spend any time at all reading programming websites or blogs, then
you can hardly have avoided discussions of functional programming.
Although it's a very old idea, it's a hot topic right now1 and much progress
has been made recently in making it accessible to novice programmers.
Functional programming is a tricky thing to define, however, and there
are a few different ways to think about it. We'll start this chapter with a
quick tour of important functional programming concepts.
x = 0
for i in range(10):
x = x + i
print(x)
We can tell that this program has state, because the value stored in the
variable x changes as the program runs2. When the program starts, the
value of x is 0. After the first iteration of the loop, the value of x is 1. After
the second iteration, it's 3, and so on. Here's a program that does the
same thing, without using state:
110
Chapter 5: Functional Python
x = sum(range(10))
print(x)
It works by using the built in sum() function, which takes as its argument
a list1 and returns the sum of all elements. Notice how, in this piece of
code, the value of the variable x is set once, and then never changed (we'll
see later in this chapter why this might be a desirable thing). From this
perspective, functional programming appears to be the opposite of object
oriented programming2. In object oriented programming, we create
objects that have various attributes that describe their state (e.g. DNA
sequences that have names and genetic codes) and we are mostly
concerned with manipulating that state.
Side effects
Another way to think about functional programming is that it's a style of
programming that avoids writing functions with side effects. A function is
said to have side effects if, when you run it, it changes the state of the
variables in the program. Here's an example of a function with side
effects:
def my_function(i):
i.extend(['a', 'b', 'c'])
return(i)
This function takes a single argument, which is a list, and adds three
more elements on to the end using the list extend() method before
returning the extended list. The side effect, in this case, is that it changes
the value of the list that's given as the argument. We can see this happen
if we print the value of the list before and after running the function:
1 In fact, the argument can be anything that behaves like a list i.e. any iterable type.
2 If you haven't looked at the chapter on object oriented programming, now would be a good
time to do so.
111
Chapter 5: Functional Python
x = [1,2,3]
print(x)
print(my_function(x))
print(x)
[1, 2, 3]
[1, 2, 3, 'a', 'b', 'c']
[1, 2, 3, 'a', 'b', 'c']
After the function has run, the variable x contains three additional
elements. Here's an example of a similar function that gives the same
return value, but without the side effect – this function doesn't alter the
value of the variable that's passed in as the argument:
def my_function(i):
return(i + ['a', 'b', 'c'])
Why are side effects considered bad? The easiest answer is to pose the
following question: imagine we have a variable x and we pass it as the
argument to some function:
x = [1,4,9]
some_function(x)
// what is the value of x now?
If the function has side effects, then we have no way of knowing what the
value of x is after the function call without going and looking at the code.
Even worse, what we look inside the definition of some_function()
and we find that it calls a bunch of other functions:
def some_function(input):
some_other_function(input)
another_function(input)
yet_another_function(input)
112
Chapter 5: Functional Python
def my_function(i):
return(i + to_add)
The two calls to my_function in the above code have exactly the same
argument, but they return different results:
113
Chapter 5: Functional Python
Functions that satisfy the two criteria we have discussed above – they
always return the same value when called with the same arguments, and
they don't have any side effects – are called pure functions. For the reasons
outlined above, it's generally much easier to reason about the behaviour
of pure functions than functions that aren't pure.
Functions as objects
Yet another way of thinking about functional programming is the idea
that functions are objects that can be passed around programs like any
other type of object – they can be stored in variables, passed to other
functions as arguments, and returned from other functions as return
values. Python makes it quite easy to do this. By way of an example, here's
a function that takes two arguments – a list and the name of a function –
and prints out the result of running the function on each element of the
list:
list_function.py
This looks odd if you've not encountered it before, but the syntax should
be familiar. The function iterates over the list, and for each element runs
my_function() with the element as the input and passes the return
value straight to the print() function. A function that takes another
function as one of its arguments, as in the example above, is known as a
higher order function.
Let's see what happens when we use it. Here, we create a list and pass it to
our print_list_with_function() function, along with the name of
the built in Python function len(), which returns the length of a string:
114
Chapter 5: Functional Python
As expected, the output contains the lengths of the three elements of our
input list:
3
6
2
Here's where it gets interesting though; we're not restricted to using built
in functions as the second argument to
print_list_with_function(). We can supply any function we like
as long as it takes a single string argument, including functions that we
define. For example, here's a function that returns the second character of
its argument:
def get_second(input):
return input[1]
print_list_with_function(input, get_second)
b
e
l
115
Chapter 5: Functional Python
1 Other programming languages have support for anonymous functions, which work the same
way.
116
Chapter 5: Functional Python
What is to be calculated
The final way I want to suggest thinking about functional programming is
that it places the emphasis on specifying what the answer looks like,
rather than how to calculate it. Consider these two bits of code that add
up the first ten integers and print out the result:
# procedural code
total = 0
for i in range(11):
total = total + i
print(total)
# functional code
print(sum(range(11)))
In the first bit of code, we are giving the steps required to calculate the
answer. If we were to translate this code into natural English, we might
write:
Create a variable to hold a running total, and set it to zero. Then, for
each number between zero and ten, add that number to the total. Finally,
print the total.
By contrast, in the second bit of code, we are simply describing the result:
The result is the sum of the numbers between zero and ten.
and we are prepared to let the computer worry about how to actually
calculate the answer. This idea is very similar to the different between
iterative and recursive approaches to programming – take a look at the
chapter on recursion if you haven't already done so.
The remainder of this chapter is divided into two main parts. In the first
part, we will look at some built in higher order functions that allow us to
carry out common programming tasks by using the techniques outlined
117
Chapter 5: Functional Python
above. In the second part, we'll see how we can use these same techniques
in our own functions.
As you work through the rest of this chapter, bear in mind that, unlike in
some other languages, functional programming in Python is not an all-or-
nothing affair. It's not really feasible to write entire programs in a
functional style1, so when you use functional programming features in
your programs they will generally be mixed in with procedural code.
map
Consider the very common situation where you have a list of data, and
you want to create a new list by carrying out some operation on each
element of the old list. For example, you have a list of DNA sequences,
and you want to create a list of their lengths. It's quite straightforward to
do this with a for loop – we create en empty list to hold the result, then
iterate over the input list adding a single element to the result on each
iteration:
1 For one thing, a ban on side effects means that a purely functional program could never
produce any output, since printing to the screen or to a file is a side effect!
118
Chapter 5: Functional Python
map.py
These two examples share a lot of code between them. They both follow a
general pattern – we start off by creating an empty list to hold the result,
then we iterate over the list of DNA sequences, and for each sequence
calculate some value and append it to the result list. The name for this
general pattern, where we want to apply some function to each element
of a list to generate a new list, is a map2, and it's implemented in Python
as a function called, unsurprisingly, map().
To use the Python map() function, we have to supply a function that will
take as its argument a single element of the input list and return the
corresponding element in the output list (we'll call this the
transformation function). For our first example – turning a list of DNA
sequences into a list of their lengths – the built in len() function will do
the job. For the second example – turning a list of DNA sequences into a
list of their AT contents – we can write a simple function that returns the
AT content of its argument:
1 Remember to include from __future__ import division if you want to run this code
in Python 2.
2 So-called because there's a one-to-one mapping between elements in the original list and the
new list.
119
Chapter 5: Functional Python
Now we simply have to call the map() function with the name of our
transformation function as the first argument, and name of the original
list as the second argument:
map.py
The map() function takes care of setting up the empty results list,
iterating over the original list, and running the transformation function
on each element. The benefit of processing lists in this way is not simply
that it involves less typing: rather, it's another way of achieving
encapsulation. We have separated the bit of the code responsible for
handling the iteration (the map() function) from the bit of code
responsible for transforming a single element (the get_at() function).
Because the transformation functions that we pass to map() are often
very short, it's quite common to use lambda expressions to do the job
instead. Here's our AT content example written as a lambda expression.
It's formatted over a few lines to make it easier to read:
at_contents = map(
lambda dna : (dna.count('A') + dna.count('T')) / len(dna),
dna_list
)
One final note about map(): its behaviour is subtly different in Python 2
and 3. In Python 2, the result of running a map() function is a
straightforward list, but in Python 3, the result is a map object, which we
can iterate over. This means that we can treat the returned value in pretty
much the same way – this type of code will work fine in all versions of
Python:
120
Chapter 5: Functional Python
l = list(range(100000))
m = map(lambda x : 2 ** x, l)
Under Python 2, this statement will take a very long time to execute –
around 30 seconds on my desktop computer1. However, under Python 3,
the exact same statement executes in no time at all, because it doesn't
start to actually calculate the elements until they are needed – for
example, when we start to iterate over the map object:
l = list(range(100000))
m = map(lambda x : 2 ** x, l)
for i in m:
print(i)
1 See the section on performance in Effective Python development for Biologists for a detailed
explanation of how to measure execution time.
121
Chapter 5: Functional Python
0
1
4
9
16
25
36
49
64
81
...
Notice how this behaviour can catch us out if we try to use map() with a
function that depends on side effects. Here's a bit of code where we define
a transformation function that has the side-effect of appending the string
'a' to a list, x. We then call map() on a list of ten elements and print the
value of x:
x = []
def square(input):
x.append('a')
return input ** 2
m = map(square, [0,1,2,3,4,5,6,7,8,9])
print(x)
Under Python 2, our square() function is run once for each element of
our list when we call the map() function, so when we print x it contains
ten elements:
['a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a']
But under Python 3, the square() function never runs, because we never
access the elements of m, so the output shows us that even after the
map() statement has completed, x is still an empty list:
122
Chapter 5: Functional Python
[]
This ability of Python 3 to carry out so-called lazy evaluation, which saves
time and memory, is a nice illustration of the power of functional
programming to simplify the process of reasoning about computation.
filter
A closely-related pattern to map is filter, used where we have a list from
which we want to select only the elements that satisfy some condition.
Imagine we have a list of DNA sequences, and we want to create a new list
containing only the sequences longer than five bases. The iterative
solution is quite straightforward:
filter.py
But if we look at another example – creating a new list that contains only
the sequences whose AT content is less than 0.6 – we can see how
repetitive this kind of code is:
at_poor_dna = []
for dna in dna_list:
print(dna, get_at(dna))
if (dna.count('A') + dna.count('T')) / len(dna) < 0.6:
at_poor_dna.append(dna)
123
Chapter 5: Functional Python
Just as with map, Python has a built in function for doing this kind of
filtering, and it works in a similar way. We supply the Python's filter
function with the name of a function that takes a single element as its
argument and returns True or False to indicate whether or not that
element should be included in the result list. Here's how we use it to
select only DNA sequences longer than five bases1:
def is_long(dna):
return len(dna) > 5
filter.py
And here's how we use it to select only DNA sequence whose AT content
is less than 0.6:
def is_at_poor(dna):
at = (dna.count('A') + dna.count('T')) / len(dna)
return at < 0.6
124
Chapter 5: Functional Python
sorted
You've probably already encountered the sorted() function and used it
for sorting lists in alphabetical or numerical order. The sorted()
function is actually capable of sorting elements using any type of custom
ordering, and it does so by acting as a higher order function. Sorting is a
little bit more complicated to understand than mapping or filtering: the
sorting algorithm used by Python is quite complicated so, unlike map()
and filter(), we can't show a simple imperative version of the code.
Nevertheless, the same principle of encapsulation applies: just as with
map() (where we we supply a function that tells Python how to transform
a single input element and Python takes care of producing the output list)
and filter() (where we supply a function that tells Python whether to
include a single input element and Python takes care of producing the
filtered list), with sorted() we supply a function that tells Python what
property of each input element we want to sort on, and Python takes care
of producing the sorted list.
A few examples will make this clearer. Let's start by sorting our list of
DNA sequences using the default order (i.e. without supplying a custom
function):
sorted.py
125
Chapter 5: Functional Python
It's very important to note that, as we can see from the second line of
output, the original list is unchanged. There is another way of sorting a
list – we can call the sort() method on the list – but we're going to avoid
using that method in this section for two reason. First, using sorted()
to create a sorted copy of the list is more compatible with the functional
programming ideas of avoiding state and mutability. Second, the
sorted() method is more flexible as it's not restricted to lists – we can
call sorted on any iterable data type (strings, files, etc).
Now let's look at sorting in a different order – for example, by length. To
do this, we supply sorted() with a key function. The key function must
take a single argument, and return the value that we want to sort on. By
convention, we supply the key function as a keyword argument like this:
sorted(some_list, key=my_key_function)
For sorting by length, we can use the built in len() function as our key
function. The len() function takes a single argument, and returns a
single value, so it satisfies the requirements for a key function and we can
use it like this:
sorted.py
If we want them in the reverse order then we can simply pass a reverse
keyword argument to the sorted function:
126
Chapter 5: Functional Python
Let's look at something a bit more complicated that requires a custom key
function: sorting by AT content. We already have a function that takes a
single DNA sequence and returns the AT content from our map()
example above, so we can just reuse it for sorted():
As the output shows, we get the DNA sequences sorted from lowest AT
content to highest:
import re
def poly_a_length(dna):
poly_a_match = re.search(r'A+$', dna)
if poly_a_match:
return len(poly_a_match.group())
else:
return 0
127
Chapter 5: Functional Python
All of the above examples involve sorting strings, but we can use sorted to
sort any type of data. Imagine we have a list of tuples, each of which
contains the name of a gene, and an expression level measurement under
two different conditions:
measurements = [
('gene1', 121, 98),
('gene2', 56, 32),
('gene3', 1036, 1966),
('gene4', 543, 522)
]
sort_tuples.py
128
Chapter 5: Functional Python
Since we are interested in the genes for which the ratio is highest, we
should probably pass the reverse=True parameter to sorted(), so
that the genes with the highest ratio (i.e. the most over-expression in
condition two) appear at the top of the list:
loci = [
(4, 9200, 'gene1'),
(6, 63788, 'gene2'),
(4, 7633, 'gene3'),
(2, 8766, 'gene4')
]
sort_chromosomes.py
We want to sort the loci by chromosome number and then, within each
chromosome, by base position. We start off by defining functions which
will return, for a given locus, either the chromosome number or the base
(simply by returning the first or second element of the tuple):
129
Chapter 5: Functional Python
def get_chromosome(locus):
return locus[0]
def get_base_position(locus):
return locus[1]
[(2, 8766, 'gene4'), (4, 7633, 'gene3'), (4, 9200, 'gene1'), (6,
63788, 'gene2')]
reduce
The final higher order function that we'll look at is reduce() – probably
the least commonly-used of Python's built in higher order functions. Just
like the other higher order functions we've looked at, reduce() takes
two arguments – a function, and a list. It then starts using the function to
reduce the list to a single value (hence its name). First it calls the function
with the first two elements of the list as arguments and stores the result.
Then it repeatedly calls the function using the result of the last call and
the next element in the list as arguments, repeating this until it runs out
of elements in the list, at which point the result is returned. We can see
from this description of reduce() that it differs from the other higher
order functions in two important ways: the function that we pass in as
the first argument must take two arguments and return a single value,
and the overall result of calling reduce() will be a single value rather
than a list.
130
Chapter 5: Functional Python
def multiply(x,y):
return x * y
We can then take this function and pass it to reduce() along with our
list of numbers:
numbers = [2,6,3,8,5,4]
print(reduce(multiply, numbers))
reduce.py
and follow what happens. First, reduce() will call multiply() using
the first two elements of the list – 2 and 6 – as arguments and get the
result 12. It will then call multiply() using the third element of the list
(3) and the result of the last call (12) as arguments and get the result 36.
It will then call multiply() using the fourth element of the list (8) and
the result of the last call (36) as arguments, and so on, until all elements
have been multiplied and it returns the final answer, 5760.
Real life examples of situations where reduce is useful are hard to come
by, but we have already encountered an example in this book. Recall that
in the chapter on recursion, our solution to the last common ancestor
exercise involved the same strategy as reduce(). To find the last
common ancestor of a list of nodes in a tree given a function that can find
the last common ancestor of any two nodes, we first found the last
common ancestor of the first two nodes, then found the last common
ancestor of that result and the third node, and so on. We can concisely
express this strategy using reduce():
131
Chapter 5: Functional Python
def find_lca_of_list(node_list):
return(reduce(find_lca_of_two, node_list))
def get_4mers(dna):
4mers = []
for i in range(len(dna) - 3):
4mers.append(dna[i:i+4])
return 4mers
def get_6mers(dna):
6mers = []
for i in range(len(dna) - 5):
6mers.append(dna[i:i+6])
return 6mers
132
Chapter 5: Functional Python
It's very obvious, looking at the two functions, that they are doing the
same thing with one small difference: the length of the kmers that they
are generating. So we abstract that part of the function's behaviour by
turning it into an argument:
133
Chapter 5: Functional Python
Now we find ourselves in the same situation as before – we have two very
similar functions, and we want to combine them to make a single, flexible
function. But what is it that needs to be abstracted in this case? In other
words, what is it that differs between the two functions? It's not a simple
variable, but rather the process that is applied to each kmer to
generate a single element of the result list. So to make our flexible,
generic function, we take this process – let's call it the analyze kmer
function – and turn it into an argument:
analyze_kmers.py
The above version of the function takes three arguments – the DNA
sequence, the kmer length, and the name of the function which analyses a
kmer – and returns a list containing the result of running the
analyze_kmer() function on each kmer generated from the input
sequence. Just like with map(), filter() and sorted(), the
analyze_kmer() function has to follow a specific set of rules. It must
take as its argument a DNA string, and it must return a single value. Our
get_at() function from earlier in the chapter follows these rules,
therefore we can pass it as the third argument to get_kmers_f(). In
134
Chapter 5: Functional Python
dna = 'ATCGATCATCGGCATCGATCGGTATCAGTACGTAC'
at_scores = get_kmers_f(dna, 8, get_at)
dna = 'ATCGATCATCGGCATCGATCGGTATCAGTACGTAC'
cg_counts = get_kmers_f(dna, 8, lambda dna : dna.count('CG'))
What have we actually achieved by structuring our code in this way? It all
comes back to the idea of encapsulation: separating out code that does
different jobs. Rather than having a function that does two jobs as in the
case of get_kmers_at() (generating kmers and calculating AT scores),
we now have one function whose job is to generate kmers, and a separate
function whose job is to calculate AT scores. We can use these single-
purpose functions as building blocks to easily make more complex pieces
of code. In fact, what we have really built here in the form of the
get_kmers_f function is a specialized kind of map function – one
designed to work on DNA sequences.
There's one other aspect to higher order functions that you're less likely
to encounter: we can write a function that returns another function – a
kind of function factory. Here's an example: imagine we want to write a
function that will take a DNA sequence as its argument, identify cut sites
for the EcoRI restriction enzyme (which cuts at the pattern GAATTC), and
return a list of the DNA fragments that would be produced by an EcoRI
digest of the input sequence. The function is quite easy to write using
Python's regular expression module1 – we just have to be careful to add an
1 We don't actually need a regular expression, as the EcoRI cut site motif has no variation, but
the re.finditer function is a useful way to iterate over pattern matches.
135
Chapter 5: Functional Python
offset of one to the match position to allow for the fact that EcoRI cuts
after the first base, and not forget to include the final fragment in the
return list:
def ecori_digest(dna):
current_position = 0
result = []
for m in re.finditer('GAATTC', dna):
result.append(dna[current_position:m.start() + 1])
current_position = m.start() + 1
result.append(dna[current_position:])
return result
def digester(dna):
current_position = 0
result = []
for m in re.finditer(pattern, dna):
result.append(dna[current_position:m.start() + offset])
current_position = m.start() + offset
result.append(dna[current_position:])
return result
return digester ❶
digester.py
1 See the regular expression chapter of Python for Biologists for a discussion of this.
136
Chapter 5: Functional Python
ecori_digester = make_digester('GAATTC', 1)
print(ecori_digester(dna))
ecori_digester = make_digester('GAATTC', 1)
print(ecori_digester(dna))
ecorv_digester = make_digester('GATATC', 3)
print(ecorv_digester(dna))
['CGATG', 'AATTCTATCGATATCGTGA']
['CGATGAATTCTATCGAT', 'ATCGTGA']
Notice how when we call the newly created functions, we don't have to
pass in the pattern and offset as arguments, since these are effectively
already part of the function definition.
It's rare to see this technique in real world code, since the situations in
which it is useful are often better handled using normal functions. Even
the example above could be implemented more easily using partial
function application, a functional programming technique whereby a
137
Chapter 5: Functional Python
function has some of its argument fixed (here, pattern and offset) to
yield another function with a smaller number of arguments.
Recap
We started this chapter with a brief overview of a few different ways of
looking at functional programming. The concepts introduced here – like
immutability and side effects – are useful to know about, even if you don't
follow a particularly functional style of programming. We also covered a
feature of the Python language that makes function programming
possible: the ability to manipulate functions like any other data type.
We then took a quick look at three built in Python list manipulation
functions that exploit Python functional features: map(), filter() and
sorted(). All three take functions as arguments, making them higher
order functions, and all three are highly flexible as a result.
Finally, we saw how we can use those same functional features to write
higher order functions of our own. As with so many programming
techniques, the value of higher order functions lies in their ability to
encapsulate code and allow for separation of concerns.
138
Chapter 5: Functional Python
Exercises
BLAST processor
The file blast_result.txt in the functional_python folder of the exercises
download contains a BLAST result in tabular format. Each row represents
a hit and the fields, in order, give:
1. the name of the query sequence
2. the name of the subject sequence
3. the percentage of positions that are identical between the two
sequences
4. the alignment length
5. the number of mismatches
6. the number of gap opens
7. the position of the start of the match on the query sequence
8. the position of the end of the match on the query sequence
9. the position of the start of the match on the subject sequence
10. the position of the end of the match on the subject sequence
11. the evalue for the hit
12. the bit score for the hit
Use a combination of map, filter and sorted to answer the following
questions:
• How many hits have fewer than 20 mismatches?
139
Chapter 5: Functional Python
• List the subject sequence names for the ten matches with the
lowest percentage of identical positions
• For matches where the subject sequence name includes the string
"COX1", list the start position on the query as a proportion of the
length of the match
FASTA processor
Write a function that copies FASTA format sequences from an input file to
an output file while allowing for arbitrary modification of both the header
and the sequence. Your function should take four arguments: the name of
the input file, the name of the output file, a header-modification function
and a sequence-modification function.
Write some code that uses your FASTA copying function to fix these
common FASTA file problems, one at a time:
• The sequence is in lower case and you need it in upper case
• The sequence contains unknown bases that should be removed
• The headers contain spaces that should be changed to underscores
• The headers are too long and need to be truncated to ten
characters
Write some code that uses your FASTA copying function to modify the
header for each sequence. Try the following, one at a time:
• Append the length of the sequence to the header
• Append the AT content of the sequence to the header
• If the sequence starts with ATG and ends with a poly-A tail, append
the phrase "putative transcript" to the header
140
Chapter 5: Functional Python
141
Chapter 5: Functional Python
Solutions
BLAST processor
We can make a fair bit of progress on this problem just by thinking about
the overall strategy for solving it. We know that map(), filter() and
sorted() work on lists, so we know that we're going to have to read in
our BLAST result file and turn it into a list, where each element
represents a single hit.
But how should we store each individual element – should it be a string, a
tuple, a dict, an object? All of these approaches will work, but for now let's
just take the simplest possible approach and store each hit as a string
read directly from the file. We can do this by creating an empty list,
opening the file, reading each line, and appending it to the list:
lines = []
for line in open('blast_result.txt'):
lines.append(line)
But wait: map(), filter() and sorted() don't just work on lists, they
work on any iterable type (i.e. any type of data that we can use in a loop).
We know that file objects are iterable, so we don't have to bother creating
a list – we can just ask map(), filter() or sorted() to process a file
object directly, and it will "see" the individual lines. In other words, we can
just write something like:
f = open('some_file.txt')
g = filter(some_function, f)
142
Chapter 5: Functional Python
With that in mind, we can start working on a filter function to answer the
first question – how many hits have fewer than 20 mismatches. This
seems straightforward: we know that the fields in each line are separated
by tab characters, and that the number of mismatches is the fifth field, so
all we have to do in our filter function is split the line using tabs, take the
fifth element of the resulting list, and ask if it's less than 20:
def mismatch_filter(hit_string):
mismatch_count = hit_string.split("\t")[4]
return mismatch_count < 20
and if we take another look at the input file, we can see why:
# BLASTX 2.2.27+
# Query: gi|322830704:1426-2962 Boreus elegans mitochondrion...
# Database: nem.fasta
# Fields: query id, subject id, % identity, ...
# 405 hits found
gi|322830704:1426-2962 gi|225622197|ref|YP...
The first five lines are comments, which give information on the version
of BLAST which generated the file, the name of the database, etc. Because
these lines don't follow the tab-separated standard expected by the
mismatch_filter() function, splitting them on tabs returns a list with
fewer than five elements.
To remove these comment lines, we could add an extra check to our
mismatch_filter() function:
143
Chapter 5: Functional Python
def mismatch_filter(hit_string):
if line.startswith('#'):
return False
mismatch_count = hit_string.split("\t")[4]
return mismatch_count < 20
but, thinking ahead, these lines are going to be a problem for all parts of
this exercise, so why not create a filter function just to remove them? This
way, we'll be able to reuse it for the other parts of the exercise:
def comment_filter(line):
return not line.startswith('#')
f = filter(mismatch_filter, hit_lines)
print(len(f))
Unfortunately, this bit of code prints zero2 – clearly not the correct result.
The problem is that in the mismatch_filter() function we're
comparing the fifth field of the input file (which is a string) with the value
20 (which is an integer). Since, under the rules of Python, any string is
always "bigger" than any integer, the function always returns False. We
can fix the problem by turning the mismatch_count variable into an
integer:
1 Remember that in Python 3, the result of filter is not a list but a filter object, so to get this bit
of code to run under Python 3 we need to convert the filter object to a list before asking for the
length: print(len(list(f)))
2 Unless you're using Python 3, in which case it will case a TypeError
144
Chapter 5: Functional Python
def mismatch_filter(hit_string):
mismatch_count = int(hit_string.split("\t")[4])
return mismatch_count < 20
blast_filter.py
def get_percent_id(hit_string):
return float(hit_string.split("\t")[2])
s = sorted(hit_lines, key=get_percent_id)
The next step is to take the first ten elements of the sorted list and assign
them to a variable:
low_id_hits = s[0:10]
145
Chapter 5: Functional Python
Finally, we need to turn each complete hit string into a subject sequence
name using a mapping function. The code for the mapping function is
very similar to the code for our filtering function: in both cases, we are
simply taking a string, splitting it, and returning one of the resulting
elements (in this case, the second element):
def get_subject(hit_string):
return hit_string.split("\t")[1]
def comment_filter(line):
return not line.startswith('#')
def get_percent_id(hit_string):
return float(hit_string.split("\t")[2])
def get_query(hit_string):
return hit_string.split("\t")[1]
blast_filter.py
146
Chapter 5: Functional Python
gi|336287915|gb|AEI30246.1|
gi|336287919|gb|AEI30248.1|
gi|336287881|gb|AEI30229.1|
gi|336287897|gb|AEI30237.1|
gi|336287895|gb|AEI30236.1|
gi|336287917|gb|AEI30247.1|
gi|336287921|gb|AEI30249.1|
gi|336287923|gb|AEI30250.1|
gi|336287885|gb|AEI30231.1|
gi|336287889|gb|AEI30233.1|
Now for the last bit of the exercise – for matches where the subject
sequence name includes the string "COX1", list the start position on the
query as a proportion of the length of the match. This obviously involves
a filter for the first part (selecting only hits with "COX1" in the subject
name) and, though it may not be obvious at first, we can use map() to
address the second part.
First the filter, and most of this code looks quite familiar by now. We split
the input line using tabs, get the element we're looking for, and return
True or False depending on whether or not the element contains the
string "COX1":
def cox1_filter(hit_string):
subject = hit_string.split("\t")[1]
if "COX1" in subject:
return True
else:
return False
Now the map. For this function, we need to extract two bits of
information from the hit line – the query start and the length – then
divide one by the other and return the result:
147
Chapter 5: Functional Python
def start_ratio(hit_string):
query_start = int(hit_string.split("\t")[6])
hit_length = int(hit_string.split("\t")[3])
return query_start / hit_length
Having written our two functions, getting the answer is just a case of
applying them in the right order (not forgetting to filter out the comment
lines):
def comment_filter(line):
return not line.startswith('#')
def cox1_filter(hit_string):
subject = hit_string.split("\t")[1]
if "COX1" in subject:
return True
else:
return False
def start_ratio(hit_string):
query_start = int(hit_string.split("\t")[6])
hit_length = int(hit_string.split("\t")[3])
return query_start / hit_length
blast_filter.py
148
Chapter 5: Functional Python
0.0226244343891
0.00900900900901
0.0226244343891
0.0226244343891
0.0226244343891
0.0226244343891
0.00779727095517
0.0430839002268
149
Chapter 5: Functional Python
FASTA processor
This exercise is an example of a task that crops up pretty regularly in
bioinformatics work flows. We want to parse some complex file format,
tinker with a specific bit of data, then put it all back together again in the
same format.
Let's start off by writing a function that does nothing but read records
from a FASTA file, split them into header and sequence, then write the
header and sequence out to another file. This function won't do anything
useful, but it will provide a nice framework for solving the rest of the
exercise. To keep things simple, we'll assume that the sequence for each
FASTA record is on a single line – i.e. the FASTA file looks like this:
>sequence1
actgatcgatcgatcgatcaatcgatcgacgatcgattacgtacgatcgtacgtacgtc
>sequence2
ttagcagtgactgtactctgtactacgtgctagtagctgtagctagtacc
>sequence1
actgatcgatcgatcg
atcaatcgatcgacga
tcgattacgtacgatc
gtacgtacgtc
>sequence2
ttagcagtgactgtac
tctgtactacgtgcta
gtagctgtagctagta
cc
150
Chapter 5: Functional Python
There's not too much going on here. We just iterate over the input file line
by line, checking to see if it starts with a greater-than symbol. If it does,
then it's a header line, in which case we set the value of the header
variable to be the contents of the line starting at the second character (i.e.
we remove the greater-than symbol). If it doesn't, then it's a sequence
line, in which case we write out the header and the sequence in FASTA
format to the output file (remembering to put the greater-than symbol
back on the start of the header).
The next step is to turn our fasta_copy() function into a true higher
order function by allowing it to modify the header and sequence before
they're written to the output file. Doing this requires surprisingly little
change in the code – we just have to add a process_header() function
and a process_sequence() function as arguments, and to run the
header and sequence through the appropriate functions before writing
them to the output:
151
Chapter 5: Functional Python
Attempting to use this code to solve the first bit of the exercise – change
the sequence from lower case to upper case – we run into a problem. The
fasta_copy() function demands that we supply a function to process
the header, but for this particular job, we don't want to change the header.
We can't simply call fasta_copy() with only three arguments (source,
destination, and a process_sequence function) because that will
cause an error – it requires four. The solution is to define a "do-nothing"
function that simply returns its input1:
def do_nothing(x):
return x
Now we can define a function that converts its input to upper case:
def to_upper(dna):
return dna.upper()
1 This is technically known as an identity function, and some languages (though not Python)
have one as part of their standard library.
152
Chapter 5: Functional Python
>sequence_in_lowercase
ACTGATCATCATCACTCGATCGACTACTATCGATGTCGATCTCATCGTAG
On to the next bit of the exercise: removing unknown bases from the
sequences. There are many different ways to do this, but the simplest is
probably using the re.sub() function from the regular expression
module to replace all non-ATGC characters with an empty string. As
before, we don't want to change the header, so we pass in the
do_nothing() function as the third argument to fasta_copy():
import re
def remove(dna):
return re.sub(r'[^ATGCatgc]', '', dna)
def fix_spaces(header):
return header.replace(' ', '_')
Similarly, for the fourth bit of the exercise, we write a function to truncate
the headers, and pass it to our fasta_copy() function as the third
argument:
def truncate(header):
return header[0:10]
153
Chapter 5: Functional Python
Looking at the next bit of the exercise, we run up against the limitations
of our current implementation of fasta_copy(). We're being asked to
write a function that appends the length of the sequence to the header, so
that this record in the input:
>normal_sequence
ACTGGCATGCATCGTACGTACGATCGATCATGCGATGCTACGATCGACGTGTATATCC
>normal_sequence_58
ACTGGCATGCATCGTACGTACGATCGATCATGCGATGCTACGATCGACGTGTATATCC
copy_fasta.py
154
Chapter 5: Functional Python
Now back to the problem: appending the sequence length to the header.
Here's our header processing function and a call to fasta_copy that
uses it:
Looking at the first few lines of corrected.fasta shows us that it's working:
>normal_sequence_58
ACTGGCATGCATCGTACGTACGATCGATCATGCGATGCTACGATCGACGTGTATATCC
>sequence_in_lowercase_50
actgatcatcatcactcgatcgactactatcgatgtcgatctcatcgtag
...
1 There are other possible ways to fix this – for example, using introspection to count the
number of arguments expected by the process_header function and passing the correct
number – but these are much more complicated and well beyond the scope of this exercise.
155
Chapter 5: Functional Python
The final bit of the exercise involves a slightly longer header modification
function which uses a regular expression to check whether the header
should be changed, but it's still quite easy to read:
156
Chapter 5: Functional Python
fasta_copy('sequences.fasta', 'corrected.fasta',
process_header=check_trans)
And we can even simply copy FASTA records without modifying anything
by omitting both the process_header and the process_sequence
arguments:
fasta_copy('from.fasta', 'to.fasta')
def fasta_copy(
source,
destination,
process_header=lambda x: x,
process_sequence=lambda x: x
):
...
copy_fasta.py
157
Chapter 6: Iterators, comprehensions & generators
Defining lists
Take a look back at the chapter on functional programming, and you'll
notice that the bulk of the text talks about functions for manipulating
lists of elements. This is not too surprising – the functional style of
programming lends itself well to operations on lists of data. Also
remember that in Python many things can be considered a lists: dicts are
lists of key/value pairs, strings are lists of characters, files are lists of
lines, etc.
Looking back at map() and filter() specifically, it's clear that what we
are doing when we use those functions is defining lists (although not
necessarily creating them: recall that, in Python 3 at least, the map()
operation is lazy so elements are not created until they are needed). For
example, when we write this bit of code:
we are defining the list at_contents as being the result of calling the
get_at function() on the elements of dna_list. Similarly, when we
write:
we are defining the list long_dna as being the elements of dna_list for
which the function is_long() returns True.
158
Chapter 6: Iterators, comprehensions & generators
The important thing to notice about the above code is that we don't care
exactly what type of iterable object first() is. It could be a list, or an
iterator, or something else entirely. All that matters is that first() is
iterable and can therefore be passed to the filter() function.
It turns out that the combination of map() and filter() in this way is
pretty common, and Python has a special type of syntax for defining lists
in this way. These special expressions are called comprehensions.
1 The exact kind of iterable object returned by keys() depends on the version. In Python 2 it
returns a list, whereas in Python 3 it returns an iterator.
159
Chapter 6: Iterators, comprehensions & generators
List comprehensions
List comprehensions allow us to define a list just like we would using the
map() function. Syntactically, list comprehensions resemble a back to
front for loop. Here's an example of a list comprehension that defines
the list of lengths of sequences in the dna_list variable:
list_comprehension.py
Compare the list comprehension above with the equivalent for loop:
When writing the for loop, we write for dna in dna_list first, then
carry out some processing on the loop variable in the body (len(dna))
and append the result to the final list. When writing the same expression
as a list comprehension, we write the processing part first, then for dna
in dna_list, and enclose the whole thing in square brackets. Just as
with lambda expressions, the processing part of a list comprehension has
to be a single expression. For completeness, here's the same code
expressed as a map() (we've already seen this code in the functional
programming chapter):
Let's look at a couple more examples that we've already seen how to write
using map(). Here's a list comprehension that defines a list of the AT
contents of the DNA sequences:
160
Chapter 6: Iterators, comprehensions & generators
l = [2 ** x for x in range(100000)]
l = list(range(100000))
m = map(lambda x : 2 ** x, l)
Notice that, because we want to use the DNA sequence itself and not the
length in the condition, we couldn't achieve the same effect by using
filter() on a list of lengths. Conditions in a list comprehension need
to be a single expression (just like when we are writing a lambda
expression), so if we want to implement a condition that requires several
161
Chapter 6: Iterators, comprehensions & generators
def is_at_poor(dna):
at = (dna.count('A') + dna.count('T')) / len(dna)
return at < 0.6
at_comprehension_filter.py
fasta_comprehension.py
Generator expressions
One drawback to using list comprehensions over map() and filter() is
that they are not lazy1. Happily, Python has a built in lazy equivalent to
the list comprehension called the generator expression. The syntax is
1 See the section on map() in the chapter on functional programming for an explanation of
what laziness is and why it's a good thing.
162
Chapter 6: Iterators, comprehensions & generators
exactly the same, but it uses parentheses rather than square brackets.
Here's our long-running map example from before, written as first a list
comprehension, then as a generator expression:
Nested comprehensions
We'll finish our survey of list comprehensions by taking a quick look at
one final feature: we can use them to iterate over multiple variables at
once by adding extra for expressions. This is the exact equivalent of
using nested for loops in procedural code. For example, we can generate
a list of all possible dinucleotides by iterating over a list of bases twice
(once using the base1 variable and once using the base2 variable) and
concatenating the two bases at each iteration:
163
Chapter 6: Iterators, comprehensions & generators
r = [
process(a, b)
for a in object_list
for b in object_list
if a != b
]
Dictionary comprehensions
As I pointed out previously, dictionaries can be thought of as simple lists
of key/value pairs (with the special ability that the value for a given key
can be looked up very quickly), so it makes sense that there's an
analogous type of comprehension for defining them. The expression has
to be represented as key:value, and the whole comprehension is
surrounded by curly brackets:
164
Chapter 6: Iterators, comprehensions & generators
dict_comprehension.py
We have to be remember, when doing this, that we'll get one key/value
pair in the resulting dict for each unique DNA sequence in the input list.
Any duplicate values will simply be overridden.
Another very useful thing we can do with dict comprehensions is to
define a dict which allows us to rapidly look up an object based on one of
its properties. Imagine we have a list of DNASequence objects1, each of
which has a name field, and all the name fields are unique. We can create
a name->object dict like this:
165
Chapter 6: Iterators, comprehensions & generators
my_object = name2object.get('ABC123')
Set comprehensions
We encountered sets in the chapter on complex data structures. Like a
list, each element in a set is a single item, but like a dict, elements have to
be unique. Internally, set elements are stored in a way that allows us to
check for the existence of an item in a set very rapidly.
We write set comprehensions in exactly the same way as list
comprehensions, but using curly brackets rather than square ones:
For example, we can create a set of all the DNA sequences in a list that are
longer than 100 base pairs:
We could, of course, create a list of the DNA sequences rather than a set:
but if the resulting list had a very large number of elements, checking to
see whether a particular DNA sequence was in it could become very slow,
whereas with a set the same operation is very fast regardless of the
number of elements.
166
Chapter 6: Iterators, comprehensions & generators
class DNASequence():
sequence = 'atgccgcat'
def __iter__(self): ❶
return iter(self.sequence)
my_seq = DNASequence()
for base in my_seq: ❷
print(base)
iterator.py
167
Chapter 6: Iterators, comprehensions & generators
168
Chapter 6: Iterators, comprehensions & generators
class DNASequence():
position = 0
sequence = 'atgccgcat'
def __iter__(self):
return self
def next(self): ❶
if self.position < (len(self.sequence) - 2): ❷
codon = self.sequence[self.position:self.position+3]
self.position += 3
return codon
else:
raise StopIteration
my_seq = DNASequence()
for codon in my_seq:
print(codon)
codon_iterator.py
The above code is quite complicated, so let's take a look at it step by step.
The job of the next() method❶ is to return the next element in the
sequence (i.e. the next codon). To do that, it needs an extra variable –
position – which keeps track of the current position in the sequence. If
there are at least three bases remaining after the current position❷ we
extract them, add three to the position and then return the codon. If,
however, the current position is within two bases of the end of the
sequence, then we have reached the end and there are no more elements
to return, so we tell Python to stop iterating❸. If we look at the output
from this code then we can see that the individual elements when
iterating over a DNASequence are now codons rather than individual
bases:
atg
ccg
cat
169
Chapter 6: Iterators, comprehensions & generators
def get_4mers(dna):
4mers = []
for i in range(len(dna) - 3):
4mers.append(dna[i:i+4])
return 4mers
def generate_4mers(dna):
for i in range(len(dna) - 3):
yield dna[i:i+4]
for x in generate_4mers('actggcgtgcatg'):
print(x)
generator.py
170
Chapter 6: Iterators, comprehensions & generators
Now we've seen how generators work, we can rewrite our DNASequence
example to take advantage of them. Here's a DNASequence class which
allows three types of iteration – by base, by codon, or by kmer:
class DNASequence():
sequence = 'atgccgcat'
def bases(self):
return iter(self.sequence)
def codons(self):
for i in range(0, len(self.sequence) -2, 3):
yield self.sequence[i:i+3]
my_seq = DNASequence()
for base in my_seq.bases():
print(base)
for codon in my_seq.codons():
print(codon)
for kmer in my_seq.kmers(5):
print(kmer)
multiple_generators.py
171
Chapter 6: Iterators, comprehensions & generators
a
t
g
c
c
g
c
a
t
atg
ccg
cat
atgcc
tgccg
gccgc
ccgca
cgcat
Recap
We started this chapter by giving a little bit of extra context to our
previous look at functional programming, and saw that often what we are
interested in is defining sequences of data. We then looked at how, in
many cases, Python's special list comprehension syntax could replace
map() and filter() in a concise and readable way. We then extended
the basic idea to look at comprehensions for other data types – dicts and
sets – along with a lazy equivalent for lists.
Finally, we saw how to exploit the power of Python's sophisticated
iteration system for our own classes and objects, allowing us to
encapsulate the complexities of iteration inside objects and write cleaner,
more readable code.
172
Chapter 6: Iterators, comprehensions & generators
Exercises
BLAST processor
Rewrite your solutions to the BLAST processor exercises from the
previous chapter to use list comprehensions. Here's a reminder of the
questions we want to answer:
• How many hits have fewer than 20 mismatches?
• List the subject sequence names for the ten matches with the
lowest percentage of identical positions
• For matches where the subject sequence name includes the string
"COX1", list the start position on the query as a proportion of the
length of the match
Primer search
Write a generator which will generate all possible primers of a given
length (hint: look back at the chapter on recursion for an example of a
function that will act as a starting point). Write a second generator which
uses the first to generate all possible pairs of such primers.
173
Chapter 6: Iterators, comprehensions & generators
Solutions
BLAST processor
This is a pretty straightforward exercise – we're just taking our solutions
to a previous set of problems and expressing them in a different way.
However, it's a very useful one to do, as seeing comprehensions and
map()/filter() side by side is a great way to explore the differences in
syntax. For each of the code samples below I'll show the
map()/filter() solution first, followed by the equivalent
comprehension.
First of all, recall that we have to filter out lines that start with "#" as
these lines contain comments rather than BLAST hit data. Our original
solution used filter(), but the same logic applies nicely to the list
comprehension solution:
blast_filter.py
Notice that the expression that gets evaluated for each lines is just the
line itself (l), since for this comprehension we don't want to alter the
lines at all.
Now we can tackle the first question. In our original solution we used
filter() to select only the hit lines where the fifth element after
174
Chapter 6: Iterators, comprehensions & generators
splitting (i.e. the number of mismatches) was less than 20. For the new
solution, we apply exactly the same logic using a comprehension:
# get subject names for the ten hits with the lowest percent id
def get_subject(hit_string):
return hit_string.split("\t")[1]
The final question involves both a filter() and a map(). First we select
the hits where the subject name contains the string "COX1", then we map
those lines to their ratio of query start position to hit length. We can fit
175
Chapter 6: Iterators, comprehensions & generators
the whole thing into a single list comprehension – below I've split the
comprehension over multiple lines to make it easier to read:
def start_ratio(hit_string):
query_start = int(hit_string.split("\t")[6])
hit_length = int(hit_string.split("\t")[3])
return query_start / hit_length
blast_filter.py
Primer search
Since the interesting part of this exercise is not the generation of possible
primers per se, but the use of generators, we'll start with the recursive
kmer generating program from the chapter on recursion. Let's remind
ourselves of what it looks like:
176
Chapter 6: Iterators, comprehensions & generators
def generate_primers(length):
if length == 1:
return ['A', 'T', 'G', 'C']
else:
result = []
for seq in generate_primers(length - 1):
for base in ['A', 'T', 'G', 'C']:
result.append(seq + base)
return result
We won't go into the details of how it works – you'll find an in-depth look
at the implementation in the chapter on recursion. Instead, let's
concentrate on the result that is produced. We know that the output from
this function is a list of all possible combinations of the four DNA bases
of a given length, which allows us to make some confident predictions
about the size of the output. The number of elements in the returned list
will be four raised to the power of the sequence length. So, a call to
generate_primers(3) will return a list with 64 elements, but if we
double the length to 6 then we'll get over four thousand elements in the
returned list, and if we double it again to 12 we get over sixteen million
elements. For realistic primer lengths of around twenty bases, the
number of elements in the returned list is on the order of one trillion,
which is obviously not going to fit into memory1.
This function, therefore, is a good candidate for being rewritten as a
generator. If we can do that, then we'll only ever have to store a single
element in memory at any one time. Remarkably, converting this function
to a generator requires very few changes. We simply replace each instance
where we return a value, or add a new element to the result list, to a
yield statement:
1 Of course, any program that attempts to carry out any kind of processing on a list of one
trillion elements is probably going to be prohibitively slow regardless of how the elements are
generated, but we'll overlook that for this example.
177
Chapter 6: Iterators, comprehensions & generators
def generate_primers(length):
if length == 1:
for base in ['A', 'T', 'G', 'C']:
yield(base)
else:
for seq in generate_primers(length - 1):
for base in ['A', 'T', 'G', 'C']:
yield(seq+base)
primer_search.py
The magic of iteration will ensure that it continues to run as before, even
though the return value of generate_primers() is no longer a list but
a generator.
Now we can tackle the final bit of this problem: writing another generator
whose job is to generate all possible pairs of primers. It's surprisingly
straightforward: we just iterate over all possible forward primers, and all
possible reverse primers, and yield each pair as a tuple in turn:
def generate_pairs(length):
for forward in generate_primers(length):
for reverse in generate_primers(length):
yield(forward, reverse)
primer_search.py
178
Chapter 6: Iterators, comprehensions & generators
Of course, for this particular question we want the bases joined together
to make a DNA sequence:
179
Chapter 6: Iterators, comprehensions & generators
AAA
AAT
AAG
AAC
ATA
ATT
ATG
ATC
AGA
...
180
Chapter 7: Exception handling
7: Exception handling
Something that becomes clear depressingly quickly when we first start to
learn to code is that our programs often don't behave exactly as we like.
Several different types of problems occur. There are straightforward
syntax errors, where we forget a colon or accidentally leave a line
unindented:
As we know from experience, syntax errors will prevent our program from
running at all.
There are also typos and incorrect function and variable names, and
things like trying to use an integer as a string:
print(dna.converttouppercase)
print('abc' + 3)
These are different from syntax errors because they will not stop the
program running entirely, but they will cause it to exit with an error
message when it reaches that point in the code.
Then there are bugs – more subtle errors that will not prevent the
program from running or create an error message, but will not do quite
what you want:
dna = 'atctgcatattgcgtctgatg'
a_count = dna.count('A') #whoops, the sequence is in lower case
What all these types of errors have in common is that they are an intrinsic
property of the code. In other words, if we run a piece of code that
181
Chapter 7: Exception handling
contains one of these errors then we'll encounter the same problem every
time in a very predictable way.
However, there's another class of errors that are not intrinsic to the code,
but instead are the result of some external situation. For example,
consider the common error that you get when you try to open a file that
doesn't exist:
This error is not a property of the code, but of the environment in which
it is being run. If we were to take the exact same piece of code that caused
the error and run it at a different time1 or in a different folder it might run
perfectly well. In programming, we refer to situations like this as
exceptions, and Python's built in mechanism for handling them is called
exception handling2.
A quick note before we dive in: the first section of this chapter uses a
rather boring non-biological example to illustrate how to use exceptions.
That's because, for reasons that will become clear later on in the chapter,
it's easier to get to grips with the basic exception system using built in
functions. Later in the chapter we switch to biological examples.
Catching exceptions
The "No such file" message is what the user of a program will see if the
code tries to open a non-existent file. The message is part of Python's
response to an exception (the program stopping is another part of the
response). It's relatively helpful for the programmer, as it identifies the
182
Chapter 7: Exception handling
exact error that occurred, but it's not very helpful for the user. If we're
writing a program that reads data from files, we might want to intercept,
or catch, the exception that caused the error message to be printed and
handle it in the code. To catch an exception, we enclose the bit of the code
that has the potential to cause the exception (in this case, the open()
function) in a try block, and add the code that we want to run in the case
of an exception in an except block:
try:
f = open('misssing.txt')
print('file contents: ' + f.read())
except:
print("sorry, couldn't find the file")
try _excpet.py
try and except blocks work just like the for/if/function blocks that
we're already familiar with – they end with a colon and the lines inside
them are indented. When we run the above code, the lines in the try
block are executed and if one of them causes an exception, the program
jumps directly to the except block and starts executing the code there.
In the above code, the open() function call will create an exception when
the file is not found (we say that the open() function raises an exception)
and so the print() line will not be executed. Because we have caught
the exception, we don't get the usual error message; instead, our
customized error message is printed:
183
Chapter 7: Exception handling
This allows the program to try to recover from the error. If we add a line
of code at the end of our example:
try:
f = open('misssing')
print('file contents: ' + f.read())
except:
print("sorry, couldn't find the file")
print("continuing....")
try:
f = open('my_file.txt') ❶
my_number = int(f.read()) ❷
print(my_number + 5)
except:
print("sorry, couldn't find the file")
Now there are two possible situations that could cause an exception: the
file could be missing (in which case we will get a IOError when we try to
open it❶) or the contents could be a string that can't be parsed into an
integer (in which case we'll get a ValueError when we try to convert
it❷). What happens if we create a new file called my_file.txt which
contains the text "twenty-three" and run the code? The call to int() will
184
Chapter 7: Exception handling
raise a ValueError, which will get caught by the except block and we
will see the very misleading error message:
try:
f = open('my_file.txt')
my_number = int(f.read())
print(my_number + 5)
except IOError:
print("sorry, couldn't find the file")
Now when we run the code with our my_file.txt present the exception,
which is a ValueError, is not handled by our except block, and causes
the correct default error message to be printed:
but we still get our custom error message if the file is missing.
We can make our code even better by writing separate except blocks for
the two possible errors. We just place the except blocks one after
another:
185
Chapter 7: Exception handling
try:
f = open('my_file.txt')
my_number = int(f.read())
print(my_number + 5)
except IOError:
print("sorry, couldn't find the file")
except ValueError:
print("sorry, couldn't parse the number")
exception_types.py
try:
f = open('my_file.txt')
my_number = int(f.read())
print(my_number + 5)
except (IOError, ValueError):
print("sorry, something went wrong")
1 Take a look at the chapter on complex data structures if you've never heard of tuples before.
186
Chapter 7: Exception handling
and we can then use the variable ex to refer to the exception object. The
details of what we can do with exception objects differ according to the
type of exception. For IOError exceptions, we can get a string
description of the error by referencing the strerror field. Here's an
updated version of our example that, when handling an IOError, prints
out the error string as part of the error message:
try:
f = open('my_file.txt')
my_number = int(f.read())
print(my_number + 5)
except IOError as ex:
print("sorry, couldn't open the file: " + ex.strerror)
except ValueError:
print("sorry, couldn't parse the number")
1 Consequently, the following explanation will make more sense if you've read the chapter on
object oriented programming.
187
Chapter 7: Exception handling
Now when we run the code and encounter an IOError, the first part of
the error message will always be the same, but the second part will
pinpoint the specific nature of the problem:
To figure out what properties are available for a given exception class we
can consult the Python documentation1, but there's also a more generic
mechanism for getting the details of an error. Exception objects have a
field called args2, which is a list of details for the error. For most types of
exception, one element of the list will be a string giving details of the
problem. For ValueError objects, it's the first element, so we can modify
our ValueError hander thus:
try:
f = open('my_file.txt')
my_number = int(f.read())
print(my_number + 5)
except IOError as ex:
print("sorry, couldn't find the file: " + ex.strerror)
except ValueError as ex:
print("sorry, couldn't parse the number: " + ex.args[0])
exception_details.py
and when we run the code we'll get a more detailed error as part of our
error message if the contents of the file can't be parsed:
sorry, couldn't parse the number: invalid literal for int() with
base 10: '12.56\n'
1 http://docs.python.org/2/library/exceptions.html
2 So called because it holds the arguments used to construct the object – see the chapter on
object oriented programming for more details on constructors.
188
Chapter 7: Exception handling
try:
f = open('my_file.txt')
my_number = int(f.read())
except IOError as ex:
print("sorry, couldn't find the file: " + ex.strerror)
except ValueError as ex:
print("sorry, couldn't parse the number: " + ex.args[0])
print(my_number + 5)
1 For example, if we are running this bit of code using shell redirection and we run out of space
on the output file device.
189
Chapter 7: Exception handling
try:
f = open('my_file.txt')
my_number = int(f.read())
except IOError as ex:
print("sorry, couldn't find the file: " + ex.strerror)
except ValueError as ex:
print("sorry, couldn't parse the number: " + ex.args[0])
else:
print(my_number + 5)
else.py
The else block is run only if there were no exceptions raised in the try
block. It's a useful technique for a very specific scenario: when we want to
run code only in the absence of earlier exceptions without catching
exceptions for the code itself.
import os
190
Chapter 7: Exception handling
rewrite the program using the techniques we learned above to catch the
exceptions, but where should we put the clean up code? One solution is to
place it at the end of the try block and all the except blocks:
import os
t = open('temp.txt', 'w')
t.write('some important temporary text')
t.close()
try:
f = open('my_file.txt')
my_number = int(f.read())
print(my_number + 5)
os.remove('temp.txt') # delete the temp file
except IOError as ex:
print("sorry, couldn't find the file: " + ex.strerror)
os.remove('temp.txt') # delete the temp file
except ValueError as ex:
print("sorry, couldn't parse the number: " + ex.args[0])
os.remove('temp.txt') # delete the temp file
but this isn't a great idea. Not only do we have to repeat the code three
times, but it still won't get run in some circumstances – for instance, if an
exception is raised inside the try block that doesn't get caught by one of
our except blocks. We could put the clean up code after the try/except
section:
import os
t = open('temp.txt', 'w')
t.write('some important temporary text')
t.close()
try:
f = open('my_file.txt')
my_number = int(f.read())
print(my_number + 5)
except IOError as ex:
print("sorry, couldn't find the file: " + ex.strerror)
except ValueError as ex:
print("sorry, couldn't parse the number: " + ex.args[0])
os.remove('temp.txt')
191
Chapter 7: Exception handling
This solves the first problem but not the second – if an exception is raised
inside the try block and not caught by one of our except blocks, the
clean up code won't be run.
The correct way to handle this situation in Python is using a finally
block. A finally block is guaranteed to run after the try block has
finished, regardless of whether an exception is raised or not. The
finally block has to come after the try/except/else blocks:
import os
t = open('temp.txt', 'w')
t.write('some important temporary text')
t.close()
try:
f = open('my_file.txt')
my_number = int(f.read())
print(my_number + 5)
except IOError as ex:
print("sorry, couldn't find the file: " + ex.strerror)
except ValueError as ex:
print("sorry, couldn't parse the number: " + ex.args[0])
finally:
os.remove('temp.txt')
finally.py
and is the best way to ensure that code is run even in the event of an
unhandled exception. finally blocks are typically used for cleaning up
or releasing resources like threads and database connections.
Blocks of code that use exception handling can become quite complex, so
here's a generic example with a quick summary:
192
Chapter 7: Exception handling
try:
# code in here will be run until an exception is raised
except ExceptionTypeOne:
# code in here will be run if an ExceptionTypeOne
# is raised in the try block
except ExceptionTypeTwo:
# code in here will be run if an ExceptionTypeTwo
# is raised in the try block
else:
# code in here will be run after the try block
# if it doesn't raise an exception
finally:
# code in here will always be run
Context managers
Some common operations are pretty much always carried out inside
try/finally blocks, to ensure that resources used inside the try block
are released. The most obvious example is reading a file – the pattern is
nearly always:
f = open('somefile.txt')
try:
# do something with f
finally:
f.close()
This is to ensure that the file is always closed regardless of any exceptions
that might occur while it is open. A feature of Python called context
managers allows this type of pattern to be encapsulated in a class and
reused. Context managers are invoked using the with statement. The
following bit of code is equivalent to the one above:
with open('somefile.txt') as f:
# do something with f
193
Chapter 7: Exception handling
but is a lot more readable. The File context manager is by far the most-
used, but there are several other built in context managers, and we can
define our own1.
try:
f = open('my_file.txt') # this line might raise an IOError
my_number = int(f.read()) # this line might raise a ValueError
except IOError:
print('cannot open file!')
except ValueError:
print('not an integer!')
finally:
f.close()
but that won't work because the variable f is created inside the try block
and so can't be accessed from the finally block2. To achieve the result
we want, we need two nested try blocks:
1 A discussion of how and why to do this is beyond the scope of this book, but if you're
interested, take a look at the contextlib module.
2 Recall that in Python, variables which are declared inside a block of any type have their scope
limited to that block.
194
Chapter 7: Exception handling
try:
f = open('my_file.txt')
try:
my_number = int(f.read())
except ValueError:
print('not an integer!')
finally:
f.close()
except IOError:
print('cannot open file')
nested_try.py
Because the inner finally block is inside the outer try block, it has
access to the variable f. It's a good idea to use this feature sparingly – if
you find yourself writing code that requires more than two layers of
nested try blocks, then it's probably better to encapsulate some of that
complexity inside a separate function.
Exceptions bubble up
Imagine you have some function that calls another function:
def function_one:
# do some processing...
return 5
def function_two:
my_number = function_one()
return my_number + 2
print(function_two())
195
Chapter 7: Exception handling
def function_one:
try:
# do some processing...
return 5
except SomeException:
# handle the exception...
def function_two:
my_number = function_one()
return my_number + 2
print(function_two())
but what happens if we don't? The answer is that the exception will be
passed up to the bit of code that called function_one – which in our
case is function_two. So we have a second chance to handle the
exception there:
def function_one:
# do some processing...
return 5
def function_two:
try:
my_number = function_one()
return my_number + 2
except SomeException:
# handle the exception...
print(function_two())
196
Chapter 7: Exception handling
def function_one:
# do some processing...
return 5
def function_two:
my_number = function_one()
return my_number + 2
try:
print(function_two())
except SomeException:
# handle the exception
When describing this behaviour, we often say that exceptions bubble up.
The best place to handle a given exception depends on what the program
is doing, and a discussion of best-practice exception handling is beyond
the scope of this book. However, it's often the case that it's easier to
handle an exception at a higher level (i.e. in function_two or the top
level code in our above example). As a general rule, your code should
handle exceptions in the place where it can do something about
them.
Sometimes we want to take some action in response to an exception – for
example, print a warning message – but we still want to allow code at a
higher level to "see" the exception and respond to it. Python has a handy
shorthand for doing this: the statement raise, on its own, will cause the
exception that's currently being handled to be re-raised. This allows us to
write something like this:
197
Chapter 7: Exception handling
def function_one:
try:
# do some processing...
return 5
except SomeException:
print("warning: something went wrong")
raise
def function_two:
my_number = function_one()
return my_number + 2
try:
print(function_two())
except SomeException:
# handle the exception
Raising exceptions
As we've seen above, exceptions are the way that Python's built in
methods and functions signal that something has gone wrong. Writing
exception-handling code in the form of try/except blocks is our way of
intercepting those signals and responding to them.
We can also write our own code that is capable of signalling when
something has gone wrong. Just as with built in functions, our own
functions can indicate a problem by raising an exception. To do this we
create a new exception object, and then use it in a raise statement:
e = ValueError()
raise e
198
Chapter 7: Exception handling
raise ValueError()
The error message that we get from the above line of code is very
unhelpful:
ValueError:
def get_at_content(dna):
length = len(dna)
a_count = dna.count('A')
t_count = dna.count('T')
at_content = (a_count + t_count) / length
return at_content
199
Chapter 7: Exception handling
import re
def get_at_content(dna):
if re.search(r'[^ATGC]', dna):
raise ValueError('Sequence cannot contain non-ATGC bases')
length = len(dna)
a_count = dna.count('A')
t_count = dna.count('T')
at_content = (a_count + t_count) / length
return at_content
check_bases.py
print(get_at_content('ACGTACGTGAC'))
print(get_at_content('ACTGCTNAACT'))
0.454545454545
Traceback (most recent call last):
...
ValueError: Sequence cannot contain non-ATGC bases
200
Chapter 7: Exception handling
201
Chapter 7: Exception handling
class AmbiguousBaseError(Exception):
pass
def get_at_content(dna):
if re.search(r'[^ATGC]', dna):
raise AmbiguousBaseError('Sequence cannot contain non-ATGC
bases')
length = len(dna)
a_count = dna.count('A')
t_count = dna.count('T')
at_content = (a_count + t_count) / length
return at_content
custom_error.py
1 If that sentence doesn't make sense, take a look at the chapter on object oriented
programming for an explanation of inheritance.
2 In Python, pass is just a place holder bit of code that means "do nothing".
202
Chapter 7: Exception handling
Recap
Whenever we write code that relies on some data or resources being
supplied by an external source, we have to consider ways in which those
data or resources might cause a problem for our code. Two common
examples are when writing code that relies on user input (or on the
contents of external files), and when writing library-type code that might
be used by another programmer. Python's exception system offers an
elegant way both to respond to problems that occur in built in functions
and methods, and to report problems that occur in our own code.
We started the chapter by looking at how exception handlers allow us to
catch and deal with exceptions in a very flexible way – we can choose
exactly what kinds of exceptions we wish to handle and can write
arbitrary code to do so, and can choose at which level of the code
exceptions should be handled. We can even interrogate exceptions to get
extra information about the problem that occurred.
Raising exceptions, on the other hand, allows us to signal that something
has gone wrong in our code, resulting in (hopefully) helpful error
1 See the solution to the second exercise in this chapter for a fairly minimal example.
203
Chapter 7: Exception handling
messages, and giving the calling code a chance to correct it. Python has a
range of built in exception types to represent common problems, but if we
need something more specific then we can easily create our own.
Using exceptions – rather than lengthy if/else conditions and print()
statements – to handle errors results in better code. Code that uses
exceptions tends to be more robust, since it allows us to deal with
problems when they actually arise, rather than trying to pre-emptively
catch them. It also tends to be more readable, since error-handling code is
clearly demarcated and the syntax of the exception handling system
makes it clear which type of errors are being handled.
Some readers might find the examples presented in this chapter
unconvincing. This is likely a reflection of the fact that exception
handling is most valuable in large projects and library code, neither of
which lend themselves to concise examples (or to exercises in
programming books). As your programming projects become larger and
more complicated, you'll find that the encapsulation offered by
exceptions far outweighs the extra mental overhead of thinking about
them.
204
Chapter 7: Exception handling
Exercises
Responding to exceptions
A Python programmer has written a piece of code that reads a DNA
sequence from a file and splits it up into a set number of equal-sized
pieces (ignoring any incomplete pieces at the end of the sequence). It asks
the user to enter the name of the file and the number of pieces, calculates
the length of each piece (by dividing the total length by the number of
pieces), then uses a range() to print out each piece:
# ask the user for the filename, open it and read the DNA sequence
input_file = raw_input('enter filename:\n')
f = open(input_file)
dna = f.read().rstrip("\n")
# ask the user for the number of pieces and calculate the piece length
pieces = int(raw_input('enter number of pieces:\n'))
piece_length = len(dna) / pieces
print('piece length is ' + str(piece_length))
original.py
The code works well enough, but after playing around with it for a while,
the programmer realizes that it's quite easy to make it crash by, for
example, giving it the name of a non-existent file, or entering zero when
asked for the number of pieces – or indeed, entering something that isn't
a number at all when asked for the number of pieces.
The programmer decides to make the code more robust by checking for
these three errors at each step of the program before proceeding to the
next step. That way, if it looks like the user has entered an invalid file
205
Chapter 7: Exception handling
import os
import sys
# do the processing
piece_length = len(dna) / pieces
print('piece length is ' + str(piece_length))
for start in range(0, len(dna)-piece_length+1, piece_length):
print(dna[start:start+piece_length])
original_with_if.py
206
Chapter 7: Exception handling
class SequenceRecord(object):
def get_fasta(self):
...
class ProteinRecord(SequenceRecord):
def get_hydrophobic(self):
...
class DNARecord(SequenceRecord):
def complement(self):
...
def get_AT(self):
...
207
Chapter 7: Exception handling
208
Chapter 7: Exception handling
Solutions
Responding to exceptions
The first step in rewriting the code is to go back to the original code and
figure out exactly what kind of errors we are dealing with. We know from
the examples earlier in the chapter that we'll get an IOError if we try to
open a file that isn't there, and that we'll get a ValueError if we try to
turn an inappropriate string into an integer using the int() function. We
can easily find out what happens when we ask for the DNA to be split into
zero pieces by running the first version of the code and giving the
relevant input:
enter filename:
test.dna
enter number of pieces:
0
Traceback (most recent call last):
File "adding.py", line 8, in <module>
piece_length = len(dna) / pieces
ZeroDivisionError: integer division or modulo by zero
209
Chapter 7: Exception handling
try:
# ask the user for the filename, open it and read the DNA sequence
input_file = raw_input('enter filename:\n')
f = open(input_file)
dna = f.read().rstrip("\n")
# ask the user for the number of pieces and calculate the piece
length
pieces = int(raw_input('enter number of pieces:\n'))
piece_length = len(dna) / pieces
print('piece length is ' + str(piece_length))
except IOError:
print("Couldn't open the file")
except ValueError:
print("Not a valid number")
except ZeroDivisionError:
print("Number of pieces can't be zero")
else:
# print out each piece of DNA in turn
for start in range(0, len(dna)-piece_length+1, piece_length):
print(dna[start:start+piece_length])
When comparing the code above with the solution offered in the exercise
description there are two things to notice. Firstly although it's not any
shorter, it's much easier to read because the code for dealing with input
errors is all collected in one place (the group of except blocks) rather
than being mixed up with the rest of the code.
Secondly, it is able to deal with a wider range of potential problems. For
instance, consider the case where the specified input file exists, but its
permissions are set such that it isn't readable by the program. Testing if
the file exists using os.path.exists, as was done in the previous
solution, will return True, but the program will still produce an error
when trying to open it. However, in our approach above, the IOError
that is raised when trying to open the file will still be caught and dealt
with. In light of this fact, we can probably make our error message in the
event of an IOError more helpful, by printing out the details alongside
our own error message:
210
Chapter 7: Exception handling
...
except IOError as ex:
print("Couldn't open the file: " + ex.strerror)
...
We can also make the ValueError message more helpful by printing out
its details – recall that these are stored in a list called args and that the
first element is the message we want:
...
except ValueError as ex:
print("Not a valid number: " + ex.args[0])
...
try:
# ask the user for the filename, open it and read the DNA sequence
input_file = raw_input('enter filename:\n')
f = open(input_file)
dna = f.read().rstrip("\n")
# ask the user for the number of pieces and calculate the piece
length
pieces = int(raw_input('enter number of pieces:\n'))
piece_length = len(dna) / pieces
print('piece length is ' + str(piece_length))
except IOError as ex:
print("Couldn't open the file: " + ex.strerror)
except ValueError as ex:
print("Not a valid number: " + ex.args[0])
except ZeroDivisionError as ex:
print("Number of pieces can't be zero")
else:
# print out each piece of DNA in turn
for start in range(0, len(dna)-piece_length+1, piece_length):
print(dna[start:start+piece_length])
responding.py
211
Chapter 7: Exception handling
class SequenceRecord(object):
def get_fasta(self):
...
Note that we could have created a custom error type – perhaps called
EmptyGeneNameError – to raise here. The choice of whether to use a
built in Python exception or a custom one generally boils down to: are
1 Take a look at the section on inheritance in the chapter on object oriented programming for a
reminder about base and derived classes.
212
Chapter 7: Exception handling
but notice that the exception is raised by the call to len() (since integers
don't have a length) and hence is a TypeError rather than a
ValueError. There are a couple of different approaches we can take to
the possibility of a non-string gene name. We might want to be flexible,
and allow a SequenceRecord object to be created with any type of
object as a gene name, in which case we need to convert the gene_name
variable to a string before checking its length:
class SequenceRecord(object):
def get_fasta(self):
...
213
Chapter 7: Exception handling
type of data as the gene name argument, since all Python objects can be
represented as strings – which may not be what we want.
Alternatively, we can decide to be very strict and only accept strings as
gene name arguments. We can enforce this by adding another validation
check which raises a TypeError if the gene name isn't a string:
class SequenceRecord(object):
def get_fasta(self):
...
r"[A-Z][a-z]+ [a-z]+"
214
Chapter 7: Exception handling
which we'll use in the re.match() function rather than the more usual
re.search because we want to match the entire string rather than just a
part of it:
import re
class SequenceRecord(object):
def get_fasta(self):
...
215
Chapter 7: Exception handling
if re.search(r'[^ATGC]', some_dna_sequence):
# raise an exception
if re.search(r'[^FLSYCWPHQRIMTNKVADEG]', some_protein_sequence):
# raise an exception
but the question is where to put them. There are two good options1:
either we could override the base class constructor in each of the derived
classes, or we could add a validation method to the derived classes which
can be called by the superclass constructor. The first approach is probably
the most object oriented: it follows the principal of allowing derived
classes to inherit general functionality from the base class while adding
functionality that is specific to themselves. Take a look at the chapter on
object oriented programming, specifically the section on overriding
methods in the base class, for an example of this type of validation.
Because we've already seen an example of the first approach in a previous
chapter, we'll try the second one here so that we have seen an example of
both. To do this we simply add a sequence validation method to each of
the derived classes (DNARecord and ProteinRecord) and call it in the
constructor of the parent class (SequenceRecord) as the last step before
actually assigning the arguments. The magic of inheritance ensures that
when the sequence validation method is called in the constructor, the
appropriate subclass method is executed depending on whether we are
creating a DNARecord or a ProteinRecord.
Because we're intending the sequence validation method to only be called
in the base class constructor, we can take advantage of the Python
formatting convention that methods beginning with an underscore are for
internal use only. This isn't enforced by the language in any way, but it's a
useful hint to anyone looking at the source code that they shouldn't call
the sequence validation method for any other reason. Here's the code:
1 And more than a few bad ones, which we won't discuss here!
216
Chapter 7: Exception handling
import re
class SequenceRecord(object):
...
class ProteinRecord(SequenceRecord):
...
class DNARecord(SequenceRecord):
...
Because these type types of error fall into a natural hierarchy – they are
both examples of invalid sequences – let's create a few custom exception
classes to represent them. We'll write a base class,
InvalidCharacterError, which inherits from the Exception class,
then add two derived classes to represent errors in DNA and protein
sequences which will inherit from the base class. Here are the class
217
Chapter 7: Exception handling
class InvalidCharacterError(Exception):
pass
class InvalidBaseError(InvalidCharacterError):
pass
class InvalidAminoAcidError(InvalidCharacterError):
pass
sequence_record.py
And here are the modifications that we have to make to the validation
methods:
class ProteinRecord(SequenceRecord):
...
class DNARecord(SequenceRecord):
sequence_record.py
218
Chapter 7: Exception handling
except InvalidCharacterError:
# deal with the invalid sequence
as well as exception handlers that only catch particular types (i.e. DNA or
protein) of error.
219
Afterword
This is the end of Advanced Python for biologists; I hope you have
enjoyed the book, and found it useful. Remember that if you have
any comments on the book – good or bad – I'd love to hear them;
drop me an email at
If you think that you might end up using the techniques you've
learned in this book to build larger Python programs, then take a
look at the companion book Effective Python development for
Biologists, which contains detailed discussions of topics that you're
likely to run into – things like how to test your code, and how to
build a user interface.
If you've found the book useful, please also consider leaving a
Amazon review. Reviews will help other people to find the book, and
hopefully make learning Python a bit easier for everyone.
Index
A
alleles................................................................................................ 93
AmbiguousBaseError.......................................................................202
anonymous expression.....................................................................116
anonymous values..............................................................................43
args.................................................................................................. 188
AT content...................................................................79, 119, 123, 155
AttributeError.................................................................................... 74
autovivification.................................................................................. 61
B
base class..................................................................................... 70, 81
BLAST............................................................................................. 139
bugs................................................................................................. 181
C
call stack............................................................................................ 24
catching exceptions..........................................................................183
child to parent....................................................................................17
children.............................................................................................. 27
clades................................................................................................. 55
class............................................................................................. 66, 70
comma-separated fields...................................................................106
comments......................................................................................... 143
complex sorts................................................................................... 129
composition....................................................................................... 65
comprehensions...............................................................................159
constructor...................................................................................75, 87
context managers.............................................................................193
D
defining a class...................................................................................69
derived class.......................................................................................81
dict comprehension............................................................................62
dict comprehensions...........................................................................51
dictionary comprehensions...............................................................164
dicts of lists........................................................................................ 49
dicts of sets........................................................................................ 46
dicts of tuples.....................................................................................48
difference........................................................................................... 41
dinucleotides...................................................................................... 15
distance matrices...............................................................................58
duplicate elements.............................................................................39
E
else.................................................................................................. 189
encapsulation.................................................................................... 66
except....................................................................................... 183, 185
exception bubbling...........................................................................195
exceptions........................................................................................ 182
exhaustible expressions....................................................................163
extend()............................................................................................ 111
F
FASTA format.............................................................................77, 140
filter.......................................................................................... 123, 147
finally............................................................................................... 190
fitness................................................................................................ 93
footnotes.............................................................................................. 4
function factory................................................................................135
functional programming...................................................................110
G
generation........................................................................................ 106
generator expressions.......................................................................162
generators........................................................................................ 170
H
heterogeneous data............................................................................38
higher order functions......................................................................132
higher-order functions......................................................................114
homogeneous data.............................................................................39
I
immutable.......................................................................................... 38
inheritance................................................................................... 65, 77
instance............................................................................................. 66
instance variables..............................................................................69
intersection.................................................................................. 41, 58
InvalidAminoAcidError....................................................................218
InvalidBaseError..............................................................................218
InvalidCharacterError......................................................................217
IOError............................................................................................ 184
isdigit............................................................................................... 206
isinstance()........................................................................................ 54
issubset method................................................................................. 47
iter................................................................................................... 167
iterable objects.................................................................................159
iterable types....................................................................................167
iterator interface..............................................................................167
itertools........................................................................................... 179
K
key functions.................................................................................... 126
kmer counting.................................................................................... 50
kmers......................................................................................... 10, 133
kmers()............................................................................................. 171
L
lambda expression....................................................................116, 135
last common ancestor...........................................................31, 56, 131
lazy evaluation.................................................................................123
list comprehension.............................................................................47
list comprehensions..........................................................................160
lists of dicts........................................................................................ 43
lists of lists......................................................................................... 41
lists of tuples...................................................................................... 43
M
map................................................................................................. 118
metadata........................................................................................... 68
methods............................................................................................. 69
mismatches...................................................................................... 143
monopyly........................................................................................... 58
multiple sequence alignment..............................................................42
mutability........................................................................................ 110
N
nested comprehensions...............................................................63, 163
nested lists......................................................................................... 52
Newick format....................................................................................52
next.................................................................................................. 168
node................................................................................................... 29
O
object................................................................................................. 66
object oriented programming.............................................................65
os.path.exists................................................................................... 210
overriding........................................................................................... 86
P
pairwise comparison..........................................................................48
pairwise comparisons.......................................................................164
parent to child....................................................................................17
parents.............................................................................................. 20
partial function application..............................................................137
phylogenetic trees...............................................................................52
poly-A tail.................................................................................127, 140
polymorphism..............................................................................65, 90
pop()❷............................................................................................... 27
population......................................................................................... 93
primers............................................................................................ 173
pure functions.................................................................................. 114
R
raise................................................................................................. 198
raising exceptions.............................................................................198
random.............................................................................................. 97
re.match........................................................................................... 215
Recursion........................................................................................... 10
reduce.............................................................................................. 130
restriction enzymes..........................................................................135
S
self..................................................................................................... 70
set comprehensions..........................................................................166
Sets.................................................................................................... 39
side effects....................................................................................... 111
simulation.......................................................................................... 93
sliding window................................................................................. 133
sorted............................................................................................... 125
stable sort........................................................................................ 129
stack.................................................................................................. 28
state................................................................................................. 110
StopIteration....................................................................................168
strerror............................................................................................ 187
sub................................................................................................... 153
subclass............................................................................................. 81
subsets............................................................................................... 47
subtrees............................................................................................. 54
sum.................................................................................................. 111
superclass.......................................................................................... 81
syntax errors.................................................................................... 181
T
taxonomy........................................................................................... 17
transformation function...................................................................119
translation......................................................................................... 85
tree.................................................................................................... 17
trees................................................................................................... 52
trinucletoides..................................................................................... 15
try.................................................................................................... 183
Tuples................................................................................................ 37
two-dimensional list...........................................................................42
TypeError......................................................................................... 213
typography........................................................................................... 4
U
union............................................................................................ 41, 58
unpacking tuples................................................................................46
V
ValueError....................................................................................... 184
Y
yield................................................................................................. 170
Z
ZeroDivisionError............................................................................209
_
__init__()............................................................................................ 75
__iter__............................................................................................. 167