0% found this document useful (0 votes)
90 views18 pages

Getting Started With Python in The Lab

The document introduces Python as a versatile programming language well-suited for tasks from small scripts to large applications. It demonstrates how to write a simple Python program to calculate the molecular weight of DNA sequences. Basic Python concepts are covered, such as creating strings, using for loops to iterate over strings, slicing strings, and using dictionaries to map nucleotides to their molecular weights. Finally, it shows how to organize the code into a reusable function to calculate the molecular weight of any input DNA sequence.

Uploaded by

Angroys
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
0% found this document useful (0 votes)
90 views18 pages

Getting Started With Python in The Lab

The document introduces Python as a versatile programming language well-suited for tasks from small scripts to large applications. It demonstrates how to write a simple Python program to calculate the molecular weight of DNA sequences. Basic Python concepts are covered, such as creating strings, using for loops to iterate over strings, slicing strings, and using dictionaries to map nucleotides to their molecular weights. Finally, it shows how to organize the code into a reusable function to calculate the molecular weight of any input DNA sequence.

Uploaded by

Angroys
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 18

Getting

started with
Python in
the lab

Gordon Webster, Amber Biology LLC


Getting started with Python in the lab

It has been my impression for some time, that many life science researchers
are missing out on some of the wonderful computational tools that are out
there, and that could be invaluable in helping them to manage and analyze their
laboratory data. This may be due to a lack of awareness that these tools exist
and that many of them are inexpensive or free, or it may be out of fear that
learning to use them will take too much time out of an already hectic schedule.
In starting this series of code tutorials for life science computing, I hope to
point life scientists with relatively little exposure to programming languages, in
the right direction to be able to start using these tools to write useful code that
can solve real problems.

Getting started with Python in the lab


Getting started with Python in the lab
In this tutorial we will create a simple program that can predict the pattern of
single-strand DNA fragments that you would expect on your electrophoresis
gel after doing a restriction digest. So without further ado, let’s introduce one
o f our favorite programming languages Python. Python is an incredibly
versatile programming environment, being well suited for creating solutions
from small scripts of a few lines in length, to large scale applications that are
used throughout global organizations. The Python development environment
is freely available for download from the of cial Python web site. For users of
Windows and Mac OS X platforms, Python comes in nice self-installing
packages that pretty much do everything for you. For Linux users doing a
manual install, you will need to unzip the install package and run a shell script
or (more conveniently), you could just do a one-click install with one of the
excellent software package managers that come with most Linux OS
distributions. The good news for Linux users is that chances are, your Linux OS
already comes with Python included and if you’re running your own Linux
system, you probably don’t need our help installing stuff anyway!

We’ll be writing our code using the nice clean IDLE development environment
that comes with Python. IDLE is really a kind of specialized text editor that can
actually run the Python code you type into it. On Mac OS X or Linux, it can be
launched from a terminal window (usually by just typing “idle” if everything is
installed and con gured correctly). On Windows you will probably need to
launch it from the command line (“run”) window.

Getting started with Python in the lab


Getting started with Python in the lab
Is IDLE running now? Congratulations, you just ran your rst Python
application because IDLE itself is actually written in Python!

Just Do It

So let’s start by creating a (very) simple DNA sequence in Python. In IDLE,


type the following line and hit the return key

mysequence = “atcg”

In programming jargon, ‘mysequence’ is a string of 4 characters. IDLE will


probably color code the “atcg” part to show this. Now type the following and
hit return

len(mysequence)

If everything went well, IDLE should now display the number 4 which is the
length of the string in characters returned by the ‘len’ function. Now try typing
the following lines. Make sure you indent the ‘print ch’ line by one tab space
just as I have done and you will have to hit the return key twice after the ‘print
ch’ line to actually run the code.

for ch in mysequence:
print ch

You should now see your sequence printed out in order, one character per line
like this

Getting started with Python in the lab


Getting started with Python in the lab
a
t
c
g

The code you just typed translates into english as “for each item in
‘mysequence’, give the item the temporary name ‘ch’ then print ‘ch'”. This kind
of loop construction is known as an iterator in Python. Python knows that
strings are made up of smaller items, in this case characters, so when you
iterate over a string using the ‘for’ command, the iterator returns each
character in the string, in order of occurrence. I used the name ‘ch’ because the
items in this case are characters, but I could just have easily used any other
name like ‘item’ or ‘c’ or ‘roger_the_shrubber‘.

The indent in the ‘print ch’ line is signi cant. Indents are the way to create
blocks of code in Python that are to be executed all together. Every line of
code that is indented by one level underneath the ‘for’ statement, gets
executed on each round trip through the loop, so if we had for example, also
wanted to print the position of the nucleotide in the sequence, we could have
added one or two more indented lines under the ‘for’ statement like this

i=0
for ch in mysequence:
print ch
print i
i += 1

Getting started with Python in the lab


Getting started with Python in the lab
In this modi ed version, a variable i is declared before the loop and then also
printed and incremented by 1 on each pass through the loop. The ‘i += 1’
syntax is Python shorthand for ‘i = i + 1’. The important thing to note here is
that all 3 of the indented lines are executed for each pass through the loop.The
output from the loop now looks like this

a
0
t
1
c
2
g
3

If we were to unindent the ‘i += 1’ line, this would remove it from the loop and
it would become the rst line to be executed after all the passes through the
loop had been completed. Can you figure out how this would change the output
printed by the loop?

Let’s bring in the string section

Another feature of strings (and all list-based objects in Python) is that their
characters/items can also be accessed according to their numerical position in
the object, starting from 0 (all lists, sequences, arrays and so on in Python are
numbered from 0), up to 1 minus the length of the object. So mysequence[0] =
‘a’, mysequence[1] = ‘t’ and so on. You can also slice strings up using [from:to]
syntax like this

Getting started with Python in the lab


Getting started with Python in the lab
name = “Arthur King”
print len(name)
11
print name[0:6]
Arthur
print name[:6]
Arthur
print name[7:11]
King
print name[7:]
King
print name[-4:]
King

Note the following features of Python indices: the last number in the ‘from:to’
range is exclusive; if the ‘from’ or ‘to’ index is excluded, the beginning or end of
the string/list is assumed; a negative index for a string/list denotes the position
-n from the end of the string.

Dictionaries are your friend

So now we know how to create DNA sequences as strings and how to iterate
over them, let’s introduce the Python dictionary. Dictionaries resemble in
some aspects the kind of list object that we already encountered in strings, but
instead of the items being labeled according to their numerical positions, they
can be labeled pretty much any old way we choose. For example …

mydictionary = {“galahad”:”pure”,”lancelot”:”brave”}

Now if you type

Getting started with Python in the lab


Getting started with Python in the lab
print mydictionary[“lancelot”]

you get

‘brave’

and so on. Dictionaries are composed of ‘key:value’ pairs and the stored value
for each key can be accessed using the mydictionary[key] syntax. Dictionaries
can also be added to like this

mydictionary[“bedevere”] = “wise”

So now let’s create a dictionary to store the molecular weights of the 4


standard DNA nucleotides

nucleotides = {‘a’:131.2,’t’:304.2,’c’:289.2,’g’:329.2}

Note that we used single quotes here. Python allows both single and double
quotes to be used interchangeably but the closing quote must be the same as
the opening quote. What this dictionary does in effect, is to translate the
symbol for the nucleotide into a molecular weight, for example

print nucleotides[‘t’]

prints the value

304.2

So now we are in a position to actually do a useful calculation. By combing the


‘for’ iterator for our sequences and the nucleotide dictionary, we can calculate
the molecular weight of any DNA sequence. Here’s how …

Getting started with Python in the lab


Getting started with Python in the lab
molecularweight = 0.0
for ch in mysequence:
molecularweight += nucleotides[ch]
print molecularweight

When you run this code, you get the molecular weight for our little 4
nucleotide sequence

1053.8

Now we could also do all kinds of fancy stuff like adding the molecular weight
of a 5′ phosphate if one is present, or accounting for DNA adducts or
chemically modified bases, but you get the general idea.

Organize your code with functions

Since we’re going to want to use the same molecular weight calculation on
different sequences, let’s create a Python function that takes a DNA sequence
as input and returns its molecular weight.

def calculateMolecularWeight(sequence):
molecularWeight = 0.0
for ch in sequence:
molecularWeight += nucleotides[ch]
return molecularWeight

Getting started with Python in the lab


Getting started with Python in the lab
In Python the ‘def’ statement de nes a function by its name
‘calculateMolecularWeight’ and by the parameters that it takes as input – in
this case a single parameter ‘sequence’. Note that Python is a dynamically
typed language that does not require you to de ne what kind of data
‘sequence’ is before you run it, so the calculateMolecularWeight function
would happily accept an integer like 2 as the input parameter, but it would
raise an error when it tried to process it since an integer has no items in it for
the ‘for’ loop to iterate over. Notice also how all of the code that forms the
body of the function is indented (with the code in the ‘for’ loop being indented
one more level still). The ‘return’ statement in a function does two things – it
exits the function and if it is accompanied by a parameter as in this case, it sets
the return value of the function to that parameter.

Now that we have our function, we can calculate the molecular weight for any
DNA sequence that we give it as an input parameter, like this

mwt = calculateMolecularWeight(mysequence)

or we can even supply the sequence parameter explicitly like this

mwt = calculateMolecularWeight(“gatgctgtggataa”)

So now let’s look at adding the nal feature – the ability to detect restriction
sites in the sequence and calculate the molecular weights of all the DNA
fragments produced by a restriction digest.

We can de ne restriction sites as sequences in a similar way as we did for DNA


sequences. For example

bamH1 = “ggatcc”
sma1 = “cccggg”

Getting started with Python in the lab


Getting started with Python in the lab
but in addition to the restriction enzyme’s recognition motif, we also need to
include information about where the enzyme cuts the DNA strand within the
motif, so let’s exapnd our deinition of a restriction site by using the very
versatile list format that Python offers.

bamH1 = [“ggatcc”,0]
sma1 = [“cccggg”,2]

Lists can be formed by grouping items in square brackets as shown above.


Python lists can contain other Python objects of any kind – even other Python
lists! The second item (an integer) in our restriction site de nition is the
position in the motif after which the enzyme cuts the DNA (again numbering
from 0 as the rst position). The items in lists can be accessed by their
positions, just like the characters in a string, so typing

print sma1[1]

returns the value

Remember that Python lists are numbered from 0, so ‘sma1[1]’ is actually the
2nd item in the list. Now that we have our restriction sites, we need a way to
search for them in a DNA sequence. Fortunately there is a very handy ‘ nd’
method for Python strings that does exactly what we want to do. The ‘ nd’
method is a property of strings in Python and such a property can be accessed
using the dot notation like this

position = mysequence.find(“tag”)

Getting started with Python in the lab


Getting started with Python in the lab
This looks for the next occurrence of the amber stop codon in ‘mysequence’
and returns its position in the sequence. If the search motif is not found, ‘ nd’
returns the value -1. Notice the phrase ‘next occurrence’ above. The ‘ nd’
method returns the position of the rst occurrence of the search motif that it
nds. If you want to look for multiple occurrences, an additional parameter
can be supplied to ‘ nd’ to tell it which position in the sequence to start
searching from, like this

position = mysequence.find(“tag”,172)

Then for example, if you nd the rst stop codon at position 172, you can
search again from that position to look for the next occurrence and so on. If no
position parameter is supplied to ‘ nd’ (as in the rst example), the search is
always done from the beginning of the string.

The . nd syntax works for any Python string because in object-oriented


programming (OOP) terms, ‘mysequence’ belongs to the general string class
and ‘ nd’ is a method de ned for that class. We will hardly delve at all into
OOP in this introductory tutorial, except to say that Python is an OOP
language and with a little OOP knowledge there are ways that we could write
the code in this tutorial much more cleanly and elegantly in an OOP style. For
instance, we could de ne our own DNA sequence and restriction site classes,
with their own properties and methods. Fortunately for us however, unlike
some other OOP languages, Python does not force the programmer to work in
an OOP style all the time, so we can learn just the general nuts and bolts of
Python for now and save OOP for another day.

We already have our molecular weight function, so let’s add another function
for doing restriction digests of our sequences. This function will take a
sequence and a restriction site de nition as input and return a list of the
digested DNA fragments. Here it is …

Getting started with Python in the lab


Getting started with Python in the lab
def digestDNA(sequence,rs):
frags = []
last = 0
while 1:
i = sequence.find(rs[0],last)
if i == -1:
frags.append(sequence[last:])
break
else:
frags.append(sequence[last:i+rs[1]])
last = i+rs[1]
return frags

The line ‘frags = []’ creates an empty list that we will use to store our DNA
fragments. The variable ‘last’ will be used to record the last place that we
found the restriction site de ned in ‘rs’. Initially, ‘last’ must obviously be set to
0 so that the search starts at the beginning of the sequence.

The Python ‘while’ command is another kind of iterative loop that works
differently from the ‘for’ loop that we have already encountered. The ‘while’
loop will continue to cycle for as long as the condition after ‘while’ is true. It is
easier to understand how the ‘while’ loop works by studying this example

x=0
while x < 10:
print x
x += 1

Getting started with Python in the lab


Getting started with Python in the lab
This loop will print the numbers from 0 to 9 and then stop once x reaches 10. In
our function however, we don’t seem to have any condition at all after ‘while’,
but just a rather mysterious integer, the number 1. The reason for this is that
integers in Python can be used as shorthand for true or false where an integer
greater or less than 0 is ‘true’, with 0 being ‘false’.

In our function then, the loop would seem to run forever since 1 is always true
and never gets changed. If you look in the indented body of the loop however,
you will see the ‘break’ command which exits the loop. The loop in our function
i s essentially con gured to run forever, until we encounter the situataion
where there are no more occurrences of the restriction site motif (in which
case sequence. nd() returns -1) and then we simply add the remainder of the
sequence to our list of fragments and exit the loop using the ‘break’ command.

Getting started with Python in the lab


Getting started with Python in the lab
The ‘if’ construct tests whether the value returned from ‘ nd’ equals -1.
Notice that in the conditional statement we use ‘i == 1’ which means “return
‘true’ if i equals 1 or ‘false’ if not”, as opposed to “i=1” which means “set the
value of i to 1”. If the condition following ‘if’ is true, everything in the indented
‘if’ block is done. The ‘else’ block is a way to tell Python what to do in the event
that the ‘if’ condition is false. In our case, if ‘ nd’ does nd another occurrence
of the restriction site, we use the restriction site cut position to define the new
fragment we’re generating, we add the new fragment to our fragment list using
the ‘append’ method that is a property of Python lists and we update the ‘last’
variable so that next search for the restriction site will continue from the
position of its last occurrence. Pretty simple right?

All together now …

Now we can nally put it all together to create a Python script that will
calculate the molelcular weights of the DNA fragments produced by a
restriction digest. You will notice that the ‘digestDNA’ function returns a list of
DNA sequences, so after we have run the function, we can use our iterative
‘for’ loop to go through the returned list of fragments one at a time and
calculate the molecular weight of each one, like this

for fragment in digestDNA(s,sma1):


print item
print calculateMolecularWeight(item)

Getting started with Python in the lab


Getting started with Python in the lab
Notice that instead of assigning the return value of the ‘digestDNA’ function to
a variable and then iterating over the variable like this

fragmentList = digestDNA(s,sma1)
for fragment in fragmentList:

we just used the function itself directly where the variable would go. This is
perfectly OK to do in Python and in fact, it is often good form to do this to
make the code more concise and readable.

So let’s create a fake sequence with a couple of SmaI restriction sites in it and
test our code.

s = “atcgatcgatcgcccgggatcgatcgcccgggatcgatcgatcg”
for fragment in digestDNA(s,sma1):
print fragment
print calculateMolecularWeight(fragment)

If everything went well, running this code should yield the following output

atcgatcgatcgcc
3739.8
cgggatcgatcgcc
3962.8
cgggatcgatcgatcg
4438.2

Getting started with Python in the lab


Getting started with Python in the lab
If we wanted to do a multiple digest, for each additional restriction site, we
could apply the same approach to the fragments generated by the previous
digest(s). In this case, we could write our code in such a way that it could be
applied recursively to our original sequence and the subsequent fragments.
Recursive methods are easy to write in Python, but we’ll save recursion for a
future tutorial.

One of the great things about Python is that it is a very popular language with a
large and ourishing community. This means that for most problem domains,
including the life sciences, somebody somewhere has probably already
developed a library of Python functions for your particular problem. As you
might have guessed, there already exist Python libraries for the kind of
bioinformatics problems that we have tackled here. Since the goal here was to
learn enough basic Python to be able to write some useful code, we did not rely
on any of them for our simple problem. Here are some links however, to give
you an idea of the kinds of life science libraries that are available in Python.

Biopython – a set of freely available tools for biological computation


Biskit – an object-oriented library for structural bioinformatics research
SciPy – a library for mathematics, science, and engineering, with many useful
algorithms for life science computing

Getting started with Python in the lab


Getting started with Python in the lab
This list is by no means exhaustive, but it gives some idea of the universe of
useful Python tools that are available to the life scientist.

Don’t stop there!

So now that you have hopefully seen enough to appreciate how easy it can be
to write useful code, we encourage you not to stop here. We will be publishing
future tutorials, but in the meantime, why not head on over and browse the
many fantastic references, tutorials and other resources for new Python users
that can be found at the documentation area of the Python web site.

If you really want to take your Python to the next level, and are interested in
learning enough Python to be able to really incorporate it into your work
and/or your research, check out our book Python For The Life Sciences,
scheduled for publication in the Summer/Fall of 2016.

Getting started with Python in the lab

You might also like