Why Python Rocks For Research: Programming

Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

PROGRAMMING

Why Python Rocks


for Research

T
By HOYT KOEPKE

HE FOLLOWING IS an account of my own experi- actual coding, the better. I now believe, however, that this attitude
ence with Python. Because that experience is misguided.
was so positive, this is an unabashed attempt MATLAB’s language design is focused on matrix and linear
to promote the use of Python for general sci- algebra operations; for turning such equations into one-liners, it is
entific research and development. About four years ago, I dropped pretty much unsurpassed. However, move beyond these operations
MATLAB in favor of Python as my primary language for coding and it often becomes an exercise in frustration. R is beautiful for
research projects. This article is a personal account of how reward- interactive data analysis, and its open library of statistical pack-
ing I found that experience to be. ages is amazing. However, the language design can be unnatural,
As I describe in the next sections, the variety and quality and even maddening, for larger development projects. While
of Python’s features have spoiled me. Even in small scripts, I Mathematica is perfect for interactive work with pure math, it is
now rely on Python’s numerous data structures, classes, nested not intended for general purpose coding.
functions, iterators, the flexible function calling syntax, an extensive The problem with the “perfect match” approach is that you
kitchen-sink-included standard library, great scientific libraries, lose generalizability very quickly. When the criteria for language
and outstanding documentation. design is too narrow, you inevitably choose excellence for one
To clarify, I am not advocating just Python as the perfect scientific application over greatness for many. This is why universities have
programming environment; I am advocating Python plus a handful graduate programs in computer language design — navigating the
of mature 3rd-party open source libraries, namely Numpy/Scipy pros and cons of various design decisions is extremely difficult
for numerical operations, Cython for low-level optimization, to get right. The extensive use of Python in everything from
IPython for interactive work, and MatPlotLib for plotting. Later, system administration and website design to numerical number-
I describe these and others in more detail, but I introduce these crunching shows that it has, indeed, hit the sweet spot. In fact,
four here so I can weave discussion of them throughout this article. I’ve anecdotally observed that becoming better at R leads to skill
Given these libraries, many features in MATLAB that enable at interacting with data, becoming better at MATLAB leads to
one to quickly write code for machine learning and artificial skill at quick-and-dirty scripting, but becoming better at Python
intelligence – my primary area of research – are essentially a small leads to genuine programming skill.
subset of those found in Python. After a day learning Python, I Practically, in my line of work, the downside is that some
was able to still use most of the matrix tricks I had learned in matrix operators that are expressable using syntactical constructs
MATLAB, but also utilize more powerful data structures and in MATLAB become function calls (e.g. !"#"$%&'()*+",- instead
design patterns when needed. of !" #" *" ." ,). In exchange for this extra verbosity — which I
have not found problematic — one gains incredible flexibility
Holistic Language Design and a language that is natural for everything from automating
I once believed that the perfect language for research was one system processes to scientific research. The coder doesn’t have
that allowed concise and direct translation from notepad scrib- to switch to another language when writing non-scientific code,
blings to code. On the surface, this is reasonable. The more barriers and allows one to easily leverage other libraries (e.g. databases)
between generating ideas and trying them out, the slower research for scientific research.
progresses. In other words, the less one has to think about the

24 PROGRAMMING
Furthermore, Python allows one to easily leverage object ori- for automatic documentation). Third, the list processing syntax
ented and functional design patterns. Just as different problems is designed to be readable. Even if you are not used to reading
call for different ways of thinking, so also different problems call Python code, it is easy to parse this code — a new list is defined
for different programming paradigms. There is no doubt that a and returned from the list '2&4($ using 7UXH if a particular value '
linear, procedural style is natural for many scientific problems. is above 5%46/27, and )DOVH otherwise. Finally, when calling func-
However, an object oriented style that builds on classes having tions, Python allows named arguments — this universally promotes
internal functionality and external behavior is a perfect design clarity and reduces stupid bookkeeping bugs, particularly with
pattern for others. For this, classes in Python are full-featured and functions requiring more than one or two arguments.
practical. Functional programming, which builds on the power Permit me to contrast these features with MATLAB. With
of iterators and functions-as-variables, makes many programming MATLAB, globally available functions are put in separate files,
solutions concise and intuitive. Brilliantly, in Python, everything discouraging the use of smaller functions and — in practice — often
can be passed around as an object, including functions, class promotes cut-and-paste programming, the bane of debugging.
definitions, and modules. Iterators are a key language component Default arguments are a pain, requiring conditional coding to set
and Python comes with a full-featured iterator library. While it unspecified arguments. Finally, specifying arguments by name
doesn’t go as far in any of these categories as flagship paradigm when calling is not an option, though one popular but artificial
languages such as Java or Haskell, it does allow one to use some construct — alternating names and values in an argument list —
very practical tools from these paradigms. These features combine allows this to some extent.
to make the language very flexible for problem solving, one key
reason for its popularity. Balance of High Level and Low Level Programming
The ease of balancing high-level programming with low-level opti-
Readability mization is a particular strong point of Python code. Python
To reiterate a recurrent point, Python’s syntax is very well code is meant to be as high level as reasonable — I’ve heard
thought out. Unlike many scripting languages (e.g. Perl), readability that in writing similar algorithms, on average you would write
was a primary consideration when Python’s syntax was designed. six lines of C/C++ code for every line of Python. However, as
In fact, the joke is that turning pseudocode into correct Python with most high-level languages, you often sacrifice code speed
code is a matter of correct indentation. for programming speed.
This readability has a number of beneficial effects. Guido One sensible approach around this is to deal with higher level
van Rossum, Python’s original author, writes: objects — such as matrices and arrays — and optimize operations
on these objects to make the program acceptably fast. This is
This emphasis on readability is no accident. As an object-oriented
MATLAB’s approach and is one of the keys to its success; it is
language, Python aims to encourage the creation of reusable code.
also natural with Python. In this context, speeding code up means
Even if we all wrote perfect documentation all of the time, code
vectorizing your algorithm to work with arrays of numbers instead
can hardly be considered reusable if it’s not readable. Many of
of with single numbers, thus reducing the overhead of the language
Python’s features, in addition to its use of indentation, conspire
when array operations are optimized.
to make Python code highly readable.
Abstractions such as these are absolutely essential for good
In addition, I’ve found it encourages collaboration, and not just scientific coding. Focusing on higher-level operations over higher-
by lowering the barrier to contributing to an open source Python level data types generally leads to massive gains in coding speed
project. If you can easily discuss your code with others in your and coding accuracy. Python’s extension type system seamlessly
office, the result can be better code and better coders. allows libraries to be designed around this idea. Numpy’s array
As two examples of this, consider the following code snippet: type is a great example.
However, existing abstractions are not always enough when
/(0"1&2$$30,)'2&4($+"5%46/27,#8-9""
you’re developing new algorithms or coding up new ideas. For
&ODVVLÀHVYDOXHVDVEHLQJEHORZ )DOVH RUDERYH 7UXH 
example, vectorizing code through the use of arrays is powerful but
2"5%46/27,:;""
limited. In many cases, operations really need loops, recursion, or
UHWXUQ> 7UXHLIY!ERXQGDU\HOVH)DOVH IRUYLQ
other coding structures that are extremely efficient in optimized,
'2&4($<
compiled machine code but are not in most interpreted languages.
As variables in many interpreted languages are not statically typed,
=">2&&"?@("25%'("0461?3%6""
the code can’t easily be compiled into optimized machine code.
1&2$$30,)A,B'2&4($+"5%46/27,#8:C-
In the scientific context, Cython provides the perfect balance
Let me list three aspects of this code. First, it is a small, self- between the two by allowing either.
contained function that only requires three lines to define, includ- Cython works by first translating Python code into equivalent C
ing documentation (the string following the function). Second, code that runs the Python interpreted through the Python C API.
a default argument for the boundary is specified in a way that It then uses a C compiler to create a shared library that can be
is instantly readable (and yes, that does show up when using Sphinx loaded as a Python module. Generally, this module is functionally

25
equivalent to the original Python module and usually runs mar- or method name. These documentation strings add tags to the
ginally faster. The advantage, however, is that Cython allows one methods which are accessible by anyone using an interactive
to statically type variables — e.g. 1/(0"36?"3 declares 3 to be an Python shell or by automatic documentation generators.
integer. This gives massive speedups, as typed variables are now The beauty of Python’s system becomes apparent when using
treated using low-level types rather than Python variables. With Sphinx, a documentation generation system originally built for
these annotations, your “Python” code can be as fast as C — while Python language documentation. To allow sufficient presentation
requiring very little actual knowledge of C. flexibility, it allows reStructuredText directives, a simple, readable
Practically, a few type declarations can give you incredible markup language that is becoming widely used in code documen-
speedups. For example, suppose you have the following code: tation. Sphinx works easily with embedded doc-strings, but it is
useful beyond documentation — for example, my personal website,
/(0"0%%)*-9""
my course webpages when I teach, my code documentation sites,
IRULLQUDQJH $VKDSH>@ "
and, of course, Python’s main website are generated using Sphinx.
IRUMLQUDQJH $VKDSH>@  "
One helpful feature for scientific programming is the ability to
""""""*D3+E<"F#"3GE
put LaTeX equations and plots directly in code documentation.
where * is a 2d NumPy array. This code uses interpreted loops For example, if you write:
and thus runs fairly slowly. However, add type information and
PDWK?*DPPD ]  ?LQWBA?LQIW\[A^]`HA[?G[
use Cython:
in the doc string, it is rendered in the webpage as
/(0"1,0%%)6/2772,D/%45&(+"6/3A#H<"*-9""
""1/(0"$3I(B?"3+"E""
"
IRULLQUDQJH $VKDSH>@ "
IRUMLQUDQJH $VKDSH>@  " Including plots is easy. The following doc-string code:
""""""*D3+E<"F#"3GE
SORW"
Cython translates necessary Python operations into calls to the LPSRUWQXPS\DVQS"
Python C-API, but the looping and array indexing operations are LPSRUWPDWSORWOLES\SORWDVSOW"
turned into low level C code. For a 1000 x 1000 array, on my 2.4 "
GHz laptop, the Python version takes 1.67 seconds, while the [ QSOLQVSDFH  "
Cython version takes only 3.67 milliseconds (a vectorized version SOWÀJXUH SOWSORW [QSVTUW [ ODEHO U6NLLQJ?
of the above using an outer product took 15.1 ms). VTUW^[` "
A general rule of thumb is that your program spends 80% of its SOWSORW [[ ODEHO U6QRZERDUGLQJ[A "
time running 20% of the code. Thus a good strategy for efficient SOWWLWOH /HDUQLQJ&XUYHVIRU6QRZERDUGLQJDQG6NLLQJ "
coding is to write everything, profile your code, and optimize the SOW[ODEHO 7LPH "
parts that need it. Python’s profilers are great, and Cython allows SOW\ODEHO 6NLOO SOWOHJHQG ORF
XSSHUOHIW
"
you to do the latter step with minimal effort. SOWVKRZ

gives
Language Interoperability
As a side affect of its universality, Python excels at gluing
other languages together. One can call MATLAB functions from
Python (through the MATLAB engine) using MLabWrap, easing
transitions from MATLAB to Python. Need to use that linear
regression package in R? RPy puts it at your fingertips. Have fast
FORTRAN code for a particular numerical algorithm? F2py will
effortless generate a wrapper. Have general C or C++ libraries you
want to call? Ctypes, Cython, or SWIG are three ways to easily
interface to it (my favorite is Cython). Now, if only all these were
two way streets...

Documentation System
Brilliantly, Python incorporates module, class, function, and
method documentation directly into the language itself. In essence,
there are two levels of comments — programming level comments In essence, this enables not only comments about the code,
(start with =) that are ignored by the compiler, and documentation but also comments about the science and research behind your
comments that are specified by a doc string after the function code, to be interwoven into the coding file.

26 PROGRAMMING
Hierarchical Module System Available Libraries
Python uses modular programming, a popular system that Python has an impressive standard library packaged with the
naturally organizes functions and classes into hierarchical program. Its philosophy is “batteries-included”, and a standard
namespaces. Each Python file defines a module. Classes, Python distribution comes with built-in database functionality, a
functions, or variables that are defined in or imported into that variety of data persistence features, routines for interfacing with the
file show up in that module’s namespace. Importing a module operating system, website interfacing, email and networking tools,
either creates a local dictionary holding that module’s objects, data compression support, cryptography, xml support, regular
pulls some of the module’s objects into the local namespace. expressions, unit testing, multithreading, and much more. In short,
For example, LPSRUWKDVKOLE binds @2$@&35:A/C to hashlib’s if I want to take a break from writing a bunch of matrix manipula-
md5 checksum function; alternately, IURPKDVKOLELPSRUWPG tion code and automate an operating system task, I don’t have to
binds A/C to this function. This helps programming namespaces switch languages.
to follow a hierarchical organization. Numerous libraries provide the needed functionality for sci-
On the coding end, a Python file defines a module. Similarly, entific . The following is a list of the ones I use regularly and find
a directory containing an BBLQLWBBS\ Python file is treated to be well-tested and mature:
the same way, files in that directory can define submodules, and
t NumPy/SciPy: This pair of libraries provide array and matrix
so on. Thus the code is arranged in a hierarchical structure for
structures, linear algebra routines, numerical optimization,
both the programmer and the user.
random number generation, statistics routines, differential equa-
Permit me a short rant about MATLAB to help illustrate why
tion modeling, Fourier transforms and signal processing, image
this is a great feature. In MATLAB, all functions are declared in
processing, sparse and masked arrays, spatial computation, and
the global namespace, with names determined by filenames in the
numerous other mathematical routines. Together, they cover
current path variable. However, this discourages code reusability
most of MATLAB’s basic functionality and parts of many of the
by making the programmer do extra work keeping disparate
toolkits, and include support for reading and writing MATLAB
program components separate. In other words, without a hierar-
files. Additionally, they now have great documentation (vastly
chical structure to the program, it’s difficult to extract and reuse
improved from a few years ago) and a very active community.
specific functionality. Second, programmers must either give their
t IPython: One of the best things in Python is IPython, an
functions long names, essentially doing what a decent hierarchical
enhanced interactive Python shell that makes debugging, pro-
system inherently does, or risk namespace conflicts which can be
filing code, interactive plotting. It supports tab completion on
difficult to resolve and result in subtle errors. While this may help
objects, integrated debugging, module finding, and more —
one to throw something together quickly, it is a horrible system
essentially, it does almost everything you’d expect a command
from a programming language perspective.
line programming interface to do. Additionally,
t Cython: Referenced earlier, Cython is a painless way of embed-
Data Structures ding compiled, optimized bits of code in a larger Python
Good programming requires having and using the correct data
program.
structures for your algorithm. This is almost universally under-
t SQLAlchemy: SQLAlchemy makes leveraging the power of a
emphasized in research-oriented coding. While proof-of-concept
database incredibly simple and intuitive. It is essentially a wrap-
code often doesn’t need optimal data structures, such code causes
per around an SQL database. You build queries using intuitive
problems when used in production. This often — though it’s
operators, then it generates the SQL, queries the database,
rarely stated or known explicitly — limits the scalability of a lot
and returns an iterator over the results. Combining it with
of existing code. Furthermore, when such features are not natural
sqlite — embedded in Python’s standard library — allows one
in a language’s design, coders often avoid them and fail to learn
to leverage databases for scientific work with impressive ease.
and use good design patterns.
And, if you tell sqlite to build its database in memory, you’ve
Python has lists, tuples, sets, dictionaries, strings, thread-
got another powerful data structure. To slightly plagiarize xkcd,
safe queues, and many other types built-in. Lists hold arbitrary
SQLAlchemy makes databases fun again.
data objects and can be sliced, indexed, joined, split, and used as
t PyTables: PyTables is a great way of managing large amounts of
stacks. Sets hold unordered, unique items. Dictionaries map from
data in an organized, reliable, and efficient fashion. It optimizes
a unique key to anything and form the real basis of the language.
resources, automatically transferring data between disk and
Heaps are available as operations on top of lists (similar to the
memory as needed. It also supports on-the-fly (DE)compression
C++ STL heaps). Add in NumPy, and one has an n-dimensional
and works seamlessly with NumPy arrays.
array structure that supports optimized and flexible broadcasting
t PyQt: For writing user interfaces in C++, I recommend it is,
and matrix operations. Add in SciPy, and you have sparse matrices,
in my experience, difficult to beat QT. PyQt brings the ease of
kd-trees, image objects, time-series, and more.
QT to Python. And I do mean ease — using the interactive QT
designer, I’ve build a reasonably complex GUI-driven scientific
application with only a few dozen lines of custom GUI code. The

27
entire thing was done in a few days. The code is cross-platform Downsides
over Linux, Mac OS X, and Windows. If you need to develop a No persuasive essay is complete without an honest presentation
front end to your data framework, and don’t mind the license of the counterpoints, and indeed several can be made here. In fact,
(GPL for PyQT, LGPL for QT), this is, in my experience, the many of my arguments invite a counterargument — with so many
easiest way to do so. options available at every corner, where does one start? Having
t TreeDict: Without proper foresight and planning, larger research to make decisions at each turn could paralyze productivity. For
projects are particularly prone to the second law of thermody- most applications, wouldn’t a language with a rigid but usually
namics: over time, the organization of parameters, options, data, adequate style — like MATLAB — be better?
and results becomes increasingly random. TreeDict is a Python While one can certainly use a no-flair scripting style in Python,
data structure I designed to fight this. It stores hierarchical I agree with this argument, at least to a certain extent. However,
collections of parameters, variables, or data, and supports splic- the situation is not uniformly bad — rather, it’s a bit like learning
ing, joining, copying, hashing, and other operations over tree to ski versus learning to snowboard. The first day or two learn-
structures. The hierarchical structure promotes organization that ing to snowboard is always horrid, while one can pick up basic
naturally tracks the conceptual divisions in the program — for skiing quite quickly. However, fast-forward a few weeks, and
example, a single file can define all parameters while reflecting while the snowboarder is perfecting impressive tricks, the skier
the structure of the rest of the code. is still working on not doing the splits. An exaggerated analogy,
t Sage: Sage doesn’t really fit on this list as it packages many of perhaps, but the principle still holds: investment in Python yields
the above packages into a single framework for mathematical impressive rewards, but be prepared for a small investment in
research. It aims to be a complete solution to scientific program- learning to leverage its power.
ming, and it incorporates over a hundred open source scientific The other downside with using Python for general scientific
libraries. It builds on these with a great notebook concept that coding is the current landscape of conventions and available
can really streamline the thought process and help organize resources. Since MATLAB is so common in many fields, it is often
general research. As an added bonus, it has an online interface for conventional to publish open research code in MATLAB (except
trying it out. As a complete package, I recommend newcomers in some areas of mathematics, where Python is more common
to scientific Python programming try Sage first; it does a great on account of Sage; or in statistics, where R is the lingua franca).
job of unifying available tools in a consistent presentation. While MLabWrap makes this fairly workable, it does means that
t Enthought Python Distribution: Also packaging these many a Python programmer may need to work with both languages
libraries into a complete package for scientific computing, the and possess a MATLAB license. Anyone considering a switch
Enthought Python Distribution is distributed by a company should be aware of this potential inconvenience; however, there
that contributes heavily to developing and maintaining these seems to be a strong movement within scientific research towards
libraries. While there are commercial support options, it is free Python — largely for the reasons outlined here.
for academic use.
A Complete Programming Solution
Testing Framework In summary, and reiterating my point that Python is a com-
I do not feel comfortable releasing code without an accompanying plete programming solution, I mention three additional points,
suite of tests. This attitude, of course, reflects practical program- each of which would make a great final thought. First, it is open
mer wisdom; code that is guaranteed to function a certain way source and completely free, even for commercial use, as are many
— as encapsulated in these unit tests — is reusable and dependable. of the key scientific libraries. Second, it runs natively on Windows,
While packaging test code without does not always equate with code Mac OS, linux, and others, as does its standard library and the third
quality, there is a strong correlation. Unfortunately, the research com- party libraries I’ve mentioned here. Third, it fits quick scripting and
munity does not often emphasize writing proper test code, due partly large development projects equally well. A quick perusal of some
to that emphasis being directed, understandably, towards technique, success stories on Python’s website showcases the diversity of envi-
theory, and publication. But this is exactly why a no-boilerplate, ronments in which Python provides a scalable, well-supported, and
practical and solid testing framework and simple testing constructs complete programming solution for research and scientific coding.
like assert statements are so important. Python provides a built-in, However, the best closing thought is due to Randall Monroe, the
low barrier-to-entry testing framework that encourages good test author of xkcd: “Programming is fun again!” Q
coverage by making the fastest workflow, including debugging time,
involve writing test cases. In this way, Python again distinguishes Hoyt Koepke is a PhD student in the Statistics Department at the Univer-
itself from its competitors for scientific code. sity of Washington studying optimization, ranking models, probability
theory, and machine learning/artificial intelligence. As a teen, he learned
to program when his parents would only let him play computer games he
wrote himself, and subsequently got a MSc in computer science from the
University of British Columbia following a BA in physics at the University
of Colorado. He can be contacted at [email protected] or visited
online at www.stat.washington.edu/~hoytak.

Reprinted with permission of the original author. First appeared in hn.my/python.

28 PROGRAMMING

You might also like