Puulmann Cs 2014
Puulmann Cs 2014
Puulmann Cs 2014
Karl-Aksel Puulmann
Tartu 2014
Python module for automatic testing of programming assignments
Abstract:
This thesis contains a description of a Python module for automatically assessing pro-
gramming assignments in introductionary programming courses. Most notably, the
module allows to test both input-output based tasks and functions at the same time.
In the first part, existing automatic assessment systems are analyzed. Then a guide is
given on how to use the module for testing different task types, how to extend it and
how to use it within other grading systems. Lastly the thesis deals with implementation
decisions, on how to secure testing and usage experiences from two different courses.
Keywords:
Automated grading; Automated testing; Education; Python; Docker
2
Contents
Introduction 5
1 Problem statement 6
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2.1 Input-output based testing frameworks . . . . . . . . . . . . . . 7
1.2.2 Functional testing framework - unittest . . . . . . . . . . . . . 8
1.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3 Implementation 25
3.1 General overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.1.1 Test registration . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.1.2 Preparations for test run . . . . . . . . . . . . . . . . . . . . . 25
3.1.3 Test run, results communication . . . . . . . . . . . . . . . . . 26
3.2 Synchronization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3
3.2.1 Tradeoffs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.3 Pre- and postprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.4 Security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.4.1 Requirements, threat model . . . . . . . . . . . . . . . . . . . 29
3.4.2 Docker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.4.3 Securing module using Docker . . . . . . . . . . . . . . . . . . 30
3.4.4 Using sandboxes with grader module . . . . . . . . . . . . . 30
4 Usage in courses 32
4.1 MTAT.03.100 Computer Programming . . . . . . . . . . . . . . . . . . 32
4.2 MTAT.03.256 Introduction to Programming II . . . . . . . . . . . . . . 33
4.2.1 Integration with the Moodle study environment . . . . . . . . . 33
4.3 Web IDE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.4 Future development . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
Conclusion 36
Bibliography 37
A Installation guide 39
4
Introduction
This thesis describes the design of a Python module created by the author that makes
writing automated tests for introductory programming courses simpler.
Chapter 1 describes the background of the work by analyzing existing testing mod-
ules and programs and their shortcomings for testing typical homework tasks. The last
section gathers the requirements and goals for the proposed solution.
Chapter 2 depicts the basic design decisions of the module and how to write tests
using the module. It also introduces ways of extending the module to support new task
types and how to use the web IDE created as a part of this thesis.
Chapter 3 shows how the module was implemented by analyzing each component in
depth, also explaining reasons for some design decisions. The later part of the chapter
depicts security problems with testing unsecured code and how sandboxing the module
is done.
Chapter 4 discusses how the system has been used in various courses and outlook for
future work.
5
Chapter 1
Problem statement
1.1 Background
The Python programming language is currently used in many universities, including
the University of Tartu, as a basis for introductory programming courses. While there
are many testing frameworks that support testing Python code, most of the testing of
homeworks and programming tests in those courses is still done by hand.
In the author’s opinion, the main reason for that is the large diversity of task types
that need to be tested and the flexibility required in assessing beginners’ programs.
For example, the following task types were used in Computer Programming course
homeworks in autumn 2013:
1. Input-output (IO) based tasks - in this task type, the program will ask for in-
put from the user, do some simple calculations based on the input and output an
answer in prescribed format. This might also involve reading and writing files.
• Interactive IO tasks - IO based task where the user and the program are
continually communicating. Examples: tic-tac-toe game against the com-
puter, computer tries to guess a number 1-10000 selected by the user.
2. Function-based tasks - the student has to create a function (or several functions)
which does something according to task statement. Usually involves returning
some value from the function.
3. Drawing tasks - draw a picture onto the screen using turtle module inspired by
logo turtle graphics [1].
4. Limited tasks - used in combination with previous task types. The student may
6
not use some construction in the Python language, e.g. loops for writing a func-
tion. This can be tested by doing static analysis such as parsing the tested program
into a representation called Abstract Syntax Tree (AST) and looking for forbidden
constructions there.
1.2 Analysis
To get a list of requirements for such a testing tool, existing solutions were analyzed
for their strengths and weaknesses, taking into account what task types the solution
needs to support.
Testing systems fall broadly into two types:
1. Input-output based testing where the program has a fixed input and the out-
put of the program is checked character-by-character against the expected output,
sometimes ignoring whitespace differences.
2. Functional testing where functions are tested by calling them with specific argu-
ments and variables and classes are checked against expected values.
These systems will be evaluated using the following criteria:
1. What task types are easy to test? Are any impossible? Do we need to change
existing tasks to use the system?
2. How easy is it to write tests? Is there a need to write extra code each time or some
other extra requirements?
3. Can the system be extended to support new task types?
4. How helpful is the feedback provided to the student? Can the feedback be used to
fix issues with the solution?
5. Can we safely test untrusted code? What extra steps do we need to take for secur-
ing the system?
7
virtual programming lab [2], Mooshak contest system [3] and Sphere online judge [4].
These systems have also been used for testing homeworks in algorithmics courses [5].
Tests for these systems are specified by an input and expected output file. Testing is
done by first compiling the program being tested and then piping the input data to the
program. The output of the program is compared against the expected output character
by character or by sometimes ignoring whitespace.
Doing fixed IO tests is most natural in these systems. Some systems like Jutge.org [6]
and Sphere online judge also provide support for testing interactive tasks - the task
creator can provide scripts which interact with the program. Grading limited tasks is
also sometimes possible if the system supports custom build scripts which can do static
analysis. Outside of this, the system is quite rigid and generally not extendable.
Test input data for these systems are usually prepared by hand for smaller inputs
and programmatically for larger inputs by writing custom generation scripts. Expected
output data is generated by running a sample solution on the input data. This method
of preparing tests usually has one unfortunate side-effect: since test data generators are
not used by the testing system, they are usually not published. This makes it harder to
change existing tasks or create new similar tasks which have slightly different require-
ments since the test generators often need to be rewritten.
Feedback in systems mentioned before is mostly given in the form of a few stan-
dard messages such as Accepted, Wrong answer, Compile error and Runtime error.
Some systems also expose the input data and expected answer to students. This sort
of feedback is not very helpful, especially on larger input data sets where it is hard to
understand what the correct answer should be without first visualizing the underlying
structure of the input data.
These systems are usually well-secured against malicious users, leveraging such
tools as chroot() jails, rlimits or by using separate virtual machines for testing [7].
Using Python unittest module [8], testing is done by writing functions which
check the behaviour of the tested program. Notifying of errors works through raising
exceptions, usually through assert-like functions1 or by propagating exceptions caused
by the tested code.
1
For example such a function might check if its two arguments are equal. If not, an exception is
raised, with the error message containing both argument values.
8
This means that it is most convenient for testing function-based tasks, but the ap-
proach is also extendable for testing limited tasks. However, by default all tests for a
program are run in the same process and in the same thread which means that if the
tested code has some significant side-effects (such as modifying global variables or files
it relies on), the next tests may fail even if the code behaves correctly under normal
conditions.
The module is extendable by allowing the user to write their own subclasses for
testing, making it possible to reuse code for similar tasks. The negative side of this is
that tests written with this module are quite verbose.
Testing tasks which do IO is not impossible but simply hard with this module, re-
quiring writing code which starts the tested program and communicates with it, feeding
it new input data and processing the output. Implementing this would be a large and
non-trivial task which would require about as much effort as writing a new testing en-
gine.
Feedback for testing is given in the form of a traceback, which contains information
about the functions being executed when an error was raised and the error message. In
case the tested code itself raised an error (e.g. a command references a variable which
does not exist), the traceback helps the programmer to find the offending line easily. In
case the testing code notifies of an error the error message should display information
about what went wrong if assertion methods bundled with unittest are used. For
example, if a function returned the wrong answer, the error message traceback contains
both the expected value and returned value.
It should be noted that while tracebacks are very helpful for an experienced program-
mer, new students often have trouble understanding them.
The module itself is not secured by default, as it was built under the assumption that
only trusted code is ever tested. This means that proper sandboxing is necessary and also
care must be taken so that tested code could not override methods the testing framework
relies on. Securing the module becomes easier if the framework was extended in a way
that each test runs in a separate process.
1.3 Summary
As seen from analysis, neither of the two common testing system types completely
satisfies all of our requirements. This makes it hard to test homeworks in introductory
programming courses taught at University of Tartu without making heavy modifications
to the course contents or significantly modifying the framework being used.
9
However such a system can be built by putting aside some of the aspects of the
previous two systems (such as independence of programming language) and combining
other ideas.
10
Chapter 2
For the purpose of automatically testing homework assignments, a new Python mod-
ule named grader was created. This chapter discusses the main design decisions and
how to write tests using the module by working through various examples.
2.1 Scope
As seen from the analysis in Chapter 1, existing testing engines vary much in scope,
starting from small modules such as unittest to full educational systems which in
addition to testing support and do course content management and grading such as
Jutge.org and Sphere online judge.
While larger systems are very much needed, this thesis focuses only on creating a
module which:
• allows teachers to specify tests for all task types in the same language and pro-
gramming style.
• is expressive and can be extended to support new task types (Section 2.6).
• gives valuable feedback to the student which helps them debug their code.
• can be used within existing grading systems.
• can be securely used for testing homeworks (Section 3.4)
11
2.2 Design decisions
The API design of the grader module is largely inspired both by Flask web
framework [9] and also Python unittest [8] and nose [10] modules.
The main design decisions made were as follows:
1. All tests are encapsulated as functions. This yields shorter and more readable
tests.
2. Each test should have a unique human-readable description, which describes the
test data and the expected result (e.g. description "function add(5, 2)
should return 7").
3. Test functions have access to the output, variables and functions of the tested
solution, and can also insert strings into the solution’s input stream.
4. Python function decorators1 are used to manipulate test configuration - setting time
limits, adding extra parameters to test function or creating temporary files. An-
other usage of decorators is wrapping tests, for example to create several tests
with same test code but different arguments.
5. Each test function runs in parallel with the tested code. This allows the framework
to do IO testing and functional testing at the same time. This is explained in detail
in Chapter 3.
6. Notifying of failed tests happens through exceptions. As the error messages are
1
Python decorators are functions which take a function as an argument and return another function
which then replaces the original function. They are often used to make functions cache their results or to
log return values and exceptions.
def l o g _ t h i s ( function ) :
def r e p l a c e d _ f u n c t i o n ( argument ) :
r e s u l t = f u n c t i o n ( argument )
p r i n t ( " C a l l w i t h a r g u m e n t {1} r e t u r n e d {2} "
. f o r m a t ( argument , r e s u l t ) )
return result
return replaced_function
@log_this
def f ( x ) :
r e t u r n 2∗ x
12
shown to the student they should contain relevant information about what went
wrong. Raising exceptions is usually done using the assert statement, but the
tester also notifies the student if their code raised an error (e.g. division by zero).
7. Tests are run in isolation, each in a separate process within a separate directory
for each test. The need for this is explained in Section 3.1.2.
8. Output of running the tests is a JSON-serializable dictionary that can be further
processed by other programs.
Write a program which tries to guess an integer between 1 and 10000 that the user
picks with as few guesses as possible. More specifically, each time the program outputs
a number, the user will answer that the number is either "too large", "too small" or
"correct".
Example session (program output in bold):
5
too small
7
too large
6
correct
2.3.2 Analysis
13
# search using binary search .
bottom = 1
t o p = 10000
while True :
guess = ( top + bottom ) / / 2
p r i n t ( guess )
# a s k t h e u s e r how t h e number c o m p a r e s
answer = i n p u t ( )
Listing 2.3 shows tester code for this task. In it the functions as tests principle comes
into play within the search_tester function. Since the function is decorated with
test_cases and given 12 different arguments the module registers 12 different tests,
each of which calls search_tester with a different number to search for.
Inside the function, the argument m is an object which contains the standard output
and input streams of the solution. Using these the tester communicates as if it were a
user using the program.
Note that the contents of search_tester are quite similar to the contents of a
program that makes the user guess a number between 1 and 10000. In other words, the
tester code is the inverse of the solution and simulates how the user is communicating
with the tested program. Everything else surrounding the function is configuration -
information about how the tests are run and what the description shown to the user
should be.
14
# i m p o r t f u n c t i o n s and d e c o r a t o r s from t h e t e s t e d module
from g r a d e r i m p o r t ∗
# T h i s c r e a t e s 12 t e s t s , e a c h f o r s e a r c h i n g a number
@test_cases (
# l i s t a l l t h e 12 numbers t o s e a r c h f o r
# e a c h one g e t s p a s s e d i n t o t h e f u n c t i o n a s t h e
# second argument − searched_number
[ 1 , 10000 , 5000 , 3 , 9990 , 7265 , 8724 , 2861 , 2117 , 811 , 6538 ,
4874] ,
# The d e s c r i p t i o n o f t h e t e s t , shown t o t h e u s e r .
# {0} i s r e p l a c e d by t h e s e a r c h e d number
d e s c r i p t i o n = " S e a r c h i n g f o r number {0} "
)
d e f s e a r c h _ t e s t e r (m, s e a r c h e d _ n u m b e r ) :
# t e s t f u n c t i o n − F i r s t argument ( always given ) i s a c o n t a i n e r
# f o r t h e u s e r s p r o g r a m and f o r t h e s t d i n / s t d o u t .
# Second i s t h e s e a r c h e d number ( s e e a b o v e ) .
found = F a l s e
guesses = []
w h i l e l e n ( g u e s s e s ) < 15 and n o t f o u n d :
# Get what t h e u s e r g u e s s e d s i n c e l a s t t i m e we a s k e d .
g u e s s = i n t (m. s t d o u t . new ( ) )
g u e s s e s . append ( guess )
# l e t t h e p r o g r a m know i f t h e g u e s s was
# correct , too l a r g e or too small .
i f guess < searched_number :
m. s t d i n . p u t ( " t o o s m a l l " )
e l i f guess > searched_number :
m. s t d i n . p u t ( " t o o l a r g e " )
e l i f g u e s s == s e a r c h e d _ n u m b e r :
m. s t d i n . p u t ( " c o r r e c t " )
found = True
# I f p r o g r a m d i d n ’ t f i n d t h e s o l u t i o n f a s t enough ,
# n o t i f y t h a t t h e p r o g r a m made t o o many g u e s s e s .
# T h i s r a i s e s an A s s e r t i o n E r r o r i f f o u n d i s F a l s e .
a s s e r t found , (
" Program made t o o many g u e s s e s . \ n " +
" G u e s s e s were : {} " . f o r m a t ( g u e s s e s ) )
15
2.3.4 Running the tests
The tester can be run via the command line with the following command:
p y t h o n −m g r a d e r p a t h / t o / t e s t e r . py p a t h / t o / s o l u t i o n . py
The command outputs JSON containing the feedback for testing solution.py with
tester.py. Another way of running the tester is by importing the module and call-
ing a specific function within the module. See API documentation [11] for method
test_module.
As mentioned previously, the result of running the tests is a JSON dictionary. The
dictionary contains the test description, a boolean representing whether the result was
successful, execution time and error message (if any).
While this representation is useful when developing tests on local machine or when
doing computer-computer communication, some values such as tracebacks or long error
messages are not readable in this format.
When presenting feedback to the user, styled html or a picture should be shown. See
Figure 2.1 for an example how feedback for this task is shown in the web IDE (Section
4.3). As shown by P. Ihantola in [12], the feedback can also be in the form of an image
- for example in a chess task, the feedback might be an animation of chess moves on a
board.
Figure 2.1: Test results with feedback. Error messages of failed tests are also displayed.
16
2.4 Example - testing functions
Unlike IO-based tasks, requirements for testing functions are usually very clear and
almost always identical - the program must first finish executing, then the tested func-
tion is called with some arguments and the returned result is checked against the value
returned by a sample solution.
Because of that, the grader module provides a test generator for these sorts of
tasks. Test generators are functions, which dynamically create tests when they are
called.
In the example in Listing 2.4, check_function generator is used to create 4 tests.
Each of those tests first checks if the function named add exists and then if it returns
the same result as the sample solution on the arguments it was called with. Figure 2.2
shows possible feedback for this task.
from g r a d e r i m p o r t ∗
d e f add ( x , y ) :
return x + y
c h e c k _ f u n c t i o n ( add , [ 2 , 4 ] ) # t e s t f o r i f add ( 2 , 4 ) == 6
c h e c k _ f u n c t i o n ( add , [ −3 , 1 ] )
c h e c k _ f u n c t i o n ( add , [ 9 , 0 ] )
# f o r e d g e c a s e s , a d e s c r i p t i o n e x p l a i n i n g t h e c a s e c a n be made
c h e c k _ f u n c t i o n ( add , [ " a n o t h e r " , " s t r i n g " ] ,
d e s c r i p t i o n = " F u n c t i o n must work on s t r i n g s " )
17
2.5 Example - file-based task
The next example deals with testing a program which reads a matrix from a file and
writes the transpose of the said matrix to another file.
There are two main options on how to make files available for the tester:
1. It is possible to add extra files in addition to the solution and tester that will be
available. This method could be used to attach data files and would allow to use
test_cases decorator that was used in Section 2.3.3. This option has two
negative aspects: it makes invoking the tester harder and the solution must always
ask for the file names that are to be processed.
2. Generate the files to be processed just before the solution program is started. This
way all information related to the tests would be in one file and it would be easy
to modify the task.
As option 2 is more natural to use and more flexible, it will be used for this example.
The module utilizes pre-test hooks (denoted by a decorator before_test) which
are called right before the solution code and tester code are started. We can use this tool
to prepare files to test. Example of this is provided in Listing 2.5.
@test # r e g i s t e r s t e s t some_test
@before_test ( c r e a t e _ f i l e ( ’ sampledata . t x t ’ , ’ hello ’ ) )
d e f s o m e _ t e s t (m) :
# sampledata . t x t containing ’ hello ’ i s a v a i l a b l e within the t e s t
...
Since files created for each test should have different contents and file names, gen-
erating functions based on test data is needed to avoid code duplication. We might
generate such functions using a loop, but as Python functions are late-binding [13], this
would not work as expected.
A better way to generate test functions is to write our own test generator as shown
in Listing 2.6 (sample feedback in Figure 2.3).
from g r a d e r i m p o r t ∗
import os
18
# D e s c r i p t i o n management f o r t h e t e s t s
rows , c o l s = s i z e ( m a t r i x )
d e s c r i p t i o n = " T r a n s p o s i n g a { rows } x { c o l s } m a t r i x ( { i n f } , { o u t } ) "
d e s c r i p t i o n = d e s c r i p t i o n . f o r m a t ( rows =rows ,
cols =cols ,
matrix =matrix ,
inf =infile_name ,
out=outfile_name )
# t r a n s f o r m t h e l i s t o f l i s t s i n t o a m u l t i −l i n e s t r i n g
i n f i l e _ c o n t e n t s = s t r i n g i f y ( matrix )
expected_outfile = s t r i n g i f y ( transpose ( matrix ) )
# I n t e r n a l l y c r e a t e a f u n c t i o n and r e g i s t e r a s a t e s t
# All the f i l e checking l o g i c goes in t h e r e .
# Before t h e t e s t i s executed , t h e needed f i l e i s g e n e r a t e d
@test
@before_test ( c r e a t e _ f i l e ( infile_name , i n f i l e _ c o n t e n t s ) )
@set_description ( description )
d e f _ i n n e r _ t e s t (m) :
m. s t d i n . p u t ( i n f i l e _ n a m e )
m. s t d i n . p u t ( o u t f i l e _ n a m e )
a s s e r t os . p a t h . e x i s t s ( o u t f i l e _ n a m e ) , \
" S o l u t i o n did not c r e a t e "+ outfile_name
w i t h open ( o u t f i l e _ n a m e ) a s f :
output = f . read ( )
# compare o u t p u t t o e x p e c t e d ( i g n o r e t r a i l i n g w h i t e s p a c e )
a s s e r t o u t p u t . s t r i p ( ) == e x p e c t e d _ o u t f i l e , (
" Finding i n v e r s e of t h e f o l l o w i n g matrix : \ n{ matrix } \ n \ n "
" Expected { o u t f i l e } to c o n t a i n : \ n{ expected } \ n \ n "
" { o u t f i l e } c o n t a i n e d : \ n{ got } "
. format ( got=output , expected= expected_outfile ,
matrix =matrix , o u t f i l e = outfile_name )
)
# create 6 tests
matrix_gen ( [ ] )
matrix_gen ( [ [ 1 ] ] )
matrix_gen ( [ [ 1 ] ] , i n f i l e _ n a m e =" a n o t h e r _ i n p u t . t x t " )
matrix_gen ( [ [ 1 ] ] , outfile_name =" a n o t h e r _ i n v e r s e _ f i l e . t x t " )
matrix_gen ( [ [ 1 , 2 ] , [3 , 4 ] ] )
matrix_gen ( [ [ 1 , 2 , 3 ] , [4 ,5 ,6] , [7 ,8 ,9] , [10 ,11 ,12]])
Listing 2.6: Code for testing matrix transposing using files. Each of the last 6 lines
registers a test each with different input data.
19
Figure 2.3: Example feedback from tester in Listing 2.6. The first error reported is
caused by an exception in solution code.
20
2.6.1 Improving function testing
Some function-based tasks might have more specific requirements than simply check-
ing return values of functions.
For example, it might be useful to check for common mistakes such as checking
if the function did printed the answer instead of returning it, if it modified any global
variables and if it returns the a different result when called multiple times.
It is possible to test for these mistakes by writing our own test generator which ex-
tends the capabilities of check_function used in Section 2.4.
A complete implementation of such a generator is shown in Listing 2.7 (without
helper methods) with example feedback in Figure 2.4. It is also available in the module
grader.extensions.adv_functions [14].
def check_function (
sample_function , arguments ,
e x p e c t e d _ r e s u l t =None ,
d e s c r i p t i o n =None ,
# check f o r p r i n t i n s t e a d of r e t u r n
c h e c k _ p r i n t = True ,
# number o f t i m e s t o c a l l the function
n _ c a l l s =3 ,
# check i f g l o b a l s change a f t e r calling the function .
c h e c k _ g l o b a l s =True ) :
d e s c r i p t i o n = g e t _ d e s c r i p t i o n _ s t r i n g ( fn_name ,
arguments ,
expected_result ,
prefix=description )
# I n t e r n a l l y c r e a t e a f u n c t i o n and r e g i s t e r a s a t e s t
# All the checking l o g i c goes in t h e r e .
@test
@set_description ( description )
d e f _ i n n e r _ t e s t (m) :
a s s e r t m. f i n i s h e d , " Program d i d n ’ t f i n i s h e x e c u t i o n "
a s s e r t h a s a t t r (m. module , fn_name ) , (
" F u n c t i o n named {} was n o t d e f i n e d . " . f o r m a t ( r e p r ( fn_name ) )
)
f n = g e t a t t r (m. module , fn_name )
21
fo r i in range (1 , n _ c a l l s + 1) :
s t a r t _ v a r s = v a r i a b l e s _ s n a p s h o t (m. module )
r e s u l t = fn (∗ arguments )
o u t p u t = m. s t d o u t . r e a d ( )
i f c h e c k _ p r i n t and r e s u l t i s None and \
s t r ( expected_result ) in output :
raise AssertionError (
" Function p r i n t e d out the c o r r e c t r e s u l t "
" i n s t e a d of r e t u r n i n g i t . \ n"
" Hint : r e p l a c e p r i n t with r e t u r n . "
)
# i f a n s w e r i s n ’ t what was e x p e c t e d , r a i s e e r r o r
a s s e r t r e s u l t == e x p e c t e d _ r e s u l t , \
get_error_description ( result , expected_result , i )
# c h e c k i f any g l o b a l s c h a n g e d
i f check_globals :
# g e t changed v a r i a b l e s as a d i c t i o n a r y
e n d _ v a r s = v a r i a b l e s _ s n a p s h o t (m. module )
d i f f = d i c t _ d i f f ( s t a r t _ v a r s , end_vars )
# i f any v a r i a b l e s changed , r a i s e e r r o r a c c o r d i n g l y
a s s e r t not diff , globals_error_msg ( d i f f )
Listing 2.7: Advanced generator for function testing. Note that instead of testing a
function directly check_function generates and registers a new test function.
Figure 2.4: Example feedbacks for an add function tested using Listing 2.7.
22
2.6.2 Fill in the blanks task type
As mentioned in Chapter 1, some tasks require that the student may not use some
functions or language constructs in their code.
One variation of such a task is as follows: Students are asked to complete a program
given by their instructor, but are only allowed to modify parts of the program.
while True :
number = i n t ( i n p u t ( " I n p u t a number : " ) )
i f number > _____ : # f i l l i n t h e b l a n k h e r e . . .
p r i n t ( " Non−n e g a t i v e ! " )
else :
. . . # and h e r e ! Can be s e v e r a l l i n e s
Listing 2.8: Template for the example task. Students should output "Negative!" and stop
if the number is negative otherwise output "Non-negative!" and ask for another number.
Such tasks are tested by comparing the parsed representation oAs mentionedf the so-
lution program against the template and checking for differences. As seen from Listing
2.8, the template may contain two different kinds of placeholders:
• Underscores, which should be replaced by a single statement or expression.
• Triple dots, which mean that it should be replaced with 0 or more statements.
To write a tester for this task, we first need access to the abstract syntax tree (AST) of
the solution code. This can be added as an argument to a test function using pre-hooks
(Chapter 3.3), as the built-in decorator expose_ast in Listing 2.9 does.
def expose_ast ( t e s t _ f u n c t i o n ) :
import as t
from g r a d e r . c o r e i m p o r t b e f o r e _ t e s t
d e f _hook ( i n f o ) :
code = read_code ( i n f o [ " s o l u t i o n _ p a t h " ] )
# add t h e s o l u t i o n s AST a s a named a r g u m e n t t o t h e t e s t
function
i n f o [ " e x t r a _ k w a r g s " ] [ "AST" ] = a s t . p a r s e ( c o d e )
# add f u n c t i o n _hook a s a p r e −hook t o t e s t _ f u n c t i o n
r e t u r n b e f o r e _ t e s t ( _hook ) ( t e s t _ f u n c t i o n )
After having the AST of both the solution and the template, the two need to be
compared. We can do that recursively by checking equality of each node. Two special
23
cases for placeholders need to be taken into account however:
• Since nodes of underscores match every other node, we can stop recursion here.
• Triple dots are matched similarly to how the regular expression .* is matched - all
possible combinations need to be checked. Note that the current implementation
requires exponential number of comparisons based on the number of potential
statements matched.
The comparison algorithm and a test generator for this task are implemented in
grader.extensions.ast submodule and a sample tester can be seen in Listing 2.10.
from g r a d e r i m p o r t ∗
from g r a d e r . e x t e n s i o n s i m p o r t a s t
template = """
number = i n t ( i n p u t ( " I n p u t a number : " ) )
i f number > _____ : # f i l l i n t h e b l a n k h e r e . . .
p r i n t ( " Non−n e g a t i v e ! " )
else :
. . . # and h e r e ! Can be s e v e r a l l i n e s
"""
Listing 2.10: Tester for the example task. Note that additional tests are also needed to
verify the behaviour of the program.
24
Chapter 3
Implementation
This chapter discusses the architecture of the grader testing module.The overall
structure of the module is discussed, explaining the purpose of different components
and how they were built. Last part of the chapter deals with how to secure the module.
When the tester is started, at first all the tests are registered. For that, the file con-
taining the tests is imported which as a side-effect adds all the functions decorated with
@test into a dictionary. Later on, the functions can be accessed from the same test
case dictionary using the test description. By default the test description is either the
function name or documentation string if it exists, but this can be overridden by using
the @set_description decorator.
In case of the interactive search example in Section 2.3.3, the decorator test_cases
internally creates 12 functions, all of which are registered as separate tests. Each of
those functions in turn call the decorated function search_tester with a specific
number that is to be searched.
Hollingsworth, author of the first known automatic programming grader, made the
observation that it is possible for a student to submit a program which undeliberately
25
does some damage to the graders’ execution environment [15]. For example, the stu-
dents program might open files for writing and never close them, making reading them
impossible; it might overwrite functions in different modules or even overwrite solution
and tester code files.
To avoid such problems, proper isolation is needed. All tested files are copied to a
temporary location at the start of each test and the test function itself is run in a separate
process. Each such process is only allowed to run for a limited time after which it is
automatically killed to avoid issues with infinite loops.
To get access to the correct test function, the new process repeats the test registration
procedure and finds the test function with appropriate index that was sent to the process.
A so-called module container object is prepared in a separate thread, which contains
references to the faked input and output streams of the tested program. It also contains
a reference to the solution module - object which can be used to access all the functions
and variables declared in solution program. The module is created by starting the so-
lution program using the exec statement, taking care so the resulting module does not
have access to the context of tester code.
After the preparations are done, the test function is started in the main thread, with
the program container passed as an argument to the function. Note that the solution
program is already running in parallel in the other thread since exec was called.
While running the test function, if an exception is raised in either thread the test is
stopped. The exception is then caught by the tester and the error message and traceback
are extracted. In case both threads raised an error, the earlier one is reported.
After the test function call is complete, JSON containing both the error message and
exception traceback is output. Both are empty strings if no exceptions were raised.
After the test running process exits, the original process reads the JSON output and
adds some fields to it like test description, execution time and a flag if the test was a
success. Should the test process run too long however, it is automatically killed and the
error message set to the string ’Timeout’.
This process is repeated for all tests and a JSON dictionary containing all the test
results is output.
26
3.2 Synchronization
When testing interactive tasks we need to be sure that the solution program has fin-
ished their output by the time that the tester tries to read it. Another facet of the same
problem is that all the functions and variables must exist by the time that the tester tries
to access them, otherwise slow but correct programs might get inaccurate feedback (for
example that the tested function or variable does not exist).
To make it easier for the test writer, the grader module does implicit synchroniza-
tion between the solution and tester threads. More specifically, at all times either one or
the other is executing, but never the two at once.
This is achieved by using locking - when the solution program is executed, the so-
lution thread acquires the lock, only releasing it when the solution tries to read input
or reaches the end of code. It then releases the lock and in the first case waits until the
tester has acquired and released it before re-acquiring it.
The tester thread behaves in a similar way - after the solution thread has released
the lock for the first time, tester acquires the lock and starts executing the test code.
When it reaches a statement that adds something to the input stream of the solution
code, it releases the lock and waits until the lock has been acquired and released before
re-acquiring or until the solution code finishes executing. Once solution code finishes
executing, the locking mechanism is ignored from that point onward.
This approach guarantees that each time the tester tries to read the output of the
solution program there will not be any subsequent writes by the solution until the tester
adds strings to the solutions input stream.
3.2.1 Tradeoffs
27
3. Testing threaded code in general is a tricky issue since there might be deadlocking
issues.
Future revisions of the module might allow for different synchronization strategies
if there is demand for it.
def pre_hook_example ( t e s t _ i n f o ) :
# add e x t r a a r g u m e n t t o t h e t e s t f u n c t i o n
t e s t _ i n f o [ ’ e x t r a _ a r g s ’ ] . append ( ’ arg ’ )
# add e x t r a named a r g u m e n t t o t h e t e s t f u n c t i o n
kwargs = t e s t _ i n f o [ ’ e x t r a _ k w a r g s ’ ]
kwargs [ ’ keyword_arg ’ ] = ’ keyword_arg ’
def post_hook_example ( r e s u l t s ) :
# add p o i n t s t o r e s u l t s
i f r e s u l t s [ " success " ] :
results [ " points " ] = 1
else :
results [ " points " ] = 0
@test
@before_test ( pre_hook_example )
@ a f t e r _ t e s t ( post_hook_example )
d e f s o m e _ t e s t (m, a r g , k e y w o r d _ a r g ) :
...
Listing 3.1: Example of using pre- and post-hooks. The before_test and
after_test decorators add the function which is its argument as a pre- and post-
hook respectively.
28
3.4 Security
The goal of the grader module is to allow testing of untrusted and unreviewed code
submitted by students. This poses a challenge since a student might want to break into
the system or otherwise disturb the behaviour of the application using the module. Also
- beginners might submit solutions inadvertently which do harm to the tester system.
By default the grader module uses no sandbox. This makes it easier to install and
use the library and also makes it possible to use the module on non-Linux machines.
However, this chapter shows how to use a Docker sandbox to secure the module.
When running a public service such as a web server which uses the module, it should
have a few desired properties such as:
• Isolation: if multiple solutions are being tested at the same time, they should not
be able to interfere with each other or with the system. Ideally, they should not
even be aware of other processes running on the host machine.
• Resource limiting: each solution being tested should only be able to use a fixed
proportion of the hosts CPU, memory and disk space.
• Access limitations: the tested program should not have access to the network and
files which it does not need.
• Cleanup: all side-effects of testing code should be automatically be removed.
This includes reverting edited files.
• Easy to install: used sandbox should be easy to install, maintain and hard to
mis-configure.
Note that these requirements do not deal with correctness of test results.
Historically, various techniques have been used to deal with remote code execution.
Chroot() jails with additional resource limiting has been used by many online judges [6]
[5] and by HackerRank [16].This solution has one disadvantage in the author’s opinion
- namely it is hard to set up and maintain correctly.
29
3.4.2 Docker
The grader module can be sandboxed using Docker containers in Linux enviro-
ments.
Docker [17] is an abstraction built on top of Linux Containers (LXC) and cgroups
and it allows the creation and teardown of complete Linux environments (containers)
within a fraction of a second. The created containers share the host systems kernel, but
have their own isolated filesystem, process tree and networking stack. Appropriately,
LXC has been described as "chroot on steroids" [18].
Docker has been used to secure large programming competitions such as Stripe CTF
3 [19] and by various code running services such as codecube.io [20].
Docker achieves isolation by creating a fresh container from a base image each time
a solution is tested. For efficiency reasons, all tests on a single solution share the same
sandbox. In addition to operating system essentials, the created container has only
Python and the grader module installed. Also because each container has their own
process tree and file system, the tested file cannot be aware of other tested solutions.
When the container is started, memory and CPU limits are automatically set and
networking is also turned off within the container.
After testing is done, the created container is automatically deleted. This means
that each time tests are run, the testing environment is in exactly the same condition as
described in the docker image.
Installing docker is described in Appendix A, but it should be noted that this requires
a recent Linux kernel (version 3.8 and above) and since it requires LXC support, it
excludes some operating systems like OpenBSD.
Docker registry is currently used to host docker image files. This means that updating
the sandbox is a single command, but the image can also be manually built. The image
is also automatically updated each time the grader source repository is changed.
When starting the grader via command line, it is possible to use docker sandbox as
follows:
30
p y t h o n −m g r a d e r −−s a n d b o x d o c k e r p a t h / t e s t e r . py p a t h / s o l u t i o n . py
31
Chapter 4
Usage in courses
This chapter discusses how the tester was used in courses at the University of Tartu,
experiences learned from those attempts and also ideas for future iterations of the mod-
ule.
32
A common problem was that students often tried to solve a slightly different problem
than the task specification described. For example, students might ask for arguments
from standard input on function-based tasks, mix up the order of arguments or ask for a
file name when it was fixed in task description.
The author believes those problems stem from four main issues:
1. Homeworks were not automatically tested so the grading criteria differed between
labs and tests and were probably more lax in the first.
2. Students did not have previous experience using automatic tests, leading them to
make small mistakes the tester stumbled on.
3. Time pressure left students less time to properly finish their solutions.
4. Predicting what mistakes will be made is hard before manually checking some
solutions first. For example, testers for many midterm tasks were revised before
announcing results.
Two of the midterms also included tasks involving using the turtle module to
draw pictures on the screen. These were not automatically graded, but an animation
was made using a script made by the author [21]. As this solution requires a windowing
system being used on the testing machine and is resource intensive, this solution will
likely not be used when testing tasks on the server.
33
Grades are also calculated by weighing tests (with weights automatically set to ones)
and counting successful test weights towards the maximum grade set in Moodle.
34
4.4 Future development
As the module is planned to be used in autumn 2014, development of the module
will not stop. This section discusses some long- and short term ideas surrounding the
module.
• Further support for ast searching and manipulation. This would allow to support
more limited tasks. The Python library macropy [24] could possibly be used to
enhance AST searching support.
• Currently the built-in function generators output feedback in English, however the
all the courses currently using the module are taught in Estonian. This means that
some kind of internalization support is needed from the module.
• Good error messages and test descriptions are essential for the student to under-
stand what went wrong with their solution. Proper guidelines for these are needed.
• Integration with programming book by Aivar Annamaa [25].
• Test parallelization.
• GUI testing support. This would mostly require for the testing servers to support
windowing systems.
35
Conclusion
As a result of this thesis, a Python module was created which allows the testing
of typical assessments in introductory programming courses. The module allows the
teacher to test tasks which mix functions and input-output based tasks as well as other
task types and can be used within existing grading systems such as Moodle Virtual
Programming Lab.
This thesis serves as an introduction to the grader module and contains examples
on how to use and extend it for testing new task types. It also contains analysis of
existing systems, design decisions of the module, how the module works and how it can
be secured.
At the time of writing, the module has been used in two courses and testers for over
40 tasks are freely available in its repository. The last part of the thesis introduced how
the tester was used in those courses and various experiences learned from it.
36
Bibliography
[1] Cynthia J. Solomon and Seymour Papert. A case study of a young child doing
turtle graphics in logo. In Proceedings of the June 7-10, 1976, National Computer
Conference and Exposition, AFIPS ’76, pages 1049–1056, New York, NY, USA,
1976. ACM.
[2] Moodle virtual programming lab. http://vpl.dis.ulpgc.es/
(26.04.2014).
[3] Mooshak contest sytem. https://mooshak.dcc.fc.up.pt/
(26.04.2014).
[4] Sphere online judge. http://www.spoj.com/info/ (26.04.2014).
[5] Adrian Kosowski, Michal Malafiejski, and Tomasz Noinski. Application of an
online judge & contester system in academic tuition. In Howard Leung, Fred-
erick Li, Rynson Lau, and Qing Li, editors, Advances in Web Based Learning -
ICWL 2007, volume 4823 of Lecture Notes in Computer Science, pages 343–354.
Springer Berlin Heidelberg, 2008.
[6] Jordi Petit, Omer Giménez, and Salvador Roura. Jutge.org: An educational pro-
gramming judge. In Proceedings of the 43rd ACM Technical Symposium on Com-
puter Science Education, SIGCSE ’12, pages 445–450, New York, NY, USA,
2012. ACM.
[7] Petri Ihantola, Tuukka Ahoniemi, Ville Karavirta, and Otto Seppälä. Review of re-
cent systems for automatic assessment of programming assignments. In Proceed-
ings of the 10th Koli Calling International Conference on Computing Education
Research, Koli Calling ’10, pages 86–93, New York, NY, USA, 2010. ACM.
[8] Unittest testing module. https://docs.python.org/3/library/
unittest.html (26.04.2014).
[9] Flask web framework. http://flask.pocoo.org/ (26.04.2014).
37
[10] nose testing module. https://nose.readthedocs.org/en/latest/
(26.04.2014).
[11] API documentation for grader module. https://macobo.github.io/
python-grader/.
[12] Petri Ihantola. Automated assessment of programming assignments: Visual feed-
back, assignment mobility, and assessment of students’ testing skills. 2011.
[13] Python guide - late binding. http://docs.python-guide.org/en/
latest/writing/gotchas/#late-binding-closures (27.04.2014).
[14] grader module. https://github.com/macobo/python-grader.
[15] Jack Hollingsworth. Automatic graders for programming classes. Commun. ACM,
3(10):528–529, October 1960.
[16] Quora answer about how HackerRank is secured. http://qr.ae/rg6vp
(28.04.2014).
[17] Dirk Merkel. Docker: Lightweight linux containers for consistent development
and deployment. Linux J., 2014(239), March 2014.
[18] Björn Pehrson. Virtualization analysis - CSD Fall, 2011.
[19] Blog post about how stripe capture the flag 3 competition testing architecture.
https://stripe.com/blog/ctf3-architecture (28.04.2014).
[20] Blog post about how codecube.io is secured. http://hmarr.com/2013/
oct/16/codecube-runnable-gists/ (28.04.2014).
[21] Turtlesnap module used to make pictures of turtle module. https://github.
com/macobo/TurtleSnap.
[22] Moodle vpl documentation. http://vpl.dis.ulpgc.es/index.php/
en/documentation/76-general-documentation (27.04.2014).
[23] Grader web ide source code. https://github.com/macobo/
grader-webapp.
[24] Macropy module. https://github.com/lihaoyi/macropy.
[25] Aivar Annamaa. Programming textbook. http://programmeerimine.cs.
ut.ee.
38
Appendix A
Installation guide
Prerequsites
• Python 3.4 or above installed.
• (for sandbox) Linux operating system supporting Linux Containers.
• (for sandbox) Linux kernel version ≥ 3.8.
39
Non-exclusive licence to reproduce thesis and make thesis public
1. herewith grant the University of Tartu a free permit (non-exclusive licence) to:
1.1 reproduce, for the purpose of preservation and making available to the pub-
lic, including for addition to the DSpace digital archives until expiry of the
term of validity of the copyright, and
1.2 make available to the public via the web environment of the University of
Tartu, including via the DSpace digital archives until expiry of the term of
validity of the copyright,
of my thesis
"Python module for automatic testing of programming assignments",
supervised by Aivar Annamaa and Margus Niitsoo,
Tartu, 14.05.2014
40