Python Parallel Programming Cookbook - Sample Chapter
Python Parallel Programming Cookbook - Sample Chapter
ee
Starting by introducing you to the world of parallel computing, the book moves on to cover the fundamentals
of Python. This is followed by exploring the thread-based parallelism model, using the Python threading
module by synchronizing threads and using locks, mutex, semaphores queues, GIL, and the thread pool.
complex problems
and problems
Parallel programming techniques are required for a developer to get the best use of all the computational
resources available today to build efficient software systems. From multicore to GPU systems up to
distributed architectures, the high computation of programs requires the use of programming tools and
software libraries.
problems efficiently
real-world problems
$ 49.99 US
31.99 UK
P U B L I S H I N G
P U B L I S H I N G
Giancarlo Zaccone
Sa
pl
e
Python Parallel
Programming Cookbook
Master efficient parallel programming to build powerful
applications using Python
Giancarlo Zaccone
Preface
The study of computer science should cover not only the principles on which computational
processing is based, but should also reflect the current state of knowledge of these fields.
Today, the technology requires that professionals from all branches of computer science know
both the software and hardware whose interaction at all levels is the key to understanding the
basics of computational processing.
For this reason, in this book, a special focus is given on the relationship between hardware
architectures and software.
Until recently, programmers could rely on the work of the hardware designers, compilers,
and chip manufacturers to make their software programs faster or more efficient without
the need for changes.
This era is over. So now, if a program is to run faster, it must become a parallel program.
Although the goal of many researchers is to ensure that programmers are not aware of the
parallel nature of the hardware for which they write their programs, it will take many years
before this actually becomes possible. Nowadays, most programmers need to thoroughly
understand the link between hardware and software so that the programs can be run
efficiently on modern computer architectures.
To introduce the concepts of parallel programming, the Python programming language has
been adopted. Python is fun and easy to use, and its popularity has grown steadily in recent
years. Python was developed more than 10 years ago by Guido van Rossum, who derived
Python's syntax simplicity and ease of use largely from ABC, which is a teaching language
that was developed in the 80s.
Preface
In addition to this specific context, Python was created to solve real-life problems, and it
borrows a wide variety of typical characteristics of programming languages, such as C ++,
Java, and Scheme. This is one of its most remarkable features, which has led to its broad
appeal among professional software developers, the scientific research industry, and
computer science educators. One of the reasons why Python is liked so much is because
it provides the best balance between the practical and conceptual approaches. It is an
interpreted language, so you can start doing things immediately without getting lost in the
problems of compilation and linking. Python also provides an extensive software library that
can be used in all sorts of tasks ranging from the Web, graphics, and of course, parallel
computing. This practical aspect is a great way to engage readers and allow them to carry out
projects that are important in this book.
This book contains a wide variety of examples that are inspired by many situations, and
these offer you the opportunity to solve real-life problems. This book examines the principles
of software design for parallel architectures, insisting on the importance of clarity of the
programs and avoiding the use of complex terminology in favor of clear and direct examples.
Each topic is presented as part of a complete, working Python program, which is followed by
the output of the program in question.
The modular organization of the various chapters provides a proven path to move from the
simplest arguments to the most advanced ones, but this is also suitable for those who only
want to learn a few specific issues.
I hope that the settings and content of this book are able to provide you with a useful
contribution for your better understanding and dissemination of parallel programming
techniques.
Preface
Chapter 4, Asynchronous Programming, explains the asynchronous model for concurrent
programming. In some ways, it is simpler than the threaded one because there is a single
instruction stream and tasks explicitly relinquish control instead of being suspended
arbitrarily. This chapter will show you how to use the Python asyncio module to organize each
task as a sequence of smaller steps that must be executed in an asynchronous manner.
Chapter 5, Distributed Python, introduces you to distributed computing. It is the process of
aggregating several computing units logically and may even be geographically distributed
to collaboratively run a single computational task in a transparent and coherent way. This
chapter will present some of the solutions proposed by Python for the implementation of
these architectures using the OO approach, Celery, SCOOP, and remote procedure calls, such
as Pyro4 and RPyC. It will also include different approaches, such as PyCSP, and finally, Disco,
which is the Python version of the MapReduce algorithm.
Chapter 6, GPU Programming with Python, describes the modern Graphics Processing
Units (GPUs) that provide breakthrough performance for numerical computing at the cost of
increased programming complexity. In fact, the programming models for GPUs require the
programmer to manually manage the data transfer between a CPU and GPU. This chapter will
teach you, through the programming examples and use cases, how to exploit the computing
power provided by the GPU cards, using the powerful Python modules: PyCUDA, NumbaPro,
and PyOpenlCL.
Memory organization
Introducing Python
Introduction
This chapter gives you an overview of parallel programming architectures and programming
models. These concepts are useful for inexperienced programmers who have approached
parallel programming techniques for the first time. This chapter can be a basic reference
for the experienced programmers. The dual characterization of parallel systems is also
presented in this chapter. The first characterization is based on the architecture of the
system and the second characterization is based on parallel programming paradigms.
Parallel programming will always be a challenge for programmers. This programming-based
approach is further described in this chapter, when we present the design procedure of a
parallel program. The chapter ends with a brief introduction of the Python programming
language. The characteristics of the language, ease of use and learning, and extensibility
and richness of software libraries and applications make Python a valuable tool for any
application and also, of course, for parallel computing. In the final part of the chapter, the
concepts of threads and processes are introduced in relation to their use in the language.
A typical way to solve a problem of a large-size is to divide it into smaller and independent
parts in order to solve all the pieces simultaneously. A parallel program is intended for a
program that uses this approach, that is, the use of multiple processors working together
on a common task. Each processor works on its section (the independent part) of the
problem. Furthermore, a data information exchange between processors could take place
during the computation. Nowadays, many software applications require more computing
power. One way to achieve this is to increase the clock speed of the processor or to increase
the number of processing cores on the chip. Improving the clock speed increases the heat
dissipation, thereby decreasing the performance per watt and moreover, this requires special
equipment for cooling. Increasing the number of cores seems to be a feasible solution, as
power consumption and dissipation are way under the limit and there is no significant gain in
performance.
To address this problem, computer hardware vendors decided to adopt multi-core
architectures, which are single chips that contain two or more processors (cores). On the
other hand, the GPU manufactures also introduced hardware architectures based on multiple
computing cores. In fact, today's computers are almost always present in multiple and
heterogeneous computing units, each formed by a variable number of cores, for example, the
most common multi-core architectures.
Therefore, it became essential for us to take advantage of the computational resources
available, to adopt programming paradigms, techniques, and instruments of parallel
computing.
Chapter 1
SISD
SIMD
Single Instruction
Single Data
Single Instruction
Multiple Data
MISD
Multiple Instructions
Single Data
Multiple Instructions
Multiple Data
SISD
The SISD computing system is a uniprocessor machine. It executes a single instruction that
operates on a single data stream. In SISD, machine instructions are processed sequentially.
In a clock cycle, the CPU executes the following operations:
Fetch: The CPU fetches the data and instructions from a memory area, which is
called a register.
Execute: The instruction is carried out on the data. The result of the operation is
stored in another register.
Control
Instruction
Data
Processor
Memory
The algorithms that run on these types of computers are sequential (or serial), since they
do not contain any parallelism. Examples of SISD computers are hardware systems with
a single CPU.
The main elements of these architectures (Von Neumann architectures) are:
Central memory unit: This is used to store both instructions and program data
CPU: This is used to get the instruction and/or data from the memory unit, which
decodes the instructions and sequentially implements them
The I/O system: This refers to the input data and output data of the program
The conventional single processor computers are classified as SISD systems. The following
figure specifically shows which areas of a CPU are used in the stages of fetch, decode, and
execute:
Fetch
Decode
Execute
Arithmetic Logic
Unit
Control Unit
Registers
Decode Unit
Data
Cache
Bus Unit
Instruction
Cache
Chapter 1
MISD
In this model, n processors, each with their own control unit, share a single memory unit.
In each clock cycle, the data received from the memory is processed by all processors
simultaneously, each in accordance with the instructions received from its control unit. In
this case, the parallelism (instruction-level parallelism) is obtained by performing several
operations on the same piece of data. The types of problems that can be solved efficiently
in these architectures are rather special, such as those regarding data encryption; for this
reason, the computer MISD did not find space in the commercial sector. MISD computers are
more of an intellectual exercise than a practical configuration.
Memory
Data
Processor 1
Instruction 1
Control 1
Data
Processor 2
Instruction 2
Control 2
Data
Processor N
Instruction N
Control N
SIMD
A SIMD computer consists of n identical processors, each with its own local memory, where
it is possible to store data. All processors work under the control of a single instruction
stream; in addition to this, there are n data streams, one for each processor. The processors
work simultaneously on each step and execute the same instruction, but on different data
elements. This is an example of data-level parallelism. The SIMD architectures are much more
versatile than MISD architectures. Numerous problems covering a wide range of applications
can be solved by parallel algorithms on SIMD computers. Another interesting feature is that
the algorithms for these computers are relatively easy to design, analyze, and implement. The
limit is that only the problems that can be divided into a number of subproblems (which are
all identical, each of which will then be solved contemporaneously, through the same set of
instructions) can be addressed with the SIMD computer. With the supercomputer developed
according to this paradigm, we must mention the Connection Machine (1985 Thinking
Machine) and MPP (NASA - 1983). As we will see in Chapter 6, GPU Programming with Python,
the advent of modern graphics processor unit (GPU), built with many SIMD embedded units
has lead to a more widespread use of this computational paradigm.
MIMD
This class of parallel computers is the most general and more powerful class according to
Flynn's classification. There are n processors, n instruction streams, and n data streams
in this. Each processor has its own control unit and local memory, which makes MIMD
architectures more computationally powerful than those used in SIMD. Each processor
operates under the control of a flow of instructions issued by its own control unit; therefore,
the processors can potentially run different programs on different data, solving subproblems
that are different and can be a part of a single larger problem. In MIMD, architecture is
achieved with the help of the parallelism level with threads and/or processes. This also
means that the processors usually operate asynchronously. The computers in this class
are used to solve those problems that do not have a regular structure that is required by
the model SIMD. Nowadays, this architecture is applied to many PCs, supercomputers, and
computer networks. However, there is a counter that you need to consider: asynchronous
algorithms are difficult to design, analyze, and implement.
Control 1
Instruction 1
Processor 1
Data
Shared
Memory
Control 2
Processor 2
Instruction 2
Data
Interconnection
Control N
Instruction N
Processor N
Data
Network
Memory organization
Another aspect that we need to consider to evaluate a parallel architecture is memory
organization or rather, the way in which the data is accessed. No matter how fast the
processing unit is, if the memory cannot maintain and provide instructions and data at a
sufficient speed, there will be no improvement in performance. The main problem that must
be overcome to make the response time of the memory compatible with the speed of the
processor is the memory cycle time, which is defined as the time that has elapsed between
two successive operations. The cycle time of the processor is typically much shorter than
the cycle time of the memory. When the processor starts transferring data (to or from the
memory), the memory will remain occupied for the entire time of the memory cycle: during
this period, no other device (I/O controller, processor, or even the processor itself that made
the request) can use the memory because it will be committed to respond to the request.
6
Chapter 1
MIMD
Distributed Memory
MPP
Shared Memory
Cluster of
Workstation
UMA
NUMA
NORMA
COMA
Shared memory
The schema of a shared memory multiprocessor system is shown in the following figure. The
physical connections here are quite simple. The bus structure allows an arbitrary number
of devices that share the same channel. The bus protocols were originally designed to allow
a single processor, and one or more disks or tape controllers to communicate through the
shared memory here. Note that each processor has been associated with a cache memory,
as it is assumed that the probability that a processor needs data or instructions present in
the local memory is very high. The problem occurs when a processor modifies data stored in
the memory system that is simultaneously used by other processors. The new value will pass
from the processor cache that has been changed to shared memory; later, however, it must
also be passed to all the other processors, so that they do not work with the obsolete value.
This problem is known as the problem of cache coherency, a special case of the problem of
memory consistency, which requires hardware implementations that can handle concurrency
issues and synchronization similar to those having thread programming.
Processor
Processor
Processor
Processor
cache
cache
cache
cache
Main Memory
I/O System
The memory is the same for all processors, for example, all the processors associated
with the same data structure will work with the same logical memory addresses, thus
accessing the same memory locations.
Chapter 1
A shared memory location must not be changed from a task while another task
accesses it.
Sharing data is fast; the time required for the communication between two tasks is
equal to the time for reading a single memory location (it is depending on the speed
of memory access).
Uniform memory access (UMA): The fundamental characteristic of this system is the
access time to the memory that is constant for each processor and for any area of
memory. For this reason, these systems are also called as symmetric multiprocessor
(SMP). They are relatively simple to implement, but not very scalable; the
programmer is responsible for the management of the synchronization by inserting
appropriate controls, semaphores, locks, and so on in the program that manages
resources.
Non-uniform memory access (NUMA): These architectures divide the memory area
into a high-speed access area that is assigned to each processor and a common
area for the data exchange, with slower access. These systems are also called as
Distributed Shared Memory Systems (DSM). They are very scalable, but complex to
develop.
Cache only memory access (COMA): These systems are equipped with only
cache memories. While analyzing NUMA architectures, it was noticed that these
architectures kept the local copies of the data in the cache and that these data were
stored as duplication in the main memory. This architecture removes duplicates
and keeps only the cache memories, the memory is physically distributed among
the processors (local memory). All local memories are private and can only access
the local processor. The communication between the processors is through a
communication protocol for exchange of messages, the message-passing protocol.
Distributed memory
In a system with distributed memory, the memory is associated with each processor and a
processor is only able to address its own memory. Some authors refer to this type of system
as "multicomputer", reflecting the fact that the elements of the system are themselves small
complete systems of a processor and memory, as you can see in the following figure:
Processor +
cache
mem I/O
Processor +
cache
mem I/O
Processor +
cache
mem I/O
Processor +
cache
mem I/O
Interconnection Network
mem I/O
Processor +
cache
mem I/O
Processor +
cache
mem I/O
Processor +
cache
mem I/O
Processor +
cache
This kind of organization has several advantages. At first, there are no conflicts at the level
of the communication bus or switch. Each processor can use the full bandwidth of their
own local memory without any interference from other processors. Secondly, the lack of a
common bus means that there is no intrinsic limit to the number of processors, the size
of the system is only limited by the network used to connect the processors. Thirdly, there
are no problems of cache coherency. Each processor is responsible for its own data and
does not have to worry about upgrading any copies. The main disadvantage is that the
communication between processors is more difficult to implement. If a processor requires
data in the memory of another processor, the two processors should necessarily exchange
messages via the message-passing protocol. This introduces two sources of slowdown; to
build and send a message from one processor to another takes time, and also, any processor
should be stopped in order to manage the messages received from other processors. A
program designed to work on a distributed memory machine must be organized as a set of
independent tasks that communicate via messages.
10
Chapter 1
Memory
Memory
data
data
Synchronization is achieved by moving data (even if it's just the message itself)
between processors (communication).
The subdivision of data in the local memories affects the performance of the
machineit is essential to make a subdivision accurate, so as to minimize the
communication between the CPUs. In addition to this, the processor that coordinates
these operations of decomposition and composition must effectively communicate
with the processors that operate on the individual parts of data structures.
The message-passing protocol is used so that the CPU's can communicate with
each other through the exchange of data packets. The messages are discrete units
of information; in the sense that they have a well-defined identity, so it is always
possible to distinguish them from each other.
11
A cluster of workstations
These processing systems are based on classical computers that are connected by
communication networks. The computational clusters fall into this classification.
In a cluster architecture, we define a node as a single computing unit that takes part in the
cluster. For the user, the cluster is fully transparentall the hardware and software complexity is
masked and data and applications are made accessible as if they were all from a single node.
Here, we've identified three types of clusters:
The fail-over cluster: In this, the node's activity is continuously monitored, and when
one stops working, another machine takes over the charge of those activities. The
aim is to ensure a continuous service due to the redundancy of the architecture.
The load balancing cluster: In this system, a job request is sent to the node that has
less activity. This ensures that less time is taken to complete the process.
12
Chapter 1
GPU
Multiprocessor 1
Multiprocessor 2
Multiprocessor 3
Multiprocessor 4
Multiprocessor N-1
Multiprocessor N
CPU
Core 1
Core 2
Core 3
Core 4
13
In this recipe, we will give you an overview of these models. A more accurate description
will be in the next chapters that will introduce you to the appropriate Python module that
implements these.
14
Chapter 1
Machine A
Machine B
task 0
task 1
data
data
send()
send()
task 2
NETWORK
task 3
data
data
receive()
receive()
The message passing paradigm model
15
array A
...
do i=1,25
A(i)=B(i)*delta
end do
...
do i=26,50
A(i)=B(i)*delta
end do
...
do i=51,100
A(i)=B(i)*delta
end do
Task 1
Task 2
Task 3
Task decomposition
Task assignment
Agglomeration
Mapping
16
Chapter 1
Task decomposition
In this first phase, the software program is split into tasks or a set of instructions that can
then be executed on different processors to implement parallelism. To do this subdivision,
there are two methods that are used:
Functional decomposition: In this case, the problem is split into tasks, where each
task will perform a particular operation on all the available data.
Task assignment
In this step, the mechanism by which the task will be distributed among the various processes
is specified. This phase is very important because it establishes the distribution of workload
among the various processors. The load balance is crucial here; in fact, all processors must
work with continuity, avoiding an idle state for a long time. To perform this, the programmer
takes into account the possible heterogeneity of the system that tries to assign more tasks to
better performing processors. Finally, for greater efficiency of parallelization, it is necessary to
limit communication as much as possible between processors, as they are often the source of
slowdowns and consumption of resources.
Agglomeration
Agglomeration is the process of combining smaller tasks with larger ones in order to improve
performance. If the previous two stages of the design process partitioned the problem into a
number of tasks that greatly exceed the number of processors available, and if the computer
is not specifically designed to handle a huge number of small tasks (some architectures, such
as GPUs, handle this fine and indeed benefit from running millions or even billions of tasks),
then the design can turn out to be highly inefficient. Commonly, this is because tasks have
to be communicated to the processor or thread so that they compute the said task. Most
communication has costs that are not only proportional with the amount of data transferred,
but also incur a fixed cost for every communication operation (such as the latency which is
inherent in setting up a TCP connection). If the tasks are too small, this fixed cost can easily
make the design inefficient.
17
Mapping
In the mapping stage of the parallel algorithm design process, we specify where each task is
to be executed. The goal is to minimize the total execution time. Here, you must often make
tradeoffs, as the two main strategies often conflict with each other:
The tasks that communicate frequently should be placed in the same processor to
increase locality
The tasks that can be executed concurrently should be placed in different processors
to enhance concurrency
Dynamic mapping
There exists many load balancing algorithms for various problems, both global and local.
Global algorithms require global knowledge of the computation being performed, which
often adds a lot of overhead. Local algorithms rely only on information that is local to the
task in question, which reduces overhead compared to global algorithms, but are usually
worse at finding an optimal agglomeration and mapping. However, the reduced overhead
may reduce the execution time even though the mapping is worse by itself. If the tasks rarely
communicate other than at the start and end of the execution, a task-scheduling algorithm
is often used that simply maps tasks to processors as they become idle. In a task-scheduling
algorithm, a task pool is maintained. Tasks are placed in this pool and are taken from it by
workers.
There are three common approaches in this model, which are explained next.
Manager/worker
This is the basic dynamic mapping scheme in which all the workers connect to a the
centralized manager. The manager repeatedly sends tasks to the workers and collects the
results. This strategy is probably the best for a relatively small number of processors. The
basic strategy can be improved by fetching tasks in advance so that communication and
computation overlap each other.
18
Chapter 1
Hierarchical manager/worker
This is the variant of a manager/worker that has a semi-distributed layout; workers are split
into groups, each with their own manager. These group managers communicate with the
central manager (and possibly, among themselves as well), while workers request tasks
from the group managers. This spreads the load among several managers and can, as such,
handle a larger amount of processors if all workers request tasks from the same manager.
Decentralize
In this scheme, everything is decentralized. Each processor maintains its own task pool and
communicates with the other processors in order to request tasks. How the processors choose
other processors to request tasks varies and is determined on the basis of the problem.
Speedup
Speedup is the measure that displays the benefit of solving a problem in parallel. It is defined
as the ratio of the time taken to solve a problem on a single processing element, TS, to the
time required to solve the same problem on p identical processing elements, Tp.
S=
TS
TP
Efficiency
In an ideal world, a parallel system with p processing elements can give us a speedup equal
to p. However, this is very rarely achieved. Usually, some time is wasted in either idling or
communicating. Efficiency is a performance metric estimating how well-utilized the processors
are in solving a task, compared to how much effort is wasted in communication and
synchronization.
T
S
= S
p pTP . The algorithms with linear speedup have
We denote it by E and can define it as
the value of E = 1; in other cases, the value of E is less than 1. The three cases are identified
as follows:
E=
Scaling
Scaling is defined as the ability to be efficient on a parallel machine. It identifies the
computing power (speed of execution) in proportion with the number of processors. By
increasing the size of the problem and at the same time the number of processors, there will
be no loss in terms of performance. The scalable system, depending on the increments of the
different factors, may maintain the same efficiency or improve it.
Amdahl's law
Amdahl's law is a widely used law used to design processors and parallel algorithms. It states
that the maximum speedup that can be achieved is limited by the serial component of the
S=
20
Chapter 1
Gustafson's law
Gustafson's law is based on the following considerations:
While increasing the dimension of a problem, its sequential parts remain constant
While increasing the number of processors, the work required on each of them still
remains the same
This states that S(P) = P ( P 1), where P is the number of processors, S is the speedup,
and is the non-parallelizable fraction of any parallel process. This is in contrast to Amdahl's
law, which takes the single-process execution time to be the fixed quantity and compares
it to a shrinking per process parallel execution time. Thus, Amdahl's law is based on the
assumption of a fixed problem size; it assumes that the overall workload of a program does
not change with respect to the machine size (that is, the number of processors). Gustafson's
law addresses the deficiency of Amdahl's law, which does not take into account the total
number of computing resources involved in solving a task. It suggests that the best way to
set the time allowed for the solution of a parallel problem is to consider all the computing
resources and on the basis of this information, it fixes the problem.
Introducing Python
Python is a powerful, dynamic, and interpreted programming language that is used in a wide
variety of applications. Some of its features include:
A very extensive standard library, where through additional software modules, we can
add data types, functions, and objects
Python can be seen as a glue language. Using Python, better applications can be developed
because different kinds of programmers can work together on a project. For example, when
building a scientific application, C/C++ programmers can implement efficient numerical
algorithms, while scientists on the same project can write Python programs that test and use
those algorithms. Scientists don't have to learn a low-level programming language and a C/
C++ programmer doesn't need to understand the science involved.
21
Getting ready
Python can be downloaded from https://www.python.org/downloads/.
Although you can create Python programs with Notepad or TextEdit, you'll notice that it's much
easier to read and write code using an Integrated Development Environment (IDE).
There are many IDEs that are designated specifically for Python, including IDLE (http://
www.python.org/idle), PyCharm (https://www.jetbrains.com/pycharm/), and
Sublime Text, (http://www.sublimetext.com/).
How to do it
Let's take a look at some examples of the very basic code to get an idea of the features of
Python. Remember that the symbol >>> denotes the Python shell:
Only for this first example, we will see how the code appears in the Python shell:
22
Chapter 1
Let's see the other basic examples:
Complex numbers:
>>> a=1.5+0.5j
>>> a.real
1.5
>>> a.imag
0.5
>>> abs(a)
# sqrt(a.real**2 + a.imag**2)
5.0
Strings manipulation:
>>> word = 'Help' + 'A'
>>> word
'HelpA'
>>> word[4]
'A'
>>> word[0:2]
'He'
>>> word[-1]
'A'
Defining lists:
>>> a = ['spam', 'eggs', 100, 1234]
>>> a[0]
'spam'
>>> a[3]
1234
>>> a[-2]
100
>>> a[1:-1]
['eggs', 100]
>>> len(a)
4
23
print b
...
a, b = b, a+b
...
1
1
2
3
5
8
The if command:
First we use the input() statement to insert an integer:
>>>x = int(input("Please enter an integer here: "))
Please enter an integer here:
...elif x == 0:
...
...elif x == 1:
...
...else:
...
print ('More')
...
...
cat 3
window 6
defenestrate 12
24
Chapter 1
Defining functions:
>>> def fib(n):
...
...
a, b = 0, 1
...
while b < n:
...
print (b),
...
a, b = b, a+b
...
>>> # Now call the function we just defined:
... fib(2000)
1 1 2 3 5 8 13 21 34 55 89 144 233 377 610 987 1597
Importing modules:
>>> import math
>>> math.sin(1)
0.8414709848078965
Defining classes:
>>> class Complex:
...
...
self.r = realpart
...
self.i = imagpart
...
>>> x = Complex(3.0, -4.5)
>>> x.r, x.i
(3.0, -4.5)
25
26
Chapter 1
Let's recap:
The threads of the same process share the address space and other resources, while
processes are independent of each other.
Before examining in detail the features and functionality of Python modules for the
management of parallelism via threads and processes, let's first look at how the Python
programming language works with these two entities.
Getting ready
In this first Python application, you'll simply get the Python language installed.
How to do it
To execute this first example, we need to type the following two programs:
called_Process.py
calling_Process.py
27
To run the example, open the calling_Process.py program with the Python IDE and then
press the F5 button on the keyboard.
You will see the following output in the Python shell:
28
Chapter 1
At same time, the OS prompt displays the following:
We have two processes running to close the OS prompt; simply press the Enter button on the
keyboard to do so.
How it works
In the preceding example, the execvp function starts a new process, replacing the current
one. Note that the "Good Bye" message is never printed. Instead, it searches for the program
called_Process.py along the standard path, passes the contents of the second argument
tuple as individual arguments to that program, and runs it with the current set of environment
variables. The instruction input() in called_Process.py is only used to manage the
closure of OS prompt. In the recipe dedicated to process-based parallelism, we will finally see
how to manage a parallel execution of more processes via the multiprocessing Python module.
How to do it
To execute this first example, we need the program helloPythonWithThreads.py:
## To use threads you need import Thread using the following code:
from threading import Thread
30
Chapter 1
# create an instance of the HelloWorld class
hello_Python = CookBook()
To run the example, open the calling_Process.py program with the Python IDE and then
press the F5 button on the keyboard.
You will see the following output in the Python shell:
31
How it works
While the main program has reached the end, the thread continues printing its message every
two seconds. This example demonstrates what threads area subtask doing something in a
parent process.
A key point to make when using threads is that you must always make sure that you never
leave any thread running in the background. This is very bad programming and can cause you
all sorts of pain when you work on bigger applications.
32
www.PacktPub.com
Stay Connected: