NumPy Cookbook - Second Edition - Sample Chapter
NumPy Cookbook - Second Edition - Sample Chapter
NumPy Cookbook - Second Edition - Sample Chapter
ee
This book will give you a solid foundation in NumPy arrays and universal functions. Starting with the installation
and conguration of IPython, you'll learn about advanced indexing and array concepts along with commonly
used yet effective functions. You will then cover practical concepts such as image processing, special arrays, and
universal functions. You will also learn about plotting with Matplotlib and the related SciPy project with the help of
examples. At the end of the book, you will study how to explore atmospheric pressure and its related techniques.
By the time you nish this book, you'll be able to write clean and fast code with NumPy.
masked arrays
Explore everything you need to know about
image processing
Dive into broadcasting and histograms
features
and problems
Assurance
Learn about exploratory and predictive data
$ 44.99 US
29.99 UK
Ivan Idris
pl
e
Second Edition
Over 90 fascinating recipes to learn and perform mathematical,
scientic, and engineering Python computations with NumPy
P U B L I S H I N G
Sa
NumPy Cookbook
P U B L I S H I N G
NumPy has the ability to give you speed and high productivity. High performance calculations can be done easily
with clean and efcient code, and it allows you to execute complex algebraic and mathematical computations in
no time.
Ivan Idris
NumPy Cookbook
Second Edition
This second edition adds two new chapters on the new NumPy functionality and data
analysis. We NumPy users live in exciting times. New NumPy-related developments
seem to come to our attention every week, or maybe even daily. At the time of the first
edition, the NumFocus, short for NumPy Foundation for Open Code for Usable Science,
was created. The Numba projecta NumPy-aware dynamic Python compiler using
LLVMwas also announced. Further, Google added support to their cloud product
called Google App Engine.
In the future, we can expect improved concurrency support for clusters of GPUs and
CPUs. OLAP-like queries will be possible with NumPy arrays. This is wonderful news,
but we have to keep reminding ourselves that NumPy is not alone in the scientific
(Python) software ecosystem. There is SciPy, matplotlib (a very useful Python plotting
library), IPython (an interactive shell), and Scikits. Outside the Python ecosystem,
languages such as R, C, and Fortran are pretty popular. We will cover the details of
exchanging data with these environments.
Chapter 6, Special Arrays and Universal Functions, introduces pretty technical topics.
This chapter explains how to perform string operations, ignore illegal values, and store
heterogeneous data.
Chapter 7, Profiling and Debugging, shows the skills necessary to produce good
software. We demonstrate several convenient profiling and debugging tools.
Chapter 8, Quality Assurance, deserves a lot of attention because it's about quality.
We discuss common methods and techniques, such as unit testing, mocking, and BDD,
using the NumPy testing utilities.
Chapter 9, Speeding Up Code with Cython, introduces Cython, which tries to combine
the speed of C and the strengths of Python. We show you how Cython works from the
NumPy perspective.
Chapter 10, Fun with Scikits, covers Scikits, which are a yet another part of the
fascinating scientific Python ecosystem. A quick tour guides you through some of the
most useful Scikits projects.
Chapter 11, Latest and Greatest NumPy, showcases new functionality not covered in
the first edition.
Chapter 12, Exploratory and Predictive Data Analysis with NumPy, presents real-world
analysis of meteorological data. I've added this chapter in the second edition.
43
Introduction
This chapter is about the commonly used NumPy functions. These are the functions that
you will be using on a daily basis. Obviously, the usage may differ for you. There are so many
NumPy functions that it is virtually impossible to know all of them, but the functions in this
chapter are the bare minimum with which we should be familiar.
This recipe uses a formula based on the golden ratio, which is an irrational number with
special properties comparable to pi. The golden ratio is given by the following formula:
1+ 5
2
We will use the sqrt(), log(), arange(), astype(), and sum() functions. The Fibonacci
sequence's recurrence relation has the following solution, which involves the golden ratio:
Fn =
n ( )
How to do it...
The following is the complete code for this recipe from the sum_fibonacci.py file in this
book's code bundle:
import numpy as np
44
Chapter 3
#Each new term in the Fibonacci sequence is generated by adding the
previous two terms.
#By starting with 1 and 2, the first 10 terms will be:
#1, 2, 3, 5, 8, 13, 21, 34, 55, 89, ...
#By considering the terms in the Fibonacci sequence whose values do
not exceed four million,
#find the sum of the even-valued terms.
#1. Calculate phi
phi = (1 + np.sqrt(5))/2
print("Phi", phi)
#2. Find the index below 4 million
n = np.log(4 * 10 ** 6 * np.sqrt(5) + 0.5)/np.log(phi)
print(n)
#3. Create an array of 1-n
n = np.arange(1, n)
print(n)
#4. Compute Fibonacci numbers
fib = (phi**n - (-1/phi)**n)/np.sqrt(5)
print("First 9 Fibonacci Numbers", fib[:9])
#5. Convert to integers
# optional
fib = fib.astype(int)
print("Integers", fib)
#6. Select even-valued terms
eventerms = fib[fib % 2 == 0]
print(eventerms)
#7. Sum the selected terms
print(eventerms.sum())
2. Next, in the recipe, we need to find the index of the Fibonacci number below 4 million.
A formula for this is given in the Wikipedia page, and we will compute it using that
formula. All we need to do is convert log bases with the log() function. We don't
need to round the result down to the closest integer. This is automatically done for us
in the next step of the recipe:
n = np.log(4 * 10 ** 6 * np.sqrt(5)
+ 0.5)/np.log(phi)
print(n)
3. The arange() function is a very basic function that many people know. Still, we will
mention it here for completeness:
n = np.arange(1, n)
I could have made a unit test instead of a print statement. A unit test is
a test that tests a small unit of code, such as a function. This variation
of the recipe is left as an exercise for you.
We are not starting with the number 0 here, by the way. The aforementioned code
gives us a series as expected:
First 9 Fibonacci Numbers [
34.]
1.
1.
2.
3.
5.
8.
13.
You can plug this right into a unit test, if you want.
5. Convert to integers.
This step is optional. I think it's nice to have an integer result at the end. Okay, I
actually wanted to show you the astype() function:
46
21.
Chapter 3
fib = fib.astype(int)
print("Integers", fib)
This code gives us the following result, after snipping a bit for brevity:
Integers [
21
34
13
514229
There we go:
[
2
8
34
196418 832040 3524578]
144
610
2584
10946
46368
How it works...
In this recipe, we used the sqrt(), log(), arange(), astype(), and sum() functions.
Their description is as follows:
Function
sqrt()
Description
This function calculates the square root of array elements (see http://
docs.scipy.org/doc/numpy/reference/generated/numpy.
sqrt.html)
log()
This function calculates the natural logarithm of array elements (see http://
docs.scipy.org/doc/numpy/reference/generated/numpy.log.
html#numpy.log)
arange()
This function creates an array with the specified range (see http://docs.
scipy.org/doc/numpy/reference/generated/numpy.arange.
html)
astype()
This function converts array elements to a specified data type (see http://
docs.scipy.org/doc/numpy/reference/generated/numpy.
chararray.astype.html)
sum()
47
See also
The Indexing with Booleans recipe in Chapter 2, Advanced Indexing and Array Concepts
N = cd = ( a + b )( a b ) = a 2 b 2
We can apply the factorization recursively until we get the required prime factors.
How to do it...
The following is the entire code needed to solve the problem of finding the largest prime factor
of 600851475143 (see the fermatfactor.py file in this book's code bundle):
from __future__ import print_function
import numpy as np
#The prime factors of 13195 are 5, 7, 13 and 29.
#What is the largest prime factor of the number 600851475143 ?
N = 600851475143
LIM = 10 ** 6
def factor(n):
#1. Create array of trial values
a = np.ceil(np.sqrt(n))
lim = min(n, LIM)
a = np.arange(a, a + lim)
b2 = a ** 2 - n
#2. Check whether b is a square
48
Chapter 3
fractions = np.modf(np.sqrt(b2))[0]
#3. Find 0 fractions
indices = np.where(fractions == 0)
#4. Find the first occurrence of a 0 fraction
a = np.ravel(np.take(a, indices))[0]
# Or a = a[indices][0]
a
b
b
c
d
=
=
=
=
=
int(a)
np.sqrt(a ** 2 - n)
int(b)
a + b
a - b
if c == 1 or d == 1:
return
print(c, d)
factor(c)
factor(d)
factor(N)
We used the ceil() function to return the ceiling of the input, element-wise.
2. Get the fractional part of the b array.
We are now supposed to check whether b is a square. Use the NumPy modf()
function to get the fractional part of the b array:
fractions = np.modf(np.sqrt(b2))[0]
49
How it works...
We applied the Fermat factorization recursively using the ceil(), modf(), where(),
ravel(), and take() NumPy functions. The description of these functions is as follows:
Function
ceil()
Description
Calculates the ceiling of array elements (see http://docs.scipy.org/
doc/numpy/reference/generated/numpy.ceil.html)
modf()
where()
ravel()
take()
50
Chapter 3
How to do it...
The following is the complete program from the palindromic.py file in this book's
code bundle:
import numpy as np
#A palindromic number reads the same both ways.
#The largest palindrome made from the product of two 2-digit numbers
is 9009 = 91 x 99.
#Find the largest palindrome made from the product of two 3-digit
numbers.
51
How it works...
We saw the outer() function in action. This function returns the outer product of two arrays
(http://en.wikipedia.org/wiki/Outer_product). The outer product of two vectors
(one-dimensional lists of numbers) creates a matrix. This is the opposite of an inner product,
which returns a scalar number for two vectors. The outer product is used in physics, signal
processing, and statistics. The sort() function returns a sorted copy of an array.
There's more...
It might be a good idea to check the result. Find out which two 3-digit numbers produce
our palindromic number by modifying the code a bit. Try implementing the last step in the
NumPy way.
52
Chapter 3
Ax = x
Another way to look at this is as the eigenvector (see http://en.wikipedia.org/wiki/
Eigenvalues_and_eigenvectors) for eigenvalue 1. Eigenvalues and eigenvectors are
fundamental concepts of linear algebra with applications in quantum mechanics, machine
learning, and other sciences.
How to do it...
The following is the complete code for the steady state vector example from the steady_
state_vector.py file in this book's code bundle:
from __future__ import print_function
from matplotlib.finance import quotes_historical_yahoo
from datetime import date
import numpy as np
today = date.today()
53
54
Chapter 3
Now we need to obtain the data:
1. Obtain 1 year of data.
One way we can do this is with matplotlib (refer to the Installing matplotlib recipe in
Chapter 1, Winding Along with IPython, if necessary). We will retrieve the data of the
last year. Here is the code to do this:
today = date.today()
start = (today.year - 1, today.month, today.day)
quotes = quotes_historical_yahoo('AAPL', start, today)
The close price is the fifth number in each tuple. We should have a list of about
253 close prices now.
3. Determine the states.
We can determine the states by subtracting the price of sequential days with the
diff() NumPy function. The state is then given by the sign of the difference. The
sign() NumPy function returns -1 for a negative number, 1 for a positive number,
and 0 otherwise:
states = np.sign(np.diff(close))
Up
Flat
Down
55
56
Chapter 3
What the aforementioned code does is compute the transition probabilities for each
possible transition based on the number of occurrences and additive smoothing.
On one of the test runs, I got the following matrix:
[[ 0.5047619
7.
0.00952381
0.48571429]
[ 0.33333333
0.33333333
0.33333333]
[ 0.33774834
0.00662252
0.65562914]]
The eig() function returns an array containing the eigenvalues and another array
containing the eigenvectors:
(array([ 1.
5.77350269e-01,
, 0.16709381,
7.31108409e-01,
0.32663057]), array([[
7.90138877e-04],
5.77350269e-01,
-4.65117036e-01,
5.77350269e-01,
-4.99145907e-01,
-9.99813147e-01],
1.93144030e-02]]))
0.57735027
0.57735027
0.57735027]
0.57735027]
57
How it works...
The values for the eigenvector we get are not normalized. Since we are dealing with
probabilities, they should sum up to one. The diff(), sign(), and eig() functions were
introduced in this example. Their descriptions are as follows:
Function
diff()
Description
Calculates the discrete difference. By default, the first order (see http://
docs.scipy.org/doc/numpy/reference/generated/numpy.diff.
html).
sign()
eig()
See also
y = cx k
The Pareto principle (see http://en.wikipedia.org/wiki/Pareto_principle) for
instance, is a power law. It states that wealth is unevenly distributed. This principle tells us that
if we group people by their wealth, the size of the groups will vary exponentially. To put it simply,
there are not a lot of rich people, and there are even less billionaires; hence the one percent.
Assume that there is a power law in the closing stock prices log returns. This is a big
assumption, of course, but power law assumptions seem to pop up all over the place.
We don't want to trade too often, because of the involved transaction costs per trade.
Let's say that we would prefer to buy and sell once a month based on a significant correction
(with other words a big drop). The issue is to determine an appropriate signal given that we
want to initiate a transaction for every 1 out of about 20 days.
58
Chapter 3
How to do it...
The following is the complete code from the powerlaw.py file in this book's code bundle:
from matplotlib.finance import quotes_historical_yahoo
from datetime import date
import numpy as np
import matplotlib.pyplot as plt
#1. Get close prices.
today = date.today()
start = (today.year - 1, today.month, today.day)
quotes = quotes_historical_yahoo('IBM', start, today)
close = np.array([q[4] for q in quotes])
#2. Get positive log returns.
logreturns = np.diff(np.log(close))
pos = logreturns[logreturns > 0]
#3. Get frequencies of returns.
counts, rets = np.histogram(pos)
# 0 counts indices
indices0 = np.where(counts != 0)
rets = rets[:-1] + (rets[1] - rets[0])/2
# Could generate divide by 0 warning
freqs = 1.0/counts
freqs = np.take(freqs, indices0)[0]
rets = np.take(rets, indices0)[0]
freqs = np.log(freqs)
#4. Fit the frequencies and returns to a line.
p = np.polyfit(rets,freqs, 1)
#5. Plot the results.
plt.title('Power Law')
plt.plot(rets, freqs, 'o', label='Data')
plt.plot(rets, p[0] * rets + p[1], label='Fit')
plt.xlabel('Log Returns')
plt.ylabel('Log Frequencies')
plt.legend()
plt.grid()
plt.show()
59
60
Chapter 3
We get a nice plot of the linear fit, returns, and frequencies, like this:
How it works...
The histogram() function calculates the histogram of a dataset. It returns the histogram
values and bin edges. The polyfit() function fits data to a polynomial of a given order. In
this case, we chose a linear fit. We discovered a power lawyou have to be careful making
such claims, but the evidence looks promising.
See also
61
Getting ready
If necessary, install matplotlib and SciPy. Refer to the See also section for the
corresponding recipes.
How to do it...
The following is the complete code from the periodic.py file in this book's code bundle:
from __future__ import print_function
from matplotlib.finance import quotes_historical_yahoo
from datetime import date
import numpy as np
import scipy.stats
import matplotlib.pyplot as plt
#1. Get close prices.
today = date.today()
start = (today.year - 1, today.month, today.day)
quotes = quotes_historical_yahoo('AAPL', start, today)
close = np.array([q[4] for q in quotes])
#2. Get log returns.
logreturns = np.diff(np.log(close))
#3. Calculate breakout and pullback
62
Chapter 3
freq = 0.02
breakout = scipy.stats.scoreatpercentile(logreturns, 100 * (1 - freq)
)
pullback = scipy.stats.scoreatpercentile(logreturns, 100 * freq)
#4. Generate buys and sells
buys = np.compress(logreturns < pullback, close)
sells = np.compress(logreturns > breakout, close)
print(buys)
print(sells)
print(len(buys), len(sells))
print(sells.sum() - buys.sum())
#5. Plot a histogram of the log returns
plt.title('Periodic Trading')
plt.hist(logreturns)
plt.grid()
plt.xlabel('Log Returns')
plt.ylabel('Counts')
plt.show()
63
77.76375466
[ 74.95502967
76.69249773
76.55980292
102.72
74.13759123
101.2
80.93512599
98.57
98.22
5 5
-52.1387025726
Thus, we have a loss of 52 dollars if we buy and sell an AAPL share five times. When
I ran the script, the entire market was in recovery mode after a correction. You may
want to look at not just the AAPL stock price but maybe the ratio of AAPL and SPY.
SPY can be used as a proxy for the U.S. stock market.
3. Plot a histogram of the log returns.
Just for fun, let's plot the histogram of the log returns with matplotlib:
plt.title('Periodic Trading')
plt.hist(logreturns)
plt.grid()
plt.xlabel('Log Returns')
plt.ylabel('Counts')
plt.show()
64
Chapter 3
How it works...
We encountered the compress() function, which returns an array containing the array
elements of the input that satisfy a given condition. The input array remains unchanged.
See also
The Installing SciPy recipe in Chapter 2, Advanced Indexing and Array Concepts
Getting ready
If necessary, install matplotlib. Refer to the See also section of the corresponding recipe.
How to do it...
The following is the complete code from the random_periodic.py file in this book's
code bundle:
from __future__ import print_function
from matplotlib.finance import quotes_historical_yahoo
from datetime import date
import numpy as np
import matplotlib.pyplot as plt
def get_indices(high, size):
#2. Generate random indices
return np.random.randint(0, high, size)
#1. Get close prices.
65
2. Simulate trades.
You can simulate trades with the random indices from the previous step. Use the
take() NumPy function to extract random close prices from the array of step 1:
buys = np.take(close, get_indices(len(close), nbuys))
sells = np.take(close, get_indices(len(close), nbuys))
profits[i] = sells.sum() - buys.sum()
Chapter 3
plt.hist(profits)
plt.xlabel('Profits')
plt.ylabel('Counts')
plt.grid()
plt.show()
Here is a screenshot of the resulting histogram of 2,000 simulations for AAPL, with
five buys and sells in a year:
How it works...
We used the randint() function, which can be found in the numpy.random module. This
module contains more convenient random generators, as described in the following table:
Function
rand()
Description
Creates an array from a uniform distribution over [0,1] with a shape based
on dimension parameters. If no dimensions are specified, a single float is
returned (see http://docs.scipy.org/doc/numpy/reference/
generated/numpy.random.rand.html).
randn()
Sample values from the normal distribution with mean 0 and variance 1.
The dimension parameters function the same as for rand() (see http://
docs.scipy.org/doc/numpy/reference/generated/numpy.
random.randn.html).
randint()
Returns an integer array given a low boundary, an optional high bound, and
an optional output shape (see http://docs.scipy.org/doc/numpy/
reference/generated/numpy.random.randint.html).
67
See also
How to do it...
The first mandatory step is to create a list of natural numbers:
1. Create a list of consecutive integers. NumPy has the arange() function for that:
a = np.arange(i, i + LIM, 2)
Chapter 3
#2. Sieve out multiples of p
a = a[a % p != 0]
return a
for i in xrange(3, N, LIM):
#1. Create a list of consecutive integers
a = np.arange(i, i + LIM, 2)
while len(primes) < P:
a = sieve_primes(a, p)
primes.append(p)
p = a[0]
print(len(primes), primes[P-1])
69
www.PacktPub.com
Stay Connected: