Python Data Processing
Python Data Processing
5.1 Introduction
Data preparation is the process of constructing a clean dataset from one or
more sources such that the data can be fed into subsequent stages of a data
science pipeline. Common data preparation tasks include handling missing
values, outlier detection, feature/variable scaling and feature encoding. Data
preparation is often a time-consuming process.
Python provides us with a specialized library for data preparation and
analysis. Thus, understanding the functions of this library is of utmost
importance for those who are beginners in data science. In this chapter, we
shall mostly be using Pandas along with Numpy and Matplotlib.
Since we shall be working with Numpy and Pandas, we import both. The
general practice for importing these libraries is as follows:
import pandas as pd
import numpy as np
Thus, each time we see pd and np, we are making reference to an object or
method referring to these two libraries.
• Series,
• DataFrame.
Output:
0 1
1 -3
2 5
3 -20
dtype: int64
dtype: int64 means the data type of values in a Series is integer of 64 bits.
The structure of a Series object is simple. It consists of two arrays, index and
value, associated with each other. The first array (column) stores the index
of the data, whereas the second array (column) stores the actual values.
Pandas assigns numerical indices starting from 0 onwards if we do not
specify any index. It is sometimes preferable to create a Series object using
descriptive and meaningful labels. In this case, we can assign indices during
the constructor call as follows:
myseries2 = pd.Series([1, -3, 5, 20], index = ['a', 'b', 'c', 'd'])
myseries2
Output:
a 1
b -3
c 5
d 20
dtype: int64
Two attributes of the Series data structure index and values can be used to
view the values and index separately.
myseries2.values
Output:
array([ 1, -3, 5, 20], dtype=int64)
myseries2.index
Output:
Index(['a', 'b', 'c', 'd'], dtype='object')
The elements of a Series can be accessed in the same way we access the
array elements.
myseries2[2]
Output:
C 5
myseries2[0:2]
Output:
a 1
b -3
dtype: int64
myseries2['b']
Output:
b -3
myseries2[['b','c']]
Output:
b -3
c 5
dtype: int64
Note double brackets in the aforementioned indexing that uses the list of
labels within an array.
We can select the value by index or label, and assign a different value to it,
for example:
myseries2['c'] = 10
myseries2
Output:
a 1
b -3
c 10
d 20
dtype: int64
We can create a Series object from an existing array as follows:
myarray = np.array([1,2,3,4])
myseries3 = pd.Series(myarray)
myseries3
Output:
0 1
1 2
2 3
3 4
dtype: int32
Most operations performed on simple Numpy arrays are valid on a Series
data structure. To facilitate data processing, additional functions are
provided on Series data structure. We can use conditional operators to filter
or select values. For example, to get values greater than 2, we may use.
myseries3[myseries3 >2]
Output:
2 3
3 4
dtype: int32
Mathematical operations can be performed on the data stored in a Series.
For example, to take the logarithm of values stored in myseries3, we enter
the following command that uses the log function defined in Numpy library.
np.log(myseries3)
Output:
a 0.000000
b NaN
c 2.302585
d 2.995732
dtype: float64
Note that logarithm of a negative number is undefined; it is returned as a
NaN, standing for Not a Number.
Thus, Python throws the following warning.
C:\Anaconda3_Python\lib\site-packages\pandas\core\series.py:679:
RuntimeWarning: invalid value encountered in log
result = getattr(ufunc, method)(*inputs, **kwargs)
NaN values are used to indicate an empty field or not definable quantity. We
can define NaN values by typing np.NaN. The isnull( ) and notnull( ) functions
of Pandas are useful to identify the indices without a value or NaN.
We create a Series mycolors to perform some common operations that can
be applied to a Series.
mycolors = pd.Series([1,2,3,4,5,4,3,2,1],
index=['white','black','blue','green','green','yellow', 'black', 'red'])
mycolors
Output:
white 1
black 2
blue 3
green 4
green 5
yellow 4
black 3
red 2
dtype: int64
The Series mycolor contains some duplicate values. We can get unique
values from the Series by typing:
mycolors.unique()
Output:
array([1, 2, 3, 4, 5], dtype=int64)
Another useful function value_counts() returns how many times the values
are present in a Series.
mycolors.value_count()
Output:
4 2
3 2
2 2
5 1
1 1
dtype: int64
This output indicates that the values 2, 3 and 4 are present twice each,
whereas values 1 and 5 are present once only.
To find a particular value contained in a Series data structure we use isin( )
function that evaluates the membership. It returns Boolean value True or
False, which can be used to filter the data present in a Series. For example,
we search values 5 and 7 in the Series mycolors by typing.
mycolors.isin([5,7])
Output:
white False
black False
blue False
green False
green True
yellow False
black False
red False
dtype: bool
We can use the Boolean values returned by mycolors.isin([5,7]) as indices
to the Series mycolors to get the filtered Series.
mycolors[mycolors.isin([5,7])]
Output:
green 5
dtype: int64
A Series can be created from an already defined dictionary.
mydict = {'White': 1000, 'Black': 500, 'Red': 200, 'Green': 1000}
myseries = pd.Series(mydict)
myseries
Output:
White 1000
Black 500
Red 200
Green 1000
dtype: int64
Output:
We can select a few columns from the DataFrame in any arbitrary order,
using the columns option. The columns will be created in the order we specify
irrespective of how they are stored within the dictionary object. For example,
myframe2 = pd.DataFrame(mydata, columns = ['Experience (years)',
Employee Name'])
myframe2
Output:
If we use the index option, we can specify indices of our choice to the
DataFrame.
myframe3 = pd.DataFrame(mydata, index=['zero', 'one','two','three','four'])
myframe3
Output:
1. a data matrix,
2. an array of the labels for the indices (index option)
3. an array containing the column names (columns option).
Output:
array([[ 0, 1, 2, 3, 4],
[ 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14]])
This is a 2-dimensional array or a matrix of size 3x5.
myframe4 = pd.DataFrame(np.arange(15).reshape((3,5)),
index=['row0','row1','row2'], columns=['col0','col1','col2','col3', 'col4'])
myframe4
Output:
Output:
0 0.865165
1 0.305467
2 0.692341
3 0.859180
4 0.004683
dtype: float64
Output:
5 0.670931
6 0.762998
7 0.200184
8 0.266258
9 0.296408
dtype: float64
To concatenate myseries1 and myseries2, type the following command.
pd.concat([myseries1,myseries2])
Output:
0 0.865165
1 0.305467
2 0.692341
3 0.859180
4 0.004683
5 0.670931
6 0.762998
7 0.200184
8 0.266258
9 0.296408
dtype: float64
The concat() function works on axis = 0 (rows), by default, to return a Series.
If we set axis = 1 (columns), then the result will be a DataFrame.
pd.concat([myseries1, myseries2],axis=1)
Output:
If we use another useful option, keys, along the axis = 1, the provided keys
become the column names of the DataFrame.
pd.concat([myseries1, myseries2],axis=1, keys=['series1','series2'])
Output:
The function concat () is applicable to a DataFrame as well. Let us create 2
data frames as follows.
myframe1 = pd.DataFrame({'Student Name': ['A','B','C'], 'Sex':['M','F','M'],
'Age': [10, 16, 17], 'School': ['Primary','High', 'High']})
myframe1
Output:
Output:
We concatenate these 2 data frames as follows.
pd.concat([myframe1, myframe2])
Output:
Note the NaN values have been placed in those columns whose information
is not present in individual data frames.
5.4.2 Merging Data
The process of merging consists of combining data through the connection
of rows using one or more keys. The keys are common columns in 2
DataFrames to be merged. Based on the keys, it is possible to obtain new
data in a tabular form. The merge() function performs this kind of operation.
We may merge myframe1 and myframe2 by typing the following command.
pd.merge([myframe1, myframe2])
Output:
Note the difference between the outputs of concat () and merge (). The
merge operation merges only those columns together for which the key
entry, Student Name, is the same. However, the concat () operation returns
all the rows even with NaN values.
Consider the case when we have multiple key columns, and we want to
merge on the basis of only 1 column. In this case, we can use the option
“on” to specify the key for merging the data.
pd.merge(myframe1, myframe2,on='School')
Output:
In this case, merge operation renames those key attributes which are
common to both data frames but not used for merging. These are Student
Name_x and Student Name_y. If we merge on the basis of Student Name,
we get a completely different result.
pd.merge(myframe1, myframe2,on='Student Name')
Output:
Output:
Output:
1 50
2 40
3 30
4 20
5 10
dtype: int64
Output:
3 100
4 200
5 300
6 400
dtype: int64
To keep the values from myseries1, we combine both series as given below.
myseries1.combine_first(myseries2)
Output:
1 50.0
2 40.0
3 30.0
4 20.0
5 10.0
6 400.0
dtype: float64
To keep the values from myseries2, we combine both series as given below.
myseries2.combine_first(myseries1)
Output:
1 50.0
2 40.0
3 100.0
4 200.0
5 300.0
6 400.0
dtype: float64
Output:
To remove the row indexed 1, we use:
myframe5.drop(1)
Output:
Output:
We note that row indexed 0 and 4 are duplicate. To detect duplicate rows,
we use the duplicated () function.
item_frame.duplicated()
Output:
0 False
1 False
2 False
3 False
4 True
dtype: bool
To display the duplicate entries only, we can use the following commands.
item_frame.duplicated() as index to the DataFrame item_frame.
item_frame[item_frame.duplicated()]
Output:
Output:
Output:
Output:
Note that the statistics of numerical columns is calculated and displayed. In
our case, Age is the only numerical column in student_frame. The presence
of outliers shifts the statistics. We expect the mean or average age of
students to be around 15 years. However, the average age of students is
20.142857 due to the presence of the outlier, i.e., 60 years.
The statistic count gives the number of elements in the column, mean gives
the average value, std provides standard deviation which is the average
deviation of data points from the mean of data, min is the minimum value,
max is the maximum value, 25%, 50% and 75% give the 25th percentile (the
first quartile - Q1), the 50th percentile (the median - Q2) and the 75th
percentile (the third quartile - Q3) of the values.
There are different ways to detect outliers, for example, if the difference
between mean and median values is too high, it can be an indication of
presence of outliers. A better approach to detect numeric outliers is to use
InterQuartile Range (IQR). In IQR, we divide our data into 4 quarters after
we sort it in ascending order.
Any data point that lies outside some small multiple of the difference between
the third and the first quartiles is considered as an outlier. For example,
IQR = Q3 – Q1 = 15.5 – 12.5 = 3
Using a typical interquartile multiplier value k=1.5, we can find the lower and
upper values beyond which data points can be considered as outliers.
IQR x 1.5 = 4.5
We subtract this value, 4.5, from the Q1 to find the lower limit, and add 4.5
to the Q3 to find the upper limit. Thus,
Lower limit = Q1-4.5 = 8
Upper limit = Q3+4.5 = 20
Now any value lesser than 8 or greater than 20 can be treated as an outlier.
A popular plot that shows these quartiles is known as a Box and Whisker plot
shown in Figure 5.1.
Figure 5.1: A Box and Whisker plot. The length of the box represents IQR.
To calculate lower and upper limits, we can enter the following script.
Q1 = student_frame.quantile(0.25) # 25% or quartile 1 (Q1)
Q3 = student_frame.quantile(0.75) # 75% or quartile 3 (Q3)
IQR = Q3 - Q1 # InterQuartile Range (IQR)
IQR_mult = IQR*1.5
lower= Q1 - IQR_mult
upper= Q3 + IQR_mult
print("The lower limit is = ", lower)
print("The upper limit is = ", upper)
Output:
The lower limit is = Age 8.0
dtype: float64
The upper limit is = Age 20.0
dtype: float64
Now we are able to filter our DataFrame, student_frame, to remove outliers.
We access the column Age using student_frame['Age'], and compare it with
int(lower). The result is used as indices to student_frame. Finally,
student_frame is updated by making an assignment as follows:
student_frame = student_frame[student_frame['Age'] > int(lower)]
student_frame = student_frame[student_frame['Age'] < int(upper)]
student_frame
Output:
• Ignored
• Filled-in or
• Removed or dropped.
Ignoring the missing values is often not a good solution because it leads to
erroneous results. Let us create a Series object with missing values.
myseries4 = pd.Series([10, 20, 30, None, 40, 50, np.NaN],
index=[0,1,2,3,4,5,6])
print (myseries4.isnull())
myseries4
Output:
0 False
1 False
2 False
3 True
4 False
5 False
6 True
dtype: bool
0 10.0
1 20.0
2 30.0
3 NaN
4 40.0
5 50.0
6 NaN
dtype: float64
To get the indices where values are missing, we may type.
myseries4[myseries4.isnull()]
Output:
3 NaN
6 NaN
dtype: float64
We can drop the missing values by using the function dropna ( ).
myseries4_dropped = myseries4.dropna()
myseries4_dropped
Output:
0 10.0
1 20.0
2 30.0
4 40.0
5 50.0
dtype: float64
The process of filling-in the missing values is called “data imputation”. One
of the widely used techniques is mean value imputation.
myseries4_filled = myseries4.fillna(myseries4.mean())
myseries4_filled
Output:
0 10.0
1 20.0
2 30.0
3 30.0
4 40.0
5 50.0
6 30.0
dtype: float64
Here, the values at indices 3 and 6 are filled-in by the mean or average of
rest of the valid values, where the mean is calculated as
Mean = (10+20+30+40+50)/5 = 30.
Besides mean, we can impute the missing data using median by typing
myseries4.median () in place of myseries4.mean ().
Output:
We create a dictionary to perform mapping.
mymap = {'blue':'dark blue', 'green': 'light green'}
Next, this dictionary is provided as an input argument to the replace ()
function.
myframe.replace(mymap)
Output:
Note, the original colors blue and green have been replaced by dark blue
and light green as mapped inside the dictionary mymap. The function
replace () can also be used to replace NaN contained inside a data structure.
myseries = pd.Series([1,2,np.nan,4,5,np.nan])
myseries.replace(np.nan,0)
Output:
0 1.0
1 2.0
2 0.0
3 4.0
4 5.0
5 0.0
dtype: float64
To add a new column to an existing DataFrame, we again create a dictionary
object that serves as a map.
mymap2 = {'ball':'round', 'pencil':'long', 'pen': 'long', 'mug': 'cylindrical',
'paper':'rectangular'}
myframe['shape']=myframe['object'].map(mymap2)
myframe
Output:
We use map () function that takes the dictionary as its input argument, and
maps a particular column of the DataFrame to create a new column. In our
case, the column named object is used for the mapping.
Finally, we can rename the indices of a DataFrame using the function
rename (). We create new indices using a dictionary.
reindex = { 0: ‘first’, 1: ‘second’, 2: ‘third’, 3: ‘fourth’, 4: ‘fifth’}
myframe=myframe.rename(reindex)
Output:
Note that we rename the indices, and assign the result of the right-hand side
to myframe to update it. If this assignment operation is not performed,
myframe will not be updated.
5.5.5 Discretization and Binning
Occasionally when we have a large amount of data, we want to transform
this into discrete categories to facilitate the analysis. For instance, we can
divide the range of values of the data in relatively smaller intervals or
categories to discover the statistics within each interval. Suppose we gather
data from an experimental study, and store it in a list.
readings = [34, 39, 82, 75, 16, 17, 15, 74, 37, 68, 22, 92, 99, 54, 39, 96, 17,
36, 91, 86]
We find that the range of data values is 0 to 100. Thus, we can uniformly
divide this interval, suppose, into four equal parts (bins):
Output:
[(25, 50], (25, 50], (75, 100], (50, 75], (0, 25], ..., (75, 100], (0, 25], (25, 50],
(75, 100], (75, 100]]
Length: 20
Categories (4, interval[int64]): [(0, 25] < (25, 50] < (50, 75] < (75, 100]]
We get 4 categories or intervals when we run the function cut (), i.e., [(0, 25]
< (25, 50] < (50, 75] < (75, 100]]. Note that each category has the lower limit
with a parenthesis and the upper limit with a bracket. This is consistent with
mathematical notation used to indicate the intervals. In the case of a square
bracket, the number belongs to the range, and if it is a parenthesis, the
number does not belong to the interval. In (0,25], 0 is excluded, whereas 25
is included. To count the number of elements in each bin, we may write.
pd.value_counts(mycategory)
Output:
(75, 100] 6
(25, 50] 5
(0, 25] 5
(50, 75] 4
dtype: int64
In place of numbers, we can give meaningful names to the bins.
bin_names = ['Poor','Below Average','Average','Good']
pd.cut(readings, bins, labels=bin_names)
Output:
[Below Average, Below Average, Good, Average, Poor, ..., Good, Poor,
Below Average, Good, Good]
Length: 20
Categories (4, object): [Poor < Below Average < Average < Good]
The Pandas library provides the function qcut(), that divides the data into
quantiles. qcut() ensure that the number of occurrences for each bin is equal,
but the ranges of each bin may vary.
pd.qcut(readings, 4)
Output:
[(31.0, 46.5], (31.0, 46.5], (46.5, 83.0], (46.5, 83.0], (14.999, 31.0], ..., (83.0,
99.0], (14.999, 31.0], (31.0, 46.5], (83.0, 99.0], (83.0, 99.0]]
Length: 20
Categories (4, interval[float64]): [(14.999, 31.0] < (31.0, 46.5] < (46.5, 83.0]
< (83.0, 99.0]]
To check the number of elements in each bin, we type.
pd.value_counts(pd.qcut(readings,4))
Output:
(83.0, 99.0] 5
(46.5, 83.0] 5
(31.0, 46.5] 5
(14.999, 31.0] 5
dtype: int64
Output:
Note that the column color has 2 entries for both white and red. If we want
to group the data based upon the column color, for example, we may type.
mygroup = myframe['price'].groupby(myframe['color'])
mygroup.groups
Output:
{'blue': Int64Index([0], dtype='int64'),
'red': Int64Index([2, 3], dtype='int64'),
'white': Int64Index([1, 4], dtype='int64')}
Thus, we get 3 distinct groups, blue, red and white, by invoking the attribute
groups.
mygroup.mean()
Output:
color
blue 1.20
red 0.75
white 1.35
Name: price, dtype: float64
mygroup.sum()
Output:
color
blue 1.2
red 1.5
white 2.7
Name: price, dtype: float64
The data aggregation can be performed using more than 1 column. For
instance, we may group data by both color and object. It is called
hierarchical grouping. We may type the following commands.
mygroup2 = myframe['price'].groupby([myframe['color'],myframe['object']])
mygroup2.groups
Output:
{('blue', 'ball'): Int64Index([0], dtype='int64'),
('red', 'paper'): Int64Index([3], dtype='int64'),
('red', 'pencil'): Int64Index([2], dtype='int64'),
('white', 'mug'): Int64Index([4], dtype='int64'),
('white', 'pen'): Int64Index([1], dtype='int64')}
Let us create a new dataframe myframe2 that is same as myframe except
an extra entry ['red','pencil',0.8] at index 5.
myframe2 = myframe
myframe2.loc[5]=['red','pencil',0.8]
myframe2
Output:
mygroup2.mean()
Output:
color object
blue ball 1.2
red paper 0.9
pencil 0.7
white mug 1.7
pen 1.0
Name: price, dtype: float64
mygroup2.sum()
Output:
color object
blue ball 1.2
red paper 0.9
pencil 1.4
white mug 1.7
pen 1.0
Name: price, dtype: float64
5.6 Selection of Data
Sometimes we have to work with a subset of a dataset. In this case, we
select data of interest from the dataset. Let us work on an already created
DataFrame, myframe4.
myframe4.columns
Output:
Index(['col0', 'col1', 'col2', 'col3', 'col4'], dtype='object')
myframe4.index
Output:
Index(['row0', 'row1', 'row2'], dtype='object')
myframe4.values
Output:
array([[ 0, 1, 2, 3, 4],
[ 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14]])
We can select a single column.
myframe4['col2']
Output:
row0 2
row1 7
row2 12
Name: col2, dtype: int32
Alternatively, we can use the column name as an attribute of our DataFrame.
myframe4.col2
Output:
row0 2
row1 7
row2 12
Name: col2, dtype: int32
It is possible to extract or select a few rows from the DataFrame. To extract
rows with index 1 and 2 (3 excluded), type the following command.
myframe4[1:3]
Output:
Output:
col0 5
col1 6
col2 7
col3 8
col4 9
Name: row1, dtype: int32
The rows and columns of a DataFrame can be given meaningful names.
myframe4.index.name = 'Rows'
myframe4.columns.name = 'Columns'
myframe4
Output:
We can add columns to the existing DataFrame by using a new column name
and assigning value(s) to this column.
myframe4['col5'] = np.random.randint(100, size = 3)
myframe4
Output:
Output:
Similar to the Series, we use the function isin() to check the membership of
a set of values. For instance,
myframe4.isin([1,4,99])
Output:
Output:
We can delete a column from the existing DataFrame using the keyword del.
del myframe4['col5']
myframe4
Output:
Output:
Output:
Output:
Hands-on Time