Python Data Processing

Data Preparation (Preprocessing)
5.1 Introduction
Data preparation is the process of constructing a clean dataset from one or
more sources such that the data can be fed into subsequent stages of a data
science pipeline. Common data preparation tasks include handling missing
values, outlier detection, feature/variable scaling and feature encoding. Data
preparation is often a time-consuming process.
Python provides us with a specialized library for data preparation and
analysis. Thus, understanding the functions of this library is of utmost
importance for those who are beginners in data science. In this chapter, we
shall mostly be using Pandas along with Numpy and Matplotlib.
5.2 Pandas for Data Preparation

Pandas is a specialized library that makes the tedious tasks of data
preparation and analysis easy. Pandas uses two data structures called
Series and DataFrame. These data structures are designed to work with
labeled or relational data, and are suitable to manage data stored in a tabular
format such as in databases, Excel spreadsheets and csv files.
Since we shall be working with Numpy and Pandas, we import both. The
general practice for importing these libraries is as follows:
import pandas as pd
import numpy as np
Thus, each time we see pd and np, we are making reference to an object or
method referring to these two libraries.
5.3 Pandas Data Structures

All operations for data pre-processing and analysis are centralized on two
data structures:
• Series,
• DataFrame.
The Series data structure is designed to store a sequence of one-

dimensional data, whereas the DataFrame is designed to handle data having
several dimensions.
5.3.1 The Series
The Series data structure, similar to an array, is used to handle one-
dimensional data. It provides features not provided by simple Numpy arrays.
To create a series object, we use the Series () constructor.
myseries = pd.Series([1, -3, 5 -20])# note capital S in Series

myseries
Output:
0 1
1 -3
2 5
3 -20
dtype: int64
dtype: int64 means the data type of values in a Series is integer of 64 bits.
The structure of a Series object is simple. It consists of two arrays, index and
value, associated with each other. The first array (column) stores the index
of the data, whereas the second array (column) stores the actual values.
Pandas assigns numerical indices starting from 0 onwards if we do not
specify any index. It is sometimes preferable to create a Series object using
descriptive and meaningful labels. In this case, we can assign indices during
the constructor call as follows:
myseries2 = pd.Series([1, -3, 5, 20], index = ['a', 'b', 'c', 'd'])
myseries2
Output:
a 1
b -3
c 5
d 20
dtype: int64
Two attributes of the Series data structure index and values can be used to
view the values and index separately.
myseries2.values
Output:
array([ 1, -3, 5, 20], dtype=int64)
myseries2.index
Output:
Index(['a', 'b', 'c', 'd'], dtype='object')
The elements of a Series can be accessed in the same way we access the
array elements.
myseries2[2]
Output:
C 5
myseries2[0:2]
Output:
a 1
b -3
dtype: int64
myseries2['b']
Output:
b -3
myseries2[['b','c']]
Output:
b -3
c 5
dtype: int64
Note double brackets in the aforementioned indexing that uses the list of
labels within an array.
We can select the value by index or label, and assign a different value to it,
for example:
myseries2['c'] = 10
myseries2
Output:
a 1
b -3
c 10
d 20
dtype: int64
We can create a Series object from an existing array as follows:
myarray = np.array([1,2,3,4])
myseries3 = pd.Series(myarray)
myseries3
Output:
0 1
1 2
2 3
3 4
dtype: int32
Most operations performed on simple Numpy arrays are valid on a Series
data structure. To facilitate data processing, additional functions are
provided on Series data structure. We can use conditional operators to filter
or select values. For example, to get values greater than 2, we may use.
myseries3[myseries3 >2]
Output:
2 3
3 4
dtype: int32
Mathematical operations can be performed on the data stored in a Series.
For example, to take the logarithm of values stored in myseries3, we enter
the following command that uses the log function defined in Numpy library.
np.log(myseries3)
Output:
a 0.000000
b NaN
c 2.302585
d 2.995732
dtype: float64
Note that logarithm of a negative number is undefined; it is returned as a
NaN, standing for Not a Number.
Thus, Python throws the following warning.
C:\Anaconda3_Python\lib\site-packages\pandas\core\series.py:679:
RuntimeWarning: invalid value encountered in log
result = getattr(ufunc, method)(*inputs, **kwargs)
NaN values are used to indicate an empty field or not definable quantity. We
can define NaN values by typing np.NaN. The isnull( ) and notnull( ) functions
of Pandas are useful to identify the indices without a value or NaN.
We create a Series mycolors to perform some common operations that can
be applied to a Series.
mycolors = pd.Series([1,2,3,4,5,4,3,2,1],
index=['white','black','blue','green','green','yellow', 'black', 'red'])
mycolors
Output:
white 1
black 2
blue 3
green 4
green 5
yellow 4
black 3
red 2
dtype: int64
The Series mycolor contains some duplicate values. We can get unique
values from the Series by typing:
mycolors.unique()
Output:
array([1, 2, 3, 4, 5], dtype=int64)
Another useful function value_counts() returns how many times the values
are present in a Series.
mycolors.value_count()
Output:
4 2
3 2
2 2
5 1
1 1
dtype: int64
This output indicates that the values 2, 3 and 4 are present twice each,
whereas values 1 and 5 are present once only.
To find a particular value contained in a Series data structure we use isin( )
function that evaluates the membership. It returns Boolean value True or
False, which can be used to filter the data present in a Series. For example,
we search values 5 and 7 in the Series mycolors by typing.
mycolors.isin([5,7])
Output:
white False
black False
blue False
green False
green True
yellow False
black False
red False
dtype: bool
We can use the Boolean values returned by mycolors.isin([5,7]) as indices
to the Series mycolors to get the filtered Series.
mycolors[mycolors.isin([5,7])]
Output:
green 5
dtype: int64
A Series can be created from an already defined dictionary.
mydict = {'White': 1000, 'Black': 500, 'Red': 200, 'Green': 1000}
myseries = pd.Series(mydict)
myseries
Output:
White 1000
Black 500
Red 200
Green 1000
dtype: int64
5.3.2 The DataFrame

The DataFrame is a tabular data structure very similar to an Excel
Spreadsheet. It can be considered an extension of a Series to multiple
dimensions. The DataFrame consists of an ordered group of columns. Every
column contains values of numeric, string or Boolean etc. types.
A DataFrame can be created by passing a dictionary object to the
DataFrame() constructor. This dictionary object contains a key for each
column with corresponding array of values for each of them.
mydata = {'Employee Name' : ['Ashley','Tom','Jack','John','Alicia'],
'Specialization' : ['Python','Data Science','Data preparation','Data
Analysis','Machine Learning'],
'Experience (years)' : [3,5,8,2,4]}
myframe = pd.DataFrame(mydata) # note capital D and F in DataFrame()
constructor.
Myframe
Output:
We can select a few columns from the DataFrame in any arbitrary order,
using the columns option. The columns will be created in the order we specify
irrespective of how they are stored within the dictionary object. For example,
myframe2 = pd.DataFrame(mydata, columns = ['Experience (years)',
Employee Name'])
myframe2
Output:
If we use the index option, we can specify indices of our choice to the
DataFrame.
myframe3 = pd.DataFrame(mydata, index=['zero', 'one','two','three','four'])
myframe3
Output:
An alternative way to create a DataFrame is to pass input arguments to the

DataFrame() constructor in the following order:
1. a data matrix,
2. an array of the labels for the indices (index option)
3. an array containing the column names (columns option).
We can use np.arange() to create an array. To convert this array to a matrix,

we use reshape() function. For example, when we use:
np.arange(15).reshape((3,5))
Output:
array([[ 0, 1, 2, 3, 4],
[ 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14]])
This is a 2-dimensional array or a matrix of size 3x5.
myframe4 = pd.DataFrame(np.arange(15).reshape((3,5)),
index=['row0','row1','row2'], columns=['col0','col1','col2','col3', 'col4'])
myframe4
Output:
5.4 Putting Data Together

Once we have our data in a DataFrame, it is ready to be manipulated for
preparation so that the data can be easily analyzed and visualized. We
illustrate several operations that can be performed using the Pandas library
to carry out data preparation. The data contained in a Series or a DataFrame
object can be put together in the following ways.
• Concatenating: pandas.concat( ) function concatenates the objects

along an axis.
• Merging: pandas.merge( ) function connects the rows in a DataFrame
based on one or more keys by implementing join operations.
• Combining: pandas.DataFrame.combine_first( ) function allows us to
connect overlapped data to fill in missing values in a data structure.
5.4.1 Concatenating Data

The concatenation is a process of linking together two or more separate data
structures and placing them next to each other to make a single entity.
Similar to NumPy’s concatenate () function, Pandas provide concat ()
function to perform concatenation. Type the following commands to generate
2 Series of 5 randomly generated numbers each.
myseries1 = pd.Series(np.random.rand(5), index=[0,1,2,3,4])
myseries1
Output:
0 0.865165
1 0.305467
2 0.692341
3 0.859180
4 0.004683
dtype: float64
myseries2 = pd.Series(np.random.rand(5), index=[5,6,7,8,9])

myseries2
Output:
5 0.670931
6 0.762998
7 0.200184
8 0.266258
9 0.296408
dtype: float64
To concatenate myseries1 and myseries2, type the following command.
pd.concat([myseries1,myseries2])
Output:
0 0.865165
1 0.305467
2 0.692341
3 0.859180
4 0.004683
5 0.670931
6 0.762998
7 0.200184
8 0.266258
9 0.296408
dtype: float64
The concat() function works on axis = 0 (rows), by default, to return a Series.
If we set axis = 1 (columns), then the result will be a DataFrame.
pd.concat([myseries1, myseries2],axis=1)
Output:
If we use another useful option, keys, along the axis = 1, the provided keys
become the column names of the DataFrame.
pd.concat([myseries1, myseries2],axis=1, keys=['series1','series2'])
Output:
The function concat () is applicable to a DataFrame as well. Let us create 2
data frames as follows.
myframe1 = pd.DataFrame({'Student Name': ['A','B','C'], 'Sex':['M','F','M'],
'Age': [10, 16, 17], 'School': ['Primary','High', 'High']})
myframe1
Output:
myframe2 = pd.DataFrame({'Student Name': ['D','E','A'], 'Class':[9,10,5],

'School': ['High', 'High', 'Primary']})
myframe2
Output:
We concatenate these 2 data frames as follows.
pd.concat([myframe1, myframe2])
Output:
Note the NaN values have been placed in those columns whose information
is not present in individual data frames.
5.4.2 Merging Data
The process of merging consists of combining data through the connection
of rows using one or more keys. The keys are common columns in 2
DataFrames to be merged. Based on the keys, it is possible to obtain new
data in a tabular form. The merge() function performs this kind of operation.
We may merge myframe1 and myframe2 by typing the following command.
pd.merge([myframe1, myframe2])
Output:
Note the difference between the outputs of concat () and merge (). The
merge operation merges only those columns together for which the key
entry, Student Name, is the same. However, the concat () operation returns
all the rows even with NaN values.
Consider the case when we have multiple key columns, and we want to
merge on the basis of only 1 column. In this case, we can use the option
“on” to specify the key for merging the data.
pd.merge(myframe1, myframe2,on='School')
Output:
In this case, merge operation renames those key attributes which are
common to both data frames but not used for merging. These are Student
Name_x and Student Name_y. If we merge on the basis of Student Name,
we get a completely different result.
pd.merge(myframe1, myframe2,on='Student Name')
Output:
Thus, it is important to consider the columns for merging different data

frames together.
We can merge on the basis of indices using “join”. For this purpose, neither
of the data frames should have the same column names. Thus, we rename
columns of one of the data frames, and merge both together using the
following commands.
myframe2.columns = ['Student Name2','Class','School2']
myframe1.join(myframe2)
Output:
5.4.3 Combining Data

Consider the case in which we have two datasets with overlapping indices.
We want to keep values from one of the datasets if an overlapping index
comes during combining these datasets. If the index is not overlapping then
its value is kept. This cannot be obtained either with merging or with
concatenation. The combine_first () function provided by the Pandas library
is able to perform this kind of operation. Let us create two Series data
structures.
myseries1 = pd.Series([50, 40, 30, 20, 10], index=[1,2,3,4,5])
myseries1
Output:
1 50
2 40
3 30
4 20
5 10
dtype: int64
myseries2 = pd.Series([100, 200, 300, 400] ,index=[3,4,5,6])

myseries2
Output:
3 100
4 200
5 300
6 400
dtype: int64
To keep the values from myseries1, we combine both series as given below.
myseries1.combine_first(myseries2)
Output:
1 50.0
2 40.0
3 30.0
4 20.0
5 10.0
6 400.0
dtype: float64
To keep the values from myseries2, we combine both series as given below.
myseries2.combine_first(myseries1)
Output:
1 50.0
2 40.0
3 100.0
4 200.0
5 300.0
6 400.0
dtype: float64
5.5 Data Transformation

The process of data transformation involves the removal or replacement of
duplicate or invalid values, respectively. It aims to handle outliers and
missing values as well. We discuss various data transformation techniques
in the following sections.
5.5.1 Removing Unwanted Data and Duplicates

To remove an unwanted column, we use del command, and to remove an
unwanted row, we use drop() function. Let us apply these commands on a
DataFrame.
myframe5 = pd.DataFrame({'Student Name': ['A','B','C'], 'Sex':['M','F','M'],
'Age': [10, 16, 17], 'School': ['Primary','High', 'High']})
del myframe5['School']
myframe5
Output:
To remove the row indexed 1, we use:
myframe5.drop(1)
Output:
Duplicate rows in a dataset do not convey extra information. These rows

consume extra memory, and are redundant. Furthermore, processing these
extra records adds to the cost of computations. Thus, it is desirable to
remove duplicate rows from the data. Let us create a DataFrame with
duplicate rows.
item_frame =
pd.DataFrame({'Items':['Ball','Bat','Hockey','Football','Ball'],'Color':['White','G
ray','White', 'Red','White'],'Price':[100,500,700,200,100]})
item_frame
Output:
We note that row indexed 0 and 4 are duplicate. To detect duplicate rows,
we use the duplicated () function.
item_frame.duplicated()
Output:
0 False
1 False
2 False
3 False
4 True
dtype: bool
To display the duplicate entries only, we can use the following commands.
item_frame.duplicated() as index to the DataFrame item_frame.
item_frame[item_frame.duplicated()]
Output:
To remove the duplicate entries, we use the following command.

item_frame.drop_duplicates()
Output:
5.5.2 Handling Outliers

Outliers are values outside the expected range of a feature. Common causes
of outliers include:
• Human errors during data entry;
• Measurement (instrument) errors;
• Experimental errors during data extraction or manipulation;
• Intentional errors to test accuracy of outlier detection methods.
During the data preparation and analysis, we have to detect the presence of
unexpected values within a data structure. For instance, let us create a
DataFrame myframe1.
student_frame = pd.DataFrame({'Student Name': ['A','B','C', 'D','E','F','G'],
'Sex':['M','F','M','F','F','M','M'], 'Age': [10, 14, 60, 15, 16, 15, 11], 'School':
['Primary','High', 'High', 'High', 'High','High','Primary']})
student_frame
Output:
We find the age of student B to be an expected value, i.e., 60 years. This is

deliberately entered wrong, and is considered as an outlier. We describe
important statistics of student_frame by using the describe () function.
student_frame.describe()
Output:
Note that the statistics of numerical columns is calculated and displayed. In
our case, Age is the only numerical column in student_frame. The presence
of outliers shifts the statistics. We expect the mean or average age of
students to be around 15 years. However, the average age of students is
20.142857 due to the presence of the outlier, i.e., 60 years.
The statistic count gives the number of elements in the column, mean gives
the average value, std provides standard deviation which is the average
deviation of data points from the mean of data, min is the minimum value,
max is the maximum value, 25%, 50% and 75% give the 25th percentile (the
first quartile - Q1), the 50th percentile (the median - Q2) and the 75th
percentile (the third quartile - Q3) of the values.
There are different ways to detect outliers, for example, if the difference
between mean and median values is too high, it can be an indication of
presence of outliers. A better approach to detect numeric outliers is to use
InterQuartile Range (IQR). In IQR, we divide our data into 4 quarters after
we sort it in ascending order.
Any data point that lies outside some small multiple of the difference between
the third and the first quartiles is considered as an outlier. For example,
IQR = Q3 – Q1 = 15.5 – 12.5 = 3
Using a typical interquartile multiplier value k=1.5, we can find the lower and
upper values beyond which data points can be considered as outliers.
IQR x 1.5 = 4.5
We subtract this value, 4.5, from the Q1 to find the lower limit, and add 4.5
to the Q3 to find the upper limit. Thus,
Lower limit = Q1-4.5 = 8
Upper limit = Q3+4.5 = 20
Now any value lesser than 8 or greater than 20 can be treated as an outlier.
A popular plot that shows these quartiles is known as a Box and Whisker plot
shown in Figure 5.1.
Figure 5.1: A Box and Whisker plot. The length of the box represents IQR.
To calculate lower and upper limits, we can enter the following script.
Q1 = student_frame.quantile(0.25) # 25% or quartile 1 (Q1)
Q3 = student_frame.quantile(0.75) # 75% or quartile 3 (Q3)
IQR = Q3 - Q1 # InterQuartile Range (IQR)
IQR_mult = IQR*1.5
lower= Q1 - IQR_mult
upper= Q3 + IQR_mult
print("The lower limit is = ", lower)
print("The upper limit is = ", upper)
Output:
The lower limit is = Age 8.0
dtype: float64
The upper limit is = Age 20.0
dtype: float64
Now we are able to filter our DataFrame, student_frame, to remove outliers.
We access the column Age using student_frame['Age'], and compare it with
int(lower). The result is used as indices to student_frame. Finally,
student_frame is updated by making an assignment as follows:
student_frame = student_frame[student_frame['Age'] > int(lower)]
student_frame = student_frame[student_frame['Age'] < int(upper)]
student_frame
Output:
5.5.3 Handling Missing or Invalid Data

It is often the case that the data we receive has missing information in some
columns. For instance, some customer’s data might be missing their age. If
the data has a large number of missing entries, the result of data analysis
may be unpredictable or even wrong. Missing values in a dataset can be
either
• Ignored
• Filled-in or
• Removed or dropped.
Ignoring the missing values is often not a good solution because it leads to
erroneous results. Let us create a Series object with missing values.
myseries4 = pd.Series([10, 20, 30, None, 40, 50, np.NaN],
index=[0,1,2,3,4,5,6])
print (myseries4.isnull())
myseries4
Output:
0 False
1 False
2 False
3 True
4 False
5 False
6 True
dtype: bool
0 10.0
1 20.0
2 30.0
3 NaN
4 40.0
5 50.0
6 NaN
dtype: float64
To get the indices where values are missing, we may type.
myseries4[myseries4.isnull()]
Output:
3 NaN
6 NaN
dtype: float64
We can drop the missing values by using the function dropna ( ).
myseries4_dropped = myseries4.dropna()
myseries4_dropped
Output:
0 10.0
1 20.0
2 30.0
4 40.0
5 50.0
dtype: float64
The process of filling-in the missing values is called “data imputation”. One
of the widely used techniques is mean value imputation.
myseries4_filled = myseries4.fillna(myseries4.mean())
myseries4_filled
Output:
0 10.0
1 20.0
2 30.0
3 30.0
4 40.0
5 50.0
6 30.0
dtype: float64
Here, the values at indices 3 and 6 are filled-in by the mean or average of
rest of the valid values, where the mean is calculated as
Mean = (10+20+30+40+50)/5 = 30.
Besides mean, we can impute the missing data using median by typing
myseries4.median () in place of myseries4.mean ().
5.5.4 Data Mapping

The Pandas library provides useful data mapping functions to perform
numerous operations. The mapping is the creation of a list of matches
between two values. To define a mapping, we can use dictionary objects.
mymap = {
'label1' : 'value1,
'label2' : 'value2,
...
}
The following functions accept a dictionary object as an argument to perform
mapping:
• replace() function replaces values;

• map() function creates a new column;
• rename() function replaces the index values.
The replace function replaces the matches with the desired new values. To
illustrate this idea, let us create a DataFrame.
data = {'color' : ['blue','green','yellow','red','white'],
'object' : ['ball','pen','pencil','paper','mug'],
'price' : [1.2,1.0,0.6,0.9,1.7]}
myframe = pd.DataFrame(data)
myframe
Output:
We create a dictionary to perform mapping.
mymap = {'blue':'dark blue', 'green': 'light green'}
Next, this dictionary is provided as an input argument to the replace ()
function.
myframe.replace(mymap)
Output:
Note, the original colors blue and green have been replaced by dark blue
and light green as mapped inside the dictionary mymap. The function
replace () can also be used to replace NaN contained inside a data structure.
myseries = pd.Series([1,2,np.nan,4,5,np.nan])
myseries.replace(np.nan,0)
Output:
0 1.0
1 2.0
2 0.0
3 4.0
4 5.0
5 0.0
dtype: float64
To add a new column to an existing DataFrame, we again create a dictionary
object that serves as a map.
mymap2 = {'ball':'round', 'pencil':'long', 'pen': 'long', 'mug': 'cylindrical',
'paper':'rectangular'}
myframe['shape']=myframe['object'].map(mymap2)
myframe
Output:
We use map () function that takes the dictionary as its input argument, and
maps a particular column of the DataFrame to create a new column. In our
case, the column named object is used for the mapping.
Finally, we can rename the indices of a DataFrame using the function
rename (). We create new indices using a dictionary.
reindex = { 0: ‘first’, 1: ‘second’, 2: ‘third’, 3: ‘fourth’, 4: ‘fifth’}
myframe=myframe.rename(reindex)
Output:
Note that we rename the indices, and assign the result of the right-hand side
to myframe to update it. If this assignment operation is not performed,
myframe will not be updated.
5.5.5 Discretization and Binning
Occasionally when we have a large amount of data, we want to transform
this into discrete categories to facilitate the analysis. For instance, we can
divide the range of values of the data in relatively smaller intervals or
categories to discover the statistics within each interval. Suppose we gather
data from an experimental study, and store it in a list.
readings = [34, 39, 82, 75, 16, 17, 15, 74, 37, 68, 22, 92, 99, 54, 39, 96, 17,
36, 91, 86]
We find that the range of data values is 0 to 100. Thus, we can uniformly
divide this interval, suppose, into four equal parts (bins):
• the first bin contains the values between 0 and 25,

• the second between 26 and 50,
• the third between 51 and 75, and
• the last between 76 and 100.
bins = [0, 25, 50, 75, 100]

We pass these readings and bins to the function cut ().
mycategory = pd.cut(readings, bins)
mycategory
Output:
[(25, 50], (25, 50], (75, 100], (50, 75], (0, 25], ..., (75, 100], (0, 25], (25, 50],
(75, 100], (75, 100]]
Length: 20
Categories (4, interval[int64]): [(0, 25] < (25, 50] < (50, 75] < (75, 100]]
We get 4 categories or intervals when we run the function cut (), i.e., [(0, 25]
< (25, 50] < (50, 75] < (75, 100]]. Note that each category has the lower limit
with a parenthesis and the upper limit with a bracket. This is consistent with
mathematical notation used to indicate the intervals. In the case of a square
bracket, the number belongs to the range, and if it is a parenthesis, the
number does not belong to the interval. In (0,25], 0 is excluded, whereas 25
is included. To count the number of elements in each bin, we may write.
pd.value_counts(mycategory)
Output:
(75, 100] 6
(25, 50] 5
(0, 25] 5
(50, 75] 4
dtype: int64
In place of numbers, we can give meaningful names to the bins.
bin_names = ['Poor','Below Average','Average','Good']
pd.cut(readings, bins, labels=bin_names)
Output:
[Below Average, Below Average, Good, Average, Poor, ..., Good, Poor,
Below Average, Good, Good]
Length: 20
Categories (4, object): [Poor < Below Average < Average < Good]
The Pandas library provides the function qcut(), that divides the data into
quantiles. qcut() ensure that the number of occurrences for each bin is equal,
but the ranges of each bin may vary.
pd.qcut(readings, 4)
Output:
[(31.0, 46.5], (31.0, 46.5], (46.5, 83.0], (46.5, 83.0], (14.999, 31.0], ..., (83.0,
99.0], (14.999, 31.0], (31.0, 46.5], (83.0, 99.0], (83.0, 99.0]]
Length: 20
Categories (4, interval[float64]): [(14.999, 31.0] < (31.0, 46.5] < (46.5, 83.0]
< (83.0, 99.0]]
To check the number of elements in each bin, we type.
pd.value_counts(pd.qcut(readings,4))
Output:
(83.0, 99.0] 5
(46.5, 83.0] 5
(31.0, 46.5] 5
(14.999, 31.0] 5
dtype: int64
5.5.6 Aggregating Data

Aggregation is the process of grouping data together into a list or any other
data structure. The aggregation uses statistical functions such as mean,
median, count or sum to combine several rows together. The combined data
resulting from data aggregation is easier to analyze. It protects the privacy
of an individual, and can be matched with other sources of data. Let us create
a DataFrame to understand the concept of grouping or aggregation.
data = {'color' : ['blue','white','red','red','white'],
'object' : ['ball','pen','pencil','paper','mug'],
'price' : [1.2,1.0,0.6,0.9,1.7]}
myframe = pd.DataFrame(data)
myframe
Output:
Note that the column color has 2 entries for both white and red. If we want
to group the data based upon the column color, for example, we may type.
mygroup = myframe['price'].groupby(myframe['color'])
mygroup.groups
Output:
{'blue': Int64Index([0], dtype='int64'),
'red': Int64Index([2, 3], dtype='int64'),
'white': Int64Index([1, 4], dtype='int64')}
Thus, we get 3 distinct groups, blue, red and white, by invoking the attribute
groups.
mygroup.mean()
Output:
color
blue 1.20
red 0.75
white 1.35
Name: price, dtype: float64
mygroup.sum()
Output:
color
blue 1.2
red 1.5
white 2.7
The data aggregation can be performed using more than 1 column. For
instance, we may group data by both color and object. It is called
hierarchical grouping. We may type the following commands.
mygroup2 = myframe['price'].groupby([myframe['color'],myframe['object']])
mygroup2.groups
Output:
{('blue', 'ball'): Int64Index([0], dtype='int64'),
('red', 'paper'): Int64Index([3], dtype='int64'),
('red', 'pencil'): Int64Index([2], dtype='int64'),
('white', 'mug'): Int64Index([4], dtype='int64'),
('white', 'pen'): Int64Index([1], dtype='int64')}
Let us create a new dataframe myframe2 that is same as myframe except
an extra entry ['red','pencil',0.8] at index 5.
myframe2 = myframe
myframe2.loc[5]=['red','pencil',0.8]
myframe2
Output:
Now we group myframe2 by color as well as by object.

mygroup2 =
myframe2['price'].groupby([myframe2['color'],myframe2['object']])
mygroup2.groups
Output:
{('blue', 'ball'): Int64Index([0], dtype='int64'),
('red', 'paper'): Int64Index([3], dtype='int64'),
('red', 'pencil'): Int64Index([2, 5], dtype='int64'),
('white', 'mug'): Int64Index([4], dtype='int64'),
('white', 'pen'): Int64Index([1], dtype='int64')}
mygroup2.mean()
Output:
color object
blue ball 1.2
red paper 0.9
pencil 0.7
white mug 1.7
pen 1.0
mygroup2.sum()
Output:
color object
blue ball 1.2
red paper 0.9
pencil 1.4
white mug 1.7
pen 1.0
5.6 Selection of Data
Sometimes we have to work with a subset of a dataset. In this case, we
select data of interest from the dataset. Let us work on an already created
DataFrame, myframe4.
myframe4.columns
Output:
Index(['col0', 'col1', 'col2', 'col3', 'col4'], dtype='object')
myframe4.index
Output:
Index(['row0', 'row1', 'row2'], dtype='object')
myframe4.values
Output:
array([[ 0, 1, 2, 3, 4],
[ 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14]])
We can select a single column.
myframe4['col2']
Output:
row0 2
row1 7
row2 12
Name: col2, dtype: int32
Alternatively, we can use the column name as an attribute of our DataFrame.
myframe4.col2
Output:
row0 2
row1 7
row2 12
Name: col2, dtype: int32
It is possible to extract or select a few rows from the DataFrame. To extract
rows with index 1 and 2 (3 excluded), type the following command.
myframe4[1:3]
Output:
The attribute loc accesses rows by the names of their indices.

myframe4.loc['row1']
Output:
col0 5
col1 6
col2 7
col3 8
col4 9
Name: row1, dtype: int32
The rows and columns of a DataFrame can be given meaningful names.
myframe4.index.name = 'Rows'
myframe4.columns.name = 'Columns'
myframe4
Output:
We can add columns to the existing DataFrame by using a new column name
and assigning value(s) to this column.
myframe4['col5'] = np.random.randint(100, size = 3)
myframe4
Output:
In the aforementioned example, we have used Numpy’s random module to

generate an array of 3 random numbers from 0 (inclusive) to 100 (exclusive).
Finally, we can change a single value by selecting that element and updating
it. For example, to update element 1 of col1, we write.
myframe4['col1'][1] = 1000
myframe4
Output:
Similar to the Series, we use the function isin() to check the membership of
a set of values. For instance,
myframe4.isin([1,4,99])
Output:
If we use the Boolean values returned by myframe4.isin([1,4,99]) as indices

to the DataFrame, we get NaN values at locations where our specified values
are not present.
myframe4[myframe4.isin([1,4,99])]
Output:
We can delete a column from the existing DataFrame using the keyword del.
del myframe4['col5']
myframe4
Output:
Let us display the already created DataFrame myframe3.

myframe3
Output:
We can select a single row or multiple rows from a DataFrame. Suppose we

are interested in those employees having more than 4 years of experience,
we use the following command for the selection.
myframe3[myframe3['Experience (years)']>4]
Output:
In the aforementioned example, myframe3['Experience (years)']>4 returns

values that are used as indices to myframe3 to display only those employees
who have an experience more than 4 years. Finally, to transpose any
DataFrame, we use
myframe3.T
Output:
Hands-on Time
It is time to check your understanding of the topic of this chapter

through the exercise questions given in Section 5.7. The answers to
these questions are given at the end of the book.

Python Data Processing

Uploaded by

Python Data Processing

Uploaded by

Data Preparation (Preprocessing)

5.2 Pandas for Data Preparation

5.3 Pandas Data Structures

The Series data structure is designed to store a sequence of one-

myseries = pd.Series([1, -3, 5 -20])# note capital S in Series

5.3.2 The DataFrame

An alternative way to create a DataFrame is to pass input arguments to the

We can use np.arange() to create an array. To convert this array to a matrix,

5.4 Putting Data Together

• Concatenating: pandas.concat( ) function concatenates the objects

5.4.1 Concatenating Data

myseries2 = pd.Series(np.random.rand(5), index=[5,6,7,8,9])

myframe2 = pd.DataFrame({'Student Name': ['D','E','A'], 'Class':[9,10,5],

Thus, it is important to consider the columns for merging different data

5.4.3 Combining Data

myseries2 = pd.Series([100, 200, 300, 400] ,index=[3,4,5,6])

5.5 Data Transformation

5.5.1 Removing Unwanted Data and Duplicates

Duplicate rows in a dataset do not convey extra information. These rows

To remove the duplicate entries, we use the following command.

5.5.2 Handling Outliers

We find the age of student B to be an expected value, i.e., 60 years. This is

5.5.3 Handling Missing or Invalid Data

5.5.4 Data Mapping

• replace() function replaces values;

• the first bin contains the values between 0 and 25,

bins = [0, 25, 50, 75, 100]

5.5.6 Aggregating Data

Now we group myframe2 by color as well as by object.

The attribute loc accesses rows by the names of their indices.

In the aforementioned example, we have used Numpy’s random module to

If we use the Boolean values returned by myframe4.isin([1,4,99]) as indices

Let us display the already created DataFrame myframe3.

We can select a single row or multiple rows from a DataFrame. Suppose we

In the aforementioned example, myframe3['Experience (years)']>4 returns

It is time to check your understanding of the topic of this chapter

You might also like