1 Introduction Python Programming For Data Science
1 Introduction Python Programming For Data Science
The programming requirements of data science demands a very versatile yet flexible language which is simple
to write the code but can handle highly complex mathematical processing. Python is most suited for such
requirements as it has already established itself both as a language for general computing as well as scientific
computing. Moreover it is being continuously upgraded in form of new addition to its plethora of libraries
aimed at different programming requirements. The following are features of python which makes it the
preferred language for data science.
i) A simple and easy to learn language which achieves result in fewer lines of code than other similar
languages like R. Its simplicity also makes it robust to handle complex scenarios with minimal code and
much less confusion on the general flow of the program.
ii) It is cross platform, so the same code works in multiple environments without needing any change. That
makes it perfect to be used in a multi-environment setup easily.
iii) It executes faster than other similar languages used for data analysis like R and MATLAB.
iv) Its excellent memory management capability, especially garbage collection makes it versatile in gracefully
managing very large volume of data transformation, slicing, dicing and visualization.
v) Python has got a very large collection of libraries which serve as special purpose analysis tools. E.g. NumPy
package deals with scientific computing and its array needs much less memory than the conventional python
list for managing numeric data. Also the number of such packages is continuously growing.
vi) Python has packages which can directly use the code from other languages like Java or C. This helps in
optimizing the code performance by using existing code of other languages, whenever it gives a better
result.
Python Machine Learning Ecosystem
The Python machine learning ecosystem is a collection of libraries that enable the developers to extract and
transform data, perform data wrangling operations, apply existing robust Machine Learning algorithms and also
develop custom algorithms easily. These libraries include numpy, scipy, pandas, scikit-learn, statsmodels,
tensorflow, keras, etc. The following is a description of these libraries:
1. PANDAS:- used for Data Analysis
2. NUMPY: - used for numerical analysis and formation i.e. for matrix and vector manipulation
3. MATPLOTLIB: - used for data visualization
4. SCIPY: - used for scientific computing
5. SEABORN: - used for data visualization
6. TENSORFLOW: - used in deep learning
7. SCIKIT-LEARN: - used in machine learning i.e. used as a source for many machine learning algorithms and
utilities
8. KERAS : - used for neural networks and deep learning
Setting Up a Python Environment
The starting step for the journey into the world of Data Science is the setup of the Python environment. You
have two options for setting up the environment:
• Install Python and the necessary libraries individually
• Use a pre-packaged Python distribution that comes with necessary libraries, e.g. Anaconda
Anaconda is a packaged compilation of Python along with a whole suite of a variety of libraries, including core
libraries which are widely used in Data Science. A major advantage of this distribution is that you don’t require
an elaborate setup and it works well on all flavors of operating systems and platforms, especially Windows,
which can often cause problems with installing specific Python packages. The Anaconda distribution is widely
used across industry Data Science environments and it also comes with a wonderful IDE, Spyder (Scientific
Python Development Environment), besides other useful utilities like jupyter notebooks, the IPython console, it
comes with an excellent package management tool, conda.
Steps
You can follow the following steps to setup Python environment using Anaconda:
i) The first step is downloading the required installation package from https://www.anaconda.com/download/.
You can choose from Windows, Mac and Linux OS as per your requirement.
ii) Select the Python version you want to install on your machine. There you will get the options for 64-bit and
32-bit Graphical installer both.
iii) After selecting the OS and Python version, it will download the Anaconda installer on your computer. You
then need to double click the file and the installer will install Anaconda package.
Installing Libraries
In Python the preferred way to install additional libraries is using the pip installer. The basic syntax to install a
package from Python Package Index (PyPI) using pip is as follows:
This will install the required_package if it is present in PyPI. You can also use other sources other than PyPI to
install packages but that generally would not be required. The Anaconda distribution is already supplemented
with a plethora of additional libraries, hence it is very unlikely that we will need additional packages from other
sources.
Another way to install packages, limited to Anaconda, is to use the conda install command. This will install the
packages from the Anaconda package channels and is recommended it especially on Windows.
1. Jupyter Notebook
The Jupyter Notebook, formerly known as ipython notebooks is an interactive environment for running code
in the browser. It is a great tool for exploratory data analysis and is widely used by data scientists.
The following are some of the features of Jupyter notebooks that makes it one of the best components of
Python ML ecosystem:
Jupyter notebooks can illustrate the analysis process step by step by arranging the stuff like code,
images, text, output etc. in a step by step manner.
It helps a data scientist to document the thought process while developing the analysis process.
One can also capture the result as the part of the notebook.
With the help of Jupiter notebooks, it is possible to share your work with your peers.
C:\>jupyter notebook
This will start a notebook server at the address localhost:8888 of your machine. Once you invoke this
command, you can navigate to the address localhost:8888 in your browser, to find the landing page depicted
in the diagram below, which can be used to access existing notebooks or create new ones.
2. NumPy
Numpy is the backbone of Machine Learning in Python. It is one of the most important libraries in Python
for numerical computations. It’s used in almost all Machine Learning and scientific computing libraries. It
stands for Numerical Python and provides an efficient way to store and manipulate multidimensional arrays
in Python. Generally, NumPy can also be seen as the replacement of MatLab because NumPy is mostly used
along with Scipy (Scientific Python) and Mat-plotlib (plotting library).
import numpy as np
On the other hand, if you are using standard Python distribution then NumPy can be installed as follows:
After installing NumPy, you can import it into your Python script as shown above.
Numpy ndarray
The numeric functionality of numpy is orchestrated by two important constituents of the numpy package,
ndarray and Ufuncs (Universal function).
ndarray (simply arrays or matrices) is a multi-dimensional array object which is the core data container
for all of the numpy operations. Mostly an array will be of a single data type (homogeneous) and
possibly multi-dimensional.
Universal functions are the functions which operate on ndarrays in an element by element fashion.
The shape attribute of the array object returns the size of each dimension in the form of (row, columns),
while the size returns the total size of the array:
In [4]: arr.shape
Out[4]: (5,)
Unlike Python lists, NumPy arrays can explicitly be multidimensional. A multidimensional array is created
as shown below:
Out[5]:
x:
[[1 2 3]
[4 5 6]]
iii). np.arange : creates an array filled with a linear sequence, starting at 0, ending at 20, stepping by 2. This
is similar to the built-in range() function
Out[10]:
array([[ 1., 0., 0.],
[0., 1., 0.],
[0., 0., 1.]])
Alternatively
In[11]: np.eye(3)
Out[11]: array([[ 1., 0., 0.],
[0., 1., 0.],
[0., 0., 1.]])
v). To initialize an array of a specified dimension with random values can be done by using the randn
function from the numpy.random package:
To Create a 3x3 array of normally distributed random values with mean 0 and standard deviation 1
In[1]:import pandas as pd
# create a simple dataset of people
data = {'Name': ["John", "Anna", "Peter", "Linda"],
'Location': ["New York", "Paris", "Berlin", "London"],
'Age': [24, 13, 53, 33]
}
data_pandas = pd.DataFrame(data)
# IPython.display allows "pretty printing" of dataframes
# in the Jupyter notebook
display(data_pandas)
Notice that the keys of dictionary are picked up as the column names in the dataframe and since index was
not included, it picked up the default index of normal arrays.
There are several possible ways to query the table. E.g. The following will select rows that have an age
column greater than 30
i) Series
Series in pandas is a one-dimensional ndarray with an axis label. i.e. its functionality is similar to a
simple array. The values in a series will have an index that needs to be hashable. This requirement is
needed when we perform manipulation and summarization on data contained in a series data structure.
ii) Dataframe
Dataframe is the most important and useful data structure, which is used for almost all kind of data
representation and manipulation in pandas. Pandas are extremely useful in representing raw datasets as
well as processed feature sets in Machine Learning and Data Science. All the operations can be
performed along the axes, rows, and columns, in a dataframe.
Data Retrieval
Pandas provides numerous ways to retrieve and read in data. You can convert data from CSV files,
databases, flat files, etc into dataframes. You can also convert a list of dictionaries (Python dict) into a
dataframe.
The following are the most important data sources:
Databases to Dataframe
The most important data source for data scientists is the existing data sources used by their
organizations. Relational databases (DBs) and data warehouses are the de facto standard of data storage
Example:
The following code is used read data from a Microsoft SQL Server database.
conn is an object used to identify the database server information and the type of database to pandas
4. Matplotlib
matplotlib is the primary scientific plotting library in Python. It provides functions for making publication-
quality visualizations such as line charts, histograms, scatter plots, etc. Visualizing your data and different
aspects of your analysis can give you important insights.
When working inside the Jupyter Notebook, you can show figures directly in the browser by using the
%matplotlib notebook and %matplotlib inline commands.
Example:
The following code produces the plot
In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
#or you can use "from matplotlib import pyplot as plt"
In[1]:
from scipy import sparse
# A 2D NumPy array with a diagonal of ones, and zeros everywhere else
eye = np.eye(4)
print("NumPy array:\n{}".format(eye))
Out[1]:
NumPy array:
[[1. 0. 0. 0.]
[0. 1. 0. 0.]
[0. 0. 1. 0.]
[0. 0. 0. 1.]]
In[2]:
# Convert the NumPy array to a SciPy sparse matrix in CSR format
# Only the nonzero entries are stored
sparse_matrix = sparse.csr_matrix(eye)
print("\nSciPy sparse CSR matrix:\n{}".format(sparse_matrix))
Out[2]:
SciPy sparse CSR matrix:
(0, 0) 1.0
(1, 1) 1.0
(2, 2) 1.0
(3, 3) 1.0
6. Scikit-learn
Scikit-learn is one of the most important and indispensable Python frameworks for Data Science and
Machine Learning in Python. It is built on top of the NumPy and SciPy scientific Python libraries and
implements a wide range of Machine Learning algorithms covering major areas of Machine Learning like
classification, clustering, regression, etc. All the mainstream Machine Learning algorithms like support
vector machines, logistic regression, random forests, K-means clustering, hierarchical clustering, etcs are
implemented efficiently in this library. Perhaps this library forms the foundation of applied and practical
Machine Learning. Besides this, its easy-to-use API and code design patterns have been widely adopted
across other frameworks.
Core APIs
Scikit-learn is built on a small and simple list of core API ideas and design patterns. The following is a brief
descriptions on the core APIs on which the central operations of scikit-learn are based.
i) Dataset representation:
The data representation of most Machine Learning tasks are quite similar to each other. Very often we
have a collection of data points represented by data point vectors. A data point vector contains multiple
independent variables (or features) and one or more dependent variables (response variables). E.g. in a
linear regression problem it can be represented as [(X1, X2, X3, X4, …, Xn), (Y)] where the independent
variables (features) are represented by the Xs and the dependent variable (response variable) is
represented by Y.
The idea is to predict Y by fitting a model on the features. This data representation resembles a matrix
(considering multiple data point vectors), and a natural way to depict it is by using numpy arrays.
ii) Estimators: The estimator interface is one of the most important components of the scikit-learn library.
All the Machine Learning algorithms in the package implement the estimator interface. The learning
process is handled in a two-step process. The first step is the initialization of the estimator object; this
involves selecting the appropriate class object for the algorithm and supplying the parameters or
hyperparameters. The second step is applying the fit function to the data supplied (feature set and
response variables). The fit function will learn the output parameters of the Machine Learning algorithm
and expose them as public attributes of the object for easy inspection of the final model. The data to the
fit function is generally supplied in the form of an input-output matrix pair.
iii) Predictors: The predictor interface is implemented to generate predictions, forecasts, etc. using a
learned estimator for unknown data. E.g. in the case of a supervised learning problem, the predictor
interface will provide predicted classes for the unknown test array supplied to it. A requirement of a
predictor implementation is to provide a score function; this function will provide a scalar value for the
test input provided to it which will quantify the effectiveness of the model used. Such values will be
used in the future for tuning the Machine Learning models.
iv) Transformers: Transformation of input data before learning of a model is a very common task in
Machine Learning. Some data transformations are simple, for example replacing some missing data with
a constant, taking a log transform, while some data transformations are similar to learning algorithms
themselves (for example, PCA). To simplify the task of such transformations, some estimator objects
will implement the transformer interface. This interface allows you to perform a non-trivial
transformation on the input data and supply the output to our actual learning algorithm.