Getting Started With Python Data Analysis - Sample Chapter
Getting Started With Python Data Analysis - Sample Chapter
With this book you can get started with Python data analysis
in a practical and example-driven way.
The book starts by introducing the principles of data analysis
and supported Python libraries, along with the basics of
NumPy for numerical data processing. Next, it provides
an overview of Pandas, a powerful library to solve data
processing problems.
Moving on, the book takes you through a brief overview of
the Matplotlib API and some common plotting functions for
visualization. Next, it will teach you how to manipulate time
series data and how to persist data structures to files or
databases. It also shows how you can reshape data to be
able to ask interesting questions about it.
Finally, the book gives a brief introduction to scikit-learn,
a popular machine learning library; how do you represent
data, what are supervised and unsupervised algorithm,
and how can we measure prediction performance, all such
questions are answered through examples.
C o m m u n i t y
Martin Czygan
P U B L I S H I N G
pl
$ 34.99 US
22.99 UK
Sa
m
Phuong Vo.T.H
ee
D i s t i l l e d
E x p e r i e n c e
Phuong Vo.T.H
Martin Czygan
Germany. He has been working as a software engineer for more than 10 years. For
the past eight years, he has been diving into Python, and is still enjoying it. In recent
years, he has been helping clients to build data processing pipelines and search
and analytics systems. His consultancy can be found at http://www.xvfz.net.
Preface
The world generates data at an increasing pace. Consumers, sensors, or scientific
experiments emit data points every day. In finance, business, administration and the
natural or social sciences, working with data can make up a significant part of the job.
Being able to efficiently work with small or large datasets has become a valuable skill.
There are a variety of applications to work with data, from spreadsheet applications,
which are widely deployed and used, to more specialized statistical packages for
experienced users, which often support domain-specific extensions for experts.
Python started as a general purpose language. It has been used in industry for a
long time, but it has been popular among researchers as well. Around ten years
ago, in 2006, the first version of NumPy was released, which made Python a first
class language for numerical computing and laid the foundation for a prospering
development, which led to what we today call the PyData ecosystem: A growing
set of high-performance libraries to be used in the sciences, finance, business or
anywhere else you want to work efficiently with datasets.
In contrast to more specialized applications and environments, Python is not
only about data analysis. The list of industrial-strength libraries for many general
computing tasks is long, which makes working with data in Python even more
compelling. Whether your data lives inside SQL or NoSQL databases or is out there
on the Web and must be crawled or scraped first, the Python community has already
developed packages for many of those tasks.
Preface
And the outlook seems bright. Working with bigger datasets is getting simpler and
sharing research findings and creating interactive programming notebooks has never
been easier. It is the perfect moment to learn about data analysis in Python. This
book lets you get started with a few core libraries of the PyData ecosystem: Numpy,
Pandas, and matplotlib. In addition, two NoSQL databases are introduced. The final
chapter will take a quick tour through one of the most popular machine learning
libraries in Python.
We hope you find Python a valuable tool for your everyday data work and that we
can give you enough material to get productive in the data analysis space quickly.
Preface
Chapter 7, Data Analysis Application Examples, applies many of the things covered
in the previous chapters to deepen your understanding of typical data analysis
workflows. How do you clean, inspect, reshape, merge, or group data these are the
concerns in this chapter. The library of choice in the chapter will be Pandas again.
Chapter 8, Machine Learning Models with scikit-learn, would like to make you familiar
with a popular machine learning package for Python. While it supports dozens of
models, we only look at four models, two supervised and two unsupervised. Even if
this is not mentioned explicitly, this chapter brings together a lot of the existing tools.
Pandas is often used for machine learning data preparation and matplotlib is used
to create plots to facilitate understanding.
[1]
The following figure illustrates the steps from data to knowledge; we call this
process, the data analysis process and we will introduce it in the next section:
Decision making
Synthesising
Knowledge
Analysing
Summarizing
Information
organizing
Data
Collecting
[2]
Chapter 1
Artificial
Intelligent &
Machine
Learning
Pr
s
hm
rit
go ....
Al
og
ra
... mm
.
in
g
Computer
Science
Data Analysis
se
rti
pe
ex ..
..
M
a
... th
.
ta
Da
Knowledge
Domain
Statistics &
Mathematics
[3]
Data cleaning: After being processed and organized, the data may still
contain duplicates or errors. Therefore, we need a cleaning step to reduce
those situations and increase the quality of the results in the following
steps. Common tasks include record matching, deduplication, and column
segmentation. Depending on the type of data, we can apply several types of
data cleaning. For example, a user's history of visits to a news website might
contain a lot of duplicate rows, because the user might have refreshed certain
pages many times. For our specific issue, these rows might not carry any
meaning when we explore the user's behavior so we should remove them
before saving it to our database. Another situation we may encounter is click
fraud on newssomeone just wants to improve their website ranking or
sabotage awebsite. In this case, the data will not help us to explore a user's
behavior. We can use thresholds to check whether a visit page event comes
from a real person or from malicious software.
[4]
Chapter 1
Data product: The goal of this step is to build data products that receive data
input and generate output according to the problem requirements. We will
apply computer science knowledge to implement our selected algorithms as
well as manage the data storage.
Weka: This is the library that I became familiar with the first time I learned
about data analysis. It has a graphical user interface that allows you to run
experiments on a small dataset. This is great if you want to get a feel for what
is possible in the data processing space. However, if you build a complex
product, I think it is not the best choice, because of its performance, sketchy
API design, non-optimal algorithms, and little documentation (http://www.
cs.waikato.ac.nz/ml/weka/).
[5]
Mallet: This is another Java library that is used for statistical natural
language processing, document classification, clustering, topic modeling,
information extraction, and other machine-learning applications on text.
There is an add-on package for Mallet, called GRMM, that contains support
for inference in general, graphical models, and training of Conditional
random fields (CRF) with arbitrary graphical structures. In my experience,
the library performance and the algorithms are better than Weka. However,
its only focus is on text-processing problems. The reference page is at
http://mallet.cs.umass.edu/.
Caffe: The last C++ library we want to mention is Caffe. This is a deep
learning framework made with expression, speed, and modularity in mind.
It is developed by the Berkeley Vision and Learning Center (BVLC) and
community contributors. You can find more information about it at
http://caffe.berkeleyvision.org/.
[6]
Chapter 1
Orange: This is an open source data visualization and analysis for novices
and experts. It is packed with features for data analysis and has add-ons
for bioinformatics and text mining. It contains an implementation of
self-organizing maps, which sets it apart from the other projects as well
(http://orange.biolab.si/).
Theano: This bridges the gap between Python and lower-level languages.
Theano gives very significant performance gains, particularly for large
matrix operations, and is, therefore, a good choice for deep learning models.
However, it is not easy to debug because of the additional compilation layer.
Here, I could not list all libraries for data analysis. However, I think the above
libraries are enough to take a lot of your time to learn and build data analysis
applications. I hope you will enjoy them after reading this book.
NumPy
One of the fundamental packages used for scientific computing in Python is Numpy.
Among other things, it contains the following:
Pandas
Pandas is a Python package that supports rich data structures and functions for
analyzing data, and is developed by the PyData Development Team. It is focused on
the improvement of Python's data libraries. Pandas consists of the following things:
A set of labeled array data structures; the primary of which are Series,
DataFrame, and Panel
Input/output tools that load and save data from flat files or PyTables/HDF5
format
Due to these features, Pandas is an ideal tool for systems that need complex
data structures or high-performance time series functions such as financial data
analysis applications.
[8]
Chapter 1
Matplotlib
Matplotlib is the single most used Python package for 2D-graphics. It provides
both a very quick way to visualize data from Python and publication-quality
figures in many formats: line plots, contour plots, scatter plots, and Basemap plots.
It comes with a set of default settings, but allows customization of all kinds of
properties. However, we can easily create our chart with the defaults of almost
every property in Matplotlib.
PyMongo
MongoDB is a type of NoSQL database. It is highly scalable, robust, and perfect to
work with JavaScript-based web applications, because we can store data as JSON
documents and use flexible schemas.
PyMongo is a Python distribution containing tools for working with MongoDB.
Many tools have also been written for working with PyMongo to add more features
such as MongoKit, Humongolus, MongoAlchemy, and Ming.
Summary
In this chapter, we presented three main points. Firstly, we figured out the
relationship between raw data, information and knowledge. Due to its contribution
to our lives, we continued to discuss an overview of data analysis and processing
steps in the second section. Finally, we introduced a few common supported libraries
that are useful for practical data analysis applications. Among those, in the next
chapters, we will focus on Python libraries in data analysis.
[9]
Practice exercise
The following table describes users' rankings on Snow White movies:
UserID
Sex
Location
Ranking
Male
Philips
Male
VN
Male
Canada
Male
Canada
Female
VN
Female
NY
Exercise 1: What information can we find in this table? What kind of knowledge can
we derive from it?
Exercise 2: Based on the data analysis process in this chapter, try to define the data
requirements and analysis steps needed to predict whether user B likes Maleficent
movies or not.
[ 10 ]
www.PacktPub.com
Stay Connected: