Python
Python
What is Python?
Python is a widely used high-level, general purpose, interpreted, dynamic
programming language.
Its design philosophy emphasizes code readability, and its syntax allows
programmers to express concepts in fewer lines of code than possible in
languages such as C++ or Java.
Desktop GUIs
Python is also script language and therefore supports scripts, i.e., programs
written for a special run-time environment that automate the execution of
tasks that could alternatively be executed one-by-one by a human operator.
It is a type of language which can be used to control other programs.
Why Python?
Python is a popular, general-purpose programming language with an
emphasis on being readable and allowing programmers to use fewer lines of
code to accomplish tasks than in older languages.
Python is an excellent tool for data analysis for four reasons:
Open source
Speed
Support
Scope
Python vs C++/Java
Java
Python programs are slower than Java programs.
Python codes are usually 3-5 times shorter than equivalent Java codes
Python programmer wastes no time declaring the types of arguments or variables
C++
C++ codes are generally 5-10 times longer than equivalent Python codes
Summary: Despite the slower runtime, Python is still sometimes preferred to C++/Java
due to the ease of programming by avoiding complex syntax and is highly readable
to interpret.
Installation
There are 2 approaches to install Python:
You can download Python directly from
its https://www.python.org/download/ and install individual components
and libraries you want
Alternately, you can download and install a package, which comes with
pre-installed libraries. I would recommend downloading Anaconda.
Another option could be Enthought Canopy Express.
One of the more popular environment used for Python computing is Ipython/Jupyter
Notebook.
It is an interactive computational environment, in which you can combine code execution,
rich text, mathematics, plots and rich media
Python Libraries
Being an open source language, Python developers have been
developing libraries to ease performing various tasks.
A library contains multiple modules which in turn contain set of dedicated
functions.
Python comes with a Python Standard Library which contains extensive set
of built-in functions to carry out various operations.
The libraries can be imported into the code once their package has been
installed on a system.
Once a library is imported, its functions can be called in the program.
Python Libraries
Ways to import Python libraries:
In the first manner, we have defined an alias m to library math. We can now use
various functions from math library (e.g. factorial) by referencing it using the alias
m.factorial().
In the second manner, you have imported the entire name space in math i.e. you
can directly use factorial() without referring to math.
Matplotlib for plotting vast variety of graphs, starting from histograms to line plots to heat plots..
You can use Pylab feature in ipython notebook (ipython notebook pylab = inline) to use these
plotting features inline. If you ignore the inline option, then pylab converts ipython environment
to an environment, very similar to Matlab. You can also use Latex commands to add math to
your plot.
Pandas for structured data operations and manipulations. It is extensively used for data
munging and preparation. Pandas were added relatively recently to Python and have been
instrumental in boosting Pythons usage in data scientist community.
Scikit Learn for machine learning. Built on NumPy, SciPy and matplotlib, this library contains a
lot of efficient tools for machine learning and statistical modeling including classification,
regression, clustering and dimensionality reduction.
Statsmodels for statistical modeling. Statsmodels is a Python module that allows users to
explore data, estimate statistical models, and perform statistical tests. An extensive list of
descriptive statistics, statistical tests, plotting functions, and result statistics are available for
different types of data and each estimator.
Seaborn for statistical data visualization. Seaborn is a library for making attractive and
informative statistical graphics in Python. It is based on matplotlib. Seaborn aims to make
visualization a central part of exploring and understanding data.
SAS vs R vs Python
SAS
Python
Open source
counterpart of SAS
Mostly used in
academics, research
Latest techniques get
released quickly due to
open source nature
Well documented
Cost effective
Big Data
Big data means really a big data, it is a collection of large datasets that cannot be
processed using traditional computing techniques. Big data is not merely a data, rather it
has become a complete subject, which involves various tools, techniques and
frameworks.
Structured data : Relational data.
Semi Structured data : XML data.
Unstructured data : Word, PDF, Text, Media Logs.
While looking into the technologies that handle big data, we examine the following two
classes of technology:
Operational
Analytical
Data Scope
Operational
Retrospective
End User
Customer
Data Scientist
Technology
NoSQL
Hadoop
Hadoop is an open-source framework that allows to store and process big data in a
distributed environment across clusters of computers using simple programming
models. It is designed to scale up from single servers to thousands of machines, each
offering local computation and storage.
Hadoop runs applications using the
MapReduce algorithm, where the data
is processed in parallel on different CPU
nodes.
A distributed file system, HDFS (Hadoop
Diistributed File System) provides highthroughput access to application data.
Thank you