18ETCS002122 Assignment (Data Science)
18ETCS002122 Assignment (Data Science)
i
Name: SUDARSHAN C Registration number: 18ETCS002122
Declaration Sheet
Student Name SUDARSHAN C
Reg. No 18ETCS002122
Declaration
The assignment submitted herewith is a result of my own investigations and that I have
conformed to the guidelines against plagiarism as laid out in the Student Handbook. All sections
of the text and results, which have been obtained from other sources, are fully referenced. I
understand that cheating and plagiarism constitute a breach of University regulations and will
be dealt with accordingly.
Signature of the
Date
Student
Signature of the Course Leader and date Signature of the Reviewer and date
Name: SUDARSHAN C Registration number: 18ETCS002122
Part A
A1.1
Python is open source, interpreted, high level language and provides great approach for object-
oriented programming. It is one of the best languag used by data scientist for various data science
projects/application. Python provide great functionality to deal with mathematics, statistics and
scientific function. It provides great libraries to deals with data science application.
One of the main reasons why Python is widely used in the scientific and research communities is
because of its ease of use and simple syntax which makes it easy to adapt for people who do not
have an engineering background. It is also more suited for quick prototyping. One of the main
reasons why Python is widely used in the scientific and research communities is because of its
ease of use and simple syntax which makes it easy to adapt for people who do not have an
engineering background. It is also more suited for quick prototyping.
• Scalability: Python is a programming language that scales very fast. Among all available
languages, Python is a leader in scaling. That means that Python has more and more
possibilities.
• Libraries and Frameworks: Due to its popularity, Python has hundreds of different
libraries and frameworks which is a great addition to your development process. They
save a lot of manual time and can easily replace the whole solution. As a Data Scientist,
you will find that many of these libraries will be focused on Data Analytics and Machine
Name: SUDARSHAN C Registration number: 18ETCS002122
Learning. Also, there is a huge support for Big Data. I suppose there should be a strong
pro why you need to learn Python as your first language.
• Huge Community: As I have mentioned before, Python has a powerful community. You
might think that it shouldn`t be one of the main reasons why you need to select Python.
But the truth is vice versa.
A1.2
an identifier is a user-defined name to represent the basic building blocks of Python. It can be a
variable, a function, a class, a module, or any other object.
Examples of a valid identifier:
• num1.
• FLAG.
Python is completely object oriented, and not statically typed. You do not need to declare variables
before using them, or declare their type. Every variable in Python is an object. A variable is created
the moment we first assign a value to it. A variable is a name given to a memory location. It is the
basic unit of storage in a program.
A variable name cannot start with a number. A variable name can only contain alpha-numeric
characters and underscores. Variable names are case-sensitive ex:- name, Name and NAME are three
different variables. The reserved words i.e. keywords cannot be used naming the variable
Underscore is used in python system, interpreter and in built-in identifiers, therefore we avoid the use
of beginning variable name with the underscore.
Name: SUDARSHAN C Registration number: 18ETCS002122
A1.3
Namespaces in Python
A namespace is a collection of currently defined symbolic names along with information about the
object that each name references. You can think of a namespace as a dictionary in which the keys
are the object names and the values are the objects themselves. Each key-value pair maps a name to
its corresponding object.
1. Built-In
2. Global
3. Enclosing
4. Local
The built-in namespace contains the names of all of Python’s built-in objects. These are available at
all times when Python is running. The global namespace contains any names defined at the level of
the main program. Python creates the global namespace when the main program body starts, and it
remains in existence until the interpreter terminates. Strictly speaking, this may not be the only
global namespace that exists. The interpreter also creates a global namespace for any module that
your program loads with the import statement.
Through various python namespaces, not each can be accessed from every part of the program. A
namespace is in variable scope in a part of a program, if it lets you access the python namespace
without having to use a prefix.
The order is- the local Python namespace, the global namespace, the built-in namespace. Also, a
nested function creates a nested variable scope inside the outer function’s scope.
Name: SUDARSHAN C Registration number: 18ETCS002122
Example:
a=1
def func1():
b=2
def func2():
c=3
In this code, ‘a’ is in the global namespace in python. ‘b’ is in the local namespace of func1, and ‘c’ is
in the nested local python namespace of func2. To func2, ‘c’ is local, ‘b’ is nonlocal, and ‘a’ is global.
By nonlocal, we mean it isn’t global, but isn’t local either. Of course, here, you can write ‘c’, and read
both ‘b’ and ‘c’. But you can’t access ‘a’, that would create a new local variable ‘a’
There are three types of Python namespaces global, local, and built-in. It’s the same with a variable
scope in python. Also, the global keyword lets us refer to a name in a global scope. Likewise, the
‘nonlocal’ keyword lets us refer to a name in a nonlocal scope.
A1.4
An exception is an error which happens at the time of execution of a program. However, while
running a program, Python generates an exception that should be handled to avoid your program to
crash. In Python language, exceptions trigger automatically on errors, or they can be triggered and
intercepted by your code. The exception indicates that, although the event can occur, this type of
event happens infrequently. When the method is not able to handle the exception, it is thrown to its
caller function. Eventually, when an exception is thrown out of the main function, the program is
terminated abruptly
Python uses try and except keywords to handle exceptions. Both keywords are followed by
indented blocks.
The try block contains one or more statements which are likely to encounter an exception. If the
statements in this block are executed without an exception, the subsequent except: block is
skipped. If the exception does occur, the program flow is transferred to the except: block. The
Name: SUDARSHAN C Registration number: 18ETCS002122
statements in the except block are meant to handle the cause of the exception appropriately. For
example, returning an appropriate error message.
Try Except in Python: Try and Except statement is used to handle these errors within our code in
Python. The try block is used to check some code for errors i.e. the code inside the try block will
execute when there is no error in the program. Whereas the code inside the except block will
execute whenever the program encounters some error in the preceding try block.
Syntax:
try:
# Some Code
except:
# Executed if error in the
# try block
Python provides a keyword finally, which is always executed after try and except blocks. The finally
block always executes after normal termination of try block or after try block terminates due to
some exception.
try-except
allow one to detect and handle exceptions. There is even an optional else clause for situations
where code needs to run only when no exceptions are detected.
try-finally
statements allow only for detection and processing of any obligatory clean-up (whether or not
exceptions occur), but otherwise has no facility in dealing with exceptions
A1.5
Python libraries helps you to perform data analysis and data manipulation in Python language.
Additionally, it provides us with fast and flexible data structures that make it easy to work with
Relational and structured data
Name: SUDARSHAN C Registration number: 18ETCS002122
• It provides sophisticated indexing functionality to make it easy to reshape, slice and dice,
perform aggregations, and select subsets of data.
2. NumPy:
NumPy is mainly used for its support for N-dimensional arrays. These multi-dimensional arrays are
50 times more robust compared to Python lists, making NumPy a favourite for data scientists.
NumPy is also used by other libraries such as TensorFlow for their internal computation on tensors.
NumPy also provides fast precompiled functions for numerical routines, which can be hard to
manually solve. To achieve better efficiency, NumPy uses array-oriented computations, so working
with multiple classes becomes easy.
• For numerical data, NumPy arrays are a much more efficient way of storing and
manipulating data
3. Scikit-learn:
Scikit-learn is arguably the most important library in Python for machine learning. After cleaning and
manipulating your data with Pandas or NumPy, scikit-learn is used to build machine learning models
as it has tons of tools used for predictive modelling and analysis. There are many reasons to use
scikit-learn. To name a few, you can use scikit-learn to build several types of machine learning
models, supervised and unsupervised, cross-validate the accuracy of models, and conduct feature
importance
4. TensorFlow
TensorFlow is one of the most popular libraries of Python for implementing neural networks. It uses
multi-dimensional arrays, also known as tensors, which allows it to perform several operations on a
particular input. Because it is highly parallel in nature, it can train multiple neural networks and
GPUs for highly efficient and scalable models. This feature of TensorFlow is also called pipelining.
5. SciPy
As the name suggests, SciPy is mainly used for its scientific functions and mathematical functions
derived from NumPy. Some useful functions which this library provides are stats functions,
optimization functions, and signal processing functions. To solve differential equations and provide
optimization, it includes functions for computing integrals numerically
A1.6:
NumPy is extremely fast for binary data loading and storage, including support for memorymapped
array. It is a Python library used for working with arrays. It also has functions for working in domain
of linear algebra, Fourier transform, and matrices. NumPy was created in 2005 by Travis Oliphant. It
is an open-source project and you can use it freely. NumPy stands for Numerical Python.
Plotly and pandas these two libraries combines to provide interactive features like zooming and
panning. The popular Pandas data analysis and manipulation tool provides plotting functions on
its DataFrame and Series objects, which have historically produced matplotlib plots. Since version
0.25, Pandas has provided a mechanism to use different backends, and as of version 4.8 of plotly,
Name: SUDARSHAN C Registration number: 18ETCS002122
you can now use a Plotly Express-powered backend for Pandas plotting. This means you can now
produce interactive plots directly from a data frame, without even needing to import Plotly.
Panda’s library is used for flexible and high-performance group by facility, enabling slice and dice,
and summarize data sets in a natural way. It is primarily used for data analysis, and it is one of the
most commonly used Python libraries. It provides you with some of the most useful set of tools to
explore, clean, and analyse your data.
Part B
B.1
B1.1
The Pandas Series Object:
A Pandas Series is a one-dimensional array of indexed data. It can be created from a list or array as
follows
data=pd.Series([0.25,0.5,0.75,1.0])
DataFrame Object is an analog of a two-dimensional array with both flexible row indices and
flexible column names. Just as you might think of a two-dimensional array as an ordered
sequence of aligned one-dimensional columns, you can think of a Data Frame as a sequence of
aligned Series objects
'area': area})
Parsing of JSON Dataset using pandas is much more convenient. Pandas allow you to convert a list
of lists into a DataFrame and specify the column names separately. A JSON parser transforms a
JSON text into another representation must accept all texts that conform to the JSON grammar. It
may accept non-JSON forms or extensions.
Working with large JSON datasets can be deteriorating, particularly when they are too large to fit
into memory. In cases like this, a combination of command line tools and Python can make for an
efficient way to explore and analyse the data.
import pandas as pd
Name: SUDARSHAN C Registration number: 18ETCS002122
Now you can read the JSON and save it as a pandas data structure, using the command read_json.
import pandas as pd
data = pd.read_json('http://api.population.io/1.0/population/India/today-and-tomorrow/?format
= json')
print(data)
pandas.ExcelFile.parse it is also a parsing funct
import pandas as pd
Syntax for reading the data in text format that does not have header:
import pandas as pd
Syntax for reading the data in text format and initializing the header
B1.2
An SQLite database is normally stored in a single ordinary disk file. However, in certain
circumstances, the database might be stored in memory.
The most common way to force an SQLite database to exist purely in memory is to open the
database using the special filename ":memory:". In other words, instead of passing the name of a
real disk file into one of the sqlite3_open(), sqlite3_open16(), or sqlite3_open_v2() functions, pass in
the string ":memory:". For example:
Name: SUDARSHAN C Registration number: 18ETCS002122
rc = sqlite3_open(":memory:", &db);
When this is done, no disk file is opened. Instead, a new database is created purely in memory. The
database ceases to exist as soon as the database connection is closed. Every :memory: database is
distinct from every other. So, opening two database connections each with the filename ":memory:"
will create two independent in-memory databases.
The special filename ":memory:" can be used anywhere that a database filename is permitted. For
example, it can be used as the filename in an ATTACH command:
Note that in order for the special ":memory:" name to apply and to create a pure in-memory
database, there must be no additional text in the filename. Thus, a disk-based database can be
created in a file by prepending a pathname, like this: "./:memory:".
The special ":memory:" filename also works when using URI filenames. For example:
rc = sqlite3_open("file::memory:", &db);
Or,
In-memory databases are allowed to use shared cache if they are opened using a URI filename. If the
unadorned ":memory:" name is used to specify the in-memory database, then that database always
has a private cache and is this only visible to the database connection that originally opened it.
However, the same in-memory database can be opened by two or more database connections as
follows:
rc = sqlite3_open("file::memory:?cache=shared", &db);
Or,
This allows separate database connections to share the same in-memory database. Of course, all
database connections sharing the in-memory database need to be in the same process. The
database is automatically deleted and memory is reclaimed when the last connection to the
database closes.
Name: SUDARSHAN C Registration number: 18ETCS002122
If two or more distinct but shareable in-memory databases are needed in a single process, then the
mode=memory query parameter can be used with a URI filename to create a named in-memory
database:
rc = sqlite3_open("file:memdb1?mode=memory&cache=shared", &db);
Or,
When an in-memory database is named in this way, it will only share its cache with another
connection that uses exactly the same name.
B1.3
Tasks are the building blocks of Celery applications. A task is a class that can be created out of any
callable. It performs dual roles in that it defines both what happens when a task is called (sends a
message), and what happens when a worker receives that message. Every task class has a unique
name, and this name is referenced in messages so the worker can find the right function to execute.
A task message is not removed from the queue until that message has been acknowledged by a
worker. A worker can reserve many messages in advance and even if the worker is killed – by power
failure or some other reason – the message will be redelivered to another worker.
@Shared_task
Return x + y
You can easily create a task from any callable by using the app.task () decorator:
@ app.task
There are also many options that can be set for the task, such as these can be specified as arguments
to the decorator:
Python decorators allow you to change the behaviour of a function without modifying the function
itself. we'll use a decorator when you need to change the behaviour of a function without modifying
the function itself. A few good examples are when you want to add logging, test performance,
perform caching, verify permissions, and so on.
Decorators are usually called before the definition of a function you want to decorate. Create a
simple decorator that will convert a sentence to uppercase. We do this by defining a wrapper inside
an enclosed function. Python provides a much easier way for us to apply decorators. We simply use
the @ symbol before the function we'd like to decorate. We can use multiple decorators to a single
function.
def my_function():
print('I am a function.')
# Assign the function to a variable without parenthesis. We don't want to execute the function.
description = my_function
• At this point, we have a ready environment. Let's test it by sending a task that will calculate
the square root of a value and return a result. First, we must define our task module tasks.py
Name: SUDARSHAN C Registration number: 18ETCS002122
inside the server. Let's check the description of the tasks.py module. the following chunk of
code, we have imports necessary for our function that will calculate the square root:
• Now, let's create the following instance of Celery, which will represent our client application:
• Then, we have to set up our result backend, which will also be in Redis, as follows:
app.config.CELERY_RESULT_BACKEND = 'redis://192.168.25.21:6379/0’
• With the basics ready, let's define our task with the @app.task decorator:
@app.task
def square_root(value):
return sqrt(value)
• At this point, since we have our tasks.py module defined, we need to initiate our workers
inside the server, where Redis and Celery (with support to Redis) are installed.
• Now, we have a Celery server waiting to receive tasks and send them to workers. The next
step is to create an application on the client side to call tasks.
• In the machine that represents the client, we have our virtual environment celery_env
already set up as we did in the last slides. So, now it is simpler to create a step-by-step
module task_dispatcher.py, as follows:
1. We import the logging module to exhibit information referring to the execution of the
program and the Celery class inside the celery module, as follows:
import logging
2. The next step is to create an instance of the Celery class informing the module
containing the tasks and then the broker, as done in the server side. This is done with the following
code:
Name: SUDARSHAN C Registration number: 18ETCS002122
app.conf.CELERY_RESULT_BACKEND = 'redis://192.168.25.21:6397/0'
3. let us create a function to encapsulate the sending of the sqrt_task(value) task. We will
create the manage_sqrt_task(value) function as follows:
def manage_sqrt_task(value):
logging.info(result.get())
if __name__ == '__main__’:
manage_sqrt_task(4)
Broker
• It is definitely a key component in Celery. Through it, we get to send and receive messages
and communicate with workers
• The most complete in terms of functionality are RabbitMQ and Redis. We will use Redis as a
broker as well as result backend.
B1.4
Celery has an architecture based on pluggable components and a mechanism of message exchange
that uses a protocol according to a selected message transport (broker).
• The client components, as presented in the previous diagram, have the function of creating
and dispatching tasks to the brokers.
Name: SUDARSHAN C Registration number: 18ETCS002122
• Demonstrate the definition of a task by using the @app.task decorator, which is accessible
through an instance of Celery application that, for now, will be called app.
• There are several types of tasks: synchronous, asynchronous, periodic, and scheduled. When
we perform a task call, it returns an instance of type AsyncResult.
• The AsyncResult object is an object that allows the task status to be checked, its ending, and
obviously, its return when it exists. However, to make use of this mechanism, another
component, the result backend, has to be active.
• The Message transport (broker) is definitely a key component in Celery. Through it, we get
to send and receive messages and communicate with workers
• The most complete in terms of functionality are RabbitMQ and Redis. We will use Redis as a
broker as well as result backend.
• Workers are responsible for executing the tasks they have received. Celery displays a series
of mechanisms so that we can find the best way to control how workers will behave. We can
define the mechanisms as follows: Concurrency mode, Remote control, Revoking tasks
• The result backend component has the role of storing the status and result of the task to
return to the client application. From the result backend supported by Celery, we can
highlight RabbitMQ, Redis, MongoDB, Memcached, among others.
Virtual environment:
we will set up two machines in Linux. The first one, hostname foshan, will perform the client role,
where app Celery will dispatch the tasks to be executed. The other machine, hostname Phoenix, will
perform the role of a broker, result backend, and the queues consumed by workers.
Client Machine
• We will set up a virtual environment with Python 3.3, using the tool pyvenv. The goal of
pyvenv is to not pollute Python present in the operating system with additional modules, but
to separate the developing environments necessary for each project.
Name: SUDARSHAN C Registration number: 18ETCS002122
• Now, we have a virtual environment and starting off from the point from where you already
installed setuptools or pip, we will install the necessary packages for our client. Let's install
the Celery framework with the following command:
• To set up the server machine, we will start by installing Redis, which will be our broker and
result backend. We will do this using the following command:
• $redis-server
• At this point, we have a ready environment. Let's test it by sending a task that will calculate
the square root of a value and return a result. First, we must define our task module tasks.py
inside the server. Let's check the description of the tasks.py module. the following chunk of
code, we have imports necessary for our function that will calculate the square root:
• Now, let's create the following instance of Celery, which will represent our client application:
• Then, we have to set up our result backend, which will also be in Redis, as follows:
app.config.CELERY_RESULT_BACKEND = 'redis://192.168.25.21:6379/0’
• With the basics ready, let's define our task with the @app.task decorator:
@app.task
def square_root(value):
return sqrt(value)
Name: SUDARSHAN C Registration number: 18ETCS002122
• At this point, since we have our tasks.py module defined, we need to initiate our workers
inside the server, where Redis and Celery (with support to Redis) are installed.
• Now, we have a Celery server waiting to receive tasks and send them to workers. The next
step is to create an application on the client side to call tasks.
• In the machine that represents the client, we have our virtual environment celery_env
already set up as we did in the last slides. So, now it is simpler to create a step-by-step
module task_dispatcher.py, as follows:
import logging
2. The next step is to create an instance of the Celery class informing the module
containing the tasks and then the broker, as done in the server side. This is done
with the code:
app.conf.CELERY_RESULT_BACKEND = 'redis://192.168.25.21:6397/0'
def manage_sqrt_task(value):
logging.info(result.get())
if __name__ == '__main__’:
manage_sqrt_task(4)
Name: SUDARSHAN C Registration number: 18ETCS002122
B.2
B2.1
B2.2
import sqlite3
query = """
);"""
con = sqlite3.connect(':memory:')
Name: SUDARSHAN C Registration number: 18ETCS002122
con.execute(query)
con.commit()
data = [('18ETCS002112',36,44,29),
('18ETCS002122',32,36,34)]
con.executemany(stmt, data)
con.commit()
import pandas as pd
df
B2.3
code for a task that will calculate the square root of a value
import celery from Celery and import math from sqrt and using @app.task decorator we can find the
sqrt of the value.
@app.task
def square(n):
return sqrt(n)
square(9)