Python
Python
chanddra.p@gmail.com
UVBL5MQSJ8
chanddra.p@gmail.com
UVBL5MQSJ8
chanddra.p@gmail.com
UVBL5MQSJ8
Ans: color_palette() is a Seaborn function that can be used to give colours to plots
and give them additional artistic appeal.
chanddra.p@gmail.com
UVBL5MQSJ8
Q: What is Histograms in Seaborn?
Ans: Histograms show the distribution of data by constructing bins throughout the data's range
and then drawing bars to show how many observations fall into each bin
• It is a measure of whether the data are heavy-tailed or light-tailed relative to a normal distribution. That is, data
sets with high kurtosis tend to have heavy tails, or outliers. Data sets with low kurtosis tend to have light tails, or
chanddra.p@gmail.com
UVBL5MQSJ8 lack of outliers. A uniform distribution would be the extreme case.
• A kurtosis greater than three will indicate Positive Kurtosis. The value of kurtosis will range from 1 to infinity.
• A kurtosis less than three will mean a Negative Kurtosis. The range of values for a negative kurtosis is from
-2 to infinity. The greater the value of kurtosis, the higher the peak.
This file is meant for personal use by chanddra.p@gmail.com only.
Sharing or publishing the contents in part or full is liable for legal action.
Subjective Question :Example 5
Q: Discuss the importance of Categorical Data Encoding . Illustrate with any example
Categorical data is a type of data that is used to group information with similar characteristics, while numerical data is a
type of data that expresses information in the form of numbers.
Example of categorical data:
chanddra.p@gmail.com
UVBL5MQSJ8
– Weather conditions: “sun”, “rain”, “overcast”, “snow”, etc.
chanddra.p@gmail.com
UVBL5MQSJ8
chanddra.p@gmail.com
UVBL5MQSJ8
• Jupyter Notebook is an open-source web application that allows you to create and share
documents that contain live code, equations, visualizations, and narrative text. Uses include data
cleaning and transformation, numerical simulation, statistical modeling, data visualization, machine
learning, and much more.
chanddra.p@gmail.com
UVBL5MQSJ8
• Jupyter has support for over 40 different programming languages and Python is one of them.
Python is a requirement (Python 3.3 or greater, or Python 2.7) for installing the Jupyter Notebook
itself.
chanddra.p@gmail.com
UVBL5MQSJ8
chanddra.p@gmail.com
UVBL5MQSJ8
chanddra.p@gmail.com
UVBL5MQSJ8
If you want to create a machine learning model but say you don’t have a computer that can take
the workload, Google Colab is the platform for you. Even if you have a GPU or a good computer
creating a local environment with anaconda and installing packages and resolving installation
issues are a hassle.
Colaboratory is a free Jupyter notebook environment provided by Google where you can use free
chanddra.p@gmail.com
UVBL5MQSJ8
GPUs and TPUs which can solve all these issues.
Getting Started
To start working with Colab you first need to log in to your google account, then go to this
link https://colab.research.google.com.
chanddra.p@gmail.com
UVBL5MQSJ8
Correct Answer
Sl. No. Images Question A B C D
[1]
What is the output:
def function1(number):
1 return number + 25
25 5 10 Name Error Name Error
function1(5)
print(number)
What is the output for the code:
function('Ananda', 25)
chanddra.p@gmail.com
UVBL5MQSJ8
self' is not
person1 and person2 The __init__ needed in
self' is not needed in person2 has a different
4 Which of the following statements is are two different method is used to statement--
statement-- "def value for 'name' than
incorrect about the following code? instances of the set initial values for "def
name_print(self):" person1.
Student class. attributes. name_print
(self):"
Correct Answer
Sl. No. Images Question A B C D
[1]
AttributeError:
7 What’s the output of the following 'Teacher' object
Walking Speaking! Lecture! Walking
code snippet? has no attribute
'walk'
def Details(self):
8 print(self.firstname, self.lastname, Mike, "Olsen","19" 19 Mike Olsen 19 Error Mike Olsen 19
chanddra.p@gmail.com self.age)
UVBL5MQSJ8
class Student(Person):
pass
x = Student("Mike", "Olsen","19")
x.Details()
Marks = [24,25,26,30]
marks_array =
9 marks_array = np.array. marks_array = np.array marks_array = np. marks_array = np.
np.array
You need to convert the above list to Marks (Marks) Marks Marks(array)
(Marks)
an ndarray named "marks_array".
Which of the following is the correct
syntax?
Suppose you have following code:
Result
Correct Answer
Sl. No. Images Question A B C D
[1]
Suppose you have the following
code:
B = np.array([2, 3, 4, 5])
11 [22,33,44,55,60] [22,33,44,55] [202,303,404,505,60] Error Error
Result = A+B
Result
12 array([90 12 13 15 array([90 13 14 15
arr = np.array([90,12,13,14,15,16]) array([90 14]) array([90 13 15 16]) array([90 14])
16]) 16])
arr[0:6:3]
What is the correct syntax for finding
chanddra.p@gmail.com
13 out the element with maximum value multi_arr.max
multi_arr(max) multi_arr.max() multi_arr_max() multi_arr.max()
UVBL5MQSJ8 in a multi dimensional array named (element)
multi_arr?
Consider the following code snippet:
arr=np.arange(1,26)
Correct Answer
Sl. No. Images Question A B C D
[1]
import pandas as pd
b= {“Name” : [“Amita”, “Any”,”Ravi”],
“RollNo” : [10,20,3]}
17 Row - 2 Column - Row - 3
Data = pd.DataFrame(b) Row - 3 Column - 3 Row - 3 Column - 2 Row - 2 Column - 3
2 Column - 2
In given code dataframe ‘Data’ has
how many rows and columns?
chanddra.p@gmail.com
UVBL5MQSJ8 18 The above dataframe has how many
3 rows, 2 Columns 2 rows, 3 Columns 1 row, 1 Column 3 rows, 3 Columns
3 rows, 2
rows and columns Columns
Deletes the
Deletes the entire row with
Deletes the entire Deletes the row with Deletes the row with
19 What is the after effect of the above column 'Subject' 'Score' value =
column 'Score' from 'Score' value = 87 from 'Score' value = 88
code from dataframe 87 from
dataframe "df1" dataframe "df1" from dataframe "df1"
"df1" dataframe
"df1"
import
matplotlib.
import matplotlib.pyplot import matplotlib.pyplot import matplotlib. pyplot as plt
Consider the following lists: import matplotlib.
as plt as plt pyplot as plt
pyplot as plt
x=
x = [11,22,33,44,55]
x = [11,22,33,44,55] x = [11,22,33,44,55] x = [11,22,33,44,55] [11,22,33,44,5
x=
5]
20 y = [9,18,27,36,45] [11,22,33,44,55]
y = [9,18,27,36,45] y = [9,18,27,36,45] y = [9,18,27,36,45]
y=
Which of the following is the correct y = [9,18,27,36,45]
plt.scatter(x,y) plt.barh(x,y) plt.bar(x,y) [9,18,27,36,45]
syntax for plotting a scatter plot using
the above lists? plt.show()
plt.show() plt.show() plt.show() plt.scatter(x,y)
plt.show()
Correct Answer
Sl. No. Images Question A B C D
[1]
import
matplotlib.
import matplotlib. import matplotlib.
import matplotlib.pyplot import matplotlib.pyplot pyplot as plt
Consider the following lists: pyplot as plt pyplot as plt
as plt as plt
x = [apple,
x = [apple, banana, orange, guava, x = [apple, banana, x = [apple, banana,
x = [apple, banana, x = [apple, banana, banana,
grapes] orange, guava, orange, guava,
orange, guava, grapes] orange, guava, grapes] orange, guava,
grapes] grapes]
21 grapes]
y = [9,18,27,36,45]
y = [9,18,27,36,45] y = [9,18,27,36,45]
y = [9,18,27,36,45] y = [9,18,27,36,45]
y=
Which of the following is the correct
plt.histogram(x,y) plt.bar(x,y) [9,18,27,36,45]
syntax for plotting a bar chart using plt.scatter(x,y) plt.pie(x,y)
the above lists?
plt.show() plt.show() plt.bar(x,y)
plt.show() plt.show()
plt.show()
Correct Answer
Sl. No. Images Question A B C D
[1]
Either
Which of the following statements dataframe.
27 Either dataframe.notnull() dataframe.isnull(). dataframe.
would give result to an output like this dataframe.notnull() notnull() or
or dataframe.isnull() sum() notnull().sum()
one? dataframe.
isnull()
28 Which of the following is not shown Min value of each Min value of
Column name Not null count Datatype
by dataframe.info statement? column each column
Correct Answer
Sl. No. Images Question A B C D
[1]
chanddra.p@gmail.com
UVBL5MQSJ8
Output:
The original String is: Data
The modified String is: #a#a
Sample Example
Input :
Enter a Number: 1221
Output :
The given Number 1221 is a palindrome
Input Format:
The first line contains the number n.
chanddra.p@gmail.com
UVBL5MQSJ8
Output Format
Print the dictionary in one line.
Example :
6
{1: 1, 2: 8, 3: 27, 4: 64, 5: 125, 6: 216}
8
{1: 1, 2: 8, 3: 27, 4: 64, 5: 125, 6: 216, 7: 343, 8: 512}
Input format:
The first line of the input contains the number n for which you have to find whether it is a power of
2 or not.
Output Format:
Print 'YES' or 'NO' accordingly without quotes.
Sample Example 1
Input :
Enter a number: 90
Output :
Sample Example 2
Input :
Enter a number: 216
Output :
YES
5. Write a Python program to create an array of 5 integers and display the array items. Access
individual element through indexes.
Sample Output:
2
4
6
7
9
Access first three items individually
2
4
6
6. Write a Python program to reverse the order of the items in the array.
Sample Output
chanddra.p@gmail.com
Original array: array('i', [1, 3, 5, 3, 7, 1, 9, 3])
UVBL5MQSJ8
Reverse the order of the items:
array('i', [3, 9, 1, 7, 3, 5, 3, 1])
Program: MCA
Specialization: Data Science
Semester: 2
Course Name: Python for Data Science
Course Code: 21VMT5S204
Unit Name: Introduction to Python
Learning outcomes:
Knowing about python
Features of python
Installing Anaconda Navigator
Launching Jupyter notebook
Basics about Jupyter notebook
chanddra.p@gmail.com
UVBL5MQSJ8
What happens to be amongst the most in-demand programming languages was in-fact
started as a hobby by its creator Guido Van Rossum to keep him occupied during Christmas.
Today, almost all big companies use python for their services in some way or the other.
Amongst the renowned ones are Google, Pinterest, Netflix, Quora, etc.
First things first, what is Python?
Python is a programming language, just like C, C++ and Java. It is a scripting language. It is a
Object-oriented- this means that its paradigm is based on ‘objects’ and ‘classes’. Python is
dynamically typed, meaning, the interpreter gives the variable a type during runtime based
on its value and it does type checking during the same.
Features of Python:
Python has various features, major ones of which are:
Easy to understand: The python code is easy to understand because the syntax is
uncomplicated and in English. Python does not use braces for different functions, it
uses indentation which makes the code look clean and neat, thus making it readable.
High-level language: a high level programming language is that which is user-friendly
and resembles natural human language.
It is an interpreted language: The python code is executed one line at a time unlike
C++ which is executed all at once. The interpreter displays the output one line at
Python IDEs
What is an IDE?
Integrated Development Environment, in easy words, allows programmers to combine
various parts of a program in a single GUI based application. An IDE ideally constitutes of a
source code editor, build automation tools and debugger. There are some IDEs that are
multi-language, like Eclipse and Visual Studio. IDEs are easy to setup, they make
development faster and easier, thus, saving efforts. IDEs also help correct errors and show
where the code is wrong.
In Python, the most frequently used IDEs include Spyder, Jupyter, PyCharm, IDLE and Atom.
For the course, we will be using Jupyter, which is part of the Anaconda distribution.
Anaconda Distribution:
Anaconda distribution is a Python and R data science distribution. It is easy to download and
is open source. It has over 7500 packages. A package is a collection of modules. All of it
freely available and Anaconda also provides community support which is available for all
python related queries one has.
Steps to Install Anaconda:
On the website, click on download for your respective operating system (i.e., Windows,
Mac, Linux)
The site should give you a prompt to save the file, select the location where you
chanddra.p@gmail.com
UVBL5MQSJ8 want to place the file.
Once downloaded, open it. You should see a prompt like this. Click on next.
Click on ‘I agree’ and do not change any settings/presets that are there. Click on
Next. Specify a destination on the computer. Click Next and it should start the
installation. Once done, click on Next.
chanddra.p@gmail.com
UVBL5MQSJ8
Once you are done downloading the Anaconda Navigator, you will be redirected to a
website. For tutorials you can glance over the website and explore.
chanddra.p@gmail.com
UVBL5MQSJ8
The ‘8888’ part in the URL might change if another notebook is open in the background.
The files shown on the page are ones that are there on your computer
There are three types of cells in the Jupyter notebook, namely, Code,
Markdown, and Raw Cells.
The Code cells are used to write the code and program. It has to be properly
indented and must have clear syntax.
The Markdown cells are used to document what you write, it is descriptive text.
Raw cells are a place where you can write the output directly. These cells are not
evaluated. They are like comments.
Every cell is a Code cell by default. One can change its type by the drop down on the
Toolbar.
SHORTCUTS
Operation Shortcut Key
Run Ctrl + Enter
Create a new cell Shift + M
Copy a cell c
Paste cell Shift + v
Delete cell Double click ‘d’
Change type of Cell to: Code Y
Change type of Cell to: Markdown M
Change type of Cell to: Raw Cell R
Save (edit checkpoint) Ctrl + S
chanddra.p@gmail.com
UVBL5MQSJ8
Program: MCA
Specialization: Data Science
Semester: 2
Course Name: Python for Data Science
Course Code: 21VMT5S204
Unit Name: Data Structures in Python
Learning Objectives:
1. Variables:
- What are variables?
- Types of variables
2. Operators:
- What are operators?
chanddra.p@gmail.com
- Different types of operators- arithmetic, logical, assignment, comparison,
UVBL5MQSJ8
identity, and bitwise.
3. Lists:
- Defining lists
- Adding, deleting, duplicating lists, and other functions of lists.
- Slicing in lists.
4. Tuples:
- Defining tuples.
- Various functions in tuples
- Discussing the difference between tuples and lists.
5. Sets:
- Defining sets.
- Functions of sets.
6. Dictionaries:
- Defining dictionaries
- Functions of dictionaries
7. Converting from one data structure to another.
Conventionally, every programming language follows one amongst the following type cases:
1. snake_case: it is a naming type where every word is separated by an underscore.
Ex- shoe_color, city_location
2. camelCase: naming type where the first word is in lowercase and the initial of every
new word is in uppercase. Ex- shoeColor, cityLocation
3. PascalCase: it is a naming type in which the initial of every word is in uppercase and
all other characters are in lowercase. Ex- ShoeColor, CityLocation.
Usually, in python, for variable names snake case is used. However, it must be noted that it
is a convention, meaning people over the world use snake_case, but it is not a compulsion.
Although, one can name a variable in whichever way they see fit, there are some rules
that are followed while naming variables:
1. Variable names are case sensitive. Therefore, apple and Apple are treated as two
different variables.
chanddra.p@gmail.com
2. The name of a variable must start with alphabets ( a-z in lowercase or uppercase) or
UVBL5MQSJ8
underscore ( _ ). For ex- alpha, Alpha, _alpha
3. Special characters like +, - , * , /, etc. are not allowed while naming variables.
4. Variable names cannot begin with a number. Ex- 1Alpha will not be a valid variable
name.
5. Python keywords cannot be used as variable names. Keywords include break,
continue, end, etc.
Keywords are specially reserved words in python. These keywords have a specific function.
For instance, the ‘end’ keyword is used at the end of a loop to break the cycle of iterations,
likewise, ‘break’ is used in situations when the desired output is obtained and the loop is to
be stopped.
Operators:
Operators are symbols that carry out arithmetic and logical calculation. So all variables and
numbers are operands and the symbols are operators.
Types of operators:
1. Arithmetic operators: these are used to perform basic mathematical operations.
chanddra.p@gmail.com
UVBL5MQSJ8
chanddra.p@gmail.com
UVBL5MQSJ8
3. Comparison operators: as the name suggests these operators are used for
comparison of two variables or values. Returns Boolean values.
Operator Function Example
== Equal to c == d
!= Not equal to c != d
> Greater than c>d
< Less than c<d
>= Greater than equal to c >= d
<= Less than equal to c <= d
5. Identity Operators: they are used to compare two objects, if they are infact the same
object or not.
chanddra.p@gmail.com
UVBL5MQSJ8Declaring a variable:
To declare a variable, type a variable name and then assign a value to it. When you print the
variable name, it displays the value stored inside the variable.
Casting of Variables: to specify the type of data being used is called casting of a variable.
2. Boolean:
Boolean type of data stores only two values- True and False. It is also
interchangeably used with 0 and 1.
3. Strings: store string/text values. They are enclosed within double quotes or single
quotes.
Examples of primitive data type:
All elements inside the list are indexed. Indexing in python starts from 0. Therefore, “Alpha”
in the above example has the index 0 in variable “letters”, while the position of “gamma” is
2.
To view a particular element from the list use the index number along with variable name
while displaying it.
chanddra.p@gmail.com
UVBL5MQSJ8Since lists are mutable, we can add, delete, duplicate and change list elements.
Lists are indexed so they can have duplicate values. A few other functions of lists include:
len() – displays the length of the list.
type() – displays the variable datatype.
sort() – displays the list in ascending order.
reverse() – reverses the order of the list
copy() – creates a duplicate of a specific element or series of elements.
count() – counts the number of elements inside the list, etc.
Slicing in lists:
If you want to get, say, all the elements beginning from the 3rd until the end, then you can
slice the list.
Using slicing you can specify the index range, the
chanddra.p@gmail.com point where you want to begin until the
UVBL5MQSJ8end.
For instance, if you want to display 1 to 10 in a list of 1 to 100 elements, then:
5. Tuples:
Tuples are another type of data structures that are used to store multiple items in one
variable. Unlike lists, tuples are immutable, i.e. they cannot be changed. Tuples are also
ordered. They cannot be shuffled neither can the positions of items inside a tuple be
changed. The way we used [] to create lists, here, we use (). For instance:
chanddra.p@gmail.com
UVBL5MQSJ8
All of these functions look very similar to Lists’. So how are these two different you might
ask?
The main difference is that tuples take up less memory than lists do. Hence, tuples are
faster to execute. They are immutable unlike lists. Lists can easily be reordered, while tuples
cannot. And the easiest difference to identify between them is the type of bracket usage.
Lists use [] while tuples use ().
One must use tuples when they are sure of the order of items, and when they are certain
that items would not be changed. Lists, however, due to their ease of editability can be used
when the elements need to be manipulated.
6. Sets:
Sets are the third type of data structures in Python. The key feature of sets is that they are
unordered and unindexed. Like usual math problems, sets are represented inside curly
braces {}. When we say that lists are unordered, we mean that everytime a variable of
chanddra.p@gmail.com
UVBL5MQSJ8
Apart from the mentioned functions, sets have numerous other functions like .isdisjoint(),
.issubset(), .issuperset(), .update(), etc.
Sets are helpful when you want you require to do mathematical operations like combine or
separate items from two different sets. They help remove duplicity from lists and tuples.
sets are also faster than lists.
7. Dictionary:
Dictionary is a data structure. It is ordered. It cannot be duplicated. It can be
changed, hence, they are mutable. They are written within curly braces. Dictionaries
store data in the key : value format. Every key has a value. Just like indexing, keys
are used to identify values here. They can be used for bivariate data.
A list is collection of values that can be identifies by their indexes. For the same
reason, lists are ordered. However, in case of dictionary, you have ‘key’s that do the
work of indexes. These keys help in identifying values. Thus, the dictionary is not
always ordered. Another important differentiation will be the use of brackets. While
the lists use square brackets [], dictionaries use curly braces {}. Inside a list, every
element is separated by a comma. In dictionaries, a key is written, followed by its
value which is then separated by a comma.
Declaring a dictionary:
To declare a dictionary, you enter the variable name and enter the elements in key:value
form.
chanddra.p@gmail.com
UVBL5MQSJ8
To print a specific value, enter the key name after the variable. For example:
chanddra.p@gmail.com
UVBL5MQSJ8
chanddra.p@gmail.com
UVBL5MQSJ8
Similary, for most part, you can convert one data structure to another easily. It must also be
noted that sometimes loops are also used to convert an object from one data structure to
another.
In a gist:
Category Lists Tuples Sets Dictionaries
Mutability Mutable Immutable Mutable Mutable
Ordered Ordered Ordered Unordered Unordered
Index-access Yes Yes No No
Braces [] () {} {}
Duplicates Can contain Can contain Cannot contain Cannot contain
duplicates duplicates duplicates duplicate keys
chanddra.p@gmail.com
UVBL5MQSJ8
Program: MCA
Specialization: Data Science
Semester: 2
Course Name: Python for Data Science
Course Code: 21VMT5S204
Unit Name: Conditional Statements
Overview:
Most times, even the simplest questions we ask have two answers- yes or no, this or that. In
situations like these, conditional statements come in. when there are multiple solutions to
problems or picking out the most appropriate solution for a bunch of options, etc.
conditional statements are used in every domain to get one optimal solution.
Outcomes:
1. What are conditional statements?
2. Examples of statements.
3. If statements:
- Meaning of If-Else.
- Syntax of If-else.
- Meaning of If-else if/elif statements.
- Syntax of else if statements.
chanddra.p@gmail.com
UVBL5MQSJ8 - Meaning of nested If-else statements.
- Syntax of nested else-if statements.
- pass statement
4. Loops:
- While loop.
- Syntax of while loop.
- range() function
- For loops.
- Syntax of for loops.
- break statement
- continue statement
- nested while loops
- nested for loops
- difference between for loop and while loop.
If-else:
When there are questions (maybe one, two or more) having one solution, and the rest
having another solution, then an if-else statement comes in handy. A block of code is
executed only when the condition is true, otherwise the second block contained inside the
‘else: ‘ is executed. If the condition is false, the interpreter does not enter the specified
block of codes.
For example, if a child is told that he would get a chocolate if he does his homework. Here
the condition is homework, if it is done, then the boy gets a chocolate, otherwise he
doesn’t. Similarly, in the world of coding, if-else statements help to execute programs where
multiple either-or or this-or-that that type of problems exist.
Here are a few more examples of If-else statements in programs:
chanddra.p@gmail.com
UVBL5MQSJ8 1. A person must get grade ‘Pass’ if they score above 45 in a 100 marks test. If they
score below it, they must get a ‘Fail’.
2. If a person’s salary is more than Rs. 200000, then display ‘Pay Tax’, if it is less than
that, display ‘Not liable to Pay tax’.
3. For whether a number is positive or negative. If the integer is more than zero, then it
is positive, if it less than zero, then it is negative.
if (condition):
code of statements if the given condition is true
else:
code of statements if the given condition is false.
Example 1
chanddra.p@gmail.com
UVBL5MQSJ8
Example 1
Note that the code will run even without the else part of the else if function. You can also
compare two conditions in the same condition statement using logical and identical
chanddra.p@gmail.com
UVBL5MQSJ8operators.
elif (condition2):
if (condition 2.1):
Set of statements 3.
else:
Set of statements 4.
else:
if (condition 3.1):
Set of statements 5.
In the example, when the first condition is true, the statements inside it are executed.
chanddra.p@gmail.com
Sequentially, the if code inside it is checked. Incase the if within the bigger if is true, the
UVBL5MQSJ8
conditions inside the former are executed, otherwise the else block is executed. The
interpreter enters any block of code if and only if the condition on which its execution is
dependent is true.
Pass Statement:
Pass statement can be used when you don’t have statements inside a set of codes. A pass
statement does not have any impact on the program. The interpreter just continues and
goes to the next statement when it reads a pass statement.
pass is a keyword in Python.
The example shows how pass is used. The program shows no display, meaning it runs
without encountering errors but there are no statements inside the if statements, only an if,
therefore, the output is empty. It is used to execute nothing.
While loop:
In a while loop, the set of statements keeps on repeating till the condition is true. As soon as
the condition is false, the loop ends and the set of codes written after the loop is executed.
It is used in cases when the number of iterations is unknown.
For a while loop, you first initialise a variable (conventionally i=0). Then the while condition
is written. Inside the loop, the set of statements that are to be printed in every iteration are
written. One must note that it is important to increment or decrement the initialised
variable otherwise it will become an infinite loop.
Increment means increasing the value of the variable, decrement means decreasing the
value of a variable.
Example 1
The above example says that i is initialised with value 0, and n with value 10. The while
condition is then checked. According to the example, if i <= n, which in this case is true, it
enters the loop. It prints Hello for the first time, and goes to the next line. i+=1 means that i
chanddra.p@gmail.com
UVBL5MQSJ8is incremented. Now the value of i changes from 0 to 1. Again, the while loop is checked. If
the condition i <= n is satisfied, then the statements inside the loop are executed again.
Since it is true, hello is printed the second time and i is incremented again. This continues
until the value of i become 11. In that case, the condition i <= n will not be satisfied because
11 is not less than 10. Thus the loop will end there.
range() function:
before doing for loops, one must know what the range() function do. The range() function is
an in-built python function. In a range(n,m), the code will run from the nth element to the
(m-1)th element.
We can also increment the variables by different integers in a range by specifying it in the
range. For example, the statement i in range(1,20,2) will run from 1 to 20 and will increment
by 2.
i =0
for i in sequence/datastructue :
Block of statements.
….
Statements.
Parts of the code:
1. i = 0 : it is called the iterator variable. It is a variable initialised before the for loop.
2. for i in seq/data_structure : in is a keyword in python. The statement reads as ‘for
loop with iteration variable i in the sequence specified’. Here i begins from the
initialised element position in the sequence and then the interpreter enters the
block of codes within the loop, the iteration variable increments after every loop
until the condition specified is false.
3. Set of codes: as long as the iterative variables satisfy the range/sequence mentioned,
chanddra.p@gmail.com
UVBL5MQSJ8
the set of codes inside the for loop is executed.
Example 1
In the above example, i is initialised as 0. Then the next line containing the for loop is
executed. i starts from the sequence mentioned and executes the block of codes within the
for loop. After the loop is over, i increments and repeats executing the code inside the loop
break statement:
The break statement is used when the condition of the loop is specified but one wants to
terminate the loop in between.
Syntax:
for i in range/sequence_Name/list_Name:
set of codes
break
statements.
According to the above mentioned syntax, the for loop runs per usual but when it
encounters a break statement then it exits the loop without completing all its iterations and
starts executing the statements mentioned after the loop. Break is a keyword to break the
flow of statements inside a loop.
chanddra.p@gmail.com
UVBL5MQSJ8Example 1
In the above example, the iterative variable is i, the loop begins from 1 and has to be
repeated until 20. When it enters the code, the ith iteration number is printed. Then the if
block is checked. If the ith number is divisible by 7 then the interpreter enter that block,
otherwise it enters the else block. Note that the keyword continue is used here to continue
the flow of the loop without passing any other statements. This loop will continue until 7,
Nested loops:
A loop within a loop is called a nested loop.
Example of a nested while loop:
chanddra.p@gmail.com
UVBL5MQSJ8
In the above example, I is initialised at 1. According to the first while loop, 1<=30 is true,
hence the interpreter enters the loop. The second while loop is checked. Since j <= 15, the
block of statements inside the loop are executed. i+2 and j are printed and then both the
variables are incremented. These types of loops- one within another, are called nested
loops.
Speed For loop is faster than while loop. While loop is slower than for loop.
Program: MCA
Specialization: Data Science
Semester: 2
Course Name: Python for Data Science
Course Code: 21VMT5S204
Unit Name: Functions in Python
Functions:
Functions are used when we want to call a particular sets of codes in different
parts of a program. For instance, when we want to check if the addition of two
numbers more than once in a program. It is easy after function declaration,
which is done once, and then the function is simply called wherever it is
Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.
chanddra.p@gmail.com
UVBL5MQSJ8User-defined functions:
While there are a lot of pre-defined functions, there can be a lot of other sets
of code that a user might want to repeat in a program, that are not pre-
defined. For purposes like these, user-defined functions come in handy.
Through a user-defined function, a programmer can write his set of code in the
block function and call it everywhere in a program, whenever it has to be
reused.
Syntax of a function:
- This block is called the function definition. It is the part where you
write all the features and codes of the function.
- ‘def’ is a keyword in the python. It is short for define/definition.
- It is followed by functionName which is the name of the function that
is going to be defined by the user.
- functionName is followed by a parenthesis which contains
arguments. Arguments are values of variables that are used within
the function.
These arguments are also called local variable since they can only be
used within the block of codes of the function. They go unrecognised
outside the function. However, one must note, it is not compulsory to
write arguments in a python function.
- The arguments or the parenthesis followed by a colon. Inside the
block enter all the conditions/statements for the case that the
chanddra.p@gmail.com
function was created.
UVBL5MQSJ8
chanddra.p@gmail.com
UVBL5MQSJ8
lambda function:
lambda function is a function written in a single line. it does not have any
name, and can take any number of arguments but it can only have one
expression in it. It is called Anonymous function because it is nameless.
One must note that the output for a one line expression given by both, the
lambda function and a normal def function are the same. def is usually used
when the function has multiple lines of code.
A lambda function does not have a return statement. Since it is a single line
function, when executed, the value of the expression is displayed in the
output.
Example:
chanddra.p@gmail.com
UVBL5MQSJ8
In the above example, the lambda function is written as a def function. The
output is the same, however, the number of lines are more when written
inside the def function.
Example: With arguments, without return
1. Calculator, factorial of a number, and reverse of a number
To make a basic calculator using functions.:
chanddra.p@gmail.com
UVBL5MQSJ8In the above example, the function calculator contains numerous arithmetic
operations. Then the function is called, with arguments as a=20 and b=5. When
executed, it gives the output for each operation. A lambda function is stored in
variable x. since only one expression can be evaluated in the lambda function,
an amalgamation of operations is used with local variables a and b. Evaluation
of arithmetic operations happens in a left to right order. The solution is then
printed.
chanddra.p@gmail.com
UVBL5MQSJ8In the above example, when function factorial is executed with 8 as argument
value, the interpreter goes inside the function defined factorial, with local
variable a. Thus, a=8. Another local variable is assigned s=1 inside factorial. The
loop is then executed, since the condition is fulfilled, i.e. 8 !=0, the loop is
entered. s is reassigned as s*a, and then a is decremented. The loop keeps
repeating until the value of a is 0, when that condition is not fulfilled, the loop
will be terminated and then the factorial stored in s is displayed.
Examples: Without arguments, with return
1. Finding the sum of all multiple of 5 between 50-200:
chanddra.p@gmail.com
UVBL5MQSJ8
The same factorial problem is executed without arguments and with return.
num is used as the local variable instead of entering arguments.
Number
chanddra.p@gmail.com
UVBL5MQSJ8
1: 12321
Number 2: 12345
Example 2:
Armstrong number (without arguments, without return)
When the sum of the cube of every digit of a number is equal to the original
number, it is called an Armstrong number.
Output:
chanddra.p@gmail.com
UVBL5MQSJ8
Number= 153
= > 13 + 53 + 33
= > 153. Thus, the number is an Armstrong number.
chanddra.p@gmail.com
UVBL5MQSJ8
Program: MCA
Specialization: Data Science
Semester: 2
Course Name: Python for Data Science
Course Code: 21VMT5S204
Unit Name: OOPs in Python
Learning Outcomes:
- Procedural Programming.
- Object Oriented Programming
- OOPs vs Procedural Programming
- Features of OOPs
- Classes:
chanddra.p@gmail.com
UVBL5MQSJ8
Syntax of classes
Example
- Objects
Syntax of objects
Example
- Difference between classes and objects
- Mini project.
Class:
A class is said to be a blueprint for an object. It is a user-defined data type. It is
a collection of objects. Data members and functions are held within a class.
For example, in a company, we evaluate two departments- HR and Accounting,
and we want to check which department a person works in. We write the
program. If say we enter the employ ID of person XYZ, we need to retrieve the
name of the employee and the department in which he works. For one entity,
it is easy to do so, but for multiple entries, in 1000s, it becomes very difficult.
Syntax of Class:
class employee_details():
To define a class, you simply write class, which is a keyword and the
className() fillowed by a semicolon. Inside the class you can write multiple
functions that are part of the class. According to access specifiers, the
functions within the class are accessible or inaccessible to rest of the program.
chanddra.p@gmail.com
UVBL5MQSJ8Inside a class, make sure that everything is properly indented, just like in a
conditional statement.
Example of a class:
Objects:
Unless an object is created inside a class, no memory is allocated to the class.
Instantiation is creating an object of a class. The objector instance contains the
actual information.
object_name = class_name()
In the above example, to assign an object to a class, simply equate the object
name to the classname(). To display the value of the string inside
employee_details, you must simply print objectName.variableName.
chanddra.p@gmail.com
UVBL5MQSJ8
Class Object
Class is used to bind all the data into Objects are like variables of class.
a single unit.
It is a logical entity It is a physical entity.
Example: Example:
employee_details is a class employee_id, employee_name,
department.
The above examples are simple and are not usually used in python. For real
world applications, we use the __init__() function.
__init__() function:
chanddra.p@gmail.com
UVBL5MQSJ8
All classes have an __init__() function that is executed when the class is
intialised. It is called automatically whenever a class is used to create a new
object.
The __init__() function is similar to constructors. Constructors in java are used
to initialise the state of the object. It contains a set of statements that are
executed when the object is created and is run when the object is instantiated.
Example:
Mini Project:
chanddra.p@gmail.com
UVBL5MQSJ8
Example 2: Display 3 book names and their page counts using classes and
objects.
chanddra.p@gmail.com
UVBL5MQSJ8
Program: MCA
Specialization: Data Science
Semester: 2
Course Name: Python for Data Science
Course Code: 21VMT5S204
Unit Name: Inheritance
Unit Outcomes:
- Inheritance: Introduction.
- Uses of inheritance.
- Advantages of using inheritance
- Types of Inheritance:
1. Single inheritance:
chanddra.p@gmail.com
UVBL5MQSJ8 I. definition, syntax.
II. example
2. Multiple inheritance:
I. Definition. Syntax.
II. Example.
3. Multi-level inheritance:
I. Definition, Syntax.
II. Example
- Difference between different types of inheritance.
- Method overriding:
1. Meaning
2. Example: in Simple Inheritance, in Multiple Inheritance.
- Diamond Problem:
1. What is it?
2. Problem explanation
3. Pseudo-code
4. Code.
One must note that inheritance classes are transitive in nature. This means that if class X
inherits qualities of class Y, then all subclasses of class X will also contain qualities of class Y.
chanddra.p@gmail.com
UVBL5MQSJ8
Advantages of Inheritance:
1. It is used to share existing features of a class.
2. Inheritance can be used to arrange functions and methods in a hierarchical form.
When more than one class is derived from a base class, it can be called hierarchical
inheritance.
3. Due to reusability of code, data redundancy is also reduced, increasing elegance.
1. Single Inheritance:
When the features of the child class are derived from only one class(parent), then it is called
single inheritance.
class parentClassName:
# statements
objectName= childClassName()
Parent Class
Child Class
To see an example of single inheritance, we make one class inherit the properties of another
class. We declare a parent class and then a child class, after which we create an object of
the child class. Once that is done, we can call methods of the parent class via objects of the
child class.
In the given example, we declare a parent class, called automobiles, and define arguments
as self. We write a function within it which tells the types of automobiles there are. Since
cars are a subtype of automobiles, we make another class cars, which can inherit the
properties of the automobiles class. Inside it we define a function that contains the types of
cars that exist. Then we declare an object that is associated with the child class, i.e. cars. We
call the function of child class via the object. But because of inheritance, we can also call
classes of the parent class via objects of the child class.
chanddra.p@gmail.com
UVBL5MQSJ8
2. Multiple Inheritance:
When there are two or more parent classes of one derived class, it is called multiple
inheritance. As the name suggests, one class takes after more than one class.
Syntax of multiple inheritance:
class parentClassName1:
# statements
chanddra.p@gmail.com
UVBL5MQSJ8class parentClassName2:
# statements
objectName= childClassName()
In the example given below, a class called calc1 is defined with a function that returns the
multiplication of two local variable a and b. Similarly, calc2 is another class defined with a
function that returns the division of two local variables. Then a derived class is declared
which inherits features and functions of both classes calc1 and calc2. The derived class also
has a function which returns the modulus value of two local variables a and b. Then an
object of the derived class is made. The same object is used to print called functions mult()
and divide() of respective parent classes.
chanddra.p@gmail.com
UVBL5MQSJ8
Multi-level Inheritance:
In this case, the features of the Parent class and the child class are further inherited by
another class, let’s say grandchild class. Basically, there is an intermediary class between the
parent class and the derived class.
For instance, if a Daughter inherits the belongings of her Mother, the daughter (let’s say
Anna) inherits the belongings of her mother, i.e. Daughter.
Example:
Parent Class 1
Parent Class 2
Child Class /
Derived Class
chanddra.p@gmail.com
UVBL5MQSJ8
In the above example, we create a parent class by the name of Details. Then we create a
child class named Name, which inherits from the parent class. Then the object is created for
the child class, and values are passed in as arguments. The flow of statements goes to the
child class. The init function contains a super() function which redirects the flow to the
Details function, inside it values of name, address and age are assigned respectively. After
the execution of the init function is done, the flow of the program goes back to the next line
of the Name class, to the if statement. After age is checked, the respective statements are
objectName= childClassName()
Method overriding:
Overriding is a feature of a class to change the implementation of a function provided by
one its base classes.
Inheritance can best be used with method overriding. It is the ability of an OOPs language to
allow the child class to provide a specific implementation of a method that is written in its
parent class also.
A real-life example that helps to easily understand method overriding would be how a
daughter gets the inheritance of her mother. She gets her mother’s land, her workspace,
and her cars. Maybe sometime later, the daughter chooses to discard the cars and buys
other cars, but still uses her mum’s land and workspace. In this case, she can use method
overriding to use her own set of cars while also using the land and workspace from her
inheritance.
Here, in python, you make a class called Mum, inside it you can mention functions like land,
workspace and cars. You call another function called Daughter which will inherit class
Example 2:
chanddra.p@gmail.com
UVBL5MQSJ8
In the above example, similar to single inheritance, a program is written. The only difference
here is that the function inside the child class has the same name as the function in the
parent class. This is where overriding comes in. Once the object of the child class is created,
Similar to single inheritance, in multiple inheritance, one function of one of the functions of
chanddra.p@gmail.com
UVBL5MQSJ8the parent class has the same name as the function in the child class. When the object of
the child class is made, python again overrides the function and instead of information
about cars, information about bikes is displayed.
Diamond problem:
In multiple inheritance, the diamond problem is found even in the simplest of programs.
In easy language, the diamond problem is a problem that arises due to ambiguity when two
classes U and V inherit from X, and class Y inherits from both U and V. if there is a method A
which is an overridden method in one of class U and class V or both of them then an
ambiguity arises which ‘A’ should class Y inherit from.
Create an object for function Child which inherit features of all classes ParentClass1,
ParentClass2, and ParentClass3.
Call methods via the object.
In case of such problems there are three types of cases that are followed.
chanddra.p@gmail.com
UVBL5MQSJ8
In this case, the same method name is common to all the classes. When the object of the
Child class is made, and the ‘method’ function of the Child class is executed. Thus, overriding
both it’s parent classes and subsequently even its super-parent class.
You can get the same output without calling each of the classes separately via the object.
chanddra.p@gmail.com
UVBL5MQSJ8
Program: MCA
Specialization: Data Science
Semester: 2
Course Name: Python for Data Science
Course Code: 21VMT5S204
Unit Name: NumPy Arrays
Overview:
In the real world, when we solve math problems, it is easy. Our brain evaluates
what type the problem is and starts analysing and solving it. In python, it is
similar, except there is an entire library for solving mathematical problems that
help in complex calculation.
This library has an official website, www.numpy.org. it is an open source
product and here, you can get detailed information about everything related to
NumPy.
Unit Outcomes:
1. Modules: Meaning, creating modules, importing modules.
2. Packages in Python.
3. Libraries in Python: The Python Standard Library.
chanddra.p@gmail.com
UVBL5MQSJ8 4. NumPy:
i. Introduction
ii. Installing NumPy
iii. Importing NumPy.
iv. NumPy Arrays:
a. Meaning
b. Creating arrays
c. Single dimension, multidimensional arrays.
d. len() function, arange() function.
e. dimension of an array.
f. Shape and size of an array.
g. Creating arrays using tuples and lists.
h. Indexing of an array.
i. Slicing of an array.
j. Merging and Splitting an array.
k. Reshaping arrays.
chanddra.p@gmail.com
UVBL5MQSJ8
Just like defining a function in python, we write a function and save it. Here,
we saved it as add.py.
Now sum is a function that accepts two numbers and returns their addition. It
is in a module named add.
Importing modules:
To import modules, all one needs to do is use the import keyword along with
the name of the module they want to import. And then call the function name.
Syntax to import:
import moduleName
moduleName.functionName(arguments)
import numpy as np
NumPy Arrays:
NumPy arrays are values in the form of a grid. All values are of the same type
and are indexed by a tuple of non-negative integers.
chanddra.p@gmail.com
UVBL5MQSJ8
The number of dimensions of an array is called the rank of an array.
the tuple of integers that gives the size of the array and its dimension is called
the shape of an array.
How are lists in Python different from NumPy arrays?
NumPy is faster, more efficient, and provides a wider range for array creation.
All the elements in a NumPy array should have the same datatype, unlike lists
where homogeneity isn’t necessary. The values of all elements can be of
different data types. NumPy is used when mathematical operations are to be
performed on the data.
NumPy arrays take up less memory and are convenient to use while also being
compact and thus, are faster than lists. They also store data and specify its
data type, optimising the data further.
Creating Arrays:
To create an array, we first import the NumPy module and give it an object
name. Conventionally, np is used as the object name. Once you are done
importing, you write the variableName=np.array([data]), as shown below .
chanddra.p@gmail.com
UVBL5MQSJ8
In the above example, the shape can be explained as the rows and columns of
the matrix. For instance for variable a, which has only one row with 4
elements, the output shows (4,) but for a 2x4 matrix it shows shape is equal (2,
4) for two rows 4 columns.
If you want to get the number of rows or columns you can get each element of
the tuple. variableName.shape[0] gives the number of rows of a matrix.
variableName.shape[1] gives the number of columns of a matrix.
For the type of the data, type(variableName) is used.
Size of an array:
chanddra.p@gmail.com
UVBL5MQSJ8
Indexing of Arrays:
Like lists and tuples, arrays can also be sliced. To access array elements, simply
type arrayName[indexNumber]. Indexing in arrays begins from 0.
In the example given below, a[0] gives the value of element 0 in a. similarly,
b[1,3] gives the value of element 3 in the first dimension, which translates to
position a23 of 2x4 matrix b.
chanddra.p@gmail.com
UVBL5MQSJ8
In a one dimensional array, all you need to do is enter the index number. In a
two dimensional array, you mention the start and end elements and the index
number to slice it accordingly. In the above example, we have written b[0: ,
1:3] which means that we are asking for python to display the index 1 to 3 in
elements starting 0 until the end.
When we write 3: it assumes that everything is to be displayed beginning from
the 3rd element.
Merging Arrays:
We use the concatenate function to merge arrays.
Output:
Reshaping an array:
Lets say you know the range of an array, and it begins from 20, until 40. You
want to reshape it in a 4x5 matrix. Doing this manually will be a big task. For
such reasons we use the reshape function of an array.
The shape of an array is the number of elements in each dimension of an array.
For example:
Output:
chanddra.p@gmail.com
UVBL5MQSJ8
Here, we created a variable to store numbers between 20 and 40. The list is
converted into an array. By default, it is a one dimensional array. This array is
then reshaped using NumPy functions. As a result, it is reshaped into a 4x5
matrix.
Similarly, for numbers between 105 and 150, a variable containing the range is
stored. It is then converted into a one dimensional array. This array is then
reshaped into a 9x5 and a 3x15 matrix.
Making a 3x15 matrix manually would be tedious, this is where .reshape()
function come sin handy.
Output:
chanddra.p@gmail.com
UVBL5MQSJ8
chanddra.p@gmail.com
UVBL5MQSJ8
Program: MCA
Specialization: Data Science
Semester: 2
Course Name: Python for Data Science
Course Code: 21VMT5S204
Unit Name: NumPy - Functions
Overview:
While covering the basics was easy, NumPy also has a lot of other functions
that help us write programs with ease and provide efficiency.
Unit Outcomes:
- Operations on arrays:
1. Using one element of an array at a time.
2. Sorting of arrays.
- Functions in NumPy: Max, min, square, add, subtraction, product,
division.
- Linspace.
- Broadcasting arrays.
chanddra.p@gmail.com
UVBL5MQSJ8
chanddra.p@gmail.com
UVBL5MQSJ8
In the above example, the elements inside each dimension have been printed.
While, only using the first for loop will return every element of the array
individually.
In the above examples, for a single dimensional array, it shows the indexes
where the number is present followed by the size of the data type. The value
stored in each column is a 64 bit integer.
For a 2-d
chanddra.p@gmail.com array, the number is present in both the dimensions, so the first part
UVBL5MQSJ8
of the output shows the array elements where the number is present, and the
next half shows the index of the respective element where it is present.
The searchsorted() method tells us the index where a new element would be
inserted in the array to maintain the order. It is performed on a chronological
array:
In the example below, for a one-dimensional array, we write elements of an
array and the searchsorted() method tells us in what position/index must the
given element be inserted. Similarly, for more than one element, the numbers
are written within square brackets. It gives the index positions where the
values can be inserted respectively.
chanddra.p@gmail.com
UVBL5MQSJ8
In the above example, we simply use one variable to store the 1 dimensional
array and the another variable to store the sorted array.
chanddra.p@gmail.com
UVBL5MQSJ8
To return an array in the descending order, one just has to use the – sign along
with the np.sort() method, i.e. -np.sort(-variableName).
Similar is the case for 2d arrays:
chanddra.p@gmail.com
UVBL5MQSJ8
6. power function: In this case, the first array is treated as the base and
every element in the first array is raised to the corresponding element in
the second array.
chanddra.p@gmail.com
UVBL5MQSJ8
Linspace:
If you were asked how many numbers lie between 1.0 and 2.0, you might say
infinite because the numbers could be any 1.001 to 1.999 or even
smaller/bigger than those on a real number line. This is where linspace is
helpful. In it we define a start point, an end point and the number of intervals
in between. It lists an array of all these numbers. For instance, in between 1.0
and 2.0 if we mention num=10, it will return 10 numbers in between 1.0 and
2.0 at equal intervals. So in easy words, it returns the numbers with respect to
the interval in between a start point and a stop point.
Syntax:
In the above syntax, the attributes inside the linspace function mean:
chanddra.p@gmail.com
UVBL5MQSJ8
In the above example, start and stop point are set to 0 and 20. 50 numbers
between 0 and 20 are printed at equal intervals, including 0 and 20.
Example 2: linspace() using optional parameters:
chanddra.p@gmail.com
UVBL5MQSJ8
In the above example, parameter axis is used. The two sub-arrays are set as
the start and end point. 4 arrays are printed with 3 elements in each array. For
axis=1, column sequence is used to return elements in the range.
Broadcasting Arrays:
Usually, when we perform arithmetic operations on arrays of difference
shapes, we get an error. A way to overcome this problem is by duplicating the
smaller array to the size and dimension of the bigger array. This is called
broadcasting. Thus, broadcasting allows us to perform various arithmetic
operations on different array sizes. Although, numpy does not duplicate the
smaller array, it makes memory and efficiently uses already existent structures
in the memory to help get the same outcome.
Rules for Broadcasting:
1. If the arrays have different dimensions, the one which has fewer
dimensions is padded with ones on its left side.
Example 1:
Example 2:
Let’s say there is a room with dimensions 8x7x6. The room is being renovated
and reconstructed. The length of the room has to be increased twice its size
and breadth by thrice. Variable dimension stores the current room dimensions
and variable increase stores the size by which it needs to be multiplied.
chanddra.p@gmail.com
UVBL5MQSJ8
Program: MCA
Specialization: Data Science
Semester: 2
Course Name: Python for Data Science
Course Code: 21VMT5S204
Unit Name: Pandas – Series and Dictionaries
Overview:
In today’s day and age, data is gold. We use this data to analyse it and read it
better via visualisations and make conclusions. As people of data science, we
know datasets are huge. To make visualisations on Python using these
datasets, we need something called Pandas. Like NumPy, Pandas are a module
in python.
Unit Outcomes:
- Introduction to Pandas
- Series:
a. Creating an empty series
b. Creating a series from an array
chanddra.p@gmail.com
UVBL5MQSJ8 c. Creating a series from a list.
d. Converting a dictionary to a list.
e. Accessing elements from a series.
Calling Pandas:
Import pandas as pd
Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.
Features of Pandas:
- Data handling: Pandas is a fast and efficient library that helps us explore
data. Series and Data Frames help us in handling the data efficiently
while also helping us manipulate it.
- Handles missing data: the dataset is quite often very difficult to read
and interpret, especially crude data. Sometimes data is also missing. The
Pandas library comes in handy here because it has features that
integrate the missing values.
- Cleaning of the data: crude data is data on the 1st stage, very raw and
messy. Pandas gives importance to cleaning of the data, it has features
to do so. It not only makes the code clean but makes it tidy too. For
better and accurate results, the data needs to be better.
- Input – Output tools: Pandas has built-in tools that help in reading and
chanddra.p@gmail.com
UVBL5MQSJ8
writing the data. The data will have to be read in data structures,
databases, etc. all of this can be done via inbuilt tools easily.
- Time series: Pandas provide tools like moving window statistics and
frequency conversion.
- Mathematical operations: Pandas helps in performing mathematical
operations on the data as a whole or a part of the data.
- Maintaining uniqueness of the data: The pandas library helps in
reducing redundancy by considering only unique values. It also masks
data that is not needed in the analysis.
- Alignment, indexing, grouping: as the header speaks for itself, Pandas
help
Series:
Series is like a column in python. It is a one-dimensional array that holds values
of any data type. The labels of each value of a series is called an index.
Example, in excel, for the details of a class of students, you write say, Name
and Email-id as heads of two columns. These two columns in python can be
called two series.
You can name your labels, but if nothing is specified, the series is labelled by
it’s index number.
chanddra.p@gmail.com
UVBL5MQSJ8
Syntax of a series:
import pandas as pd
objectName=pd.Series([“datapoints”]) # the data points can be of any datatype
print(objectName)
Example:
Syntax:
import pandas as pd
objectName=pd.Series()
Example:
chanddra.p@gmail.com
UVBL5MQSJ8
Syntax:
import pandas as pd
import numpy as np
objectName=np.array([“array elements”])
objectName1=pd.Series(objectName)
In the above example, two arrays are considered. The first one has numeric
chanddra.p@gmail.com
UVBL5MQSJ8values and the other one has string values. Both are converted into a series
simply by using the pd.Series(objectName) syntax. One can also see the
difference between the arrays and the series.
import pandas as pd
list1=[“listValues”]
ser=pd.Series(list1)
Example:
chanddra.p@gmail.com
UVBL5MQSJ8
Converting dictionaries to series:
Like lists and arrays, dictionaries can also be converted into a series.
In case of dictionaries, the elements are of the form key:value. Here, the key
becomes the label by which the value can be recognized, instead of its indexes.
Example:
chanddra.p@gmail.com
UVBL5MQSJ8
In the above example, we find the value on the 2 nd index for list1 and the value
of the 4th index for the dictionary named dict1.
Example2:
chanddra.p@gmail.com
UVBL5MQSJ8
Program: MCA
Specialization: Data Science
Semester: 2
Course Name: Python for Data Science
Course Code: 21VMT5S204
Unit Name: Pandas – Data Frames
Overview:
As data scientists, we deal with loads of data. This data is often in the
form of rows and columns. There can be over 100s and 1000s of rows
which have 100s of attributes in the form of columns. To analyse,
evaluate and find different types of correlations and connections
between two columns or rows in python, data frames are used.
Unit Outcomes:
- Data frames
- Empty data frames
- Creating data frames from dictionaries
chanddra.p@gmail.com
UVBL5MQSJ8 - Creating data frames from lists
- zip() function
- Creating data frames from arrays
- loc function
- iloc function
- The describe function
- info function
Data Frames:
Data frames in python are two-dimensional data structures. They are
mutable. They are said to be two-dimensional because they are divided
into labelled axis which are called rows and columns. One can also
perform arithmetic operations on these rows and columns. Data frames
are like an excel sheet. They are used for storing tables. All files with a
dataFrameName= pd.DataFrame()
Example:
chanddra.p@gmail.com
UVBL5MQSJ8Creating a data frame:
Let say we use a dictionary to create a data frame.
chanddra.p@gmail.com
UVBL5MQSJ8
In the above example, we create a list with seven names. We call the list
‘names’. We convert this list into a data frame with the column head as
‘Names’. It is stored in the variable ‘dataframe’. dataframe is then
printed.
To make a data frame using two lists:
zip() function:
The zip() function can be used when there are two iterables that can be
clubbed into one tuple. Tuples, lists, sets or dictionaries can be passed
through the zip() function. the output of the zip() function contains an
Syntax:
Example:
Creating a data frame from two or more lists using the zip() function:
chanddra.p@gmail.com
UVBL5MQSJ8
In the above example, we first declared three lists. These lists are stored
in objects ‘names’, ‘age’ and ‘subject’. One must note that the default
output datatype of a zip() function is a tuple, which is why it is converted
to a list. The object ‘dataframe’ stores the zipped list as a data frame with
column names as given above.
loc function:
The pandas library has an loc function that helps us retrieve data from an
array or data frame. It is used to access groups of rows and columns by
their labels or indexes.
The permitted index labels can be:
- a single label
- a list of an array of labels
- a slice object with labels
- a Boolean array of the same length as the axis that is sliced
- a function with one argument that can be called and returns a valid
output for indexing.
Syntax:
pandas.DataFrame.loc[index label]
chanddra.p@gmail.com
UVBL5MQSJ8
DataFrame.loc[‘labelName’]
Example:
In the below example, we extract the Age, Color and Food preference of
the label named ‘Carl’. A single label is used to retrieve information from
the data frame.
Likewise, similar things can be done with the rest of the labels as well.
Example:
In the below example, we extract the Age, Color and Food preference of
the label named ‘Brandon’ and ‘Greg’. Both labels are used to retrieve
their respective information from the data frame.
Likewise, similar things can be done with the other labels as well.
chanddra.p@gmail.com
UVBL5MQSJ8
3. To access a single label of the row and column from the data
frame:
It permits one to extract the column value of the mentioned row.
Syntax:
DataFrame.loc[‘labelNameRow’ , ‘labelNameColumn’]
Example:
In the below example, we extract the Food preference of the labels
chanddra.p@gmail.com
UVBL5MQSJ8named ‘Franco’ and ‘Greg’. Row labels along with their respective column
label is used to retrieve information from the data frame.
Example:
In the below example, we write True and False for every alternate row
label. It thus, displays the rows that have been assigned True.
Syntax:
dataframeName.iloc[indexPosition]
There are two arguments in an iloc function called the row selector and
the column selector.
For the iloc function, we will use ‘names’ as a part of the zipped data
frames’ column labels instead of using them as row labels.
Data frame:
dataframeName.iloc[indexPositionNumber]
chanddra.p@gmail.com
UVBL5MQSJ8
You can also use negative indexing. It starts with the last row. -1 is the nth
row of the data frame.
Example:
dataframeName.iloc[ : , indexPositionNumber]
Example:
chanddra.p@gmail.com
UVBL5MQSJ8
chanddra.p@gmail.com
UVBL5MQSJ8
One must remember that when one mentions the end index, the rows
are printed until the (end index-1)th index.
describe() function:
The describe function is used to calculate statistical data of numerical
values of a dataframe. It can be used to find percentiles, central
tendency, standard deviation, etc. of the numerical values. It returns the
statistical summary of the dataframe, like summary() in R.
Syntax:
DataFrame.describe(percentiles=None, include= None, exclude= None)
Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.
The output gives the count of the elements in the series, its mean,
standard deviation, minimum value, its 25th percentile, 50th percentile,
75th percentile and maximum value along with the datatype.
chanddra.p@gmail.com
UVBL5MQSJ8
In the above example, we print the dataframe called marksheet, and then
we find the statistical summary of subject English. One must note, that
you write the name of the column label in the syntax of the describe
function.
Output:
The output below shows the data frame first, and then the summary of
chanddra.p@gmail.com
UVBL5MQSJ8the data. It shows the range of the dataframe, the number of columns. It
Program: MCA
Specialization: Data Science
Semester: 2
Course Name: Python for Data Science
Course Code: 21VMT5S204
Unit Name: Data Visualization - Matplotlib
Overview:
As data scientists, data visualisation plays an important role while we analyse
and interpret our data. This visualisation can be in the form of charts, graphs,
plots, etc. In python, like Pandas and Numpy there is a library called matplotlib
that helps us make all of these graphical charts and enables easy interpretation
of the data. It can be used to make analysis on the entire dataset overall or
specifically on particular columns and rows.
Unit Outcomes:
- Matplotlib- introduction.
- Need for matplotlib
- Line plots
chanddra.p@gmail.com
UVBL5MQSJ8 - Scatterplot
- Bar plot
- Histogram
- Boxplot
- Pie Charts
- Doughnut Charts
However, the Anaconda distribution has matplotlib pre-installed and the above
step is not necessary for the same reason.
Advantages/Uses of Matplotlib:
1. It is open source: one does not need a licence to access the matplotlib
chanddra.p@gmail.com
UVBL5MQSJ8 package and can easily be accessed by everybody.
2. It is extensible and can be customised: since matplotlib has a lot of
graphs and features, it can fit in almost any circumstance.
3. It is portable and cross-platform: if you write the code on Linux,
Windows can read it too, which makes code interpretation easy.
Matplotlib, being part of the python library, can run on any platform.
Pyplot:
Pyplot is a submodule of matplotlib. It is a collection of functions that make
matplotlib work like MATLAB, which is a programming language that helps in
plotting data into graphs, implement algorithms, create user interfaces and do
matrix manipulations.
We use pyplot to make graphical visualisations, each of its function makes
changes to a figure.
To import pyplot:
chanddra.p@gmail.com
UVBL5MQSJ8
In the above example we first created to two arrays which act like the points
for the x axis and y axis respectively. We created a line chart by using the
pyplot function. To display the chart, plt.show() function is supposed to be
used, like we use print() to display strings and numeric data.
Scatter plot:
As the name suggests, a scatter plot shows all data points scattered on a graph.
It shows the relationship between two types of data. For instance, the height
and weight of students in a class. Lets say there are 50 students, the height is
on the y-axis and the weight on the x-axis. The datapoints are marked on the
corresponding height-weight point on the graph. Scatterplots are often used to
graphically show the correlation between two variables.
Syntax:
Bar Plot:
Histogram:
A histogram, like a bar plot is a graphical representation of categorical data
that can be divided into buckets. Histograms are used when the frequency
distribution of a variable is given, unlike bar plots, which are generally used for
discrete data.
Syntax:
plt.hist()
Example:
Box Plot:
The main
chanddra.p@gmail.com
UVBL5MQSJ8
idea of a box plot is using to represent the data using a five-number
summary. This includes the minimum of the range, the maximum, the median,
and the first and third quartiles.
A box plot also helps to realise how many outliers our dataset might have. The
extending lines from the boxes show how the data is variable outside the
upper and lower quartiles.
Syntax:
chanddra.p@gmail.com
UVBL5MQSJ8
Pie Charts:
The basic idea of a pie chart is that it looks like a pie, it is circular. It is divided
into slices to show the quantity it represents out of the total. A circle has 360
degrees, each slice represents a percentage according to its
representation/frequency in the dataset.
Syntax:
Example:
chanddra.p@gmail.com
UVBL5MQSJ8
In the above example, we how a pie chart with 5 slices divided proportionately.
They show the number of people who like a respective dryfruit. Let’s say we
want to know number of people who like raisins, to do so we use the explode
parameter to separate it from the rest of the pie.
Doughnut Charts:
Doughnut Charts are similar to pie charts. It has categorical data that are
divided into parts of a whole chart. The only difference here is the center part
of the chart is absent. It uses the area of arc to represent the information,
unlike pie charts which focuses on comparing the proportion area between the
wedges. The center part can be used to display additional information about
the chart.
chanddra.p@gmail.com
UVBL5MQSJ8Example:
chanddra.p@gmail.com
UVBL5MQSJ8
Program: MCA
Specialization: Data Science
Semester: 2
Course Name: Python for Data Science
Course Code: 21VMT5S204
Unit Name: Data Visualization - SeaBorn
Overview:
The motto of matplotlib is known- to make visualisations. Now imagine a
library, bigger than matplotlib, or might it be said, a superset of the matplotlib
library that can be used to not only make visualisations, but also help in
statistical analysis on time series data, or even help picture linear regression
models. There is one library that does all of these things and more, it is called
seaborn.
Unit Outcomes:
- Introduction
- Difference between seaborn and matplotlib
- Joint plot
chanddra.p@gmail.com
UVBL5MQSJ8 - Scatterplot
- Box plot
- Pair plot
- Hex bin plot
- Violin plot
- Strip plot
- Pie chart
---------------------------------------------------------------------------------------------------------
chanddra.p@gmail.com
UVBL5MQSJ8
Difference between matplotlib and seaborn:
While being very similar, matplotlib and seaborn vary in a lot of aspects, in
their functionality, flexibility, visualisations, etc.
Feature Seaborn Matplotlib
Plots and graphs All plots in matplotlib plus Bargraphs, histograms, piecharts,
usually made heatmap, factor plot, scatter plots, lines, box plot, ect.
density plot, joint plot, etc.
Dealing multiple It sets a time for creation of Multiple figures can be opened
figures every figure. It can also lead and closed simultaneously.
to out of memory issues. Although, figures are closed
distinctly.
Visualisation It is more congenial to It is connected to NumPy and
handle Pandas data frames. Pandas well. Pyplot in matplotlib
Basic methods are used to provides similar features like
provide graphics. MATLAB. Matplotlib also acts as a
graphics package for data
visualisation.
For the above statement, you will get an output displaying all 18 dataset
names:
chanddra.p@gmail.com
UVBL5MQSJ8
1. Joint plots:
A plot of two variables with bivariate and univariate graphs. It has three
plots. One displays the bivariate data which shows how the dependent
variable varies with respect to the independent variable. Above the
bivariate graph is another horizontally places graph that shows the
distribution of the independent variable. Yet another plot is on the right
margin of the bivariate plot. Its orientation is vertical, and shows the
distribution of the dependent variable. All in all, one bivariate chart has
a univariate analysis done of its independent variables, above and on the
right margin.
Jointplot() creates a scatter plot of the bivariate data and two
histograms above and on the right margin, by default.
Although the given parameters are not the only ones, but these are
amongst the important ones, where:
x,y – these are names of variables in the ‘data’.
data – it takes the dataframe where x and y are variable names.
kind – takes into account the kind of plot to draw. For eg- hex, scatter,
etc.
color – assigns the color used for the plot element.
dropna – takes Boolean values. If it is true, then removes missing
observations from ‘x’ and ‘y’.
Return – returns jointgrid object with the plot on it.
Example:
chanddra.p@gmail.com
UVBL5MQSJ8
Input:
here, we take into consideration a built-in dataset called ‘tips’. The
dataset has 244 rows and 7 columns that in a gist show the total bill, the
tip, the sex of the person, etc.
We write a jointplot function where we compare the ‘total_bill’ with the
‘tip’. We perform a regression analysis and give it orange color. The
expected output chart must show whether the two variables are
correlated and we can make judgements according to the same.
chanddra.p@gmail.com
UVBL5MQSJ8
Scatter Plot:
One can plot a scatter chart with only the bivariate data as well. A scatter plot
can be used to understand semantic groupings better. It is a 2-dimensional
graph that can be mapped with three additional variables using size,hue and
style parameters.
Syntax:
seaborn.scatterplot(x, y, hue, style, size, data, palette, markers)
Example:
In this case,
chanddra.p@gmail.com for ease of availability, we use a built-in dataframe, one can use
UVBL5MQSJ8
any independent dataset as well.
We use the ‘Tips’ dataframe. We provide the following input where the
independent variable ‘y’ is ‘tip’ and the dependent variable ‘x’ is total_bill. The
dataset is loaded on the variable named data. The hue is set to ‘smoker’. Hue
segregates elements with different colors. Style is set to ‘smoker’, it shows
elements with different styles, and lastly, the size is set to ‘size’ it shows the
datapoints according to its size.
Box Plot:
chanddra.p@gmail.com
UVBL5MQSJ8
A box and whiskers plot is mainly used for categorical data. It shows the
distribution of the quantitative data in a manner where its variables can be
compared. The five main parts of a box-whisker plot are median, first quartile,
third quartile, maximum observed value, and minimum observed value. Thus,
the boxplot shoes quartiles, and the whisker shows the rest of the distribution.
There are points on the whiskers, which are also called ‘outliers’.
Syntax:
These are amongst the main arguments of the boxplot() method. Their use is:
x, y – names of variables from the ‘data’
data – the dataframe which have x and y as column heads
hue – name of grouping variable that segregates data with different colors.
Contains nominal type of data.
orient – to set the orientation of the box-whisker plot. Can be horizontal or
chanddra.p@gmail.com
UVBL5MQSJ8
For the given input we get an appropriate box and whisker plot where the days
are on the x-axis and the total bill on the y-axis. It takes into account whether a
datapoint/person is a smoker or not. The palette is set to pastel shades, so all
different levels are shown in pastel colors. The output:
These are amongst the most used arguments, however, are many more. The
function of these arguments is:
data – the dataset where each column is a variable name
hue – grouping variable to plot aspects of the data into different colors
palette – to use a different color palette for different aspects mapped in the
hue variable
dropna – to drop missing values from the data before it is plotted. Boolean
chanddra.p@gmail.com
UVBL5MQSJ8type of variable.
kind – can be ‘scatter’, ‘kde’, ‘hist’, ‘reg’ as per requirements
height – to provide the height of each faucet
{x,y}_vars – variables from the dataframe that can separately be used for rows
and columns of the figure
markers – like for scatterplot, markers in pairplot are also used to color all
levels differently.
Example:
We use the tips dataset and draw a pairplot with hue ‘smoker’.
Input:
chanddra.p@gmail.com
UVBL5MQSJ8
Syntax:
Example:
We take the ‘Tips’ dataset into consideration and draw a hexbin plot for
‘total_bill’ and ‘tip’ variables.
Input:
chanddra.p@gmail.com
UVBL5MQSJ8
For the given input, we get an output as: the bivariate hexbin plot shows that
more datapoints are concentrated where the total bill amount is between 10$
- 20$ and when the tip is between 0$ - 4$. The univariate chart above shows
the histogram for the x-axis data, and the histogram on the right margin of the
chart shows the data of the ‘tip’ variable.
Syntax:
These are few of the many parameters that can be used. Where these
parameters mean:
x,y – variables that are used for plotting datapoints.
hue – grouping variable to plot aspects of the data into different colors.
data – the dataset that has variables x and y as column heads
scale – to scale the width of each violin
Example:
Output: in the output, we see 4 violins that are made according to the day and
their respective ‘total_bill’ values.
chanddra.p@gmail.com
UVBL5MQSJ8
Strip plot:
A strip plot compliments a box-and-whisker plot or violin plot. It is used to
draw scatter plot on the basis of categories.
Syntax:
seaborn.stripplot(x,y,data,hue,palette)
All parameters function in a similar pattern as that of violin plot and boxplot.
chanddra.p@gmail.com
UVBL5MQSJ8In the above example, we have taken the ‘Tips’ dataset and we consider the tip
on the y axis and days on the x-axis. As the name suggests, the plot is like a
strip. The jitter attribute provides displacements on the horizontal axis.
Pie Chart:
We use matplotlib to create a pie chart, but we can use Seaborn palettes to
make it aesthetically better. There is no specific method in seaborn for pie
charts.
Example:
Here, we have randomly assigned values to different subjects and used
seaborn for the color palette of the pie chart.
Program: MCA
Specialization: Data Science
Semester: 2
Course Name: Python for Data Science
Course Code: 21VMT5S204
Unit Name: Introduction to EDA
Overview:
As data scientists, our main aim is not to just draw graphs based on the
dataset. Albeit super important, it is a step that comes after EDA does. EDA
helps analyse the data, and determine the goals behind the analysis. Data
scientists realise the relationship between factors/variables in the dataset and
then use the tools and softwares like R and Python to explore the data and
make inferences on the same.
Unit Outcomes:
- EDA: Understanding
- Data Manipulation
- Skewness
chanddra.p@gmail.com
UVBL5MQSJ8 - Kurtosis
- Pair plot analysis
- Categorical Data encoding
Numerous tools can be used for EDA. Infact, all the libraries that we have
understood about, until now, are direct or indirect tools used for EDA. Various
graphical techniques include:
- Bar graphs
- Scatter diagrams
- Stem and leaf charts
- Pareto charts
- Boxplots and histogram
- Odds ratio, etc.
Data Manipulation:
For any type of data collected, discrepancies, fallacies and other human errors
have to be accounted for. There are many datapoints where the information is
simply not available, or incorrect information is entered intentionally, or
unintentionally. To do true justice to the data collected and make proper and
accurate inferences from them, the data is manipulated.
Data manipulation is basically organising data so that it becomes better to read
and understand and its format becomes more structured and uniform. It is
used to optimise the working of the firm/industry. Business operations can be
better carried out if the data and surveys a company collects are organised.
Some of the major objectives of data manipulation include:
Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.
1. Apply function:
The Apply function in Pandas is used for creating variables and
manipulating DataFrame. It returns a certain value after each
row/column of a data frame is passed with a function. once it takes an
input function, it applies it to all datapoints of a DataFrame. For tabular
data, specify the axis on which you want your function to act on, i.e., 1
for rows and 0 for columns.
Example: For the given technique, we use an in-built iris dataset.
2. Sorting a DataFrames:
Data can be sorted based on columns:
chanddra.p@gmail.com
UVBL5MQSJ8
3. Merging DataFrames:
Two dataframes can be merged into one using the Pandas library. The
concat and merge function give the same output.
4. Crosstab in Pandas
A cross-tabulation creates a contingency table of all
elements/datapoints of a
DataFrame, incase of categorical data. For instance, in the given
example, we compare the time and the day on which most people visit
the place in the ‘tips’ dataset.
The output gives us the frequency of visitors on a particular day and
whether they visited the place during lunch or dinner. The margins
attribute gives the total frequency of the table.
chanddra.p@gmail.com
UVBL5MQSJ8
5. Pivot table:
In excel, we often use pivot tables to summarise huge chunks of data.
Numerical data can easily be analysed using pivots. These tables can also
be created in python using the .pivot() function.
Example: in the given example, we see the sum of tips and total_bill
collected on a particular time, on each day, by Males and Females. This
Boolean Indexing:
Boolean indexing includes using actual values of the data in a dataframe. We
can access the data by masking the data based on its index value, or the
column value or applying a Boolean mask on a dataframe or accessing it with a
Boolean index. Usually, we created an index which contains the Boolean value,
which can then be accessed using the .loc[], .iloc[] or .ix[] function.
Skewness:
In statistical analysis, skewness means the degree of asymmetry in a
probability distribution or the shift of the distribution curve of a given
dataframe. the distortion that deviates from the bell curve that is symmetric,
or a normal distribution, in a set of data is called skewness. If the curve is
shifted towards its right or the left, it is said to be skewed. A normal
distribution is not skewed, ie. its skewness is zero.
The curve that appears on the right side with its tail tapering towards the
left is called a left-tailed curve and is negatively skewed. On the other hand,
when a curve is longer and fatter on the left, tapering and decreasing on the
right, it is a right-tailed curve or a positively skewed curve.
For a positively skewed curve, the mean will be greater than the median, and
reverse is true for the negative skew, ie., the median is greater than the
mean.
The skewness can be found using Pearson’s first and second coefficients. In
Person’s coefficient of skewness, we subtract the mode from the mean upon
the standard deviation.
The general formula is:
chanddra.p@gmail.com
UVBL5MQSJ8 ̅−𝑴
𝒙
𝒔𝒌𝟏 =
𝒔𝒅
Where Sk1 is the first coefficient of skewness,
𝑥̅ is the mean
M is the mode
sd is the standard deviation.
̅ − 𝑴𝒆𝒅𝒊𝒂𝒏
𝟑𝒙
𝒔𝒌𝟐 =
𝒔𝒅
Where Sk2 is Pearson’s second coefficient of skewness
𝑥̅ is the mean
sd is the standard deviation.
When the mode is strong, the first coefficient of skewness is used. However,
when a distribution has multiple modes or a weak mode, then Pearson’s
second coefficient is used.
A typical skewed graph looks like:
Kurtosis:
The way skewness measures the magnitude of the distribution, kurtosis
measures the heaviness of the tail. It is used to identify whether the
distribution contains extreme values. The kurtosis of a normal distribution is
equal to 3. Excess kurtosis is the difference between the kurtosis of a normal
distribution from the kurtosis of a given distribution.
Excess Kurtosis = Kurtosis - 3
For the ‘Tips’ dataset, the variable ‘day has categorical data with values as days
of the week.
chanddra.p@gmail.com
UVBL5MQSJ8
- Nominal data: when the categories don’t have a particular order that
they follow. While encoding nominal data, one must check if the feature
is present or not. For example, the department of employees of a
- Ordinal encoding:
it is used in case of data that is ordinal in nature. Each label is assigned
an integer value. A variable is created containing categories.
For instance, in the given case, we transform the days via ordinal
encoding, and assign an integer to each day.
category_encoders is not a pre-installed library in Anaconda. One must
install it by writing the first line, i.e.,
Once written, run the code. Close the kernel and reopen it. Execute the
code and you will receive an output that is such:
chanddra.p@gmail.com
UVBL5MQSJ8
chanddra.p@gmail.com
UVBL5MQSJ8
Program: MCA
Specialization: Data Science
Semester: 2
Course Name: Python for Data Science
Course Code: 21VMT5S204
Unit Name: Outlier detection and Treatment
Overview:
Outlier by definition is ‘a person or thing situated away or detached from the
main body or system.’ It is self-explanatory. In real life, we know when
something is different from usual, and cannot be a significant object to draw
conclusions from. For instance, if out of 100, 90 students receive marks
between 60-80, then the other 10 students who receive marks anywhere
between 0 to 45 and above 90 are called outliers because those datapoints are
far away from the main group of data.
Unit Outcomes:
- Introduction: What are outliers?
- Detecting outliers
- Visualisation of outliers
chanddra.p@gmail.com
- Missing value detection and treatment
UVBL5MQSJ8
- Outlier treatment
- Project
Examples of outliers:
- When we talk about the revenue of a company, we measure it by the
sales of each type of product and its cost. If product A yields 1200 lakhs,
product B yields 1400 lakhs, product C yields 1100 lakhs, product D
yields 1300 lakhs and the best selling product, product E yields 2100
lakhs. While evaluating the company’s performance or prospective
growth trends or patterns they must follow, product E will be considered
as an outlier. If it is considered, the calculations will not be based on the
objective of increasing sales of the other 4 products but on the general
sales increase. This variation might hinder in the company’s actual
growth.
Detecting Outliers:
1. The simplest way to detect outliers is by sorting the data. It is effective
as it will highlight all the unusual datapoints. Sorting in ascending or
descending order shows the highest and lowest values and will give you
a gist of how spread out the datapoints are with respect to the given
variable/parameter. However, this method doesn’t quantify the degree
of unusualness of an outlier.
3. Using Z-test one can identify outliers in the data. The unusualness of
the outliers can be quantified using z-test for a normal distribution. It
shows number of variations above and below the mean for each
datapoint. A positive z score ‘x’ indicates that an observation is ‘x’
deviations above the mean. A negative z-score ‘-x’ indicates that an
observation is ‘-x’ deviations below the mean. When z-score is 0, it
represents no deviation from mean. Farther the z-score, the more
unusual the datapoint is said to be. A standard cut-off for z-scores would
be ±3 or further from 0. Z-scores beyond ±3 are extreme.
Z-scores assumes that populations are independent of each other and
that they are normally distributed. It is also assumed that the sample
size are greater than or equal to 30.
4. Creating outlier fences using the inter-quantile range can also help
detect outliers. one can use the interquartile range, various quartile
Visualisation of Outliers:
As mentioned before, outliers can be detected by visualising them. they are
mostly visualised using histograms, scatterplots and boxplot.
For bivariate data we use scatterplot since they show the relation between two
variables. Let’s consider the ‘tips’ dataset. We compare if the total_bill and tip
are correlated. Here, it is observable that the datapoints that are far off from
the regression line are ones with varying total_bill and tip values. In this case,
the circled point are clearly the outliers, which means that it is unusual from
the rest of the datapoints.
One variable outlier detection can also be done using histograms. Incase of
histograms, the point on the far left or right is an outlier. In general, it can be
said that outliers are points that fall above the third quadrant or below the first
quadrant, more than 1.5 times the interquartile range.
Figure 1
The .isnull() checks whether the values are null. The .notnull() does the exact
opposite and checks whether the values are not null or if they are.
In the following example we check if values in the variable ‘deck’ are null or
not. If they are not null, the output would print the respective rows. According
Once we have found out whether our dataset has null values or not, we move
on to decide what can be done with these rows or those particular datapoints:
1. Ignoring
chanddra.p@gmail.com missing values: this can be done when the samples without
UVBL5MQSJ8
missing data are sufficient for analysis even if incomplete cases are not
considered.
2. Deleting rows: The next easiest way to deal with null values is deleting a
particular column if it has more than 70-75 percent missing values or if a
row, if there is a null value for a particular characteristic. One has to be
cautious that after deletion there is no bias in the data. They must also
remember that deletion will cause loss of information which might affect
the anticipated results while prediction. It does not work properly if the
missing values are high. But this process will also be robust and highly
accurate.
We can use the .dropna() function to drop or delete all NA values. for
instance, incase of the titanic dataset, we first check the number of null
values in each specified variable (figure 1). Then we use the .dropna()
function and check whether there are null values in the dataset. We can
notice that there are no null values, which means all rows with null
values have been dropped/deleted.
6. Predicting
chanddra.p@gmail.com missing values: machine learning algorithms can be used to
UVBL5MQSJ8
predict the nulls using values that are not missing. This method has good
accuracy unless a high variance is expected from the missing value. It
gives us unbiased estimates for model parameters. The bias does not
only arise from the data being large but also when conditioning set used
for a categorical variable is incomplete.
One can use the SciKit Learn library to perform the linear regression and KNN
imputation to find missing values.
Outlier treatment:
Some of the ways to handle outliers would be by trimming or removing the
outlier. It isn’t a good practice. One can copy the normal values to another
array and delete the remaining outliers.
Flooring and Capping based on quantiles will floor an outlier at a particular
value below the 10th percentile value and cap it above the 90th percentile
value. the values of outliers than lie above the latter are replaced by the value
of the 90th percentile and the same is true for outliers below the 10th
percentile.
Moving further, to perform Exploratory Data Analysis on any dataset, one must
follow the basic steps:
1. Decide the objective of the analysis. EDA is making sense of all the
numbers in the data. For instance, in the considered dataset, the median
value of owner-occupied homes will not make any sense if we do not
Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.
chanddra.p@gmail.com
UVBL5MQSJ8 3. Looking at the type of data that it has. It can be noticed that all
columns are of numeric type, some are integer values, but most
others are floating point type values.
chanddra.p@gmail.com
UVBL5MQSJ8
chanddra.p@gmail.com
UVBL5MQSJ8
For all the variables, the pairplot will make bivariate charts. This
will make it easier for the analyst to figure out which variables are
correlated instead of trying to find the correlation between some
two variables individually.
chanddra.p@gmail.com
UVBL5MQSJ8
chanddra.p@gmail.com
UVBL5MQSJ8