0% found this document useful (0 votes)
82 views80 pages

Numbers: # Basic Calculations 1+2 5/6 # Numbers A 123.1 Print (A) B 10 Print (B) A + B C A + B Print (C)

This document provides an overview of key Python concepts including: 1) Basic calculations, numbers, Boolean, None, and multiple assignments using basic operators and print statements. 2) Strings - accessing characters and getting string length. 3) Functions - defining and calling a simple sum function. 4) Loops - for and while loops to iterate through items. 5) If statements for conditional logic. 6) Lists - creation, accessing/changing elements, sorting, slicing, and iterating. 7) Introduction to NumPy arrays and pandas Series and DataFrames for working with tabular data.

Uploaded by

Ahmad Nazir
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
Download as pptx, pdf, or txt
0% found this document useful (0 votes)
82 views80 pages

Numbers: # Basic Calculations 1+2 5/6 # Numbers A 123.1 Print (A) B 10 Print (B) A + B C A + B Print (C)

This document provides an overview of key Python concepts including: 1) Basic calculations, numbers, Boolean, None, and multiple assignments using basic operators and print statements. 2) Strings - accessing characters and getting string length. 3) Functions - defining and calling a simple sum function. 4) Loops - for and while loops to iterate through items. 5) If statements for conditional logic. 6) Lists - creation, accessing/changing elements, sorting, slicing, and iterating. 7) Introduction to NumPy arrays and pandas Series and DataFrames for working with tabular data.

Uploaded by

Ahmad Nazir
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1/ 80

Numbers

# Basic Calculations
1+2
5/6

# Numbers
a = 123.1
print(a)
b = 10
print(b)

a + b

c = a + b
print(c)
1
Boolean and None
Multiple Assignments
# Boolean
a = True
b = False
print(a, b)

# No value
a = None
print(a)

# Multiple Assignments
a, b, c = 1, 2, 3
print(a, b, c)

2
Strings
# Strings
data = 'hello world'

print(data)

print(len(data)) – length of data – 12 characters

print(data[0]) – gives 1st letter in the word - h

3
Functions
# Sum function
def mysum(x, y):
return x + y-----------specify here what want to do
variables defined abve
# Be very careful with the correct indentation!
# Recommendation: use 4 spaces.

# Test sum function


result = mysum(1, 3)--- could also put variables
print(result)

locals()
# will give you a dictionary of all local variables.
4
Loops
# For-Loop
myrange = range(10) gives the domain/data we looking at
for i in myrange:
print(i)

My range could series of numbers (1,2,3,4,5

# While-Loop
i = 0
while i < 10: this is the condition set for valu of I,
print(i) give result
i += 1
means
I = I + 1 use this condition otw will keep doing 0
5
If Statements
# If Statement (Boolean)
is_customer = True
if is_customer:
print("Yes")

# If Statement with else branch


is_customer = "No"
if (is_customer == "No"): use == when using text
print("No")
else:
print("Yes")

6
Appendix:
Arithmetic Expressions
• An arithmetic expression consists of operands
and operators combined in a manner that is
already familiar to you from learning algebra

7
Arithmetic Expressions (continued)
• Precedence rules:
– ** has the highest precedence and is evaluated first
– Unary negation is evaluated next
– *, /, and % are evaluated before + and -
– + and - are evaluated before =
– With two exceptions, operations of equal precedence
are left associative, so they are evaluated from left to
right
» ** and = are right associative
– You can use () to change the order of evaluation

8
Arithmetic Expressions (continued)

9
Mathematical Constants
# e, pi
import math
math.e
math.pi

10
11
12
13
14
15
16
17
18
19
20
Lists
emptylist = []
mylist = [1, 2, 3]
print("Zeroth Value: ", mylist[0])

mylist.append(4) .. Use to add value to list

print("List Length: ", len(mylist))

mylist += [5] … use to add to list

print("List Length: ", len(mylist))

for value in mylist:


print(value) …………… gives values in the list
21
Sorting Lists
# A list can be sorted in-place (without creating a new
object) by calling its sort function:
a = [7, 2, 5, 1, 3]

a.sort()

print(a)

for i in range(len(a)):
print(a[i])

Also gives list of values like before

22
Slicing
# You can select sections of most sequence types by
using slice notation, which in its basic form consists
of start:stop passed to the indexing operator []:

seq = [7, 2, 3, 7, 5, 6, 0, 1]
seq[1:5]

# While the element at the start index is included,


the stop index is not included, so that the number of
elements in the result is stop - start.

23
Slicing (Cont’d)
# Slices can also be assigned to with a sequence.
seq[3:4] = [6, 3]
seq

# Check the output carefully!!

seq[3:5] = [0, 1]
seq

# Check again!

24
Slicing (Cont’d)
# Either the start or stop can be omitted in which case
they default to the start of the sequence and the end
of the sequence, respectively:

seq[:5]

seq[3:]

25
Slicing (Cont’d)
# Negative indices slice the sequence relative to the
end:

seq[-4:]

seq[-6:-2]

26
Data Sets: Lists in Lists – Not quite!
# access values
dataset = [[1, 2, 3], [3, 4, 5]]--- putting 2 rows
print(dataset)

print("First row: ", dataset[0]) gives 1,2,3,

print("Last row: ", dataset[-1])

print("Specific row and col: ", dataset[0][2])

Would give 3 –from 1st bracket and 0,1,2 digit in that

# How to get an entire column, say the third column?


print("Whole col: ", [row[2] for row in dataset])
27
Next Step: Numpy Arrays
import numpy
mylist = [[1, 2, 3], [3, 4, 5]]
myarray = numpy.array(mylist) turning into 2 dimension
123
345
print(myarray)
print(myarray.shape)tells 2 rows 3 columns

print("First row: ", myarray[0]) gives 1 st row

print("Last row: ", myarray[-1]) gives last row

print("Specific row and col: ", myarray[0,2])- gives info from all
row 0 AND only against column 2

# How to get an entire column, say column 2?


print("Whole col: ", myarray[:,2])-----(: ) MEANS ALL ROWS, 2 ND COL
28
29
And then came the pandas …
A Series of them! ??
import pandas as pd (run this command always)
mylist = [1, 2, 3]
myrownames = ['Ann', 'Ben', 'Clay']
myseries = pd.Series(mylist, index=myrownames)
Joining 2 data sets
print(myseries)
print(myseries.index)--------- gives names
print(myseries.values)------ gives values against names

print(myseries[0])---- gives 1
print(myseries['Ann'])

print(myseries[[0, 2]])--- give me info about customer 0 and 2, use 2


brackets here – first telling that I want to retrieve, 2 nd [ telling
that want to get all list, not just one element
print(myseries[['Ann', 'Clay']])

30
pandas Series slicing
# Try some numerical slicing, e.g.,
print(myseries[:])
print(myseries[1:3])
print(myseries[1:])
print(myseries[:1])
print(myseries[:-1])
print(myseries[-1:])

# How about using the labels:


print(myseries['Ann':'Clay'])
print(myseries['Ann':])
print(myseries[:'Ann'])
# Note the difference!
Includes both extremes unlike above
31
Explicit is better than implicit!
# In other words use iloc or loc for indexing or
slicing.

print(myseries.iloc[0]– just gives number against Ann


print(myseries.loc['Ann'])

print(myseries.iloc[[0, 2]])--- accessing series here


so give complete infor for Ann and Clay–

print(myseries.loc[['Ann', 'Clay’]])– ONLY 2 COLS HERE

print(myseries.iloc[0:2])-- want all columns so just


asking for rows
print(myseries.loc['Ann':'Clay'])
32
And finally the DataFrame
# dataframe
import pandas as pd
myarray = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]– lists
within lists – 2D
rownames = ['Ann', 'Ben', 'Clay']
colnames = ['Amt', 'Bills', 'Clicks']
mydataframe = pd.DataFrame(myarray, index=rownames,
columns=colnames)
print(mydataframe)

mydataframe.head(2) ---- getting first 2 rows using


HEAD
print(mydataframe.index)
print(mydataframe.columns)
print(mydataframe.values) 33
Rows and Columns – to Series
# row in a DataFrame can be retrieved as a Series
print("row 'Ann':\n" + str(mydataframe.loc['Ann']))
print("row 'Ann':\n" + str(mydataframe.iloc[0]))

# column in a DataFrame can be retrieved as a Series


print("column 'Amt':\n" + str(mydataframe['Amt']))
print("column 'Amt':\n" + str(mydataframe.Amt))

print("column 'Amt':\n" + str(mydataframe.loc[:,'Amt']))


print("column 'Amt':\n" + str(mydataframe.iloc[:,0]))
Rows , columns

Rows,colums is the format for iloc – so getting all rows agasint 1


column only above

34
Rows and Columns – to DataFrame
# rows in a DataFrame can be retrieved as a DataFrame
print(mydataframe[0:2])
print(mydataframe['Ann':'Ben'])
print(mydataframe.iloc[0:2])– give me entire rows 0 to 2
print(mydataframe.loc['Ann':'Ben’]) – give me entrie colmn

# columns in a DataFrame can be retrieved as a DataFrame


print(mydataframe[['Amt','Clicks']])

print(mydataframe.loc[:, ['Amt','Clicks’]])– all rows but


only these columns as rows , columns
print(mydataframe.iloc[:, [0,2]])

# How about rows 'Ann' and 'Clay' and columns 'Amt' to


'Clicks'? 35
Rows and Columns – Summary
Bracket Notation (mydataframe[]) works with:
• individual column names or a list of column names
to retrieve columns
• slices of row indices to retrieve rows

So, you might want to use loc or iloc to be explicit


and avoid potential confusion!

36
37
Reading Data from CSV file
import pandas as pd
# load CSV using pandas
from pandas import read_csv
filename = 'OnlineRetail.csv'
sales = read_csv(filename, encoding = "ISO-8859-1")
print(sales.shape) tells how many rows and columns
print(sales.head())if don’t use value, will give 5 rows
print(sales.dtypes) tells data types of columns
print(sales.info())

gives all summary and also tells non null data count
and hence missing data

print(sales.count())
38
Reading Data from CSV file
with new column names
# Load CSV using pandas
# first attempt
mynames = ["InvNo", "SKU", "Descr", "Qty", "InvDate",
"UnitP", "CusID", "Cntry"]--- new column names
sales = read_csv(filename, names=mynames, encoding =
"ISO-8859-1")
print(sales.shape)
print(sales.head())
print(sales.dtypes)
# second attempt
sales = read_csv(filename, header=0, names=names,
encoding = "ISO-8859-1")– DO NOT USE THE HEADER NAMES
WHICH ALREADY EXIST so putting header =0
print(sales.shape)
print(sales.head())
39
Changing Column Types (Part I)
You can attempt to change the type of a column:

Isn't the Invoice Number an integer? Try:


sales['InvNo'] = sales['InvNo'].astype(int)
What do you get?

Isn't the Customer ID an integer? Try:


sales['CusID'] = sales['CusID'].astype(int)
What do you get?

Not able to convert as invoice number has letters too


40
Extracting Data
# Within each transaction:
# Which products, what price, how often?
prodsold = sales[['SKU','Descr','UnitP','Qty’]]
EXTRACTING COLUMNS FROM SALES
print(prodsold.shape)
print(prodsold.head())
# Note that the order of columns is different from the
order in the original dataset "sales".

41
Creating/Deleting Columns
# Revenue?
prodsold = prodsold.assign(Rev = prodsold.UnitP *
prodsold.Qty)
REV WOULD BE A NEW COLUMN WITH THIS CALCULATION
print(prodsold.shape)
print(prodsold.head())

# Revenue in CAD
prodsold = prodsold.assign(inCAD = prodsold.Rev)
prodsold.inCAD *= 1.5929
print(prodsold.head())

# delete inCAD column


prodsold = prodsold.drop('inCAD', axis=1)--- axis 1 is
to specify its column, axis = 0 if want to del row 42
Apply Functions
All product prices are given in sterling
Assume we want to transfer them into Euros
# function to convert pound sterling to Euro
def toEuro(pound):
return pound * 1.1663

pd.set_option("display.precision", 2)– 2 decimal plcs

# convert all entries in column Unit Price


prodsold['UnitP'] = prodsold['UnitP'].apply(toEuro)

# convert all entries in column Revenue


prodsold['Rev'] = prodsold['Rev'].apply(toEuro)
print(prodsold) 43
Rename Columns (accordingly)
All product prices are given in sterling
Assume we want to transfer them into Euros
Now we also change the column names accordingly

# rename columns:
prodsold.rename(columns={'Rev': 'RevInEuro', 'UnitP':
'UnitP_in_Euro'}, inplace=True)
print(prodsold)

# rename them again:


prodsold.rename(columns={'RevInEuro': 'Rev',
'UnitP_in_Euro': 'UnitP'}, inplace=True)
print(prodsold)
44
Manipulate Strings
# trim/strip white spaces around text:
prodsold['Descr'] = prodsold['Descr'].str.strip()
Replacing descr with new descry which has been
stripped/space removed
print(prodsold)

# change all letters to lower case:


prodsold['Descr'] = prodsold['Descr'].str.lower()
print(prodsold)

# remove full-stop:
prodsold['Descr'] = prodsold['Descr'].str.replace('.','')
print(prodsold)

45
Sorting
The default order for sorting is ascending.

# Sorting by Unit Price


sorted = prodsold.sort_values('UnitP')
print(sorted.head(10))

# First sort by Unit Price, then by Quantity


sorted = prodsold.sort_values(['UnitP','Qty'])
print(sorted.head(10))

46
Sorting (cont'd)
# Now in descending order (both columns)
sorted = prodsold.sort_values(['UnitP','Qty'],
ascending=False) ------descending
print(sorted.head(10))

# Now in ascending for Unit Price and then in


descending order for Quantity
sorted = prodsold.sort_values(['UnitP','Qty'],
ascending=[True,False])
print(sorted.head(10))

47
Filtering Data – Single Condition
# Create a Boolean array for all products
# with price > 9.99
atleast10 = prodsold.UnitP > 9.99
print(atleast10)

# filter the prodsold DataFrame accordingly


atleast10_df = prodsold[atleast10]
print(atleast10_df)

48
Missing Values?!
# Several columns are missing data:
print(sales.shape)
print(sales.count())

TELLS MSSING VALUES FOR EACH COLUMN

49
Missing Values: Series Example
import pandas as pd
from numpy import nan as NA

myseries = pd.Series([1, NA, 3.5, NA, 7])

print(myseries.isnull())

print(myseries.dropna())--- removing rows containing


blanks/NA

# which is equivalent to
print(myseries[myseries.notnull()])

50
Missing Values: DataFrame Example
mydf = pd.DataFrame([[1., 6.5, 3.], [1., NA, NA], [NA,
NA, NA], [NA, 6.5, 3.]])
print(mydf)

cleaned = mydf.dropna()
print(cleaned)

# Equivalent to passing how='any’:--- if any NA


cleaned = mydf.dropna(how='any')
print(cleaned)

# Passing how='all' will only drop rows that are all NA:
cleaned = mydf.dropna(how='all')
print(cleaned)
51
Missing Values: DataFrame Example
Dropping Columns
# modify the example --- adding another column here
mydf[3] = NA
print(mydf)

cleaned = mydf.dropna(axis=1)
print(cleaned)

# Passing axis=1 and how='all' will only drop columns


that are all NA:
cleaned = mydf.dropna(axis=1, how='all')
print(cleaned)

52
Missing Values: DataFrame Example
Dropping Threshold
import numpy as np
mydf = pd.DataFrame(np.random.randn(7, 3))
mydf.iloc[:4, 1] = NA
mydf.iloc[:2, 2] = NA
print(mydf)
cleaned = mydf.dropna()
print(cleaned)

# use a threshold of minimum non-NAs before dropping


cleaned = mydf.dropna(thresh=2)
print(cleaned)
# and for columns
cleaned = mydf.dropna(axis=1, thresh=4)
print(cleaned)
53
Missing Values: DataFrame Example
Filling in!
filled = mydf.fillna(0)
print(filled)

# using a dictionary to fill different columns differently


filled = mydf.fillna({1: 0.5, 2: 0})
print(filled)

Replacing Nas with 0.5 in col 1 and by 0 in col 2

54
55
56
57
58
59
First, Read Data from CSV file again
import pandas as pd
# load CSV using pandas
from pandas import read_csv
filename = 'OnlineRetail.csv'
names = ["InvNo", "SKU", "Descr", "Qty", "InvDate",
"UnitP", "CusID", "Cntry"]
sales = read_csv(filename, header=0, names=names,
encoding = "ISO-8859-1")
# drop some column
sales.drop('Descr', axis=1, inplace=True)
print(sales.shape)
print(sales.head())
print(sales.dtypes)

60
Grouping Data by Column(s) and
Aggregate – Step by Step
# group by customer ID
by_customer = sales.groupby('CusID')

# aggregate transactions in sales by "counting SKU"


# Important Note:
# "counting SKU" only means that the number of rows
# that contain an entry for SKU are counted
count_by_customer = by_customer['SKU'].count()
print(count_by_customer)

# Then, how many transactions are there for each


# individual SKU sold to a particular customer?

61
Grouping Data by Column(s) and
Aggregate – Step by Step (cont'd)
# group by customer ID and StockCode
by_customer_SKU = sales.groupby(['CusID','SKU'])

# aggregate transactions in sales by "counting SKU"


# Again:
# "counting SKU" only means that the number of rows
# that contain an entry (not NaN) for SKU are counted

count_by_customer_SKU = by_customer_SKU['SKU'].count()
print(count_by_customer_SKU)

62
Grouping Data:
Series vs DataFrame
Important note:
♦ If you aggregate only one column, a Series will
be returned.
♦ If you aggregate multiple columns, a DataFrame
will be returned.

What if we always want to get a DataFrame?

# group by SKU and sum up Qty


print(pd.DataFrame(sales.groupby('SKU')['Qty'].sum()))

63
Grouping Data by Column(s) and
Aggregate – More Concisely
# group by Country and aggregate all columns
print(sales.groupby('Cntry').count())

# group by SKU and sum up Qty


# or: How many of each SKU been sold?
# Note: Returns have a negative Qty
print(sales.groupby('SKU')['Qty'].sum())

Group by qty and then looking at Qty and summing it

# How many of each SKU have been sold to each customer?


print(sales.groupby(['CusID','SKU'])['Qty'].sum())

# group by InvNo and find the average unit price


print(sales.groupby('InvNo')['UnitP'].mean())

64
Multiple Aggregates of Multiple Columns
# group by Country
by_country = sales.groupby('Cntry')
# select Qty and UnitP
by_country_sub = by_country[['Qty','UnitP']]

# aggregate columns in the sub DF by 'max' and 'median'


# use method ".agg()"
aggregated = by_country_sub.agg(['max','median'])

# print the maximum Qty for each country (sorted)


print(aggregated[('Qty','max')].sort_values())

Use both col heads as they have been aggregated


# print the median UnitP for each country
print(aggregated[('UnitP','median')])

65
Multiple-level Column Index
Have a closer look at aggregated itself:

print(aggregated.head())

This is called a multi-level column index.

66
Multiple-level Row Index
You can also create a multi-level row index, even while
loading the data set:
sales = read_csv(filename, header=0, names=names,
encoding = "ISO-8859-1",
index_col=['Cntry','CusID','InvNo','SKU'])
print(sales.head(12))

To group by Cntry and CusID you should use the level


parameter:
by_country_customer =
sales.groupby(level=['Cntry','CusID'])
print(by_country_customer.count())

67
Multiple-level Row Index
– Flattened
If you would rather like to flatten it, you can reset the index:
by_country_customer =
sales.groupby(level=['Cntry','CusID'])
ccc = by_country_customer.count()
ccc = ccc.reset_index()
print(ccc)

Use level when its not flattened and has an index


# flatten the index of sales as well, i.e., reset it:
sales = sales.reset_index()

68
Parsing Dates
# parsing InvDate as date – transforming date into date type
sales = read_csv(filename, header=0, names=names, encoding = "ISO-
8859-1", parse_dates=['InvDate'])
print(sales.head())

# parsing InvDate as date and making it the index


sales = read_csv(filename, header=0, names=names, encoding = "ISO-
8859-1", index_col='InvDate', parse_dates=True)
print(sales.head())

# For non-standard datetime parsing, use pd.to_datetime after


pd.read_csv, e.g.,
Changing format of date
sales['InvDate'] = pd.to_datetime(sales['InvDate'],
format='%Y-%m-%d %H:%M')

69
Grouping by Weekday
# parsing InvDate as date
sales = read_csv(filename, header=0, names=names,
encoding = "ISO-8859-1", parse_dates=['InvDate'])
print(sales.head())

# grouping by weekday
by_day = sales.groupby(sales.InvDate.dt.strftime('%a'))

# how many units were sold each weekday?


qty_sum = by_day['Qty'].sum()
print(qty_sum.sort_values(ascending=False))

70
Detecting Outliers with Z-Scores
# import zscore
from scipy.stats import zscore

# standardize unit prices


standardized = zscore(sales['Qty'])
print(pd.Series(standardized).head())

71
Detecting Outliers with Z-Scores (cont'd)
# But what are the Z-Scores with respect to individual
customers?

# First, drop the NANs:


sales = sales.dropna()

# Then, calculate the Z-scores:


standardized = pd.DataFrame( sales.groupby('CusID')
['Qty'].transform(zscore))

72
Detecting Outliers with Z-Scores (cont'd)
# construct a Boolean Series to identify outliers:
outliers = ((standardized['Qty'] < -3) |
(standardized['Qty'] > 3))

# filter by outliers:
sales_outliers = sales[outliers]
print(sales_outliers.head())

73
Descriptive Statistics

Numerical Summaries

74
First, Read Data from CSV file
# Load CSV using pandas
import numpy as np
import pandas as pd
from pandas import read_csv
filename =
'https://raw.githubusercontent.com/GerhardTrippen/Data
Sets/master/AirBnB.csv'
visits = read_csv(filename, parse_dates =
['ts_min','ts_max'])
print(visits.head())

print(visits.shape)
print(visits.head())
print(visits.dtypes)
75
How much Time did Visitors spend?
visits['ts_diff'] = visits.ts_max - visits.ts_min

by_visitor = visits.groupby('id_visitor')

visit_mean_time = by_visitor['ts_diff'].mean()
# Hmm, NOT implemented (yet)!?

visit_mean_time = by_visitor['ts_diff'].sum() /
by_visitor['ts_diff'].count()

print(visit_mean_time)

76
How much Time did Visitors spend? (cont'd)
# Now in minutes (represented as float64,
# not as timedelta64 as before)
visits['ts_mins'] = visits.ts_diff /
np.timedelta64(1,'m')
print(visits.head())

Converting time into float

visit_mean_time = by_visitor['ts_mins'].mean()
print(visit_mean_time)

77
Visitors' Totals
visitors = by_visitor['ts_mins', 'did_search',
'sent_message', 'sent_booking_request'].sum()

print(visitors.head())

# Save CSV using pandas


filename = 'visitors.csv'
visitors.to_csv(filename)

78
Descriptive Statistics
pd.set_option('display.width', 100)
pd.set_option('precision', 3)

description = visitors.describe()
print(description)

description = visitors.describe(percentiles=[.05, .
25, .5, .75, .95 ])
print(description)

print(visitors.mode())

79
Descriptive Statistics (cont'd)
# counting categories
booking_request_class_counts =
visitors.groupby('sent_booking_request').size()
print(booking_request_class_counts)

# correlation coefficients
correlations = visitors.corr(method = 'pearson')
print(correlations)

# skewness
skew = visitors.skew()
print(skew)

80

You might also like