Numbers: # Basic Calculations 1+2 5/6 # Numbers A 123.1 Print (A) B 10 Print (B) A + B C A + B Print (C)

Numbers
# Basic Calculations
1+2
5/6
# Numbers
a = 123.1
print(a)
b = 10
print(b)
a + b
c = a + b
print(c)
1
Boolean and None
Multiple Assignments
# Boolean
a = True
b = False
print(a, b)
# No value
a = None
print(a)
# Multiple Assignments
a, b, c = 1, 2, 3
print(a, b, c)
2
Strings
# Strings
data = 'hello world'
print(data)
print(len(data)) – length of data – 12 characters
print(data[0]) – gives 1st letter in the word - h
3
Functions
# Sum function
def mysum(x, y):
return x + y-----------specify here what want to do
variables defined abve
# Be very careful with the correct indentation!
# Recommendation: use 4 spaces.
# Test sum function

result = mysum(1, 3)--- could also put variables
print(result)
locals()
# will give you a dictionary of all local variables.
4
Loops
# For-Loop
myrange = range(10) gives the domain/data we looking at
for i in myrange:
print(i)
My range could series of numbers (1,2,3,4,5
# While-Loop
i = 0
while i < 10: this is the condition set for valu of I,
print(i) give result
i += 1
means
I = I + 1 use this condition otw will keep doing 0
5
If Statements
# If Statement (Boolean)
is_customer = True
if is_customer:
print("Yes")
# If Statement with else branch

is_customer = "No"
if (is_customer == "No"): use == when using text
print("No")
else:
print("Yes")
6
Appendix:
Arithmetic Expressions
• An arithmetic expression consists of operands
and operators combined in a manner that is
already familiar to you from learning algebra
7
Arithmetic Expressions (continued)
• Precedence rules:
– ** has the highest precedence and is evaluated first
– Unary negation is evaluated next
– *, /, and % are evaluated before + and -
– + and - are evaluated before =
– With two exceptions, operations of equal precedence
are left associative, so they are evaluated from left to
right
» ** and = are right associative
– You can use () to change the order of evaluation
8
Arithmetic Expressions (continued)
9
Mathematical Constants
# e, pi
import math
math.e
math.pi
10
11
12
13
14
15
16
17
18
19
20
Lists
emptylist = []
mylist = [1, 2, 3]
print("Zeroth Value: ", mylist[0])
mylist.append(4) .. Use to add value to list
print("List Length: ", len(mylist))
mylist += [5] … use to add to list
print("List Length: ", len(mylist))
for value in mylist:

print(value) …………… gives values in the list
21
Sorting Lists
# A list can be sorted in-place (without creating a new
object) by calling its sort function:
a = [7, 2, 5, 1, 3]
a.sort()
print(a)
for i in range(len(a)):
print(a[i])
Also gives list of values like before
22
Slicing
# You can select sections of most sequence types by
using slice notation, which in its basic form consists
of start:stop passed to the indexing operator []:
seq = [7, 2, 3, 7, 5, 6, 0, 1]
seq[1:5]
# While the element at the start index is included,

the stop index is not included, so that the number of
elements in the result is stop - start.
23
Slicing (Cont’d)
# Slices can also be assigned to with a sequence.
seq[3:4] = [6, 3]
seq
# Check the output carefully!!
seq[3:5] = [0, 1]
seq
# Check again!
24
Slicing (Cont’d)
# Either the start or stop can be omitted in which case
they default to the start of the sequence and the end
of the sequence, respectively:
seq[:5]
seq[3:]
25
Slicing (Cont’d)
# Negative indices slice the sequence relative to the
end:
seq[-4:]
seq[-6:-2]
26
Data Sets: Lists in Lists – Not quite!
# access values
dataset = [[1, 2, 3], [3, 4, 5]]--- putting 2 rows
print(dataset)
print("First row: ", dataset[0]) gives 1,2,3,
print("Last row: ", dataset[-1])
print("Specific row and col: ", dataset[0][2])
Would give 3 –from 1st bracket and 0,1,2 digit in that
# How to get an entire column, say the third column?

print("Whole col: ", [row[2] for row in dataset])
27
Next Step: Numpy Arrays
import numpy
mylist = [[1, 2, 3], [3, 4, 5]]
myarray = numpy.array(mylist) turning into 2 dimension
123
345
print(myarray)
print(myarray.shape)tells 2 rows 3 columns
print("First row: ", myarray[0]) gives 1 st row
print("Last row: ", myarray[-1]) gives last row
print("Specific row and col: ", myarray[0,2])- gives info from all
row 0 AND only against column 2
# How to get an entire column, say column 2?

print("Whole col: ", myarray[:,2])-----(: ) MEANS ALL ROWS, 2 ND COL
28
29
And then came the pandas …
A Series of them! ??
import pandas as pd (run this command always)
mylist = [1, 2, 3]
myrownames = ['Ann', 'Ben', 'Clay']
myseries = pd.Series(mylist, index=myrownames)
Joining 2 data sets
print(myseries)
print(myseries.index)--------- gives names
print(myseries.values)------ gives values against names
print(myseries[0])---- gives 1
print(myseries['Ann'])
print(myseries[[0, 2]])--- give me info about customer 0 and 2, use 2

brackets here – first telling that I want to retrieve, 2 nd [ telling
that want to get all list, not just one element
print(myseries[['Ann', 'Clay']])
30
pandas Series slicing
# Try some numerical slicing, e.g.,
print(myseries[:])
print(myseries[1:3])
print(myseries[1:])
print(myseries[:1])
print(myseries[:-1])
print(myseries[-1:])
# How about using the labels:

print(myseries['Ann':'Clay'])
print(myseries['Ann':])
print(myseries[:'Ann'])
# Note the difference!
Includes both extremes unlike above
31
Explicit is better than implicit!
# In other words use iloc or loc for indexing or
slicing.
print(myseries.iloc[0]– just gives number against Ann

print(myseries.loc['Ann'])
print(myseries.iloc[[0, 2]])--- accessing series here

so give complete infor for Ann and Clay–
print(myseries.loc[['Ann', 'Clay’]])– ONLY 2 COLS HERE
print(myseries.iloc[0:2])-- want all columns so just

asking for rows
print(myseries.loc['Ann':'Clay'])
32
And finally the DataFrame
# dataframe
import pandas as pd
myarray = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]– lists
within lists – 2D
rownames = ['Ann', 'Ben', 'Clay']
colnames = ['Amt', 'Bills', 'Clicks']
mydataframe = pd.DataFrame(myarray, index=rownames,
columns=colnames)
print(mydataframe)
mydataframe.head(2) ---- getting first 2 rows using

HEAD
print(mydataframe.index)
print(mydataframe.columns)
print(mydataframe.values) 33
Rows and Columns – to Series
# row in a DataFrame can be retrieved as a Series
print("row 'Ann':\n" + str(mydataframe.loc['Ann']))
print("row 'Ann':\n" + str(mydataframe.iloc[0]))
# column in a DataFrame can be retrieved as a Series

print("column 'Amt':\n" + str(mydataframe['Amt']))
print("column 'Amt':\n" + str(mydataframe.Amt))
print("column 'Amt':\n" + str(mydataframe.loc[:,'Amt']))

print("column 'Amt':\n" + str(mydataframe.iloc[:,0]))
Rows , columns
Rows,colums is the format for iloc – so getting all rows agasint 1

column only above
34
Rows and Columns – to DataFrame
# rows in a DataFrame can be retrieved as a DataFrame
print(mydataframe[0:2])
print(mydataframe['Ann':'Ben'])
print(mydataframe.iloc[0:2])– give me entire rows 0 to 2
print(mydataframe.loc['Ann':'Ben’]) – give me entrie colmn
# columns in a DataFrame can be retrieved as a DataFrame

print(mydataframe[['Amt','Clicks']])
print(mydataframe.loc[:, ['Amt','Clicks’]])– all rows but

only these columns as rows , columns
print(mydataframe.iloc[:, [0,2]])
# How about rows 'Ann' and 'Clay' and columns 'Amt' to

'Clicks'? 35
Rows and Columns – Summary
Bracket Notation (mydataframe[]) works with:
• individual column names or a list of column names
to retrieve columns
• slices of row indices to retrieve rows
So, you might want to use loc or iloc to be explicit

and avoid potential confusion!
36
37
Reading Data from CSV file
import pandas as pd
# load CSV using pandas
from pandas import read_csv
filename = 'OnlineRetail.csv'
sales = read_csv(filename, encoding = "ISO-8859-1")
print(sales.shape) tells how many rows and columns
print(sales.head())if don’t use value, will give 5 rows
print(sales.dtypes) tells data types of columns
print(sales.info())
gives all summary and also tells non null data count
and hence missing data
print(sales.count())
38
Reading Data from CSV file
with new column names
# Load CSV using pandas
# first attempt
mynames = ["InvNo", "SKU", "Descr", "Qty", "InvDate",
"UnitP", "CusID", "Cntry"]--- new column names
sales = read_csv(filename, names=mynames, encoding =
"ISO-8859-1")
print(sales.shape)
print(sales.head())
print(sales.dtypes)
# second attempt
sales = read_csv(filename, header=0, names=names,
encoding = "ISO-8859-1")– DO NOT USE THE HEADER NAMES
WHICH ALREADY EXIST so putting header =0
print(sales.shape)
print(sales.head())
39
Changing Column Types (Part I)
You can attempt to change the type of a column:
Isn't the Invoice Number an integer? Try:

sales['InvNo'] = sales['InvNo'].astype(int)
What do you get?
Isn't the Customer ID an integer? Try:

sales['CusID'] = sales['CusID'].astype(int)
What do you get?
Not able to convert as invoice number has letters too

40
Extracting Data
# Within each transaction:
# Which products, what price, how often?
prodsold = sales[['SKU','Descr','UnitP','Qty’]]
EXTRACTING COLUMNS FROM SALES
print(prodsold.shape)
print(prodsold.head())
# Note that the order of columns is different from the
order in the original dataset "sales".
41
Creating/Deleting Columns
# Revenue?
prodsold = prodsold.assign(Rev = prodsold.UnitP *
prodsold.Qty)
REV WOULD BE A NEW COLUMN WITH THIS CALCULATION
print(prodsold.shape)
# Revenue in CAD
prodsold = prodsold.assign(inCAD = prodsold.Rev)
prodsold.inCAD *= 1.5929
# delete inCAD column

prodsold = prodsold.drop('inCAD', axis=1)--- axis 1 is
to specify its column, axis = 0 if want to del row 42
Apply Functions
All product prices are given in sterling
Assume we want to transfer them into Euros
# function to convert pound sterling to Euro
def toEuro(pound):
return pound * 1.1663
pd.set_option("display.precision", 2)– 2 decimal plcs
# convert all entries in column Unit Price

prodsold['UnitP'] = prodsold['UnitP'].apply(toEuro)
# convert all entries in column Revenue

prodsold['Rev'] = prodsold['Rev'].apply(toEuro)
print(prodsold) 43
Rename Columns (accordingly)
All product prices are given in sterling
Assume we want to transfer them into Euros
Now we also change the column names accordingly
# rename columns:
prodsold.rename(columns={'Rev': 'RevInEuro', 'UnitP':
'UnitP_in_Euro'}, inplace=True)
print(prodsold)
# rename them again:

prodsold.rename(columns={'RevInEuro': 'Rev',
'UnitP_in_Euro': 'UnitP'}, inplace=True)
print(prodsold)
44
Manipulate Strings
# trim/strip white spaces around text:
prodsold['Descr'] = prodsold['Descr'].str.strip()
Replacing descr with new descry which has been
stripped/space removed
print(prodsold)
# change all letters to lower case:

prodsold['Descr'] = prodsold['Descr'].str.lower()
print(prodsold)
# remove full-stop:
prodsold['Descr'] = prodsold['Descr'].str.replace('.','')
print(prodsold)
45
Sorting
The default order for sorting is ascending.
# Sorting by Unit Price

sorted = prodsold.sort_values('UnitP')
print(sorted.head(10))
# First sort by Unit Price, then by Quantity

sorted = prodsold.sort_values(['UnitP','Qty'])
46
Sorting (cont'd)
# Now in descending order (both columns)
sorted = prodsold.sort_values(['UnitP','Qty'],
ascending=False) ------descending
# Now in ascending for Unit Price and then in

descending order for Quantity
sorted = prodsold.sort_values(['UnitP','Qty'],
ascending=[True,False])
47
Filtering Data – Single Condition
# Create a Boolean array for all products
# with price > 9.99
atleast10 = prodsold.UnitP > 9.99
print(atleast10)
# filter the prodsold DataFrame accordingly

atleast10_df = prodsold[atleast10]
print(atleast10_df)
48
Missing Values?!
# Several columns are missing data:
print(sales.shape)
print(sales.count())
TELLS MSSING VALUES FOR EACH COLUMN
49
Missing Values: Series Example
import pandas as pd
from numpy import nan as NA
myseries = pd.Series([1, NA, 3.5, NA, 7])
print(myseries.isnull())
print(myseries.dropna())--- removing rows containing

blanks/NA
# which is equivalent to
print(myseries[myseries.notnull()])
50
Missing Values: DataFrame Example
mydf = pd.DataFrame([[1., 6.5, 3.], [1., NA, NA], [NA,
NA, NA], [NA, 6.5, 3.]])
print(mydf)
cleaned = mydf.dropna()
print(cleaned)
# Equivalent to passing how='any’:--- if any NA

cleaned = mydf.dropna(how='any')
print(cleaned)
# Passing how='all' will only drop rows that are all NA:
cleaned = mydf.dropna(how='all')
print(cleaned)
51
Dropping Columns
# modify the example --- adding another column here
mydf[3] = NA
print(mydf)
cleaned = mydf.dropna(axis=1)
print(cleaned)
# Passing axis=1 and how='all' will only drop columns

that are all NA:
cleaned = mydf.dropna(axis=1, how='all')
print(cleaned)
52
Dropping Threshold
import numpy as np
mydf = pd.DataFrame(np.random.randn(7, 3))
mydf.iloc[:4, 1] = NA
mydf.iloc[:2, 2] = NA
print(mydf)
cleaned = mydf.dropna()
print(cleaned)
# use a threshold of minimum non-NAs before dropping

cleaned = mydf.dropna(thresh=2)
print(cleaned)
# and for columns
cleaned = mydf.dropna(axis=1, thresh=4)
print(cleaned)
53
Filling in!
filled = mydf.fillna(0)
print(filled)
# using a dictionary to fill different columns differently

filled = mydf.fillna({1: 0.5, 2: 0})
print(filled)
Replacing Nas with 0.5 in col 1 and by 0 in col 2
54
55
56
57
58
59
First, Read Data from CSV file again
import pandas as pd
# load CSV using pandas
filename = 'OnlineRetail.csv'
names = ["InvNo", "SKU", "Descr", "Qty", "InvDate",
"UnitP", "CusID", "Cntry"]
encoding = "ISO-8859-1")
# drop some column
sales.drop('Descr', axis=1, inplace=True)
print(sales.shape)
print(sales.head())
print(sales.dtypes)
60
Grouping Data by Column(s) and
Aggregate – Step by Step
# group by customer ID
by_customer = sales.groupby('CusID')
# aggregate transactions in sales by "counting SKU"

# Important Note:
# "counting SKU" only means that the number of rows
# that contain an entry for SKU are counted
count_by_customer = by_customer['SKU'].count()
print(count_by_customer)
# Then, how many transactions are there for each

# individual SKU sold to a particular customer?
61
Aggregate – Step by Step (cont'd)
# group by customer ID and StockCode
by_customer_SKU = sales.groupby(['CusID','SKU'])
# aggregate transactions in sales by "counting SKU"

# Again:
# "counting SKU" only means that the number of rows
# that contain an entry (not NaN) for SKU are counted
count_by_customer_SKU = by_customer_SKU['SKU'].count()
print(count_by_customer_SKU)
62
Grouping Data:
Series vs DataFrame
Important note:
♦ If you aggregate only one column, a Series will
be returned.
♦ If you aggregate multiple columns, a DataFrame
will be returned.
What if we always want to get a DataFrame?
# group by SKU and sum up Qty

print(pd.DataFrame(sales.groupby('SKU')['Qty'].sum()))
63
Aggregate – More Concisely
# group by Country and aggregate all columns
print(sales.groupby('Cntry').count())
# group by SKU and sum up Qty

# or: How many of each SKU been sold?
# Note: Returns have a negative Qty
print(sales.groupby('SKU')['Qty'].sum())
Group by qty and then looking at Qty and summing it
# How many of each SKU have been sold to each customer?

print(sales.groupby(['CusID','SKU'])['Qty'].sum())
# group by InvNo and find the average unit price

print(sales.groupby('InvNo')['UnitP'].mean())
64
Multiple Aggregates of Multiple Columns
# group by Country
by_country = sales.groupby('Cntry')
# select Qty and UnitP
by_country_sub = by_country[['Qty','UnitP']]
# aggregate columns in the sub DF by 'max' and 'median'

# use method ".agg()"
aggregated = by_country_sub.agg(['max','median'])
# print the maximum Qty for each country (sorted)

print(aggregated[('Qty','max')].sort_values())
Use both col heads as they have been aggregated

# print the median UnitP for each country
print(aggregated[('UnitP','median')])
65
Multiple-level Column Index
Have a closer look at aggregated itself:
print(aggregated.head())
This is called a multi-level column index.
66
Multiple-level Row Index
You can also create a multi-level row index, even while
loading the data set:
encoding = "ISO-8859-1",
index_col=['Cntry','CusID','InvNo','SKU'])
print(sales.head(12))
To group by Cntry and CusID you should use the level

parameter:
by_country_customer =
sales.groupby(level=['Cntry','CusID'])
print(by_country_customer.count())
67
Multiple-level Row Index
– Flattened
If you would rather like to flatten it, you can reset the index:
by_country_customer =
sales.groupby(level=['Cntry','CusID'])
ccc = by_country_customer.count()
ccc = ccc.reset_index()
print(ccc)
Use level when its not flattened and has an index

# flatten the index of sales as well, i.e., reset it:
sales = sales.reset_index()
68
Parsing Dates
# parsing InvDate as date – transforming date into date type
sales = read_csv(filename, header=0, names=names, encoding = "ISO-
8859-1", parse_dates=['InvDate'])
print(sales.head())
# parsing InvDate as date and making it the index

sales = read_csv(filename, header=0, names=names, encoding = "ISO-
8859-1", index_col='InvDate', parse_dates=True)
print(sales.head())
# For non-standard datetime parsing, use pd.to_datetime after

pd.read_csv, e.g.,
Changing format of date
sales['InvDate'] = pd.to_datetime(sales['InvDate'],
format='%Y-%m-%d %H:%M')
69
Grouping by Weekday
# parsing InvDate as date
encoding = "ISO-8859-1", parse_dates=['InvDate'])
print(sales.head())
# grouping by weekday
by_day = sales.groupby(sales.InvDate.dt.strftime('%a'))
# how many units were sold each weekday?

qty_sum = by_day['Qty'].sum()
print(qty_sum.sort_values(ascending=False))
70
Detecting Outliers with Z-Scores
# import zscore
from scipy.stats import zscore
# standardize unit prices

standardized = zscore(sales['Qty'])
print(pd.Series(standardized).head())
71
Detecting Outliers with Z-Scores (cont'd)
# But what are the Z-Scores with respect to individual
customers?
# First, drop the NANs:

sales = sales.dropna()
# Then, calculate the Z-scores:

standardized = pd.DataFrame( sales.groupby('CusID')
['Qty'].transform(zscore))
72
Detecting Outliers with Z-Scores (cont'd)
# construct a Boolean Series to identify outliers:
outliers = ((standardized['Qty'] < -3) |
(standardized['Qty'] > 3))
# filter by outliers:
sales_outliers = sales[outliers]
print(sales_outliers.head())
73
Descriptive Statistics
Numerical Summaries
74
First, Read Data from CSV file
# Load CSV using pandas
import numpy as np
import pandas as pd
filename =
'https://raw.githubusercontent.com/GerhardTrippen/Data
Sets/master/AirBnB.csv'
visits = read_csv(filename, parse_dates =
['ts_min','ts_max'])
print(visits.head())
print(visits.shape)
print(visits.dtypes)
75
How much Time did Visitors spend?
visits['ts_diff'] = visits.ts_max - visits.ts_min
by_visitor = visits.groupby('id_visitor')
visit_mean_time = by_visitor['ts_diff'].mean()
# Hmm, NOT implemented (yet)!?
visit_mean_time = by_visitor['ts_diff'].sum() /
by_visitor['ts_diff'].count()
print(visit_mean_time)
76
How much Time did Visitors spend? (cont'd)
# Now in minutes (represented as float64,
# not as timedelta64 as before)
visits['ts_mins'] = visits.ts_diff /
np.timedelta64(1,'m')
Converting time into float
visit_mean_time = by_visitor['ts_mins'].mean()
print(visit_mean_time)
77
Visitors' Totals
visitors = by_visitor['ts_mins', 'did_search',
'sent_message', 'sent_booking_request'].sum()
print(visitors.head())
# Save CSV using pandas

filename = 'visitors.csv'
visitors.to_csv(filename)
78
Descriptive Statistics
pd.set_option('display.width', 100)
pd.set_option('precision', 3)
description = visitors.describe()
print(description)
description = visitors.describe(percentiles=[.05, .
25, .5, .75, .95 ])
print(description)
print(visitors.mode())
79
Descriptive Statistics (cont'd)
# counting categories
booking_request_class_counts =
visitors.groupby('sent_booking_request').size()
print(booking_request_class_counts)
# correlation coefficients
correlations = visitors.corr(method = 'pearson')
print(correlations)
# skewness
skew = visitors.skew()
print(skew)
80

Numbers: # Basic Calculations 1+2 5/6 # Numbers A 123.1 Print (A) B 10 Print (B) A + B C A + B Print (C)

Uploaded by

Numbers: # Basic Calculations 1+2 5/6 # Numbers A 123.1 Print (A) B 10 Print (B) A + B C A + B Print (C)

Uploaded by

Numbers

print(len(data)) – length of data – 12 characters

print(data[0]) – gives 1st letter in the word - h

# Test sum function

My range could series of numbers (1,2,3,4,5

# If Statement with else branch

mylist.append(4) .. Use to add value to list

print("List Length: ", len(mylist))

mylist += [5] … use to add to list

print("List Length: ", len(mylist))

for value in mylist:

Also gives list of values like before

# While the element at the start index is included,

# Check the output carefully!!

print("First row: ", dataset[0]) gives 1,2,3,

print("Last row: ", dataset[-1])

print("Specific row and col: ", dataset[0][2])

Would give 3 –from 1st bracket and 0,1,2 digit in that

# How to get an entire column, say the third column?

print("First row: ", myarray[0]) gives 1 st row

print("Last row: ", myarray[-1]) gives last row

# How to get an entire column, say column 2?

print(myseries[[0, 2]])--- give me info about customer 0 and 2, use 2

# How about using the labels:

print(myseries.iloc[0]– just gives number against Ann

print(myseries.iloc[[0, 2]])--- accessing series here

print(myseries.loc[['Ann', 'Clay’]])– ONLY 2 COLS HERE

print(myseries.iloc[0:2])-- want all columns so just

mydataframe.head(2) ---- getting first 2 rows using

# column in a DataFrame can be retrieved as a Series

print("column 'Amt':\n" + str(mydataframe.loc[:,'Amt']))

Rows,colums is the format for iloc – so getting all rows agasint 1

# columns in a DataFrame can be retrieved as a DataFrame

print(mydataframe.loc[:, ['Amt','Clicks’]])– all rows but

# How about rows 'Ann' and 'Clay' and columns 'Amt' to

So, you might want to use loc or iloc to be explicit

Isn't the Invoice Number an integer? Try:

Isn't the Customer ID an integer? Try:

Not able to convert as invoice number has letters too

# delete inCAD column

pd.set_option("display.precision", 2)– 2 decimal plcs

# convert all entries in column Unit Price

# convert all entries in column Revenue

# rename them again:

# change all letters to lower case:

# Sorting by Unit Price

# First sort by Unit Price, then by Quantity

# Now in ascending for Unit Price and then in

# filter the prodsold DataFrame accordingly

TELLS MSSING VALUES FOR EACH COLUMN

myseries = pd.Series([1, NA, 3.5, NA, 7])

print(myseries.dropna())--- removing rows containing

# Equivalent to passing how='any’:--- if any NA

# Passing axis=1 and how='all' will only drop columns

# use a threshold of minimum non-NAs before dropping

# using a dictionary to fill different columns differently

Replacing Nas with 0.5 in col 1 and by 0 in col 2

# aggregate transactions in sales by "counting SKU"

# Then, how many transactions are there for each

# aggregate transactions in sales by "counting SKU"

What if we always want to get a DataFrame?

# group by SKU and sum up Qty

# group by SKU and sum up Qty

Group by qty and then looking at Qty and summing it

# How many of each SKU have been sold to each customer?

# group by InvNo and find the average unit price

# aggregate columns in the sub DF by 'max' and 'median'

# print the maximum Qty for each country (sorted)

Use both col heads as they have been aggregated

This is called a multi-level column index.

To group by Cntry and CusID you should use the level

Use level when its not flattened and has an index

# parsing InvDate as date and making it the index

# For non-standard datetime parsing, use pd.to_datetime after

# how many units were sold each weekday?

# standardize unit prices