C1 W3
C1 W3
C1 W3
More label
ambiguity examples
Speech recognition example
Major types of
data problems
Major types of data problems
Unstructured Structured
Housing price
Manufacturing
Small data prediction based
visual inspection
on square footage,
from 100 training Clean labels are critical.
etc. from 50
examples
training examples
Structured data
! May be more difficult to obtain more data.
! Human labeling may not be possible (with some exceptions).
Small data vs. big data
Small data
! Clean labels are critical.
! Can manually look through dataset and fix labels.
! Can get all the labelers to talk to each other.
Big data
! Emphasis data process.
Define data and
establish baseline
Small data and label
consistency
Why label consistency is important
• Small data
• Small data • Big data
• Clean (consistent)
• Noisy labels • Noisy labels
labels
Defect?
Phone defect example 1
0
0.1 0.2 0.3 0.4 0.5
Scratch length (mm)
Defect?
0
0.1 0.2 0.3 0.4 0.5
Scratch length (mm)
Big data problems can have small data challenges too
Problems with a large dataset but where there’s a long tail
of rare events in the input will have small data challenges
too.
• Web search
• Self-driving cars
• Product recommendation systems
Define data and
establish baseline
Improving label
consistency
Improving label consistency
! Have multiple labelers label same example.
! Merge classes
Scratch
Alternative: 0, Borderline, 1
! Unintelligible audio
Big data
! Get to consistent definition with a small group.
! Then send labeling instructions to labelers.
! Can consider having multiple labelers label every example and using
voting or consensus labels to increase accuracy.
Define data and
establish baseline
Human level
performance (HLP)
Why measure HLP?
Estimate Bayes error / irreducible error to help with error analysis and
prioritization.
Ground
Inspector
Truth Label
1 1
1 0
1 1
0 0
0 0
0 1
Other uses of HLP
! In academia, establish and beat a respectable benchmark to support
publication.
! “Prove” the ML system is superior to humans doing the job and thus
the business or product owner should adopt it.
The problem with beating HLP as a “proof” of ML “superiority”
The 12% better performance is not important for anything! This can
also mask more significant errors ML may be making.
Define data and
establish baseline
Raising
HLP
Raising HLP
When the ground truth label is externally defined, HLP gives an estimate
for Bayes error / irreducible error.
But often ground truth is just another human label.
Scratch Ground
Inspector
length (mm) Truth Label
0.7 1 1
0.2 1 0
0.5 1 1
0.2 0 0
0.1 0 0
0.1 0 1
Raising HLP
! When the label y comes from a human label, HLP << 100% may indicate
ambiguous labeling instructions.
! This makes it harder for ML to beat HLP. But the more consistent labels
will raise ML performance, which is ultimately likely to benefit the actual
application performance.
HLP on structured data
Structured data problems are less likely to involve human labelers,
thus HLP is less frequently used.
Some exceptions:
! User ID merging: Same person?
! Based on network traffic, is the computer hacked?
! Is the transaction fraudulent?
! Spam account? Bot?
! From GPS, what is the mode of transportation – on foot, bike, car, bus?
Label and
organize data
Obtaining
data
How long should you spend obtaining data?
Model + ! Get into this iteration loop as quickly
Hyperparameters + Data possible.
Owned 100h $0 0
Data pipeline
Data pipeline example
x = user info
New Data
Replicate
Production Product
scripts
How to replicate?
POC and Production phases
POC (proof-of-concept):
! Goal is to decide if the application is workable and worth deploying.
! Focus on getting the prototype to work!
! It's ok if data pre-processing is manual. But take extensive notes/comments.
Production phase:
! After project utility is established, use more sophisticated tools to make sure the
data pipeline is replicable.
! E.g., TensorFlow Transform, Apache Beam, Airflow,….
Label and
organize data
Meta-data, data provenance
and lineage
Data pipeline example
Task: Predict if someone is looking for a job. ( x = user data, y = looking for a job? )
Spam dataset
Anti-spam
model
De-spammed
ML Code
user data
User data Clean user data
ID merge
data Job search
model Predictions
ID merge
model
ML Code
ML Code
Useful for:
! Error analysis. Spotting unexpected effects.
! Keeping track of data provenance.
Label and
organize data
Balanced
train/dev/test
splits
Balanced train/dev/test splits in small data problems
Visual inspection example: 100 examples, 30 positive (defective)
Train/dev/test:
Random split:
Want:
No need to worry about this with large datasets – a random split will be
representative.
C1W3 Slides
(Optional)
Scoping
Scoping
(optional)
What is scoping?
Scoping Data Modeling Deployment
Scoping process
Scoping process
What are the top 3 things you wish were
Identify a business problem
(not an AI problem) working better?
! Increase conversion
Brainstorm AI solutions
! Reduce inventory
! Increase margin (profit per item)
Assess the feasibility and
value of potential solutions
Determine milestones
Increase margin (profit per item) Optimizing what to sell (e.g., merchandising),
recommend bundles
Unstructured Structured
(e.g., speech, images) (e.g., transactions, records)
New
Existing
HLP: Can a human, given the same data, perform the task?
Why use HLP to benchmark?
People are very good on unstructured data tasks
Criteria: Can a human, given the same data, perform the task?
Do we have features that are predictive?
Given past purchases, predict future purchases
Time
Scoping
(optional)
Diligence on
value
Diligence on value
MLE Business
metrics metrics
Have technical and business teams try to agree on metrics that both are comfortable with.
Ethical considerations
! Is this project creating net positive societal value?
Milestones and
resourcing
Milestones
Key specifications: