C1 W3

C1W3 Slides
Define data and

establish baseline
Why is data definition hard?
Iguana detection example
Labeling instructions: "Use bounding boxes

to indicate the position of iguanas"
Phone defect detection
Data stage
Scoping Data Modeling Deployment
Define data Monitor &

Label and Select and Perform error Deploy in
Define project and establish maintain
organize data train model analysis production
baseline system
Define data and
establish baseline
More label
ambiguity examples
Speech recognition example
"Um, nearest gas station"
"Umm, nearest gas station"
”Nearest gas station [unintelligible]"

User ID merge example
Job Board (website) Resume chat (app)
Email [email protected] [email protected]
First Name Nova Nova

Last Name Ng Ng
Address 1234 Jane Way ?

? 1 if same
State CA
Zip 94304 94304 0 if different
Data definition questions
! What is the input x?

! Lightning? Contrast? Resolution?
! What features need to be included?
! What is the target label y?

! How can we ensure labelers give consistent labels?
Define data and
establish baseline
Major types of
data problems
Major types of data problems
Unstructured Structured
Housing price
Manufacturing
Small data prediction based
visual inspection
on square footage,
from 100 training Clean labels are critical.
etc. from 50
examples
training examples
Speech recognition Online shopping

Big data from 50 million recommendations
training examples for 1 million users Emphasis on data process.
Humans can label data. Harder to obtain more data.

Data augmentation.
Unstructured vs. structured data
Unstructured data
! May or may not have huge collection of unlabeled examples x.
! Humans can label more data.
! Data augmentation more likely to be helpful.
Structured data
! May be more difficult to obtain more data.
! Human labeling may not be possible (with some exceptions).
Small data vs. big data
Small data
! Clean labels are critical.
! Can manually look through dataset and fix labels.
! Can get all the labelers to talk to each other.
Big data
! Emphasis data process.
Define data and
establish baseline
Small data and label
consistency
Why label consistency is important
Speed Speed Speed

(rpm) (rpm) (rpm)
Voltage Voltage Voltage
• Small data
• Small data • Big data
• Clean (consistent)
• Noisy labels • Noisy labels
labels
Defect?
Phone defect example 1
0
0.1 0.2 0.3 0.4 0.5
Scratch length (mm)
Defect?
0
0.1 0.2 0.3 0.4 0.5
Scratch length (mm)
Big data problems can have small data challenges too
Problems with a large dataset but where there’s a long tail
of rare events in the input will have small data challenges
too.
• Web search
• Self-driving cars
• Product recommendation systems
Define data and
establish baseline
Improving label
consistency
Improving label consistency
! Have multiple labelers label same example.
! When there is disagreement, have MLE, subject matter expert

(SME) and/or labelers discuss definition of y to reach agreement.
! If labelers believe that x doesn’t contain enough information,

consider changing x.
! Iterate until it is hard to significantly increase agreement.

Examples
! Standardize labels
"Umm, nearest gas station" "Um, nearest gas station"
”Nearest gas station [unintelligible]”
! Merge classes
Scratch
Deep scratch Shallow scratch

Have a class/label to capture uncertainty
! Defect: 0 or 1
Alternative: 0, Borderline, 1
! Unintelligible audio
“nearest go” “nearest [unintelligible]”

“nearest grocery”
Small data vs. big data (unstructured data)
Small data
! Usually small number of labelers.
! Can ask labelers to discuss specific labels.
Big data
! Get to consistent definition with a small group.
! Then send labeling instructions to labelers.
! Can consider having multiple labelers label every example and using
voting or consensus labels to increase accuracy.
Define data and
establish baseline
Human level
performance (HLP)
Why measure HLP?
Estimate Bayes error / irreducible error to help with error analysis and
prioritization.
Ground
Inspector
Truth Label
1 1
1 0
1 1
0 0
0 0
0 1
Other uses of HLP
! In academia, establish and beat a respectable benchmark to support
publication.
! Business or product owner asks for 99% accuracy. HLP helps

establish a more reasonable target.
! “Prove” the ML system is superior to humans doing the job and thus
the business or product owner should adopt it.
The problem with beating HLP as a “proof” of ML “superiority”
"Um... nearest gas station"

Two random labelers agree:

ML agrees with humans:
The 12% better performance is not important for anything! This can
also mask more significant errors ML may be making.
Define data and
establish baseline
Raising
HLP
Raising HLP
When the ground truth label is externally defined, HLP gives an estimate
for Bayes error / irreducible error.
But often ground truth is just another human label.
Scratch Ground
Inspector
length (mm) Truth Label
0.7 1 1
0.2 1 0
0.5 1 1
0.2 0 0
0.1 0 0
0.1 0 1
Raising HLP
! When the label y comes from a human label, HLP << 100% may indicate
ambiguous labeling instructions.
! Improving label consistency will raise HLP.
! This makes it harder for ML to beat HLP. But the more consistent labels
will raise ML performance, which is ultimately likely to benefit the actual
application performance.
HLP on structured data
Structured data problems are less likely to involve human labelers,
thus HLP is less frequently used.
Some exceptions:
! User ID merging: Same person?
! Based on network traffic, is the computer hacked?
! Is the transaction fraudulent?
! Spam account? Bot?
! From GPS, what is the mode of transportation – on foot, bike, car, bus?
Label and
organize data
Obtaining
data
How long should you spend obtaining data?
Model + ! Get into this iteration loop as quickly
Hyperparameters + Data possible.
! Instead of asking: How long it would

take to obtain m examples?
Ask: How much data can we
obtain in k days.
! Exception: If you have worked on

Error analysis Training the problem before and from
experience you know you need m
examples.
Inventory data
Brainstorm list of data sources ( speech recognition)
Source Amount Cost Time
Owned 100h $0 0
Crowdsourced – Reading 1000h $10000 14 days
Pay for labels 100h $6000 7 days
Purchase data 1000h $10000 1 day
Other factors: Data quality, privacy, regulatory constraints

Labeling data
! Options: In-house vs. outsourced vs. crowdsourced
! Having MLEs label data is expensive. But doing this for just a few days
is usually fine.
! Who is qualified to label?
Speech recognition – any reasonably fluent speaker
Factory inspection, medical image diagnosis – SME (subject matter expert)
Recommender systems – maybe impossible to label well
! Don’t increase data by more than 10x at a time
Label and
organize data
Data pipeline
Data pipeline example
x = user info
y = looking for job
Raw data Data cleaning

spam user ID merge ML
cleanup
Data Pre- Test set

Development processing s performance
cripts
New Data
Replicate
Production Product
scripts
How to replicate?
POC and Production phases
POC (proof-of-concept):
! Goal is to decide if the application is workable and worth deploying.
! Focus on getting the prototype to work!
! It's ok if data pre-processing is manual. But take extensive notes/comments.
Production phase:
! After project utility is established, use more sophisticated tools to make sure the
data pipeline is replicable.
! E.g., TensorFlow Transform, Apache Beam, Airflow,….
Label and
organize data
Meta-data, data provenance
and lineage
Task: Predict if someone is looking for a job. ( x = user data, y = looking for a job? )
Spam dataset
Anti-spam
model
De-spammed
ML Code
user data
User data Clean user data
ID merge
data Job search
model Predictions
ID merge
model
ML Code
ML Code
Keep track of data provenance and lineage

where it comes from sequence of steps
Meta-data
Examples:
Manufacturing visual inspection: Time, factory, line #, camera settings, phone model,
inspector ID,….
Speech recognition: Device type, labeler ID, VAD model ID,.…
Useful for:
! Error analysis. Spotting unexpected effects.
! Keeping track of data provenance.
Label and
organize data
Balanced
train/dev/test
splits
Balanced train/dev/test splits in small data problems
Visual inspection example: 100 examples, 30 positive (defective)
Train/dev/test:
Random split:
Want:
No need to worry about this with large datasets – a random split will be
representative.
C1W3 Slides
(Optional)
Scoping
Scoping
(optional)
What is scoping?
Scoping Data Modeling Deployment
Define data Monitor &

Label and Select and Perform error Deploy in
Define project and establish maintain
organize data train model analysis production
baseline system
Scoping example: Ecommerce retailer looking to increase

sales ! Better recommender system Questions:
• !Data Storage
Define dataModel
Better search Algorithm/Model
!! Decide
Research/Academia
on key metrics (accuracy, ! What projects
latency/throughput etc…)should we work on?
ML Model Code
a. How
Production much silence before & after clip?
! Improve catalog Hyperparameters
data ! What areXthe
→ Ymetrics for success?
!!b. Decide tomanagement
Consistent
Inventory work on speech recognition for!voice
LabelsData Whatsearch
are the resources (data, time,
! Price optimization people) needed?
Scoping
(optional)
Scoping process
Scoping process
What are the top 3 things you wish were
Identify a business problem
(not an AI problem) working better?
! Increase conversion
Brainstorm AI solutions
! Reduce inventory
! Increase margin (profit per item)
Assess the feasibility and
value of potential solutions
Determine milestones
Budget for resources

Separating problem identification from solution
Problem Solution
Increase conversion Search, recommendations
Reduce inventory Demand prediction, marketing
Increase margin (profit per item) Optimizing what to sell (e.g., merchandising),
recommend bundles
What to achieve How to achieve

Scoping
(optional)
Diligence on feasibility and value

Feasibility: Is this project technically feasible?
Use external benchmark (literature, other company, competitor)
Unstructured Structured
(e.g., speech, images) (e.g., transactions, records)
New
Existing
HLP: Can a human, given the same data, perform the task?
Why use HLP to benchmark?
People are very good on unstructured data tasks
Criteria: Can a human, given the same data, perform the task?
Do we have features that are predictive?
Given past purchases, predict future purchases
Given weather, predict shopping mall foot traffic
Given DNA info, predict heart disease
Given social media chatter, predict demand for a clothing style
Given stock price history, predict future price

History of project
Error
Time
Scoping
(optional)
Diligence on
value
Diligence on value
MLE Business
metrics metrics
Word-level Query-level Search result User

Revenue
accuracy accuracy quality engagement
Have technical and business teams try to agree on metrics that both are comfortable with.
Ethical considerations
! Is this project creating net positive societal value?
! Is this project reasonably fair and free from bias?
! Have any ethical concerns been openly aired and debated?

Scoping
(optional)
Milestones and
resourcing
Milestones
Key specifications:
! ML metrics (accuracy, precision/recall, etc.)

! Software metrics (latency, throughput, etc. given compute resources)
! Business metrics (revenue, etc.)
! Resources needed (data, personnel, help from other teams)
Timeline
If unsure, consider benchmarking to other projects, or building a POC (Proof of Concept) first.

C1 W3

Uploaded by

Copyright:

Available Formats

C1 W3

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

C1 W3

Uploaded by

Copyright:

Available Formats

C1W3 Slides

Define data and

Labeling instructions: "Use bounding boxes

Scoping Data Modeling Deployment

Define data Monitor &

"Um, nearest gas station"

"Umm, nearest gas station"

”Nearest gas station [unintelligible]"

Email [email protected] [email protected]

First Name Nova Nova

Address 1234 Jane Way ?

! What is the input x?

! What is the target label y?

Speech recognition Online shopping

Humans can label data. Harder to obtain more data.

Speed Speed Speed

Voltage Voltage Voltage

! When there is disagreement, have MLE, subject matter expert

! If labelers believe that x doesn’t contain enough information,

! Iterate until it is hard to significantly increase agreement.

Deep scratch Shallow scratch

“nearest go” “nearest [unintelligible]”

! Business or product owner asks for 99% accuracy. HLP helps

"Um... nearest gas station"

Two random labelers agree:

! Improving label consistency will raise HLP.

! Instead of asking: How long it would

! Exception: If you have worked on

Source Amount Cost Time

Crowdsourced – Reading 1000h $10000 14 days

Pay for labels 100h $6000 7 days

Purchase data 1000h $10000 1 day

Other factors: Data quality, privacy, regulatory constraints

y = looking for job

Raw data Data cleaning

Data Pre- Test set

Keep track of data provenance and lineage

Define data Monitor &

Scoping example: Ecommerce retailer looking to increase

Budget for resources

Increase conversion Search, recommendations

Reduce inventory Demand prediction, marketing

What to achieve How to achieve

Diligence on feasibility and value

Given weather, predict shopping mall foot traffic

Given DNA info, predict heart disease

Given social media chatter, predict demand for a clothing style

Given stock price history, predict future price

Word-level Query-level Search result User

! Is this project reasonably fair and free from bias?

! Have any ethical concerns been openly aired and debated?

! ML metrics (accuracy, precision/recall, etc.)

You might also like