C1 W3

Download as pdf or txt
Download as pdf or txt
You are on page 1of 60

C1W3 Slides

Define data and


establish baseline
Why is data definition hard?
Iguana detection example

Labeling instructions: "Use bounding boxes


to indicate the position of iguanas"
Phone defect detection
Data stage

Scoping Data Modeling Deployment

Define data Monitor &


Label and Select and Perform error Deploy in
Define project and establish maintain
organize data train model analysis production
baseline system
Define data and
establish baseline

More label
ambiguity examples
Speech recognition example

"Um, nearest gas station"

"Umm, nearest gas station"

”Nearest gas station [unintelligible]"


User ID merge example
Job Board (website) Resume chat (app)

Email [email protected] [email protected]

First Name Nova Nova


Last Name Ng Ng

Address 1234 Jane Way ?


? 1 if same
State CA
Zip 94304 94304 0 if different
Data definition questions

! What is the input x?


! Lightning? Contrast? Resolution?
! What features need to be included?

! What is the target label y?


! How can we ensure labelers give consistent labels?
Define data and
establish baseline

Major types of
data problems
Major types of data problems
Unstructured Structured

Housing price
Manufacturing
Small data prediction based
visual inspection
on square footage,
from 100 training Clean labels are critical.
etc. from 50
examples
training examples

Speech recognition Online shopping


Big data from 50 million recommendations
training examples for 1 million users Emphasis on data process.

Humans can label data. Harder to obtain more data.


Data augmentation.
Unstructured vs. structured data
Unstructured data
! May or may not have huge collection of unlabeled examples x.
! Humans can label more data.
! Data augmentation more likely to be helpful.

Structured data
! May be more difficult to obtain more data.
! Human labeling may not be possible (with some exceptions).
Small data vs. big data
Small data
! Clean labels are critical.
! Can manually look through dataset and fix labels.
! Can get all the labelers to talk to each other.

Big data
! Emphasis data process.
Define data and
establish baseline
Small data and label
consistency
Why label consistency is important

Speed Speed Speed


(rpm) (rpm) (rpm)

Voltage Voltage Voltage

• Small data
• Small data • Big data
• Clean (consistent)
• Noisy labels • Noisy labels
labels
Defect?
Phone defect example 1

0
0.1 0.2 0.3 0.4 0.5
Scratch length (mm)
Defect?

0
0.1 0.2 0.3 0.4 0.5
Scratch length (mm)
Big data problems can have small data challenges too
Problems with a large dataset but where there’s a long tail
of rare events in the input will have small data challenges
too.
• Web search
• Self-driving cars
• Product recommendation systems
Define data and
establish baseline

Improving label
consistency
Improving label consistency
! Have multiple labelers label same example.

! When there is disagreement, have MLE, subject matter expert


(SME) and/or labelers discuss definition of y to reach agreement.

! If labelers believe that x doesn’t contain enough information,


consider changing x.

! Iterate until it is hard to significantly increase agreement.


Examples
! Standardize labels
"Um, nearest gas station"
"Umm, nearest gas station" "Um, nearest gas station"
”Nearest gas station [unintelligible]”

! Merge classes

Scratch

Deep scratch Shallow scratch


Have a class/label to capture uncertainty
! Defect: 0 or 1

Alternative: 0, Borderline, 1

! Unintelligible audio

“nearest go” “nearest [unintelligible]”


“nearest grocery”
Small data vs. big data (unstructured data)
Small data
! Usually small number of labelers.
! Can ask labelers to discuss specific labels.

Big data
! Get to consistent definition with a small group.
! Then send labeling instructions to labelers.
! Can consider having multiple labelers label every example and using
voting or consensus labels to increase accuracy.
Define data and
establish baseline

Human level
performance (HLP)
Why measure HLP?
Estimate Bayes error / irreducible error to help with error analysis and
prioritization.

Ground
Inspector
Truth Label

1 1
1 0
1 1
0 0
0 0
0 1
Other uses of HLP
! In academia, establish and beat a respectable benchmark to support
publication.

! Business or product owner asks for 99% accuracy. HLP helps


establish a more reasonable target.

! “Prove” the ML system is superior to humans doing the job and thus
the business or product owner should adopt it.
The problem with beating HLP as a “proof” of ML “superiority”

"Um... nearest gas station"


"Um, nearest gas station"

Two random labelers agree:


ML agrees with humans:

The 12% better performance is not important for anything! This can
also mask more significant errors ML may be making.
Define data and
establish baseline

Raising
HLP
Raising HLP
When the ground truth label is externally defined, HLP gives an estimate
for Bayes error / irreducible error.
But often ground truth is just another human label.
Scratch Ground
Inspector
length (mm) Truth Label

0.7 1 1
0.2 1 0
0.5 1 1
0.2 0 0
0.1 0 0
0.1 0 1
Raising HLP
! When the label y comes from a human label, HLP << 100% may indicate
ambiguous labeling instructions.

! Improving label consistency will raise HLP.

! This makes it harder for ML to beat HLP. But the more consistent labels
will raise ML performance, which is ultimately likely to benefit the actual
application performance.
HLP on structured data
Structured data problems are less likely to involve human labelers,
thus HLP is less frequently used.

Some exceptions:
! User ID merging: Same person?
! Based on network traffic, is the computer hacked?
! Is the transaction fraudulent?
! Spam account? Bot?
! From GPS, what is the mode of transportation – on foot, bike, car, bus?
Label and
organize data

Obtaining
data
How long should you spend obtaining data?
Model + ! Get into this iteration loop as quickly
Hyperparameters + Data possible.

! Instead of asking: How long it would


take to obtain m examples?
Ask: How much data can we
obtain in k days.

! Exception: If you have worked on


Error analysis Training the problem before and from
experience you know you need m
examples.
Inventory data
Brainstorm list of data sources ( speech recognition)

Source Amount Cost Time

Owned 100h $0 0

Crowdsourced – Reading 1000h $10000 14 days

Pay for labels 100h $6000 7 days

Purchase data 1000h $10000 1 day

Other factors: Data quality, privacy, regulatory constraints


Labeling data
! Options: In-house vs. outsourced vs. crowdsourced
! Having MLEs label data is expensive. But doing this for just a few days
is usually fine.
! Who is qualified to label?
Speech recognition – any reasonably fluent speaker
Factory inspection, medical image diagnosis – SME (subject matter expert)
Recommender systems – maybe impossible to label well
! Don’t increase data by more than 10x at a time
Label and
organize data

Data pipeline
Data pipeline example

x = user info

y = looking for job

Raw data Data cleaning


spam user ID merge ML
cleanup
Data pipeline example

Data Pre- Test set


Development processing s performance
cripts

New Data
Replicate
Production Product
scripts

How to replicate?
POC and Production phases
POC (proof-of-concept):
! Goal is to decide if the application is workable and worth deploying.
! Focus on getting the prototype to work!
! It's ok if data pre-processing is manual. But take extensive notes/comments.

Production phase:
! After project utility is established, use more sophisticated tools to make sure the
data pipeline is replicable.
! E.g., TensorFlow Transform, Apache Beam, Airflow,….
Label and
organize data
Meta-data, data provenance
and lineage
Data pipeline example
Task: Predict if someone is looking for a job. ( x = user data, y = looking for a job? )
Spam dataset
Anti-spam
model
De-spammed
ML Code
user data
User data Clean user data

ID merge
data Job search
model Predictions
ID merge
model
ML Code
ML Code

Keep track of data provenance and lineage


where it comes from sequence of steps
Meta-data
Examples:
Manufacturing visual inspection: Time, factory, line #, camera settings, phone model,
inspector ID,….
Speech recognition: Device type, labeler ID, VAD model ID,.…

Useful for:
! Error analysis. Spotting unexpected effects.
! Keeping track of data provenance.
Label and
organize data
Balanced
train/dev/test
splits
Balanced train/dev/test splits in small data problems
Visual inspection example: 100 examples, 30 positive (defective)

Train/dev/test:

Random split:

Want:

No need to worry about this with large datasets – a random split will be
representative.
C1W3 Slides
(Optional)
Scoping
Scoping
(optional)

What is scoping?
Scoping Data Modeling Deployment

Define data Monitor &


Label and Select and Perform error Deploy in
Define project and establish maintain
organize data train model analysis production
baseline system

Scoping example: Ecommerce retailer looking to increase


sales ! Better recommender system Questions:
• !Data Storage
Define dataModel
Better search Algorithm/Model
!! Decide
Research/Academia
on key metrics (accuracy, ! What projects
latency/throughput etc…)should we work on?
ML Model Code
a. How
Production much silence before & after clip?
! Improve catalog Hyperparameters
data ! What areXthe
→ Ymetrics for success?
!!b. Decide tomanagement
Consistent
Inventory work on speech recognition for!voice
LabelsData Whatsearch
are the resources (data, time,
! Price optimization people) needed?
Scoping
(optional)

Scoping process
Scoping process
What are the top 3 things you wish were
Identify a business problem
(not an AI problem) working better?

! Increase conversion
Brainstorm AI solutions
! Reduce inventory
! Increase margin (profit per item)
Assess the feasibility and
value of potential solutions

Determine milestones

Budget for resources


Separating problem identification from solution
Problem Solution

Increase conversion Search, recommendations

Reduce inventory Demand prediction, marketing

Increase margin (profit per item) Optimizing what to sell (e.g., merchandising),
recommend bundles

What to achieve How to achieve


Scoping
(optional)

Diligence on feasibility and value


Feasibility: Is this project technically feasible?
Use external benchmark (literature, other company, competitor)

Unstructured Structured
(e.g., speech, images) (e.g., transactions, records)

New

Existing

HLP: Can a human, given the same data, perform the task?
Why use HLP to benchmark?
People are very good on unstructured data tasks

Criteria: Can a human, given the same data, perform the task?
Do we have features that are predictive?
Given past purchases, predict future purchases

Given weather, predict shopping mall foot traffic

Given DNA info, predict heart disease

Given social media chatter, predict demand for a clothing style

Given stock price history, predict future price


History of project
Error

Time
Scoping
(optional)

Diligence on
value
Diligence on value
MLE Business
metrics metrics

Word-level Query-level Search result User


Revenue
accuracy accuracy quality engagement

Have technical and business teams try to agree on metrics that both are comfortable with.
Ethical considerations
! Is this project creating net positive societal value?

! Is this project reasonably fair and free from bias?

! Have any ethical concerns been openly aired and debated?


Scoping
(optional)

Milestones and
resourcing
Milestones
Key specifications:

! ML metrics (accuracy, precision/recall, etc.)


! Software metrics (latency, throughput, etc. given compute resources)
! Business metrics (revenue, etc.)
! Resources needed (data, personnel, help from other teams)
Timeline
If unsure, consider benchmarking to other projects, or building a POC (Proof of Concept) first.

You might also like