Week 6a

Natural Language Processing
Mahmmoud Mahdi
Precision, Recall, and F measure
The 2-by-2 confusion matrix
Evaluation: Accuracy
Why don't we use accuracy as our metric?
Imagine we saw 1 million tweets
○ 100 of them talked about Delicious Pie Co.
○ 999,900 talked about something else
We could build a dumb classifier that just labels every tweet
"not about pie"
○ It would get 99.99% accuracy!!! Wow!!!!
○ But useless! Doesn't return the comments we are looking for!
○ That's why we use precision and recall instead
Evaluation: Precision
% of items the system detected (i.e., items the system labeled
as positive) that are in fact positive (according to the human
gold labels)
Evaluation: Recall
% of items actually present in the input that were correctly
identified by the system.
Why Precision and recall
Our dumb pie-classifier
○ Just label nothing as "about pie"
Accuracy=99.99%
but
Recall = 0
○ (it doesn't get any of the 100 Pie tweets)
Precision and recall, unlike accuracy, emphasize true

positives:
○ finding the things that we are supposed to be looking for.
A combined measure: F
F measure: a single number that combines P and R:
We almost always use balanced F1 (i.e., β = 1)

Evaluation with more than two classes
Confusion Matrix for 3-class classification
How to combine P/R from 3 classes to get one metric
Macroaveraging:
○ compute the performance for each class, and then average over
classes
Microaveraging:
○ collect decisions for all classes into one confusion matrix
○ compute precision and recall from that table.
Macroaveraging and Microaveraging
Statistical Significance Testing
Development Test Sets ("Devsets") and Cross-validation
Development Test Test
Training set
Set Set
Train on training set, tune on devset, report on testset

○ This avoids overfitting (‘tuning to the test set’)
○ More conservative estimate of performance
○ But paradox: want as much data as possible for training, and as much for
dev; how to split?
Cross-validation: multiple splits
Pool results over splits, Compute pooled dev performance
How do we know if one classifier is better than another?
Given:
○ Classifier A and B
○ Metric M: M(A,x) is the performance of A on testset x
○ 𝛿(x): the performance difference between A, B on x:
■ 𝛿(x) = M(A,x) – M(B,x)
○ We want to know if 𝛿(x)>0, meaning A is better than B
○ 𝛿(x) is called the effect size
○ Suppose we look and see that 𝛿(x) is positive. Are we done?
○ No! This might be just an accident of this one test set, or
circumstance of the experiment.
Statistical Hypothesis Testing
Consider two hypotheses:
• Null hypothesis: A isn't better than B
• A is better than B
We want to rule out H0
We create a random variable X ranging over test sets
And ask, how likely, if H0 is true, is it that among these test
sets we would see the 𝛿(x) we did see?
• Formalized as the p-value:
● In our example, this p-value is the probability that we would

see δ(x) assuming H0 (=A is not better than B).
○ If H0 is true but δ(x) is huge, that is surprising! Very low probability!
● A very small p-value means that the difference we observed is
very unlikely under the null hypothesis, and we can reject
the null hypothesis
● Very small: .05 or .01
● A result(e.g., “A is better than B”) is statistically
significant if the δ we saw has a probability that is below
the threshold and we therefore reject this null hypothesis.
● How do we compute this probability?
● In NLP, we don't tend to use parametric tests
● Instead, we use non-parametric tests based on sampling:
artificially creating many versions of the setup.
● For example, suppose we had created zillions of testsets
x'.
○ Now we measure the value of 𝛿(x') on each test set
○ That gives us a distribution
○ Now set a threshold (say .01).
○ So if we see that in 99% of the test sets 𝛿(x) > 𝛿(x')
■ We conclude that our original test set delta was a
real delta and not an artifact.
Two common approaches:
● approximate randomization
● bootstrap test
Paired tests:
● Comparing two sets of observations in which each observation
in one set can be paired with an observation in another.
● For example, when looking at systems A and B on the same test
set, we can compare the performance of system A and B on each
same observation xi
The Paired Bootstrap Test
Bootstrap test Efron and Tibshirani, 1993
Can apply to any metric (accuracy, precision, recall, F1,

etc).
Bootstrap means to repeatedly draw large numbers of smaller
samples with replacement (called bootstrap samples) from an
original larger sample.
Bootstrap example
Consider a baby text classification example with a test set x
of 10 documents, using accuracy as metric.
Suppose these are the results of systems A and B on x, with 4
outcomes (A & B both right, A & B both wrong, A right/B wrong,
A wrong/B right):
either A+B both correct, or
Bootstrap example
Now we create, many, say, b=10,000 virtual test sets x(i),
each of size n = 10.
To make each x(i), we randomly select a cell from row x, with
replacement, 10 times:
Bootstrap example
Now we have a distribution! We can check how often A has an
accidental advantage, to see if the original 𝛿(x) we saw was
very common.
Now assuming H0, that means normally we expect 𝛿(x')=0
So we just count how many times the 𝛿(x') we found exceeds the
expected 0 value by 𝛿(x) or more:
Bootstrap example
Alas, it's slightly more complicated.
We didn’t draw these samples from a distribution with 0 mean; we created
them from the original test set x, which happens to be biased (by .20) in
favor of A.
So to measure how surprising is our observed δ(x), we actually compute the
p-value by counting how often δ(x') exceeds the expected value of δ(x) by
δ(x) or more:
Bootstrap example
Suppose:
○ We have 10,000 test sets x(i) and a
threshold of .01
○ And in only 47 of the test sets do we find
that δ(x(i)) ≥ 2δ(x)
○ The resulting p-value is .0047
○ This is smaller than .01, indicating δ (x)
is indeed sufficiently surprising
○ And we reject the null hypothesis and
conclude A is better than B.
Paired bootstrap example
After Berg-Kirkpatrick et al (2012)
Avoiding Harms in Classification
Harms in sentiment classifiers
Kiritchenko and Mohammad (2018) found that most sentiment
classifiers assign lower sentiment and more negative emotion
to sentences with African American names in them.
This perpetuates negative stereotypes that associate African
Americans with negative emotions
Harms in toxicity classification
Toxicity detection is the task of detecting hate speech,
abuse, harassment, or other kinds of toxic language
But some toxicity classifiers incorrectly flag as being toxic
sentences that are non-toxic but simply mention identities
like blind people, women, or gay people.
This could lead to censorship of discussion about these
groups.
What causes these harms?
Can be caused by:
○ Problems in the training data; machine learning systems are known to amplify
the biases in their training data.
○ Problems in the human labels
○ Problems in the resources used (like lexicons)
○ Problems in model architecture (like what the model is trained to optimized)
Mitigation of these harms is an open research area

Meanwhile: model cards
Model Cards (Mitchell et al., 2019)
For each algorithm you release, document:

○ training algorithms and parameters
○ training data sources, motivation, and preprocessing
○ evaluation data sources, motivation, and preprocessing
○ intended use and users
○ model performance across different demographic or other groups and
environmental situations

Week 6a

Uploaded by

Copyright:

Available Formats

Week 6a

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Week 6a

Uploaded by

Copyright:

Available Formats

Natural Language Processing

Precision and recall, unlike accuracy, emphasize true

We almost always use balanced F1 (i.e., β = 1)

Train on training set, tune on devset, report on testset

● In our example, this p-value is the probability that we would

Can apply to any metric (accuracy, precision, recall, F1,

Mitigation of these harms is an open research area

For each algorithm you release, document:

You might also like