Representing Numbers in NLP: A Survey and A Vision
Representing Numbers in NLP: A Survey and A Vision
Avijit Thawani and Jay Pujara and Pedro Szekely and Filip Ilievski
University of Southern California
Information Sciences Institute
{thawani,jpujara,pszekely,ilievski}@isi.edu
in the brain, numbers are represented differ- token. Subword tokenization approaches like BPE
ently from words. We arrange recent NLP (Sennrich et al., 2016) and WordPiece (Wu et al.,
work on numeracy into a comprehensive taxon- 2016) instead retain numbers, but split them into
omy of tasks and methods. We break down the
arbitrary tokens, for example 1234 might be split
subjective notion of numeracy into 7 subtasks,
arranged along two dimensions: granularity
into two tokens as 12-34 or 123-4 or 1-234.
(exact vs approximate) and units (abstract vs Recent work has shown that these are subopti-
grounded). We analyze the myriad represen- mal number representations (Wallace et al., 2019;
tational choices made by 18 previously pub- Zhang et al., 2020). On the DROP Question An-
lished number encoders and decoders. We syn- swering benchmark, BERT performs five times
thesize best practices for representing numbers worse when the answer is a number instead of a
in text and articulate a vision for holistic nu-
span of text (Dua et al., 2019). Relatively simple
meracy in NLP, comprised of design trade-offs
and a unified evaluation. strategies like switching from subword to char-level
tokenization (Geva et al., 2020), or from decimal
1 Introduction to scientific notation (Zhang et al., 2020) already
Numbers are an integral part of text. To under- boost performance. Such results warrant a deeper
stand a simple sentence like I woke up at 11, we study into the best number representations.
need not just literacy but also numeracy. We must Numbers are important. Given the ubiquity of
decode the string 11 to the quantity 11 and infer numbers and their fundamental differences with
11 to denote a time of the day, probably 11 a.m. words, enabling NLP systems to represent them ef-
We need commonsense to reason that 11 a.m. is fectively is beneficial for domains like scientific ar-
quite late in the morning. This interpretation of ticles (Spithourakis and Riedel, 2018) and financial
11 is strongly contextual, as I earn $11 per month documents (Chen et al., 2019; Jiang et al., 2020).
evokes different units and value expectations. Note Number understanding is also useful to detect sar-
how the semantics remains the same for both sen- casm (Dubey et al., 2019) and to model dialogues
tences if 11 was replaced by 10, i.e., the context involving price negotiations (Chawla et al., 2020).
is tolerant to some variability. Recent NLP progress towards numeracy has
Numbers are everywhere. Reasoning with been sporadic but encouraging. In this paper, we
quantities and counts is crucial to understanding survey prior work and highlight the kind of numer-
the world. Evolutionary learning has given numer- acy targeted (e.g., arithmetic, measurement, numer-
ical cognition skills to several animals, including ation) as well as the kind of representation used
human beings (Dehaene, 2011). Our ancient an- (e.g., value embeddings, DigitRNNs). We provide
cestors furthered numeracy by developing multiple the first NLP-centric taxonomy of numeracy tasks
number systems, similar to but independent from (Section 2) and of number representations (Sec-
the evolution of languages. Numeracy is an essen- tion 3) for the reader to succinctly comprehend
tial skill for language understanding, since numbers the challenge posed by numeracy. We synthesize
are often interspersed in text: the 6 million pages in key takeaways (Section 5) and propose a unifying
English Wikipedia have over 150 million numbers. vision for future research (Section 6).
Benchmarking or Probing Tasks Downstream
Abstract Grounded Applications
Exact Simple Arithmetic (2+3=5) AWP (2 balls + 3 balls = 5 balls), Question Answering,
Exact Facts (birds have two legs) Science Problems
Approx Numeration (‘2’ = 2.0), Measurement (dogs weigh 50 lbs), Sarcasm Detection,
Magnitude (‘2’ < ‘5’) Numerical Language Modeling Numeral Categorization
Table 1: Seven numeracy tasks, arranged along the axes of (rows) granularity - exact vs approximate, and (columns)
units - abstract vs grounded. We also list downstream applications requiring a similar granularity of numeracy.
Table 2: An overview of numeracy in NLP: Each row is a method (§3.2), arranged as per our taxonomy (§3.1)
split by string and real, further branching into three dimensions each. The last seven columns correspond to
the seven subtasks of numeracy (§2.2), split by Exact and Approximate granularity (§2.1). The cells point to
representative (not exhaustive) works that have experimented with a given method (row) on a given task (column).
Notes: Prototype* is encoder-only but reuses embeddings for the decoder (Jiang et al., 2020). GMM** has been
discretized (Spithourakis and Riedel, 2018) as well as continuous valued (Berg-Kirkpatrick and Spokoyny, 2020).
Sara Cordes, Rochel Gelman, Charles R Gallistel, and Maxwell Forbes and Yejin Choi. 2017. Verb physics:
John Whalen. 2001. Variability signatures distin- Relative physical knowledge of actions and objects.
guish verbal from nonverbal counting for both large In Proceedings of the 55th Annual Meeting of the As-
and small numbers. Psychonomic bulletin & review, sociation for Computational Linguistics (Volume 1:
8(4):698–707. Long Papers), pages 266–276, Vancouver, Canada.
Association for Computational Linguistics.
Marie-Catherine de Marneffe, Christopher D. Manning,
and Christopher Potts. 2010. “was it good? it was Mor Geva, Ankit Gupta, and Jonathan Berant. 2020.
provocative.” learning the meaning of scalar adjec- Injecting numerical reasoning skills into language
tives. In Proceedings of the 48th Annual Meeting models. In Proceedings of the 58th Annual Meet-
of the Association for Computational Linguistics, ing of the Association for Computational Linguis-
pages 167–176, Uppsala, Sweden. Association for tics, pages 946–958, Online. Association for Com-
Computational Linguistics. putational Linguistics.
Stanislas Dehaene. 2011. The number sense: How the
mind creates mathematics. OUP USA. Pranav Goel, Shi Feng, and Jordan Boyd-Graber. 2019.
How pre-trained word representations capture com-
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and monsense physical comparisons. In Proceedings
Kristina Toutanova. 2019. BERT: Pre-training of of the First Workshop on Commonsense Inference
deep bidirectional transformers for language under- in Natural Language Processing, pages 130–135,
standing. In Proceedings of the 2019 Conference Hong Kong, China. Association for Computational
of the North American Chapter of the Association Linguistics.
for Computational Linguistics: Human Language
Technologies, Volume 1 (Long and Short Papers), David Graff, Junbo Kong, Ke Chen, and Kazuaki
pages 4171–4186, Minneapolis, Minnesota. Associ- Maeda. 2003. English gigaword. Linguistic Data
ation for Computational Linguistics. Consortium, Philadelphia, 4(1):34.
Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Rodrigo Nogueira, Zhiying Jiang, and Jimmy Li.
Arora, Steven Basart, Eric Tang, Dawn Song, and 2021. Investigating the limitations of the transform-
Jacob Steinhardt. 2021. Measuring mathematical ers with simple arithmetic tasks. arXiv preprint
problem solving with the math dataset. arXiv arXiv:2102.13019.
preprint arXiv:2103.03874.
Arkil Patel, Satwik Bhattamishra, and Navin Goyal.
Chengyue Jiang, Zhonglin Nian, Kaihao Guo, Shanbo 2021. Are nlp models really able to solve simple
Chu, Yinggong Zhao, Libin Shen, and Kewei Tu. math word problems?
2020. Learning numeral embedding. In Findings
of the Association for Computational Linguistics: Jeffrey Pennington, Richard Socher, and Christopher
EMNLP 2020, pages 2586–2599, Online. Associa- Manning. 2014. GloVe: Global vectors for word
tion for Computational Linguistics. representation. In Proceedings of the 2014 Confer-
ence on Empirical Methods in Natural Language
Devin Johnson, Denise Mak, Andrew Barker, and Lexi
Processing (EMNLP), pages 1532–1543, Doha,
Loessberg-Zahl. 2020. Probing for multilingual
Qatar. Association for Computational Linguistics.
numerical understanding in transformer-based lan-
guage models. In Proceedings of the Third Black-
Matthew Peters, Mark Neumann, Mohit Iyyer, Matt
boxNLP Workshop on Analyzing and Interpreting
Gardner, Christopher Clark, Kenton Lee, and Luke
Neural Networks for NLP, pages 184–192, Online.
Zettlemoyer. 2018. Deep contextualized word rep-
Association for Computational Linguistics.
resentations. In Proceedings of the 2018 Confer-
Katherine D Kinzler and Elizabeth S Spelke. 2007. ence of the North American Chapter of the Associ-
Core systems in human cognition. Progress in brain ation for Computational Linguistics: Human Lan-
research, 164:257–264. guage Technologies, Volume 1 (Long Papers), pages
2227–2237, New Orleans, Louisiana. Association
Teuvo Kohonen. 1990. The self-organizing map. Pro- for Computational Linguistics.
ceedings of the IEEE, 78(9):1464–1480.
Marten Postma, Filip Ilievski, and Piek Vossen. 2018.
Bill Yuchen Lin, Seyeon Lee, Rahul Khanna, and Xi- SemEval-2018 task 5: Counting events and par-
ang Ren. 2020. Birds have four legs?! NumerSense: ticipants in the long tail. In Proceedings of The
Probing Numerical Commonsense Knowledge of 12th International Workshop on Semantic Evalua-
Pre-Trained Language Models. In Proceedings of tion, pages 70–80, New Orleans, Louisiana. Asso-
the 2020 Conference on Empirical Methods in Nat- ciation for Computational Linguistics.
ural Language Processing (EMNLP), pages 6862–
6868, Online. Association for Computational Lin- Ofir Press and Lior Wolf. 2016. Using the output
guistics. embedding to improve language models. arXiv
Richard Diehl Martinez, Scott Novotney, Ivan Bulyko, preprint arXiv:1608.05859.
Ariya Rastrow, and Andreas Stolcke. 2020. Contex-
tual datetime language model adaptation for speech Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and
recognition. West Coast NLP Summit. Percy Liang. 2016. SQuAD: 100,000+ questions for
machine comprehension of text. In Proceedings of
Tomas Mikolov, Kai Chen, Greg Corrado, and Jef- the 2016 Conference on Empirical Methods in Natu-
frey Dean. 2013. Efficient estimation of word ral Language Processing, pages 2383–2392, Austin,
representations in vector space. arXiv preprint Texas. Association for Computational Linguistics.
arXiv:1301.3781.
Qiu Ran, Yankai Lin, Peng Li, Jie Zhou, and Zhiyuan
George A Miller. 1995. Wordnet: a lexical database for Liu. 2019. NumNet: Machine reading comprehen-
english. Communications of the ACM, 38(11):39– sion with numerical reasoning. In Proceedings of
41. the 2019 Conference on Empirical Methods in Nat-
ural Language Processing and the 9th International
Swaroop Mishra, Arindam Mitra, Neeraj Varshney,
Joint Conference on Natural Language Processing
Bhavdeep Sachdeva, and Chitta Baral. 2020. To-
(EMNLP-IJCNLP), pages 2474–2484, Hong Kong,
wards question format independent numerical rea-
China. Association for Computational Linguistics.
soning: A set of prerequisite tasks.
Aakanksha Naik, Abhilasha Ravichander, Carolyn Abhilasha Ravichander, Aakanksha Naik, Carolyn
Rose, and Eduard Hovy. 2019. Exploring numeracy Rose, and Eduard Hovy. 2019. EQUATE: A bench-
in word embeddings. In Proceedings of the 57th An- mark evaluation framework for quantitative reason-
nual Meeting of the Association for Computational ing in natural language inference. In Proceedings
Linguistics, pages 3374–3380, Florence, Italy. Asso- of the 23rd Conference on Computational Natural
ciation for Computational Linguistics. Language Learning (CoNLL), pages 349–361, Hong
Kong, China. Association for Computational Lin-
Mikhail Nefedov. 2020. Dataset for evaluation of guistics.
mathematical reasoning abilities in russian. In Con-
ference on Artificial Intelligence and Natural Lan- Subhro Roy and Dan Roth. 2015. Solving general arith-
guage, pages 135–144. Springer. metic word problems. In Proceedings of the 2015
Conference on Empirical Methods in Natural Lan- Andrew Trask, Felix Hill, Scott E Reed, Jack Rae,
guage Processing, pages 1743–1752, Lisbon, Portu- Chris Dyer, and Phil Blunsom. 2018. Neural arith-
gal. Association for Computational Linguistics. metic logic units. In Advances in Neural Informa-
tion Processing Systems, pages 8035–8044.
Subhro Roy, Tim Vieira, and Dan Roth. 2015. Reason-
ing about quantities in natural language. Transac- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
tions of the Association for Computational Linguis- Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
tics, 3:1–13. Kaiser, and Illia Polosukhin. 2017. Attention is all
you need. In Advances in neural information pro-
David Saxton, Edward Grefenstette, Felix Hill, and cessing systems, pages 5998–6008.
Pushmeet Kohli. 2019. Analysing mathematical rea-
soning abilities of neural models. In International Eric Wallace, Yizhong Wang, Sujian Li, Sameer Singh,
Conference on Learning Representations. and Matt Gardner. 2019. Do NLP models know
numbers? probing numeracy in embeddings. In
Rico Sennrich, Barry Haddow, and Alexandra Birch. Proceedings of the 2019 Conference on Empirical
2016. Neural machine translation of rare words Methods in Natural Language Processing and the
with subword units. In Proceedings of the 54th An- 9th International Joint Conference on Natural Lan-
nual Meeting of the Association for Computational guage Processing (EMNLP-IJCNLP), pages 5307–
Linguistics (Volume 1: Long Papers), pages 1715– 5315, Hong Kong, China. Association for Computa-
1725, Berlin, Germany. Association for Computa- tional Linguistics.
tional Linguistics.
Alex Wang, Yada Pruksachatkun, Nikita Nangia,
Rebecca Sharp, Mithun Paul, Ajay Nagesh, Dane Bell, Amanpreet Singh, Julian Michael, Felix Hill, Omer
and Mihai Surdeanu. 2018. Grounding gradable ad- Levy, and Samuel Bowman. 2019. Superglue: A
jectives through crowdsourcing. In Proceedings of stickier benchmark for general-purpose language un-
the Eleventh International Conference on Language derstanding systems. In Advances in Neural Infor-
Resources and Evaluation (LREC 2018), Miyazaki, mation Processing Systems, pages 3266–3280.
Japan. European Language Resources Association
(ELRA). Alex Wang, Amanpreet Singh, Julian Michael, Fe-
lix Hill, Omer Levy, and Samuel Bowman. 2018.
Georgios P. Spithourakis and Sebastian Riedel. 2018.
GLUE: A multi-task benchmark and analysis plat-
Numeracy for language models: Evaluating and im-
form for natural language understanding. In Pro-
proving their ability to predict numbers. CoRR,
ceedings of the 2018 EMNLP Workshop Black-
abs/1805.08154.
boxNLP: Analyzing and Interpreting Neural Net-
Dhanasekar Sundararaman, Shijing Si, Vivek Subra- works for NLP, pages 353–355, Brussels, Belgium.
manian, Guoyin Wang, Devamanyu Hazarika, and Association for Computational Linguistics.
Lawrence Carin. 2020. Methods for numeracy-
preserving word embeddings. In Proceedings of the Sandra Williams and Richard Power. 2010. A fact-
2020 Conference on Empirical Methods in Natural aligned corpus of numerical expressions. In Pro-
Language Processing (EMNLP), pages 4742–4753, ceedings of the Seventh International Conference
Online. Association for Computational Linguistics. on Language Resources and Evaluation (LREC’10),
Valletta, Malta. European Language Resources As-
Mirac Suzgun, Yonatan Belinkov, Stuart Shieber, and sociation (ELRA).
Sebastian Gehrmann. 2019. LSTM networks can
perform dynamic counting. In Proceedings of the Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V.
Workshop on Deep Learning and Formal Languages: Le, Mohammad Norouzi, Wolfgang Macherey,
Building Bridges, pages 44–54, Florence. Associa- Maxim Krikun, Yuan Cao, Qin Gao, Klaus
tion for Computational Linguistics. Macherey, Jeff Klingner, Apurva Shah, Melvin John-
son, Xiaobing Liu, Łukasz Kaiser, Stephan Gouws,
Oyvind Tafjord, Peter Clark, Matt Gardner, Wen-tau Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith
Yih, and Ashish Sabharwal. 2019. Quarel: A dataset Stevens, George Kurian, Nishant Patil, Wei Wang,
and models for answering questions about qualita- Cliff Young, Jason Smith, Jason Riesa, Alex Rud-
tive relationships. In Proceedings of the AAAI Con- nick, Oriol Vinyals, Greg Corrado, Macduff Hughes,
ference on Artificial Intelligence, volume 33, pages and Jeffrey Dean. 2016. Google’s neural machine
7063–7071. translation system: Bridging the gap between human
and machine translation.
Alon Talmor, Oyvind Tafjord, Peter Clark, Yoav Gold-
berg, and Jonathan Berant. 2020. Leap-of-thought: Karen Wynn. 1990. Children’s understanding of count-
Teaching pre-trained models to systematically rea- ing. Cognition, 36(2):155–193.
son over implicit knowledge. Advances in Neural
Information Processing Systems, 33. Xikun Zhang, Deepak Ramachandran, Ian Tenney,
Yanai Elazar, and Dan Roth. 2020. Do language em-
Alberto Testolin, Serena Dolfi, Mathijs Rochus, and beddings capture scales? In Findings of the Associ-
Marco Zorzi. 2020. Visual sense of number vs. ation for Computational Linguistics: EMNLP 2020,
sense of magnitude in humans and machines. Sci- pages 4889–4896, Online. Association for Computa-
entific reports, 10(1):1–13. tional Linguistics.
Ben Zhou, Qiang Ning, Daniel Khashabi, and Dan on their fingers, but struggle with realizing the Car-
Roth. 2020. Temporal common sense acquisition dinal Principle, i.e., the last counter value denotes
with minimal supervision. In Proceedings of the
the number of entities being considered (Wynn,
58th Annual Meeting of the Association for Compu-
tational Linguistics, pages 7579–7589, Online. As- 1990). Similarly, LSTMs (Suzgun et al., 2019) and
sociation for Computational Linguistics. transformers (Bhattamishra et al., 2020) have been
shown to possess counting skills but in order to
A Other Numeracy Tasks answer counting questions, they must also learn
to map the counts to number words or numerals.
Here, we describe certain related tasks that fall Counting tasks have been proposed in computer
outside our taxonomy: vision (Testolin et al., 2020) as well as in NLP
(Numeric) Paraphrasing is what we call the (Postma et al., 2018; Talmor et al., 2020).
task of identifying one-to-one correspondences be- Domain-specific tasks require background
tween different surface forms of the same number. knowledge in addition to mathematical skills. Num-
Twelve is the same as ‘12’, also referred to as a bergame (Mishra et al., 2020) includes questions
dozen. This task cuts across all the tasks we dis- on Physics (find the distance travelled in 2 hrs by
cussed, since the same number, expressed in several a train moving at 50 mph) and Chemistry (find the
different ways, should be nevertheless identified by mass percentage of H in C6H6). Project Aristo
an NLP model before any subsequent reasoning. (Clark et al., 2019) solves elementary and high
Similar to how WordNet (Miller, 1995) provides school science problems, which often involve nu-
a huge list of synonyms, numeric paraphrases can meric reasoning.
be obtained by libraries3 which convert numerals
to words, words to numerals, etc. One could also
envision this as a learning task given a large enough
corpus, such as the NumGen dataset (Williams and
Power, 2010) containing 2000 fact-aligned numeric
expressions over 110 articles.
Quantity Entailment tasks (Ravichander et al.,
2019; Roy et al., 2015) are analogous to Natural
Language Inference, which requires understanding
of not only equivalence (as in paraphrasing) but
also deeper relations like entailment and contradic-
tion, e.g., the premise ‘he was 16 yrs old’ entails the
hypothesis ‘he was a teenager’. On similar lines,
Mishra et al. (2020) modified the existing QuaRel
dataset (Tafjord et al., 2019) to force models to per-
form quantity entailment, e.g., dog1 is light, dog2
is heavy is replaced with dog1 weighs 70 lbs, dog2
weighs 90 lbs.
Numeral Understanding is the task of cate-
gorizing numbers into percentages, prices, dates,
times, quantities, etc. and their respective subcate-
gories (Chen et al., 2018).
Fused-Head Resolution for numbers is essen-
tial to ground them when the context is implicit.
For example, the sentence “I woke up at 11" has
‘a.m.’ or ‘o’clock’ as the fused head to be resolved
(Elazar and Goldberg, 2019).
Counting is the task of keeping track of discrete
instances of some object. When kids count a set
of objects, they quickly learn to keep a track, say
3
Example: https://pypi.org/project/num2words/