ETNLP: A Toolkit For Extraction, Evaluation and Visualization of Pre-Trained Word Embeddings

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

ETNLP: A Toolkit for Extraction, Evaluation and Visualization of

Pre-trained Word Embeddings


Xuan-Son Vu1 , Thanh Vu2 , Son N. Tran3 , Lili Jiang1
1
Umeå University, Sweden
2
The Australian E-Health Research Centre, CSIRO, Australia
3
The University of Tasmania, Australia;
{sonvx, lili.jiang}@cs.umu.se
[email protected], [email protected] ;

Abstract overcome that problem, recently, contextual em-


beddings (e.g., ELMO of Peters et al. (2018),
In this paper, we introduce a comprehensive
toolkit, ETNLP, which can evaluate, extract,
BERT of Devlin et al. (2018)) have been proposed
arXiv:1903.04433v1 [cs.CL] 11 Mar 2019

and visualize multiple sets of pre-trained word and helps existing models achieve new state-of-
embeddings. First, for evaluation, ETNLP the-art results on many NLP tasks. Different from
analyses the quality of pre-trained embeddings non-contextual embeddings, ELMO and BERT
based on an input word analogy list. Second, can capture different latent syntactic-semantic in-
for extraction ETNLP provides a subset of formation of the same word based on its contextual
the embeddings to be used in the downstream uses. Thus, this paper incorporates both classical
NLP tasks. Finally, ETNLP has a visualiza-
embeddings (i.e., Word2Vec, fastText) and contex-
tion module which is for exploring the em-
bedded words interactively. We demonstrate tual embeddings (i.e., ELMO, BERT) to evaluate
the effectiveness of ETNLP on our pre-trained their performances.
word embeddings in Vietnamese. Specifically, Given the fact that there are many different
we create a large Vietnamese word analogy types of word embedding models, we argue that
list to evaluate the embeddings. We then uti- building a unified toolkit, which can evaluate, ex-
lize the pre-trained embeddings for the name tract, and visualize word embeddings for NLP
entity recognition (NER) task in Vietnamese
tasks, is important. However, to our knowledge,
and achieve the new state-of-the-art results
on a benchmark dataset for the NER task. A
there is no single toolkit which can perform all
video demonstration of ETNLP is available at the tasks of evaluation, extraction, and visualiza-
https://vimeo.com/317599106. The tion. For example, the recent framework called
source code and data are available at https: flair (Akbik et al., 2018) is famous for stacking
//github.com/vietnlp/etnlp. multiple embeddings but not for quick evaluation
or visualization.
1 Introduction
In this paper, we propose ETNLP, a comprehen-
Word embedding, also known as word repre- sive embedding toolkit, which can extract, evalu-
sentation, represents a word as a vector cap- ate and visualize the pre-trained embeddings. We
turing both syntactic and semantic information, detail the three main components which are evalu-
so that the words with similar meanings should ator, extractor, and visualizer as follows.
have similar vectors (Levy and Goldberg, 2014).
• Evaluator: given multiple sets of pre-trained
Although, classic embedding models, such as
embeddings, how do we choose the embeddings
Word2Vec (Mikolov et al., 2013), GloVe (Pen-
which will potentially work best for the down-
nington et al., 2014), fastText (Bojanowski et al.,
stream NLP tasks (e.g., NER)? Mikolov et al.
2017), have been shown to help improve the per-
(2013) presented a large benchmark for embed-
formance of existing models in a variety of tasks
ding evaluation based on a series of analogies.
like parsing (Bansal et al., 2014), topic model-
However, the benchmark is only for English and
ing (Nguyen et al., 2015; Batmanghelich et al.,
there is no publicly available large benchmark for
2016), and document classification (Taddy, 2015).
under-resourced languages like Vietnamese.
Each word is associated with a single vector lead-
ing to a challenge on using the vector in vary • Extractor: given multiple sets of pre-trained
across linguistic contexts (Peters et al., 2018). To embeddings, how do we get the advantage from
all of them? For instance, if people want to use • Provide a general embedding toolkit
the character embedding to handle the out-of- (ETNLP) to let users evaluate, extract, and vi-
vocabulary (OOV) issue in the Word2Vec model, sualize multiple sets of word embeddings. The
they must implement their own extractor to com- system’s design is generally to be used in any
bine two different sets of embeddings. It is more language.
complicated when they want to evaluate the per-
• Release a large word analogy list in Viet-
formance of either each set of embeddings sep-
namese for evaluating multiple word embeddings.
arately or the combination of the two sets. The
provided extractor API in ETNLP will fulfill this • Train and release multiple sets of word em-
seamlessly to elaborate this process in NLP appli- beddings for NLP tasks in Vietnamese, wherein,
cations. their effectiveness is verified through new state-
of-the-art results on a NER task in Vietnamese.
• Visualizer: when having a new set of word
embeddings, people usually want to get samples The rest of this paper is organized as fol-
from the embedding set to see how is the seman- lows. Section 2 describes how different embedding
tic similarity between different words. To fulfill models are trained. Section 3 shows how to use the
this requirement, we employ the well-known Em- toolkit to evaluate, extract, and visualize word em-
bedding Projector (projector.tensorflow. beddings. Section 4 shows how the word embed-
org) to let users explore the embedding space in- dings are evaluated through word analogy task and
teractively. Moreover, users can also compare the NER task. Section 5 concludes the paper followed
qualities of the word similarity list between multi- by future work.
ple embeddings side by side (see the demo). 2 Embedding Models
To demonstrate the effectiveness of ETNLP, we This section details the word embedding models
employ the toolkit to a use case in Vietnamese. supported in our ETNLP toolkit.
Evaluating pre-trained embeddings in Vietnamese
• Word2Vec (W2V) (Mikolov et al., 2013): a
is a challenge as there is no publicly available
widely used method in NLP for generating word
large1 lexical resource similar to the word anal-
embeddings.
ogy list in English to evaluate the performance of
pre-trained embeddings. Moreover, different from • W2V_C2V: the Word2Vec model faces the
English where all word analogy records consist OOV issue on unseen text, therefore, we provide a
of single words in one record (e.g.,grandfather | character2vec (C2V) embedding for unseen words
grandmother | king | queen), in Vietnamese (e.g., by getting embedding vectors at character level of
ông nội | bà ngoại | vua | nữ_hoàng), there are unseen words. The C2V embedding can be eas-
many cases where only compound words can rep- ily calculated from a W2V model by averaging all
resent a similar semantic relationship between two vectors where a character occurred.
word pairs to state a word analogy record.
• fastText (Bojanowski et al., 2016): fastText
We propose a large word analogy list in Viet-
associates embeddings with character-based n-
namese which can handle the problems. Having
grams, and a word is represented as the summa-
that word analogy list constructed, we utilize dif-
tion of the representations of its character-based
ferent embedding models, namely Word2Vec, fast-
n-grams. Based on this design, fastText attempts to
Text, ELMO and BERT on Vietnamese Wikipedia
capture morphological information to induce word
data to generate different sets of word embeddings.
embeddings, and hence, deals better with OOV
We then utilize the word analogy list to select suit-
words.
able sets of embeddings for the name entity recog-
nition (NER) task in Vietnamese. We achieve the • ELMO (Peters et al., 2018): a model gener-
new state-of-the-art results on VLSP 2016 2 , a ates embeddings for a word based on the context
Vietnamese benchmark dataset for the NER task. it appears. Thus, we choose the contexts where
Here are our main contributions in this work: the word appears in the training corpus to gener-
1 ate embeddings for each of its occurrences. Then
There are a couple of available datasets (Nguyen et al.,
2018b). But the datasets are small containing only 400 words. the final embedding vector is the average of all its
2
http://vlsp.org.vn/vlsp2016/eval/ner context embeddings.
Vietnamese English
Emb#1 ông nội | bà ngoại | ông | bà grandfather | grandmother | grandpa | grandma
2.1 Evaluator Evaluation Results
cháu trai | cháu gái | vua | nữ_hoàng grandson | granddaughter | king | queen

Extracted Embedding $python3 etnlp_api.py  ­input_embs "<emb_in#1>;<emb_in#2>" 


Emb#2 1. Pre-processing 2.2 Extractor
For NLP Tasks                        ­analoglist <file>
                       ­output <eval_results> ­args eval

Emb#n 2.3 Visualizer Visualization of


Embedding Space
Figure 3: Run evaluation on multiple word embeddings
Figure 1: General process of ETNLP Toolkit on the word analogy task.

$python3 etnlp_api.py  ­input <emb_file> 
                       ­output <emb_out>  
$python3 etnlp_api.py  ­input_embs "<emb_in#1>;<emb_in#2>" 
                    ­args glove2w2v                        ­input_c2v <emb_in#3>
                       ­vocab <file>
Figure 2: Conversion from GloVe format to W2V for-                        ­output <out_file.gz> 
mat.                        ­args extract;solveoov:1
Figure 4: Run extractor to export single or multiple em-
beddings for NLP tasks.
• BERT_{Base, Large} (Devlin et al., 2018):
BERT makes use of Transformer, an attention
mechanism that learns contextual relations be-
tween words (or sub-words) in a text. Different • Extractor: to extract embedding vectors at
from ELMO, the directional models, which reads word level for other NLP tasks. For instance, the
the text input sequentially (left-to-right or right- popular implementation of Reimers and Gurevych
to-left), the Transformer encoder reads the entire (2017) on the sequence tagging task allows users
sequence of words simultaneously. It, therefore, to set location for the word embeddings. The for-
is considered bidirectional. This characteristic al- mat of the file is text-based, i.e., each line con-
lows the model to learn the context of a word tains the embedding of a word. The file then
based on all of its surroundings (left and right of is compressed in .gz format. Figure 4 shows a
the word). BERT comes with two configurations command-line to extract multiple embeddings for
called BERT_Base (12 layers) and BERT_Large an NLP task. The option “solveoov:1” informs
(24 layers). the extractor to use Character2Vec (C2V) embed-
ding to solve OOV words in the first embedding
3 Basic Usages “<emb_in#1>”. The “-input_c2v” can be omitted
if users wish to simply extract embeddings from
Figure 1 shows the general process of the toolkit.
the embedding list given after the “-input_embs”
The four main processes of EMNLP are very sim-
argument.
ple to call from either the command-line or the
Python API.

• Pre-processing: since we use Word2Vec


(W2V) format as the standard format for the whole • Visualizer: to visualize given word embed-
process of ETNLP, we provide a pre-processing dings in the argument “-input_embs". After the
tool for converting different embedding formats to executions, embedding vectors are transformed to
the W2V format. Figure 2 shows the command- tensors to visualize with the Embedding Projec-
line to convert from GloVe format to W2V format. tor. Each word embedding will be set to dif-
ferent local port from which, users can explore
• Evaluator: to evaluate multiple sets of em-
the embedding space using a Web browser. Fig-
beddings on the word analogy task, users have to
ure 6 shows an example of the interactive visual-
set the location of the word embeddings and the
ization of “Hà_Nội”Hannoi using ELMO embed-
word analogy list. To make it convenient for rep-
dings. See Figure 5 for an example command-line.
resenting compound words, we use “ | ” to sep-
arate different part of a word analogy record in-
stead of space as in the English word analogy list.
Figure 3 shows an example of two records in the
$python3 etnlp_api.py  ­input_embs "<emb_in#1>;<emb_in#2>" 
word analogy in Vietnamese (on the left) and their                        ­args visualizer

translation (on the right). The lower part shows a


command-line to evaluate multiple sets of word Figure 5: Run visualizer to explore given pre-trained
embedding models.
embeddings on this task.
Table 1: Evaluation results of different word embed-
dings on the Word Analogy Task. P-value column
shows results from Paired t-tests.
Model MAP@10 P-value
W2V 0.4796 -
W2V_C2V 0.4796 -
FastText 0.4970 See [1] & [2]
ELMO 0.4999 vs. FastText: 0.95 [1]
BERT_Base 0.4609 -
MULTI 0.4906 vs. FastText: 0.025 [2]

Table 2: Grid search for hyper-parameters.


Hyper-parameter Search Space
cemb dim (char embedding) 50 100 500
drpt (dropout rate) 0.3 0.5 0.7
lstm-s (LSTM size) 50 100 500
Figure 6: Interactive visualization for the word lrate (learning rate) 0.0005 0.001 0.005
“Hà_Nội” with ELMO embeddings.

4 Evaluations: a use-case in Vietnamese Dong and Nguyen, 2018). It is noted that, in the
original dataset, each word representing a full per-
4.1 Training word embeddings sonal name are separated into syllables that consti-
We trained embedding models detailed in Section tute the word. Because this annotation scheme re-
2 on the Wikipedia dump in Vietnamese 3 . We sults in an unrealistic scenario for a pipeline eval-
then apply sentence tokenization and word seg- uation (Vu et al., 2018), therefore, we tested on a
mentation provided by VnCoreNLP (Vu et al., “modified” VLSP 2016 corpus where we merge
2018; Nguyen et al., 2018a) to pre-process all contiguous syllables constituting a full name to
documents. It is noted that, for BERT model, we form a word. This similar setup was also used
have to (1) format the data differently for the next in (Vu et al., 2018; Dong and Nguyen, 2018), the
sentence prediction task; and (2) use Sentence- current state-of-the-art approaches.
Piece (Kudo and Richardson, 2018) to tokenize
the data for learning the pre-trained embedding. It 4.3 Word Analogy Task
is worth noting that due to the limitation in com- To measure the quality of different sets of em-
puting resources, we can only run BERT_Base for beddings in Vietnamese, similar to Mikolov et al.
900,000 update steps and BERT_Large for 60,000 (2013), we define a word analogy list consisting of
update steps. We, therefore, do not report the re- 9802 word analogy records. To create the list, we
sult of BERT_Large for a fair comparison. We selected suitable categories from the English word
also create MULTI embeddings by concatenating analogy list and then translate them to Vietnamese.
four sets of embeddings (i.e., W2V_C2V, fastText, We also added customized categories which are
ELMO and BERT_Base) 4 . suitable for Vietnamese (e.g., cities and their zones
4.2 Dataset in Vietnam). Since most of this process is auto-
The named entity recognition (NER) shared task matically done, it can be applied easily to other
at the 2016 VLSP workshop provides a dataset of languages. To know which set of word embed-
16,861 manually annotated sentences for training dings potentially works better for a downstream
and development, and a set of 2,831 manually an- task, we limit the vocabulary of the embeddings
notated sentences for test, with four NER labels similar to vocabulary of the task. Thus, only 3135
PER, LOC, ORG, and MISC. The data was pub- word analogy records are being evaluated for the
lished in 2016 and recently reported in Nguyen NER dataset (Section 4.2).
et al. (2019). It is a standard benchmark on the Regarding the evaluation metric, Mikolov et al.
NER task and has been used in (Vu et al., 2018; (2013) used accuracy metric to measure the qual-
3 ity of word embeddings on the task in which only
https://goo.gl/8WNfyZ
4
We do not use W2V here because W2V_C2V is W2V when the expected word is on top of the predic-
with the use of character embedding to deal with OOV. tion list, then the model gets +1 for true positive
Table 3: Performance of the NER task using different embedding models. The MULTIW C_F _E_B is the concate-
nation of four embeddings: W2V_C2V, fastText, ELMO, and Bert_Base. “wemb dim” is the dimension of the
embedding model. VnCoreNLP* means we retrain the VnCoreNLP with our pre-trained embeddings.
F1 wemb dim cemb dim drpt lstm-s lrate
BiLC3 (Ma and Hovy, 2016) 88.28 300 - - - -
VNER (Dong and Nguyen, 2018) 89.58 300 300 0.6 - 0.001
VnCoreNLP (Vu et al., 2018) 88.55 300 - - - -
VnCoreNLP (*) 91.30 1024 - - - -
BiLC3 + W2V 89.01 300 50 0.5 100 0.0005
BiLC3 + BERT-Base 88.26 768 500 0.3 100 0.0005
BiLC3 + W2V_C2V 89.46 300 100 0.5 500 0.0005
BiLC3 + fastText 89.65 300 500 0.3 100 0.001
BiLC3 + ELMO 89.67 1024 100 0.7 500 0.0005
BiLC3 + MULTIW C_F _E_B 91.09 2392 100 0.7 100 0.001

count. However, this is not a well-suited metric in validation set. Moreover, due to the availability of
low resource languages where training corpus is the VnCoreNLP code, we also retrain their model
relatively small, i.e., 233M tokens in Vietnamese with our pre-trained embeddings (VnCoreNLP∗ ).
Wiki compared to 6B tokens in Google News cor- Main results: Table 3 shows results of the NER
pus. Therefore, we change to use mean average task using different word embeddings. It clearly
precision (MAP) metric to measure quality of the shows that, by using the pre-trained embeddings
word analogy task. MAP is widely used in infor- on Vietnamese Wikipedia data, we can achieve the
mation retrieval to evaluate results based on the new state-of-the-art results on the task. The rea-
topK returned results (Manning et al., 2008). We son might be that fastText, ELMO and MULTI can
use MAP@10 in this paper. Table 1 shows evalua- handle OOV words as well as capture better the
tion results of different sets of embeddings on the context of the words. Moreover, learning the em-
word analogy task. The evaluator of ETNLP also beddings from a formal dataset like Wikipedia is
shows P-value using the paired t-tests on the raw beneficial for the NER task. This is also verified
MAP@10 scores (i.e., before averaging) between the fact that using our pre-trained embeddings on
different sets of embeddings. The P-values (Ta- VnCoreNLP helps significantly boost its perfor-
ble 1) show that the performances of the top three mance. Table 3 also shows the F1 scores of W2V,
sets of word embeddings (i.e., fastText, ELMO, W2V_C2V and BERT_Base embeddings which
and MULTI), are significantly better than the re- are worse than three selected embeddings’ (i.e.,
mainders but there is no significantly different be- fastText, ELMO and MULTI). This might indicate
tween the three. Therefore, these sets of embed- that using word analogy to select embeddings for
dings will be sellected for NER task. downstream NLP tasks is sensible.

4.4 NER Task 5 Conclusions


Model: We apply the current most well-known We have presented a new toolkit, ETNLP, for eval-
neural network architecture for NER task of Ma uating, extracting, and visualizing multiple pre-
and Hovy (2016) with no modification in its trained embeddings. The toolkit was designed
architecture, namely, BiLSTM-CRF+CNN-char with three principles in mind: (1) easy to use, (2)
(BiLC3). Only in the embedding layer, a different better performance, and (3) be able to handle un-
set of word embeddings is used to evaluate their known vocabulary in real-world data (i.e., using
effectiveness. Regarding experiments, we perform C2V). The evaluation of the toolkit in Vietnamese
a grid search for hyper-parameters and select the NER task showed its effectiveness. In the future,
best parameters on the validation set to run on the we plan to support more embeddings in different
test set. Table 2 presents the value ranges we used languages, especially in low resource languages.
to search for the best hyper-parameters. We also We will also apply the toolkit to other downstream
follow the same setting as in (Vu et al., 2018) to NLP tasks, such as part-of-speech (POS) tagging
use the last 2000 records in the training data as the (Nguyen et al., 2017).
References Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey
Dean. 2013. Efficient estimation of word represen-
Alan Akbik, Duncan Blythe, and Roland Vollgraf. tations in vector space. CoRR, abs/1301.3781.
2018. Contextual string embeddings for sequence
labeling. In COLING 2018, 27th International Con- Dat Quoc Nguyen, Richard Billingsley, Lan Du, and
ference on Computational Linguistics, pages 1638– Mark Johnson. 2015. Improving Topic Models with
1649. Latent Feature Word Representations. Transactions
of the Association for Computational Linguistics,
Mohit Bansal, Kevin Gimpel, and Karen Livescu. 3:299–313.
2014. Tailoring continuous word representations for
dependency parsing. In Proceedings of the 52nd An- Dat Quoc Nguyen, Dai Quoc Nguyen, Thanh Vu, Mark
nual Meeting of the Association for Computational Dras, and Mark Johnson. 2018a. A Fast and Accu-
Linguistics (Volume 2: Short Papers), pages 809– rate Vietnamese Word Segmenter. In Proceedings
815. Association for Computational Linguistics. of the 2018 LREC, Miyazaki, Japan.

Kayhan N. Batmanghelich, Ardavan Saeedi, Karthik Dat Quoc Nguyen, Thanh Vu, Dai Quoc Nguyen, Mark
Narasimhan, and Samuel Gershman. 2016. Non- Dras, and Mark Johnson. 2017. From word segmen-
parametric spherical topic modeling with word em- tation to pos tagging for vietnamese. In Proceedings
beddings. Proceedings of the 54th Annual Meet- of the Australasian Language Technology Associa-
ing of the Association for Computational Linguistics tion Workshop 2017, pages 108–113, Brisbane, Aus-
(Volume 2: Short Papers), abs/1604.00126:537–542. tralia.
Huyen Nguyen, Quyen Ngo, Luong Vu, Vu Tran, and
Piotr Bojanowski, Edouard Grave, Armand Joulin, Hien Nguyen. 2019. Vlsp shared task: Named en-
and Tomas Mikolov. 2016. Enriching word vec- tity recognition. Journal of Computer Science and
tors with subword information. arXiv preprint Cybernetics, 34(4):283–294.
arXiv:1607.04606.
Kim Anh Nguyen, Sabine Schulte im Walde, and
Piotr Bojanowski, Edouard Grave, Armand Joulin, and Ngoc Thang Vu. 2018b. Introducing two viet-
Tomas Mikolov. 2017. Enriching word vectors with namese datasets for evaluating semantic models of
subword information. Transactions of the Associa- (dis-)similarity and relatedness. In Proceedings of
tion for Computational Linguistics, 5:135–146. the 2018 NAACL: Short Papers, pages 199–205.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Jeffrey Pennington, Richard Socher, and Christo-
Kristina Toutanova. 2018. Bert: Pre-training of deep pher D. Manning. 2014. Glove: Global vectors for
bidirectional transformers for language understand- word representation. In Empirical Methods in Nat-
ing. arXiv preprint arXiv:1810.04805. ural Language Processing (EMNLP), pages 1532–
1543.
Ngan Dong and Kim Anh Nguyen. 2018. Attentive
neural network for named entity recognition in viet- Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt
namese. CoRR, abs/1810.13097. Gardner, Christopher Clark, Kenton Lee, and Luke
Zettlemoyer. 2018. Deep contextualized word rep-
Taku Kudo and John Richardson. 2018. Sentencepiece: resentations. In Proc. of NAACL.
A simple and language independent subword tok- Nils Reimers and Iryna Gurevych. 2017. Report-
enizer and detokenizer for neural text processing. ing Score Distributions Makes a Difference: Perfor-
CoRR, abs/1808.06226. mance Study of LSTM-networks for Sequence Tag-
ging. In Proceedings of the 2017 Conference on
Omer Levy and Yoav Goldberg. 2014. Neural word Empirical Methods in Natural Language Processing
embedding as implicit matrix factorization. In Pro- (EMNLP), pages 338–348, Copenhagen, Denmark.
ceedings of the 27th International Conference on
Neural Information Processing Systems - Volume 2, Matt Taddy. 2015. Document classification by inver-
NIPS’14, pages 2177–2185, Cambridge, MA, USA. sion of distributed language representations. In Pro-
MIT Press. ceedings of the 53rd Annual Meeting of the Associ-
ation for Computational Linguistics and the 7th In-
Xuezhe Ma and Eduard Hovy. 2016. End-to-end se- ternational Joint Conference on Natural Language
quence labeling via bi-directional lstm-cnns-crf. In Processing (Volume 2: Short Papers), pages 45–49.
Proceedings of the 54th Annual Meeting of the As- Association for Computational Linguistics.
sociation for Computational Linguistics (Volume 1:
Long Papers), pages 1064–1074. Association for Thanh Vu, Dat Quoc Nguyen, Dai Quoc Nguyen, Mark
Computational Linguistics. Dras, and Mark Johnson. 2018. Vncorenlp: A viet-
namese natural language processing toolkit. In
Christopher D. Manning, Prabhakar Raghavan, and Proceedings of the 2018 NAACL: Demonstrations,
Hinrich Schütze. 2008. Introduction to Information pages 56–60, New Orleans, Louisiana. Association
Retrieval. Cambridge University Press, New York, for Computational Linguistics.
NY, USA.

You might also like