Electronics 10 01372 With Cover
Electronics 10 01372 With Cover
Electronics 10 01372 With Cover
Article
Sanjanasri JP, Vijay Krishna Menon, Soman KP, Rajendran S and Agnieszka Wolk
Special Issue
Creative and Generative Natural Language Processing and Its Applications
Edited by
Dr. Krzysztof Wołk, Dr. Ida Skubis and Dr. Tomasz Grzes
https://doi.org/10.3390/electronics10121372
electronics
Article
Generation of Cross-Lingual Word Vectors for Low-Resourced
Languages Using Deep Learning and Topological Metrics in a
Data-Efficient Way
Sanjanasri JP 1, *, Vijay Krishna Menon 2 , Soman KP 1 , Rajendran S 1 and Agnieszka Wolk 3,4
1 Center for Computational Engineering and Networking (CEN), Amrita School of Engineering, Amrita Vishwa
Vidyapeetham, Coimbatore 641112, India; [email protected] (S.K.); [email protected] (R.S.)
2 Gadgeon Systems Privated Limited, Kochi 682021, India; [email protected]
3 Multimedia Department, Polish-Japanese Academy of Information Technology, Koszykowa 86,
02-008 Warsaw, Poland; [email protected]
4 The Institute of Literary Research of the Polish Academy of Sciences, Nowy Świat 72, 00-330 Warsaw, Poland
* Correspondence: [email protected]
Abstract: Linguists have been focused on a qualitative comparison of the semantics from different
languages. Evaluation of the semantic interpretation among disparate language pairs like English
and Tamil is an even more formidable task than for Slavic languages. The concept of word embedding
in Natural Language Processing (NLP) has enabled a felicitous opportunity to quantify linguistic
semantics. Multi-lingual tasks can be performed by projecting the word embeddings of one language
onto the semantic space of the other. This research presents a suite of data-efficient deep learning
approaches to deduce the transfer function from the embedding space of English to that of Tamil,
deploying three popular embedding algorithms: Word2Vec, GloVe and FastText. A novel evaluation
Citation: JP, S.; Menon, V.K.; KP, S.; S, paradigm was devised for the generation of embeddings to assess their effectiveness, using the
R.; Wolk, A. Generation of original embeddings as ground truths. Transferability across other target languages of the proposed
Cross-Lingual Word Vectors for model was assessed via pre-trained Word2Vec embeddings from Hindi and Chinese languages.
Low-Resourced Languages Using We empirically prove that with a bilingual dictionary of a thousand words and a corresponding small
Deep Learning and Topological
monolingual target (Tamil) corpus, useful embeddings can be generated by transfer learning from a
Metrics in a Data-Efficient Way.
well-trained source (English) embedding. Furthermore, we demonstrate the usability of generated
Electronics 2021, 10, 1372. https://
target embeddings in a few NLP use-case tasks, such as text summarization, part-of-speech (POS)
doi.org/10.3390/electronics10121372
tagging, and bilingual dictionary induction (BDI), bearing in mind that those are not the only possible
Academic Editor: Rui Pedro Lopes applications.
Received: 10 May 2021 Keywords: semantic interpretation; transfer learning; bilingual embedding; cross-lingual embedding;
Accepted: 26 May 2021 English–Tamil; low resourced languages; topological measures; ontology engineering
Published: 8 June 2021
1.1. Overview
In this paper, cross-lingual embedding is accomplished by mapping the vectors from
one language’s embedding space into that of the other language through a transfer function.
Multiple experiments with various methodologies are carried out to obtain target word
vectors for English–Tamil language pairs. Recently developed contextual embeddings
methods such as Contextual word Vectors (CoVe), Embeddings from Language Models
(ELMo), and Bidirectional Encoder Representations from Transformers (BERT) do not
support the transfer of knowledge across languages, mainly from resource-rich to resource-
poor languages as they use robust baseline architectures specific to each task. Hence,
experiments are carried with the three most popular embedding algorithms, Word2Vec,
Glove, and FastText. The trained cross-lingual model, Transfer Function-based Generated
Embedding (TFGE) synthesizes new vectors for unknown words by transfer learning with
a minimal seed dictionary (five thousand words) from a resource-rich source language
(English) to a resource-poor target language (Tamil).
The topology-based comparative assessment (neighborhood analysis) was considered
to assess the quality of the generated embedding, as word embedding has no ground truth
data available [21]. The trained cross-lingual model is language-independent. The built-in
model can be shared by the languages that share a lot of common syntax and vocabulary
with the target language. Hence pre-trained Hindi and Chinese embeddings (Word2Vec)
were piped through the cross-lingual model on the target side to show the sharing prop-
erty (transferability). The generated embeddings were further validated with real NLP
tasks such as Text Summarisation, a multi-class model of the Part-Of-Speech Tagging and
Bilingual Dictionary Induction (BDI) for low-resource languages featuring Tamil.
1.2. Motivation
The primary downside of vector representations is that they quantify only relative
semantics between words. It is evident that for the same corpus, vectors will manifest
disparately each time a different vector training algorithm is used. These vectors have no
absolute position. In linear algebraic terms, the vector space spanned by each embedding
model is different [22]. The embeddings offer the best opportunity to address NLP use
cases such as machine learning (ML) tasks, which call for comparison of vectors (projections
Electronics 2021, 10, 1372 3 of 23
– Three techniques are used, linear mapping as mentioned in [8] and two deep
learning networks, One Dimensional Convolutional Neural Network (1D-CNN)
and Multi-Layer Perceptron (MLP);
– A standard bilingual embedding algorithm, Bilingual Bag-Of-Words without
Alignments (BilBOWA) is also considered for relative comparison;
• Achieve the above objective in a data-efficient way;
– Transfer functions over various embedding is learned using different dictionary
size as low as 1000 English–Tamil word pair;
– Parallel or comparable corpora are not used to generate embedding, only the
learned transfer function;
• Evaluation of the generated embeddings for their efficacy;
– Embeddings are evaluated quantitatively using topological measures using Pair-
wise Accuracy and Neighborhood Accuracy;
– Visually verified using t-SNE plots for each of the TFGEs categories;
– Usability is tested for real NLP use-cases POS tagging, Extractive summarization,
and BDI.
The three most popular contemporary vector training algorithms, namely Word2Vec,
GloVe and FastText [7,28,29], are used to generate cross-lingual embedding. Bilingually
trained BilBOWA [26] embeddings were used as a baseline reference to compare our TFGEs.
The primary dataset used is a cEnTam English–Tamil corpus [30], which has monolingual
English, Tamil, and sentence-aligned English–Tamil data.
2.2. Premise
A transfer function (F) enables mapping from a pre-trained source embedding (X), to
the target embedding (Y). The target embedding is derived by the use of a contemporary
vector training algorithm on a limited target corpus. Source embeddings are pre-trained on
a billion-word text corpus, readily retrievable from various online sources. ML model (M)
is used to learn the mapping from X to Y. However, this is possible only if the bilingual
information of X matches with that of Y. The bilingual information is obtained from the
dictionary (D), which has to be created. This method obviates the use of any aligned
bilingual corpus.
Once the model M is trained, it provides the transfer function F, which is used to
generate vectors for unknown target words by providing a vector of the known source word.
It is imperative to note that the target is the low-resource language. Figure 1 schematically
explains the flow of the working premise. The creation of a bilingual dictionary is easier
and is a well-defined process in preference to the creation of an aligned corpus.
Transfer learning is achieved in two ways. The embedding set (not all words) for a
language is not a feature set by itself, but a fairly good language model. One way is to
use pre-trained source embedding to generate vectors of the unknown target words by
mapping the vector spaces, i.e., by transferring the information in the source embedding to
improve the target embedding. It is of vital significance that the target embedding is obtained
with the same vector training algorithm that was employed for pre-trained source embedding.
The generated vectors exist in the target embedding space.
Similar languages may roughly align their embedding spaces. This will enable re-
use of the transfer function trained on a target language, directly with another language,
which shares common semantic properties. Empirical proof of this hypothesis is presented
in future sections. The second transfer learning method fits parlance with ML, where a
model trained with one pair serves another without augmentation/retraining. In sum-
mary, instead of generating embedding from a large monolingual corpus, the pre-trained
embedding of one language is used to generate a bulk of the unknown vectors, using a
machine-learned transfer function. The generated cross-lingual embedding exists in the
same space of monolingual embedding of the target language. Doing so averts the need to
compromise on language specific semantics, unlike bilingual models.
Electronics 2021, 10, 1372 5 of 23
In order to test the effectiveness and usability of the generated embedding, two
evaluation metrics are proposed:
• Pairwise cosine accuracy—a measure of semantics between similar words
• Global cosine neighbourhood, which measures how words are separate from each other.
These relative measures compare generated embeddings with the original target
embeddings.
context and decodes them into another language. This model uses a two-layer bidirectional
LSTM-based encoder initialized with GloVe and a two-layer unidirectional LSTM with
an attention mechanism as the decoder. The pre-trained encoder of MT-LSTM, CoVe, is
applied across various downstream NLP tasks such as sentiment analysis and question clas-
sifier based on the transfer learning idea. CoVe and GloVe embedding’s unified approach
is said to be more reliable than the application of GloVe alone.
Differently from CoVe, the recently developed contextualized word embedding al-
gorithms such as Embeddings from Language Models (ELMo) [35] and Bidirectional
Encoder Representations from Transformers (BERT) [36] learn contextual word represen-
tation by pre-training a language model in an unsupervised way. The vector generated
is a weighted combination of all layers in the network. ELMo uses a combination of in-
dependently trained bidirectional LSTMs. BERT uses the Transformer, a neural network
architecture based on a self-attention mechanism. The Transformer has demonstrated
superior performance in modelling long-term dependencies in the text, compared to the
RNN architecture [37]. Thus, the ELMo and BERT capture the syntax, semantics, and
polysemy of a word using the deep embeddings from the language model and can be
used for a multitude of NLP activities. The integration of the contextual word embeddings
into neural architectures has led to consistent improvements in important NLP tasks such
as sentiment analysis, question answering, reading comprehension, textual entailment,
semantic role labelling, co-reference resolution, and dependency parsing.
Although traditional word embedding algorithms, including Word2Vec, GloVe, Fast-
Text and the contextualized embedding, transfer knowledge from a general-purpose source
task to a more specialized target task, there are some significant differences. The features
of attention models such as ELMo and BERT are more specific since they are context-
dependent and cannot be generalized to a new task (transfer learned across languages)
because specific features are less useful for transfer learning. The standard word embed-
ding algorithms are not context-dependent, and the various semantics concerning the word
are mixed, hence a generic representation. The contextualized embedding computes vec-
tors dynamically as a sentence or a sequence of words is being processed, so it is necessary
to provide a model for the downstream tasks. The standard word embedding algorithms
generate a matrix of word vectors that can be plugged into the neural network model to
perform a lookup operation by mapping a word to a vector.
A few propitious examples of transfer learning in NLP based on topological prop-
erties are noted next. The authors of [38] present a comprehensive study of the machine-
translation-based cross-lingual approach of sentiment analysis in Bengali. The paper
compares and provides a detailed analysis regarding the performance of ML classifiers in
the Bengali and machine-translated datasets (English). The performance of simple transfer
learning that utilizes the cross-domain data is presented. The authors use multiple cross-
domain datasets from the English language, IMDB, TripAdvisor, etc. to train the Logistic
Regression (LR) classifier. The trained model predicts the semantic orientations of reviews
from the machine-translated (Bengali–English) corpus. The authors of [39] use monolingual
resources and unsupervised techniques to induce cross-lingual task-specific word embed-
dings for the tasks of emoji prediction and sentiment classification of micro-blog posts from
Twitter and Sina Weibo. Enormous Mandarin Chinese language datasets were utilized
to train a monolingual model for emoji prediction, and the trained embedding layer was
adapted to support the English language. The cross-lingual English models achieved 11.8%
accuracy at emoji prediction (out of 64 emojis) and 73.2% at binary sentiment classifica-
tion. Lastra-Díaz et al. [40] developed a software Half-Edge Semantic Measures Library
(HESML) to implement various ontology-based semantic similarity measures proposed to
evaluate word embedding models and have shown an increase in performance time and
scalability of the models.CLassifying Interactively with Multilingual Embeddings (CLIME)
efficiently specializes cross-lingual embedding using the annotated task-specific key-words
by bilingual speakers [41].
Electronics 2021, 10, 1372 7 of 23
4. Dataset Description
We used cEnTam, an English-Tamil bilingual dataset [30]. The dataset consists of
a sentence-aligned English–Tamil corpus, a good chunk of crawled monolingual corpus
of Tamil, and a stand-alone comparable corpus of English. We used this dataset for the
generation of Tamil embeddings for all our experiments, and the bilingual embeddings
were generated using BilBOWA algorithms. Details of the dataset used for training cross-
lingual embedding are shown in Table 1. The data are organized as four instances.
#. of Dic.
Source Data Target Data Algorithm Attribute
Words
Wikipedia pre-trained (en) Wikipedia pre-trained (ta) FastText 10,786 Comparable
cEnTam-Monolingual
GloVe 840b pre-trained (en) GloVe 10,861 Monolingual
Tamil (ta)
cEnTam-Monolingual
Google new pretrained (en) Word2Vec 10,723 Monolingual
Tamil (ta)
cEnTam (en) cEnTam (ta) BilBOWA 6088 Parallel/Comparable
The first instance was trained using the Word2Vec algorithm. The English source is
constituted of pre-trained Google news embeddings trained on 100 billion words. Corre-
spondingly, cEnTam is trained on Word2Vec for Tamil.
The second instance had to be trained with the GloVe algorithm. Here, the source
was a set of pre-trained English embeddings, which use a common crawled corpus of
840 billion words. The Tamil GloVe embeddings were obtained from the cEnTam corpus.
With FastText models, pretrained Wikipedia corpus embeddings and a cEnTam corpus for
Tamil embedding were used.
The cEnTam corpus was used for training bilingual embedding using BilBOWA. All
four dictionaries were carefully hand-crafted while maintaining good dispersion of all
categories of words over the Tamil vocabulary.
Pre-trained Hindi and Chinese vectors were sourced online [42], along with the
respective online dictionaries [43,44]. Hindi and Chinese dictionary sizes were capped at
1000 words in order to induce a resource constraint. The purpose of these embeddings is to
demonstrate transferability.
Tx = y (1)
Now consider matrix X spanning the embedding space of English and matrix Y
spanning the embedding space of Tamil. Then T can be computed as shown in Equation (2),
where X + is Moore–Penrose pseudo inverse and X + = ( X T X )−1 X T . Equation (2) presents
the transfer function as a matrix operator, T [45].
TX = Y
(2)
T = YX +
T will be more accurate if X and Y are synthesized appropriately such that their
columns are populated by the most diverse (semantically unrelated) word, from the corre-
sponding embedding spaces. Here, we apply the concept of semantics between words as a
linear combination of semantics of the other words. The projection TX is the target embed-
ding space. The linear operator is easy to compute as it does not have any iterative training.
T is an m × m square matrix, where m is the embedding dimension and X is m × n, where
m is the word embedding dimension and n is the bilingual vocabulary/dictionary size.
cos_proxinv = 1 − K (ŷ, y)
ŷ T × y (3)
K (ŷ, y) =
||ŷ|| × ||y||
This was implemented in the Keras library in Python. Figure 2 shows the basic archi-
tectural pipeline for MLP, enabled by usage of certain hyper-parameter values. The MLP
architecture is constituted of three dense layers: Rectified Linear Unit (ReLU) as its acti-
vation layer, and the dense layer followed by the Dropout Layer, to avert over-fitting in
training. The cosine proximity is used as the loss function and the RMSprop as the opti-
mizer.
Electronics 2021, 10, 1372 9 of 23
Dropout (30%)
Source 300 300 Target
embedding embedding
Dense layer Dense layer
(ReLU) (Loss: Cosine proximity)
Flatten layer
Source Filter: 22, Strides: 2 Target
300
embedding Kernel: 7 embedding
Dense layer
(Loss: Cosine proximity)
The network has three layers, a CNN layer followed by a Max Pooling layer and a
Dense layer; each layer uses ReLU as an activation function. The CNN filter defines the
number of features to be learned; this investigation used twenty-two filters of kernel size
seven. Cosine proximity and RMSprop were used as the loss function and optimizer, re-
spectively.
7. Evaluation Tasks
Word embeddings have no measurable ground truths to verify their semantic proper-
ties [49]. Evaluation of these vector representations requires conception of precise linguistic
use cases. Word vectors translate semantic relationships to spatial distances. When two
words are semantically related, their respective embeddings are expected to have high
similarity measures. Assessment of word embeddings, using a crowd-sourced scoring
scheme, is detailed in [50].
Here, the original monolingual embedding is treated as ground truth for the evaluation
of TFGE. Embeddings can be evaluated quantitatively with respect to original ones and
qualitatively by visualization, by plotting them on two-dimensional graphs that show
the position of words in relation to other words, followed by visualization using the t-
distributed Stochastic Neighbor Embedding (t-SNE) method [51]. Visual verification of
the quality of the embeddings facilitates precise estimation of their usability. Nevertheless,
one has to have a basic idea of the semantic relation of a target language space prior to
visual inspection. In this case, this gap shall be filled by comprehensive explanation of the
t-SNE plots.
Table 2. A sample word pairs list for computing pairwise cosine accuracy for all three evaluating
languages.
Percentage
error=(RMSE/2)*10
P.accuracy=100-
Percentage error
N.accuracy=100-
Percentage error
(a) (b)
(c)
Figure 6. t-SNE visualization of transfer-learned Tamil vectors (MLP& CNN) and original vectors gen-
erated by Word2Vec. (a) Original Embeddings; (b) Embeddings generated by MLP; (c) Embeddings
generated by CNN.
PoS tagged
words
(Loss: Categorical cross-entropy)
scenario, the highest TFGE model’s testing accuracy achieved was limited to 62% with any
of the transfer learning models. In spite of moderately high training error (MSE as high as
(30–40)%), the model still yielded fairly good embedding. Figure 8a,b shows the 1D-CNN
TFGE model’s training accuracy/loss vs. epoch. In addition, the TFGE model afforded all
the desirable semantic properties that were verified and evaluated, using the quantitative
(topological) and qualitative metrics, mentioned in Section 7.
(a) (b)
Figure 8. Graphical representation of Training Accuracy/Loss vs. Epochs for 1D-CNN model of
TFGE. (a) Training Accuracy vs. Epochs; (b) Training Loss vs. Epochs.
Table 3. Neighborhood and Pairwise Accuracy (N.accuracy and P.accuracy) of English–Tamil TFGE
trained on various models. LM, MLP, and CNN refers to the transfer function learning models, Linear
Mapping, Multi Layer Perceptron, and Convolutional Neural Network, respectively. The subscripts
W, ,G and F refer to the vector training algorithms, Word2Vec, GloVe, and FastText, respectively.
Table 4. Neighbourhood and Pairwise Accuracy (N.accuracy and P.accuracy) of Asian TFGE Embed-
dings trained by various models.
One of the priorities of transfer learning embedding is data efficiency. This implies
that the transfer function can be trained to obtain TFGEs with dictionaries that contain
no more than 1000 words. All the TFGE models were trained with this premise in mind.
The resource constraint was simulated by reduction of the size of the dictionary and
training the model on varying data instances. As the linear mapping model achieved good
neighborhood accuracy but failed miserably on pairwise accuracy, the data efficiency study
was restricted to deep learning models. This study was limited to pairwise accuracies as
they are based on linguistic ground truth (for every embedding model, the formation of
word pairs is supervened by known linguistic relationships). Table 5 shows the pairwise
accuracy of deep-transfer-learned Tamil and English vectors on MLP and CNN networks,
for every embedding model. Albeit computationally intricate, P.accuracy was chosen in
preference to N.accuracy, as it is linguistically verifiable. The loss functions for the same
are shown in Figures 9 and 10.
Electronics 2021, 10, 1372 16 of 23
Table 5. Pairwise Accuracy of MLP and CNN Network over various dictionary lengths. P.accuracy
was chosen because it is more difficult to achieve than N.accuracy and is linguistically verifiable.
MLP CNN
#. of Word Pairs
Word2Vec GloVe FastText Word2Vec GloVe FastText
1000 80.39 88.88 85.14 82.42 90.99 90.94
2000 81.99 87.34 86.71 83.47 91.21 91.05
3000 81.79 87.28 89.69 83.66 91.58 90.99
5000 82.27 88.20 89.82 83.98 92.55 92.36
7000 82.77 88.10 89.57 84.49 89.75 92.21
8000 82.35 87.70 89.77 85.13 90.93 92.48
10,000 82.68 89.70 90.67 86.34 91.38 92.68
Figure 9. Graphical representation of Pairwise Accuracy of MLP Network over various dictionary
lengths. The X-axis represents the dictionary sizes, and the Y-axis represents the accuracy.
Figure 10. Graphical representation of Pairwise Accuracy of CNN Network over various dictionary
lengths. The X-axis represents the dictionary sizes, and the Y-axis represents the accuracy.
Neighborhood accuracy was also measured across various POS categories. The vo-
cabulary was divided into the top four categories: nouns, verbs, adverbs, and adjec-
tives. Cosine neighborhood accuracies were measured individually, within each category,
as showcased in Table 6.
Electronics 2021, 10, 1372 17 of 23
Embedding Models
Category
Word2Vec GloVe FastText
Nouns 80.13 91.79 80.97
Verbs 82.19 91.63 77.85
Adjectives 80.34 89.44 78.77
Adverbs 85.97 90.42 80.40
(a) (b)
(c)
Figure 11. t-SNE visualization of transfer-learned Tamil vectors (MLP and CNN) and original vectors
generated by GloVe. (a) Original Embeddings; (b) Embeddings generated by MLP; (c) Embeddings
generated by CNN.
Electronics 2021, 10, 1372 18 of 23
(a) (b)
(c)
Figure 12. t-SNE visualization of transfer learned Tamil vectors (MLP and CNN) and original vectors
generated by FastText. (a) Original Embeddings; (b) Embeddings generated by MLP; (c) Embeddings
generated by CNN.
The POS Tagger, depicted in Figure 7, takes input from an annotated Tamil corpus
of the cEnTam dataset to predict the tag label. Even though the network predicts thirty
tags, the accuracies of only four classes—noun, verb, adverb, and adjective—are compared.
Table 9 tabulates the class-wise accuracy of these four POS categories trained with different
embeddings. The first column provides the accuracies; when trained with random vectors,
as expected, they gave very poor results. The original embedding, however, fared much
better with average accuracy over four classes of 70%, 71%, and 81% for Word2Vec, GloVe,
and FastText, respectively. Linear Mapping performed even better, with an average of over
82%. TFGE has the highest average accuracy of prediction in this whole exercise; they even
outperformed the original embedding, on which they were transfer-learned from. They
are trained on a very small monolingual target corpus compared to a pre-trained source
(<100 billion words). However, TFGE was generated from a transfer process, where the
input comes from an embedding space trained with a billion-word corpus (pre-trained
embeddings). Some ineffable properties of the pre-trained embeddings may have been
concurrently transferred to TFGEs, which outfits them with an upper hand in the POS
tagging task. TFGE vectors dominated this usability test.
Table 9. Category-based accuracy for Deep Learned POS Tagger trained on different embedding
models and Linear Mapping on Word2Vec.
Embedding Models
Category
Random Vectors Word2Vec GloVe FastText TFGE Linear Mapping
Noun 0.53 0.65 0.69 0.70 0.86 0.84
Verb 0.52 0.75 0.71 0.86 0.84 0.74
Adverb 0.57 0.65 0.65 0.89 0.91 0.85
Adjectives 0.69 0.74 0.80 0.81 0.86 0.88
Our next use case task is BDI. The authors of [25] used their bilingual embedding
to perform BDI on three separate language pairs. We conducted a reverse lookup of the
transferred learned vector using the original monolingual embedding to elicit a word
translation of the source word. Provided a series of source and target words < wsi , wti > as
well as their corresponding embeddings (original monolingual embeddings) < wvsi , wvti >
and transfer learned target embedding < wvti∗ >. The correct target word wti is identified
for each query source word wsi by finding the target embedding wvti that is the closest
neighbor to the transition learned/projected target word embedding wvti∗ , where cosine
similarity is computed as a measure between the embeddings. The reported highest
accuracy is 68.9 over their dictionary size of 1000 words. Reference [55] describes a shared
task system, where bilingual pairs are inducted over German–English (de–en) and Tamil–
English(ta–en). The study employed cross-lingual (TFGE) embeddings to induct a bilingual
dictionary, in both cases. Table 10 summarizes BDI accuracy, derived using TFGE [55];
TFGEs performed very well in the BDI task.
Electronics 2021, 10, 1372 20 of 23
Table 10. Accuracy of German–English and Tamil–English BDI system as reported in [55].
Language Pairs
Models
de–en ta–en
Linear Mapping 73.01 76.05
MLP 80.67 85.52
CNN 85.16 90.33
10. Discussions
This paper’s empirical studies entailed trained models on word sets as low as 1000 for
Tamil, the target language. Even with such a low word count and corresponding vectors, a
cross-lingual transfer learning network could be devised that generated reasonably good
quality vectors from English words for unknown Tamil words. The generated vectors
were assessed over an unseen set of word pairs. The cosine distance obtained over the
pair of words using original embeddings and the cosine distance obtained using generated
embeddings were compared. The resultant error was used for calculation of accuracy.
The pairwise accuracy is expressed as the complement of root-mean-square percentage
error (RMSPE). However, the networks were trained on absolute prediction error with the
known monolingual target vectors (of Tamil in this case). The trained model is further
validated with real NLP tasks for verification of the propriety of the generated embeddings.
Summaries from the same text document were generated with the algorithm presented
in [53]. The summary generated with the original embeddings and the transfer learned
embeddings were identical for all embedding types. The generated embeddings were
tested on a POS classification network for BDI over German–English and Tamil–English
pairs. In all these cases, the generated embeddings (TFGE) were as effective as the original
embeddings. This conclusively proves the aptness of the learned vectors.
11. Conclusions
The primary objective of this investigation was to devise an efficient transfer learning
scheme for attainment of cross-lingual word embeddings obviating the need for large
monolingual and bilingual corpora. Multiple experiments were conducted, employing
Electronics 2021, 10, 1372 21 of 23
different methodologies, to attain target word vectors for the English–Tamil bilingual
pair. Tamil, a popular Asian language, is linguistically similar to many other south Asian
languages. We created sufficient corpora for (monolingual and bilingual) Tamil for the eval-
uation of the proffered methodologies and empirical outcomes. Furthermore, pre-trained
Hindi and Chinese embeddings were marshalled to validate the transfer learning model.
Target word vectors were successfully generated with a minimal corpus (monolingual)
size of 5000 words, approximately the size of a textbook. Such a modest-sized corpus was
apposite for achievement of useful word vectors—89% using GloVe vectors and at least an
accuracy of 80% using Word2Vec—with proven cross-validated topological (pairwise and
neighborhood) accuracy. The cosines were scaled between [0, 2], and the error was also
computed in the same interval. The accuracies obtained were compared with the standard
bilingual embedding algorithm BilBOWA [26], which uses a sentence-aligned parallel
and comparable corpora. It also considers a minimal word-aligned model to improve the
accuracy of the target vectors. This paper’s investigators are convinced that a bilingual
model is a compromise between the languages’ semantics. The ineluctable semantic gap
between the languages is traded off in the interest of a common vector space. The proffered
model is a cross-lingual transfer learning model that takes a source language vector and
projects it to the target space, maintaining the semantic integrity of the target language.
The deep learning networks essentially learn the semantic gap between the languages and
incorporates the source vector into the target vector space. In contrast, the cross-lingual
model requires only monolingual corpora in both the languages with a dictionary.
As a reconfirmation of the submitted approach, pre-trained Hindi and Chinese embed-
dings (Word2Vec) were piped through the tendered model. The model that was trained on
Tamil as the target language yielded accuracies of 77% with Hindi and 70% with Chinese.
In their own ways, the Hindi and Chinese languages are very distinct from Tamil, as em-
pirically observed from their accuracies. When the models were re-trained with respective
languages using a word set of 1000 words, an accuracy of 83% was reported with Hindi,
versus an accuracy of 88% observed with Chinese. These findings corroborate that the
method put forth is language-independent. We are optimistic that the word embedding
strategy will work seamlessly on syntactically similar target languages, foreclosing the
prerequisite of re-training on the second target language.
Author Contributions: Conceptualization, S.J. and V.K.M.; methodology, S.J. and V.K.M.; software,
S.J.; validation, S.J. and V.K.M.; formal analysis, S.J. and V.K.M.; investigation, S.J. and V.K.M.;
resources, S.J.; data curation, S.J.; writing—original draft preparation, S.J.; writing—review and
editing, V.K.M. and A.W.; supervision, V.K.M., R.S. and S.K.; project administration, V.K.M. and S.K.;
funding acquisition, S.J., V.K.M. and A.W. All authors have read and agreed to the published version
of the manuscript.
Funding: This research received no external funding.
Conflicts of Interest: The authors declare no conflict of interest.
Electronics 2021, 10, 1372 22 of 23
References
1. Torabi Asr, F.; Zinkov, R.; Jones, M. Querying Word Embeddings for Similarity and Relatedness. In Proceedings of the 2018
Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1
(Long Papers); Association for Computational Linguistics: Stroudsburg, PA, USA, 2018; pp. 675–684. [CrossRef]
2. Billah Nagoudi, E.M.; Ferrero, J.; Schwab, D.; Cherroun, H. Word Embedding-Based Approaches for Measuring Semantic
Similarity of Arabic-English Sentences. In Proceedings of the 6th International Conference on Arabic Language Processing, Fez,
Morocco, 11–12 October 2017.
3. Deerwester, S.; Dumais, S.T.; Furnas, G.W.; Landauer, T.K.; Harshman, R. Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci.
1990, 41, 391–407. [CrossRef]
4. Fellbaum, C. WordNet: An Electronic Lexical Database; Language, Speech, and Communication; MIT Press: Cambridge, MA,
USA, 1998.
5. Bengio, Y.; Ducharme, R.; Vincent, P.; Janvin, C. A Neural Probabilistic Language Model. J. Mach. Learn. Res. 2003, 3, 1137–1155.
6. Collobert, R.; Weston, J.; Bottou, L.; Karlen, M.; Kavukcuoglu, K.; Kuksa, P.P. Natural Language Processing (almost) from Scratch.
arXiv 2011, arXiv:1103.0398.
7. Pennington, J.; Socher, R.; Manning, C.D. Glove: Global Vectors for Word Representation. In Proceedings of the EMNLP, Doha,
Qatar, 25–29 October 2014; Volume 14, pp. 1532–1543.
8. Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G.; Dean, J. Distributed Representations of Words and Phrases and their Composi-
tionality. arXiv 2013, arXiv:1310.4546.
9. Wang, S.; Zhou, W.; Jiang, C. A survey of word embeddings based on deep learning. Computing 2020, 102, 717–740. [CrossRef]
10. Treviso, M.V.; Shulby, C.D.; Aluísio, S.M. Evaluating Word Embeddings for Sentence Boundary Detection in Speech Transcripts.
arXiv 2017, arXiv:1708.04704.
11. Bansal, M.; Gimpel, K.; Livescu, K. Tailoring Continuous Word Representations for Dependency Parsing. In Proceedings of
the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers); Association for Computational
Linguistics: Stroudsburg, PA, USA, 2014; pp. 809–815.
12. Guo, J.; Che, W.; Wang, H.; Liu, T. Revisiting Embedding Features for Simple Semi-supervised Learning. In Proceedings of the
EMNLP, Doha, Qatar, 25–29 October 2014.
13. Soliman, A.B.; Eissa, K.; El-Beltagy, S.R. AraVec: A set of Arabic Word Embedding Models for use in Arabic NLP. Procedia Comput.
Sci. 2017, 117, 256–265. [CrossRef]
14. Xu, J.; Liu, J.; Zhang, L.; Li, Z.; Chen, H. Improve Chinese Word Embeddings by Exploiting Internal Structure. In Proceedings of
the HLT-NAACL, San Diego, CA, USA, 17 June 2016.
15. Upadhyay, S.; Chang, K.; Taddy, M.; Kalai, A.T.; Zou, J.Y. Beyond Bilingual: Multi-sense Word Embeddings using Multilingual
Context. arXiv 2017, arXiv:1706.08160.
16. Wolk, K. Contemporary Polish Language Model (Version 2) Using Big Data and Sub-Word Approach. In Proceedings of the
Interspeech 2020, 21st Annual Conference of the International Speech Communication Association, Virtual Event, Shanghai,
China, 25–29 October 2020;
17. Bhattacharya, P.; Goyal, P.; Sarkar, S. Using Word Embeddings for Query Translation for Hindi to English Cross Language
Information Retrieval. arXiv 2016, arXiv:1608.01561.
18. Devi, G.R.; Veena, P.; Kumar, M.A.; Soman, K. Entity Extraction for Malayalam Social Media Text Using Structured Skip-gram
Based Embedding Features from Unlabeled Data. Procedia Comput. Sci. 2016, 93, 547–553. [CrossRef]
19. Devi, G.R.; Veena, P.; Kumar, M.A.; Soman, K. AMRITA_CEN@FIRE 2016: Code-Mix Entity Extraction for Hindi-English and
Tamil-English Tweets; Indian Statistical Institute: Kolkata, India, 2016.
20. Ajay, S.G.; Srikanth, M.; Kumar, M.A.; Soman, K.P. Word Embedding Models for Finding Semantic Relationship between Words
in Tamil Language. Indian J. Sci. Technol. 2016, 9, 1–5. [CrossRef]
21. Yin, Z.; Shen, Y. On the Dimensionality of Word Embedding. In Advances in Neural Information Processing Systems 31; Bengio, S.,
Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., Garnett, R., Eds.; Curran Associates, Inc.: Dutchess County, NY, USA,
2018; pp. 895–906.
22. Li, B.; Drozd, A.; Guo, Y.; Liu, T.; Matsuoka, S.; Du, X. Scaling Word2Vec on Big Corpus. Data Sci. Eng. 2019, 4, 157–175.
[CrossRef]
23. Chandar, A.P.S.; Lauly, S.; Larochelle, H.; Khapra, M.M.; Ravindran, B.; Raykar, V.C.; Saha, A. An Autoencoder Approach to
Learning Bilingual Word Representations. arXiv 2014, arXiv:1402.1454.
24. Hermann, K.M.; Blunsom, P. A Simple Model for Learning Multilingual Compositional Semantics. arXiv 2013, arXiv:1312.6173.
25. Vulić, I.; Moens, M.F. Bilingual Word Embeddings from Non-Parallel Document-Aligned Data Applied to Bilingual Lexicon
Induction. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International
Joint Conference on Natural Language Processing (Volume 2: Short Papers), Beijing, China, 26–31 July 2015; pp. 719–725.
[CrossRef]
26. Gouws, S.; Bengio, Y.; Corrado, G. BilBOWA: Fast Bilingual Distributed Representations without Word Alignments. In
Proceedings of the ICML, Lille, France, 6–11 July 2015; JMLR Workshop and Conference Proceedings; Volume 37, pp. 748–756.
27. Ruder, S.; Vulić, I.; Søgaard, A. A Survey of Cross-Lingual Word Embedding Models. J. Artif. Int. Res. 2019, 65, 569–630.
[CrossRef]
Electronics 2021, 10, 1372 23 of 23
28. Mikolov, T.; Chen, K.; Corrado, G.; Dean, J. Efficient Estimation of Word Representations in Vector Space. arXiv 2013,
arXiv:1301.3781.
29. Bojanowski, P.; Grave, E.; Joulin, A.; Mikolov, T. Enriching Word Vectors with Subword Information. Trans. Assoc. Comput.
Linguist. 2017, 5, 135–146. [CrossRef]
30. Sanjanasri, J.P.; Premjith, B.; Menon, V.K.; Soman, K.P. cEnTam: Creation and Validation of a New English-Tamil Bilingual
Corpus. In Proceedings of the 13th Workshop on Building and Using Comparable Corpora, Marseille, France, 11–16 May 2020.
31. Elsisi, M.; Tran, M.Q.; Mahmoud, K.; Lehtonen, M.; Darwish, M.M.F. Deep Learning-Based Industry 4.0 and Internet of Things
towards Effective Energy Management for Smart Buildings. Sensors 2021, 21, 1038. [CrossRef]
32. Elsisi, M.; Mahmoud, K.; Lehtonen, M.; Darwish, M.M.F. An Improved Neural Network Algorithm to Efficiently Track Various
Trajectories of Robot Manipulator Arms. IEEE Access 2021, 9, 11911–11920. [CrossRef]
33. Le, Q.V.; Mikolov, T. Distributed Representations of Sentences and Documents. arXiv 2014, arXiv:1405.4053.
34. McCann, B.; Bradbury, J.; Xiong, C.; Socher, R. Learned in Translation: Contextualized Word Vectors. arXiv 2017, arXiv:1708.00107.
35. Peters, M.E.; Neumann, M.; Iyyer, M.; Gardner, M.; Clark, C.; Lee, K.; Zettlemoyer, L. Deep contextualized word representations.
arXiv 2018, arXiv:1802.05365.
36. Devlin, J.; Chang, M.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.
arXiv 2018, arXiv:1810.04805.
37. Wolf, T.; Debut, L.; Sanh, V.; Chaumond, J.; Delangue, C.; Moi, A.; Cistac, P.; Rault, T.; Louf, R.; Funtowicz, M.; et al. Transformers:
State-of-the-Art Natural Language Processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language
Processing: System Demonstrations, Online, 16–20 November 2020; Association for Computational Linguistics: Stroudsburg, PA,
USA, 2020; pp. 38–45.
38. Sazzed, S. Cross-lingual sentiment classification in low-resource Bengali language. In Proceedings of the Sixth Workshop on
Noisy User-generated Text (W-NUT 2020), Online, 19 November 2020; Association for Computational Linguistics: Stroudsburg,
PA, USA, 2020; pp. 50–60.
39. Zou, A. Learning Cross-Lingual Word Embeddings for Sentiment Analysis of Microblog Posts. Master’s Thesis, Princeton
University, Princeton, NJ, USA, 2020.
40. Lastra-Díaz, J.J.; Goikoetxea, J.; Taieb, M.A.H.; García-Serrano, A.M.; Benaouicha, M.; Agirre, E. A reproducible survey on word
embeddings and ontology-based methods for word similarity: Linear combinations outperform the state of the art. Eng. Appl.
Artif. Intell. 2019, 85, 645–665. [CrossRef]
41. Yuan, M.; Zhang, M.; Van Durme, B.; Findlater, L.; Boyd-Graber, J. Interactive Refinement of Cross-Lingual Word Embeddings.
In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, 16–20 November
2020; Association for Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 5984–5996.
42. Pre-Trained, Asian Embeddings. 2019. Available online: https://github.com/Kyubyong/wordvectors (accessed on 10 May
2021).
43. Hindi, Dictionary. 2019. Available online: https://www.shabdkosh.com/dictionary/english-hindi/ (accessed on 10 May 2021).
44. Chinese, Dictionary. 2019. Available online: http://www.mandarintools.com (accessed on 10 May 2021).
45. Mikolov, T.; Le, Q.V.; Sutskever, I. Exploiting Similarities among Languages for Machine Translation. arXiv 2013, arXiv:1309.4168.
46. Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet Classification with Deep Convolutional Neural Networks. In Advances in
Neural Information Processing Systems 25; Pereira, F., Burges, C.J.C., Bottou, L., Weinberger, K.Q., Eds.; Curran Associates, Inc.:
Dutchess County, NY, USA, 2012; pp. 1097–1105.
47. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. arXiv 2015, arXiv:1512.03385.
48. Jacovi, A.; Shalom, O.S.; Goldberg, Y. Understanding Convolutional Neural Networks for Text Classification. arXiv 2018,
arXiv:1809.08037.
49. Bakarov, A. A Survey of Word Embeddings Evaluation Methods. arXiv 2018, arXiv:1801.09536.
50. Sanjanasri, J.P.; Menon, V.K.; Rajendran, S.; Soman, K.P.; Anand Kumar, M. Intrinsic Evaluation for English-Tamil Bilingual Word
Embeddings. In Intelligent Systems, Technologies and Applications; Springer: Singapore, 2020; pp. 39–51.
51. van der Maaten, L.; Hinton, G. Visualizing Data using t-SNE. J. Mach. Learn. Res. 2008, 9, 2579–2605.
52. Hill, F.; Reichart, R.; Korhonen, A. SimLex-999: Evaluating Semantic Models With (Genuine) Similarity Estimation. Comput.
Linguist. 2015, 41, 665–695. [CrossRef]
53. Menon, V.; Maniyil, S.; Harikumar, K.; Soman, K. Semantic Analysis Using Pairwise Sentence Comparison with Word Embeddings;
Springer International Publishing: Cham, Switzerland, 2018; pp. 268–278. [CrossRef]
54. Aldarmaki, H.; Mohan, M.; Diab, M. Unsupervised Word Mapping Using Structural Similarities in Monolingual Embeddings.
Trans. Assoc. Comput. Linguist. 2018, 6, 185–196. [CrossRef]
55. Sanjanasri, J.P.; Menon, V.K.; Soman, K.P. BUCC2020: Bilingual Dictionary Induction using Cross-lingual Embedding. In
Proceedings of the 13th Workshop on Building and Using Comparable Corpora, Marseille, France, 11–16 May 2020.