10.1007@978 3 030 41407 8
10.1007@978 3 030 41407 8
10.1007@978 3 030 41407 8
Semantic Technology
9th Joint International Conference, JIST 2019
Hangzhou, China, November 25–27, 2019
Proceedings
Lecture Notes in Computer Science 12032
Founding Editors
Gerhard Goos
Karlsruhe Institute of Technology, Karlsruhe, Germany
Juris Hartmanis
Cornell University, Ithaca, NY, USA
Semantic Technology
9th Joint International Conference, JIST 2019
Hangzhou, China, November 25–27, 2019
Proceedings
123
Editors
Xin Wang Francesca Alessandra Lisi
Tianjin University University of Bari
Tianjin, China Bari, Italy
Guohui Xiao Elena Botoeva
Free University of Bozen-Bolzano Imperial College London
Bozen-Bolzano, Italy London, UK
This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Preface
This is the first volume of the proceedings of the 9th Joint International Semantic
Technology Conference (JIST 2019) held during November 25–27, 2019, in
Hangzhou, China. JIST is a joint event for regional Semantic related conferences. Since
its launched in Hangzhou in 2011, it has become the premium Asian forum on
Semantic Web, Knowledge Graph, Linked Data, and AI on the Web. In 2019, JIST
returned to Hangzhou, and the mission was to bring together researchers in the
Knowledge Graph and Semantic Technology research community and other related
areas to present their innovative research results and novel applications. This year’s
theme was “Open Web and Knowledge Graph.”
The proceedings of JIST 2019 are presented in two volumes: the first one in LNCS
and the second one in CCIS. The conference attracted high-quality submissions and
participants from all over the world. There were 70 submissions from 8 countries. The
Program Committee (PC) consisted of 52 members from 13 countries. Each PC has
been assigned four papers on average and each submission was reviewed by at least
three PC members. The committee decided to accept 24 full papers (34.3%) in volume
1 (LNCS) and 22 other papers (31.4%) in volume 2 (CCIS). In addition to the paper
presentations, the program of JIST 2019 also featured three tutorials, three keynotes,
one special forum on Open Knowledge Graph, and poster presentations.
We are indebted to many people who made this event possible. As the organizers of
JIST 2019, we would like to express our sincere thanks to the PC members and
additional reviewers for their hard work in reviewing the papers. We would also like to
thank the sponsors, support organizations, all the speakers, authors, and participants for
their great contributions. Last but not least, we would like to thank Springer for their
support in producing these proceedings.
General Chairs
Huajun Chen Zhejiang University, China
Diego Calvanese Free University of Bozen-Bolzano, Italy
Program Chairs
Xin Wang Tianjin University, China
Francesca A. Lisi Università degli Studi di Bari, Italy
Workshop Chairs
Yuan-Fang Li Monash University, Australia
Xianpei Han ISCAS, China
Tutorial Chairs
Xiaowang Zhang Tianjin University, China
Jiaoyan Chen Oxford University, UK
viii Organization
Sponsorship Chair
Jinguang Gu Wuhan Science and Technology University, China
Proceeding Chairs
Guohui Xiao Free University of Bozen-Bolzano, Italy
Elena Botoeva Imperial College of London, UK
Publicity Chairs
Meng Wang Southeast University, China
Naoki Fukuta Shizuoka University, Japan
Program Committee
Carlos Bobed everis and NTT Data, Spain
Fernando Bobillo University of Zaragoza, Spain
Huajun Chen Zhejiang University, China
Wenliang Chen Soochow University, China
Gong Cheng Nanjing University, China
Dejing Dou University of Oregon, USA
Jianfeng Du Guangdong University of Foreign Studies, China
Alessandro Faraotti IBM, Italy
Naoki Fukuta Academic Institute Shizuoka University, Japan
Jinguang Gu Wuhan University of Science and Technology, China
Xianpei Han ISCAS, China
Wei Hu Nanjing University, China
Ryutaro Ichise National Institute of Informatics, Japan
Takahiro Kawamura National Agriculture and Food Research Organization,
Japan
Evgeny Kharlamov Bosch Center for Artificial Intelligence and University
of Oslo, Norway
Martin Kollingbaum University of Aberdeen, UK
Kouji Kozaki Osaka Electro-Communication University, Japan
Weizhuo Li Academy of Mathematics and Systems Science, CAS,
China
Yuan-Fang Li Monash University, Australia
Juanzi Li Tsinghua University, China
Francesca A. Lisi Università degli Studi di Bari, Italy
Kang Liu Institute of Automation, CAS, China
Yinglong Ma NCEPU, China
Theofilos Mailis National and Kapodistrian University of Athens,
Greece
Riichiro Mizoguchi Japan Advanced Institute of Science and Technology,
Japan
Organization ix
Additional Reviewers
Abstract. Taxonomic relations (also called “is-A” relations) are key components
in taxonomies, semantic hierarchies and knowledge graphs. Previous works on
identifying taxonomic relations are mostly based on linguistic and distributional
approaches. However, these approaches are limited by the availability of a large
enough corpus that can cover all terms of interest and provide sufficient contex-
tual information to represent their meanings. Therefore, the generalization abili-
ties of the approaches are far from satisfactory. In this paper, we propose a novel
neural network model to enhance the semantic representations of term pairs by
encoding their respective definitions for the purpose of taxonomic relation identi-
fication. This has two main benefits: (i) Definitional sentences represent specified
corpus-independent meanings of terms, hence definition-driven approaches have
a great generalization capability to identify unseen terms and taxonomic relations
which are not expressed in domain specificity of the training data; (ii) Global
contextual information from a large corpus and definitions in the sense level can
provide richer interpretation of terms from a broader knowledge base perspec-
tive, and benefit the accurate prediction for the taxonomic relations of term pairs.
The experimental results show that our model outperforms several competitive
baseline methods in terms of F-score on both specific and open domain datasets.
1 Introduction
Taxonomic relation (also called “is-A” relation) identification is a task to determine
whether a specific pair of terms1 holds the taxonomic relation or not. Concretely, given
a pair of terms (x, y), if y holds a semantically border category that includes x, we call
y a hypernym of x and x a hyponym of y [26]. For instance, “scientist” is a hypernym
of “Einstein”, “actor” is a hypernym of “Mel Gibson”, “Paris” is a hyponym of “excit-
ing city”. The accurate prediction of these taxonomic relations benefits for a variety
of downstream applications, such as serving as building blocks for semantic structure
1
This paper uses “terms” to refer to any words or phrases.
c Springer Nature Switzerland AG 2020
X. Wang et al. (Eds.): JIST 2019, LNCS 12032, pp. 1–17, 2020.
https://doi.org/10.1007/978-3-030-41407-8_1
2 Y. Sheng et al.
can provide complementary knowledge to the context from corpus, so that our pro-
posed model enables better tolerance for unseen terms, rare terms, and terms with
biased sense distribution.
– Our model enables combine distributional model with definition encoding, rather
than a simple concatenation of the two subsystems. This benefits to generate more
indicative features across distributional contexts and definitions for the accurate pre-
diction of taxonomic relations of term pairs.
– The experiment results on both general and domain-specific datasets corroborate
the effectiveness and robustness of our model over several competitive models in
F-score metrics.
2 Related Work
Taxonomic relation identification is one of the most topics in NLP research. Many
approaches have been explored for this task can be divided into two branches, including
linguistic and distributional approaches.
In linguistic approaches, the range of pre-defined rules or lexical-syntactic patterns
are leveraged to extract taxonomic relations from text corpus. Patterns are either chosen
manually [9] or learnt automatically via bootstrapping [26]. While such approaches
can result in taxonomic relations with relatively high accuracies. Unfortunately, using
patterns as features may result in the sparsity of the feature space [19]. More approaches
require the co-occurrence of the two terms in the same sentence, which strongly hinders
the recall of these methods. Higher recall can be achieved contributes to distributional
methods.
In distributional approaches, by studying the relations of distributional representa-
tions (word embeddings) derived from contexts, between hypernyms and their respec-
tive hypernyms, the taxonomic relations can be identified by learning semantic predic-
tion models, especially for several unsupervised measures [10, 22]. Such approaches
draw primarily on the distributional hypothesis [8], which states that terms appear in
similar context may share semantic relationship. The main advantage of distributional
approaches is that they can discover relations not directly expressed in the text. How-
ever, such approaches depend on the choice of feature from domain specificity of the
training data, e.g., an IT corpus hardly mentions “apple” as a fruit. Furthermore, rare
terms are poorly expressed by their sparse global context and, more generally, these
methods would not generalize to the low-resource language setting.
Our proposed model shares the same inspiration with distributional methods. More
importantly, it is beyond the framework of distributional models acquiring context-
aware term meaning in a large corpus, and instead explore a novel information resources
- definitive sentences, to enhance the robustness of the system.
For the design of the baseline system, we follow the idea of Siamese Network [21], as
shown in Fig. 1. Concretely, the two identical sentence encoders share the same set of
weights during training, and generate two neural representations. We observe from the
figure that the system mainly consists of three layers from bottom to top: (i) Sentence
input layer; (ii) Sentence encoding layer; (iii) Sentence output layer. We will explain
the last two layers in detail in the following subsection. And the sentence input layer
will be introduced in Sect. 3.2.
Sentence Encoding Layer. Take the sentence encoder on any one side as an example.
Given a sentence Si (i = 1 or 2) with length L, our goal is to find a neural representation
of Si . We first map L words of Si to a sequence of word embedding vectors {xij }
(j = 1, ..., L, and the dimension of word embedding denotes as dm ), based on the pre-
trained embeddings that will be described in the following Sect. 4.1. Then we employ
a Bi-LSTM which is composed of a forward LSTM and backward LSTM component,
to process {xij } in the forward left-to-right and the backward right-to-left directions,
respectively. In each direction, the reading of {xij } is modelled as a recurrent process
with a single hidden state. Given an initial value, the state changes its value recurrently,
and each time-step consumes an incoming word.
Take the forward LSTM component as an example. Denoting the initial hidden state
→
− → −
− → −
→
as h 0 , the recurrent state transition values can be calculated by { h 1 , h 2 , ..., h L }
→
−
when it reads the input {x1 , x2 , ..., xL }. At time t, the current hidden state vector h t
→
−
is computed based on the previous hidden h t−1 , the previous cell ct−1 and the current
input word embedding xt . The detail computations of the forward LSTM are defined
as follows [6]:
→
−
ît = δ(Wxi xt + Whi h t−1 + bi ),
→
−
fˆt = δ(Wxf xt + Whf h t−1 + bf ),
Term Definitions for Taxonomic Relation Identification 5
→
−
ot = δ(Wxo xt + Who h t−1 + bo ),
→
−
ut = δ(Wxu xt + Whu h t−1 + bu ),
it , ft = sof tmax(ît , fˆt ),
ct = ft ct−1 + it ut ,
→
−
h t = ot tanh(ct ), (1)
where it , ft , ot and ut are an input gate, a forget gate, an output gate and an actual
input at t time, respectively. The activation function of the LSTM δ is set to tanh. W(.)
represent trained weight matrices, ct is a vector representation of state in recurrent cell
at time step t, and bx (x ∈ {i, o, f, u}) is a bias vector. denotes the Hadamard product.
The backward LSTM component follows the same recurrent state transition process
as described in Eq. (1). Starting from an initial state hn+1 , which is a model parame-
←
− ← − ←−
ter, it reads the input {xn , xn−1 , ..., x0 }, changing its value to { h n , h n−1 , ..., h 0 },
respectively. A separate set of parameters W(.) and bx are used for the backward com-
ponent.
→
− ←−
Finally, the Bi-LSTM concatenates the vector value of h t and h t to represent the
→ ←
− −
encoding information of xt at t time, which is denoted as ht = [ h t ; h t ].
An additive attention mechanism [2] is exploited to the resulting hidden states cor-
responding to {ht } (t = 1, ..., L). That is a weighted calculation for learning more
accurate and focused sentence representations, based on the following formulas:
gt = αti hi , (2)
i
where hi is the column vector denoting the hidden state of xi , i can be regarded as the
intermediate attention representation of xi in the sentence and can be obtained from a
linear transformation of hi . αi denotes the attention weight of xi (i.e., namely attention
vector α) and is computed by the combination of weighted values in i .
T
eu i
αti = T , (3)
Σj eu j
i = tanh(Wa hi + ba ), (4)
where Wa is a trained weight matrices, u denotes transpose of a trained parameter u.
T
ba is a bias vector.
We concatenate contextual information vector for each time step as follows:
In our task, the indicative features related to the taxonomic relation are likely to appear
in any area of the sentence under different contexts. Hence, we should encode the
sentence further by utilizing all local features and form better neural representations
globally. When using a neural network, the convolution approach is a natural means
of extracting local features with a sliding window of length l over the sentence [4].
6 Y. Sheng et al.
Here, typically the size of the sliding window l is 3. Then, it combines all fine-grained
features via a max-pooling operation to obtain a fixed-sized vector for the output of the
convolution operation.
Here, convolution is defined as a matrix multiplication between a sequence of vec-
tors g which is formed by Eq. (5), a convolution matrix W ∈ Rdm ×(dm ×l) and a bias
vector b with a sliding window [30]. Let us define the vector qi ∈ Rdm ×l as the con-
catenation of a sequence of input representations g in the i-th window, we have:
Hence, the output of a single convolutional kernel pi (i.e., i-th window) can be
expressed as follows:
pi = f (W qi + b), (7)
where f is the activation function. A convolutional layer can comprise dc convolutional
kernels which could result in a output matrix P = [p1 , p2 , ..., pdc ] ∈ Rdc ×(dm −l+1) .
Then, a max-pooling operation is applied to this matrix to obtain maximum valued
features as follows:
p̂ = max(P ) = max(pi ). (8)
i
Sentence Output Layer. In our settings, the neural representation of each sentence Si
(i = 1, 2) (term can be treated as short sentence) is generated using Eq. (8), and can
be shortly denoted as p̂i (i = 1, 2). Finally, we obtain the overall representation for the
sentence pair by concatenating p̂1 and p̂2 .
Inspired by the above observations, we investigate four strategies to define the input
representations on the baseline system as described in the baseline system, including:
(xhypo , yhyper ), (xhypo , dy ), (dx , yhyper ) and (dx , dy ). Moreover, we obtain four sep-
arate representations based on the above combinations as the output: ptt from (xhypo ,
yhyper ), ptd from (xhypo , dy ), pdt from (dx , yhyper ), and pdd from (dx , dy ). In the
following, we further give the explanations of these combinations with more details:
The baseline system over (xhypo , yhyper ), which considers two vector representa-
tions as the input, and outputs a joint vector representation. This combination intends
to model embeddings from a hyponym to its hypernym via a network with weights. It
actually is common with a pioneer work in this field [5], which employs uniform linear
projection and piecewise linear projection to map the embeddings of a hyponym to its
hypernym.
The baseline system over (xhypo , dy ) and (dx , yhyper ) also outputs a joint vector rep-
resentation. These combinations benefit to generate indicative features across distribu-
tional context and definition for discriminating taxonomic relations from other seman-
tic relations, e.g., as we described in the beginning of this subsection, the hyponym or
prominent context words of the hyponym are expected to appear in its hypernym’s def-
inition, and vice versa. Based on our high-quality training data, this can provide direct
clues for inferring taxonomic relation between the terms.
The baseline system over (dx , dy ) may provide an alternative evidence for interpret-
ing the terms. Inherent structure information in definitive sentences is often important
to understand term meanings for the purpose of accurately detecting taxonomic rela-
tions. The assumption in the background is that: if two sentences represent the features
8 Y. Sheng et al.
of fine-grained alignments in the structure in some level, their corresponding terms may
hold the taxonomic relation. For example, the definition of term “apple” and “Malus” in
WordNet are “An apple is a sweet, edible fruit produced by an apple tree (Malus domes-
tica)”, “apple trees; found throughout temperate zones of the northern hemisphere.”
respectively. Actually, “Malus” is a hypernym of “apple”.
Heuristic Matching. Inspired in part by the ideas of natural language inference pro-
vided by [14, 18], we combine the output from the baseline system into a joint vector
representation via different strategies:
p = [ptt ; pdd ; ptd − pdt ; ptd pdt ], (9)
where semicolons refer to the concatenation of multiple column vectors, denotes the
element-wise multiplication of two vectors, p ∈ R4dh is the output of this layer, and dh
is the number of hidden units of the LSTMs.
Softmax Output. Since our task refers to predict whether there exists or not a taxo-
nomic relation for the given term pair, it is modeled as a binary classification problem.
Thus, the feature vector p in Eq. (9) is fed into a softmax classifier for computing the
confidence of each output result:
o = W1 (p ◦ r) + b. (10)
where W1 ∈ Rn1 ×4dh is the transformation matrix, and o ∈ Rn1 is the final output of
the network. n1 is equal to 2.
Loss Function and Training. Given an input instance xi = (xhypo , dx , yhyper , dy , 1/0),
our model with parameter θ outputs the vector o computed in Eq. (10), which is a 2-
dimensional vector, where the i-th value oi of o is the probability score for determining
the taxonomic relation of term pair (xhypo , yhyper ) is holds or not, and the sum of
values in o is to 1. To obtain the conditional probability p(i|xi , θ), we apply a softmax
operation as:
eoi
p(i|xi , θ) = , (11)
Σk eok
Given all of our training instances T = {xi | i = 1, 2, ..., N }, we can then define
the negative log-likelihood loss function:
N
J(θ) = − log p(yi |xi , θ), (12)
i=1
where θ denotes the parameters in our model. We train the model through a simple
optimization technique called stochastic gradient descent (SGD) over shuffled mini-
batches with the Adadelta rule. Regularization is implemented by a dropout [11] and
L2 norm.
4.1 Dataset
Random and Lexical Dataset Splits. As pointed out by Levy et al. [13], mostly super-
vised distributional lexical inference methods tend to learn a dependent semantics of a
single term, instead of learning the relation between two terms, this can be expressed as
“lexical memorization” phenomenon. To address this, Levy et al. [13] made a sugges-
tion of splitting the train and test sets such that each of them will contain distinct term
pairs for presenting the model from overfitting during training.
To investigate such behaviors, we also follow the solution for a lexical split of our
dataset. In this case, we maintain roughly a ratio of 14:5:1 for training set, test set and
validation set partitioned randomly. Moreover, we maintain roughly a ratio of 8:1 for
positive instances and negative instances in random or lexical splits in our datasets. The
overall statistics of the datasets used in the experiments are summarized in Table 1. We
present briefly the summary of each dataset below.
– BLESS dataset [3]3 . It consists of 200 distinct, unambiguous concepts. Each of
which is involved with other terms, called relata, in some relations. We extract from
BLESS 12,994 pairs of terms for the following four types of relations: taxonomic
relation, “meronymy” (a.k.a. part-of relation), “coordinate” (i.e., two terms having
the same hypernym), and random relations. From these term pairs, we set taxonomic
relations as positive instances, while other relations form the negative instances.
– Conceptual Graph. This is a popular taxonomic benchmark dataset derived from
Microsoft Concept Graph project4 . It contains more than 5 million unique concepts,
12 unique million entities and 85 million taxonomic relations. We randomly pick
the term pairs with possible (direct and indirect) taxonomic relation, along with high
frequencies as the positive instances, while the term pairs are as negative instances
when their relation frequencies are relatively lower.
– WebIsA-Animal, Plant dataset. WebIsA5 is a publicly large-scale database con-
taining more than 400 million taxonomic relation pairs. In this work, we select
3
https://sites.google.com/site/geometricalmodels/shared-evaluation.
4
https://concept.research.microsoft.com/Home/Download.
5
http://webdatacommons.org/isadb/.
10 Y. Sheng et al.
1.1M subset pertaining to two specified domains (i.e., the classes like “Animal”
and “Plant”). The positive instances are created by extracting all possible (direct
and indirect) taxonomic relation from the taxonomies. The negative instances are
generated by randomly pairing two terms do not have any taxonomic relation.
Term Definition Collection. In this work, we pick two types of structured knowledge
sources, including WordNet and the complete English Wikipedia, for extracting defini-
tive descriptions for term pairs. We chose WordNet as the source of textual definitions
partly because WordNet has been used for related tasks before, e.g., Snow et al. [26]
constructed a larger taxonomic relation dataset based on WordNet, and partly because
a number of accurate definitions of terms in sense level are available in WordNet. How-
ever, some literatures, e.g., Shwartz et al. [24], claimed that only limited coverage for
almost all knowledge resources particularly to several rare or recently pairs, e.g., (“Bul-
let tuna”, “fish”), (“lead acid battery”, “automobile”). As the English Wikipedia can be
viewed as the complementary data source to WordNet. Concretely, for each term pair
in the training set, we first try to extract their respective definition from WordNet based
on the term in strings6 . For a few pairs which contain terms not covered by WordNet.
We then switch to Wikipedia in which the term can be involved, and select the top-2
sentences in the first subgraph in the introductory sections, as its definitive description.
If we are failed in two knowledge resources, we set definitions the same as the term in
strings. As a result, we obtain the training instances in the form of (xhypo , dx , yhyper ,
dy , 1/0) for our experiments, where each term is accompanied by its definition.
Pre-trained Word Embeddings. To cover abundant words in the terms and definitions
and provide the better support for initial input of our model. We use two large-scale tex-
tual corpus to train word embeddings. The first contains 570M entity content pages con-
sisting of approximately 40 million sentences extracted from the English wikipedia7 .
The second is larger in size and derived from extended abstracts of DBpedia8 , then we
employ the NLPIR system9 for segmentation and Skip-gram method [15] for training.
Note that the out-of-vocabulary words in the training set are randomly initialized with
sampling values that meet the uniformly distributional representation in the range of
(−0.05, 0.05). As to the word or phrase with more than one word, we treat it as a whole
to learning word embedding, instead of using character-level embeddings.
Pre-trained embeddings as preliminary inputs used in our model are not being
updated in training, mainly for two reasons: (i) Reduce the number of needed param-
eters in the training stage; (ii) Improve the generalization capability of the model as it
ensures that the words in training and the new words in the test set lie in the same space.
6
As WordNet sorts sense definitions by sense frequency [17], we only choose the top-1 sense
definition to denote a term.
7
https://dumps.wikimedia.org/wikidatawiki/20180320/.
8
http://tagesnetzwerk.de.
9
https://github.com/NLPIR-team/NLPIR.
Term Definitions for Taxonomic Relation Identification 11
Compared Methods. In a series of previous works [12, 26, 27], several pattern-based,
text-based inference methods have been applied for taxonomic relation identification.
Their experiments showed that these methods achieve the F-score lower than 65% in
most cases, which are not suggested to be strong baselines to compare with our app-
roach. To make the convincing conclusion, in the experiments, we use the following
competitive baseline methods for comparison:
– Word2Vec + SVM. This model first obtain the two term embeddings by apply-
ing the Skip-gram method [15] on the same corpus used for training pre-trained
word embeddings as ours, and then combine their vectors to train an off-the-shelf
SVM classifier for the taxonomic relation identification. Note that, in the Skip-gram
model, if a term with more than one word, its embedding is calculated as the average
of all words in the term.
– DDM + SVM [29]. This learns term embeddings via a dynamic distance-margin
model, and then a SVM classifier is trained on concatenation of term pair vectors
for the taxonomic relation detection.
– DWNN + SVM [1]. This is a extended copy of DDM [29], it not only utilizes the
information of hypernyms and hyponyms, but also considers the contextual infor-
mation between them via a dynamic weighting neural network when learning term
embeddings.
– Best unsupervised (dependency-based context) [25]. This is the best unsuper-
vised method, which implements similarity measurement over weighted dependency
based context vectors.
– OursSubInput . This is the variant of our method. The collection {(xhypo , yhyper ),
(xhypo , dy ), (dx , yhyper ), (dx , dy )} provided for each input vector pair is changed
to its sub-collection {(xhypo , yhyper ), (dx , dy )}. We use the sub-script SubInput to
denote this setting.
– OursConcat . This is the variant of our method. To form the joint representation to
distributional features, we directly concatenate the output of the four separate rep-
resentations from term pairs as follows: ptt from (xhypo , yhyper ), ptd from (xhypo ,
dy ), pdt from (dx , yhyper ), and pdd from (dx , dy ), rather than relying on the heuristic
matching. We use the sub-script Concat to denote this setting.
relations in the datasets for different methods. As for the Lexical Splits, we report the
Mean Average Precision (P), Mean Average Recall (R), and Mean Average F-score (F).
Parameter Tuning. We conducted extensive experiments to determine the optimal con-
figuration of parameters for our model. There are two types of parameters in the model:
the first type includes weights and biases in the model layers, which can be initiated ran-
domly and learned afterwards from each iteration; the second type includes the param-
eters that should be configured manually. In particular, we select six most common
hyper-parameters for our model, namely the dimension of word embedding dm in the
input representation layer, the number of hidden units of all the LSTMs dh , convolu-
tional filters length F l, number of filters N r in the CNN component of the baseline
system, the learning rate lr, and the ratio of dropout ρ. Since the weight W and bias b
in each neural layer can be learned automatically via the evolution of model network,
we focus on turning the hyper-parameters dm , dh , F l, N r, lr, and ρ. In practice, we
train our models with a batch size of 128 for at most 100 epochs for each experiment
and performed parameter selection strategy10 on the validation set to tune parameters of
the model for better convergence. Finally, we set dm = 300, dh = 300, F l = 3, N r = 2, lr
= 10−4 , ρ = 0.1, and early-stopping on validation accuracy. Our model is implemented
using the TensorFlow11 machine learning framework.
consideration on the contextual information between the hypernym and its hyponym,
rather than their respective meanings. While our method can learn more indicative fea-
tures related to the taxonomic relations under the guidance of both the distributional
contexts and encoded definitions. Besides, we observe an interesting scenario is that
the Precision of DDM improves significantly in the open domain datasets compared to
the specific domain datasets. One possible explanation is that the method learns term
embeddings using pre-extracted taxonomic relations from Probase, and if a relation
dose not in Probase, there is high possibility that it becomes a negative instance and
be recognized as a non-taxonomic relation by the classifier. Therefore, the training data
extracted from Probase plays an import role in this method. For open domain datasets
(BLESS [3] and Conceptual Graph (See footnote 4)), there are approximately 75%–
85% of taxonomic relations in these datasets found in Probase, while only approxi-
mately 25%–45% of relations in domain-specific datasets (WebIsA-Animal (See foot-
note 5) and WebIsA-Plant (See footnote 5)) can be found in Probase. Therefore, DDM
achieves better performance in the open domain datasets than the specific ones. Our
approach, in contrast, mainly depends on the valuable evidence - corpus-independent
textual definitions, thus, it has better generalization capability and could achieve higher
F-score in domain-specific datasets.
Inaccurate Definition. The error statistics show that this kind of error account for
approximately 78%, 83% and 86% total errors in BLESS [3], Conceptual Graph (See
footnote 4), and WebIsA (See footnote 5), respectively. For example, our model obtains
the definition “fruit with red or yellow or green skin and sweet to tart crisp whitish flesh”
for the term “Apple” in the pair (“Apple”, “IT company”), however, a correct detection
may require another definition “Apple Inc. is an American multinational technology
company headquartered in Cupertino”, which comes from the article’s abstract of the
English Wikipedia13 . This is a common issue due to the ambiguity of entity mentions.
To alleviate this problem, we will explore more advanced entity liking techniques, or
extract more accurately one from all highly related definitions by combining current
context, along with the efficient ranking algorithm.
As to the roughly expression or misleading information errors appearing in defini-
tions. We shall illustrate two term pairs for further analysis. The definition of “volley-
ball” in the first term pair (“volleyball”, “game”) is: “a game in which two teams hit an
inflated ball over a high net using their hands”, and the initial prediction result of our
model is the value of 1. If we removed the phrase “a game” in the definition, and then
the model outputs the value of 0 as the prediction result. Similarly, in the second term
pair (“mexico”, “latin american country”), the initial prediction result of our model is
the value of 1, and the definition of “mexico” is: “a republic country in the southern
of latin America”. If we removed the phrase “of latin America” in the definition, and
then our model outputs the value of 0 as the prediction result. These demonstrate that
our model is more sensitive for the prominent context words which depict the taxo-
nomic relation in the definitions. Meanwhile, the definitions indeed provide more rich
13
https://en.wikipedia.org/wiki/Apple Inc.
Term Definitions for Taxonomic Relation Identification 15
knowledge for understanding the term pairs. But when we cannot obtain the enough
information in the definitions of the terms, our model is not enough intelligent to avoid
either inaccurate or misleading errors without human-crafted knowledge.
Apart from the above situation, due to parts of term and entities pairs are rare ones,
e.g., (“coma”, “knowledge”), (“bacterium”, “microorganism”), (“chromium”, “metal”).
As a consequence, it is difficult for the model to make a correct decision alone when
only encoding their term meanings as the definitions in our model.
Other Relations. In this case, the majority of errors stem from confusing meronymy
and taxonomic relations. For example, in fact, the term pair (“paws”, “cat”) is of the
meronymy relation, rather than the taxonomic relation. We found that such a problem
has been also reported in [23], and one possible solution for reducing this error is:
adding more negative instances of this kind to the datasets.
The remained errors in the case are with respect to the type of the reversed error.
In order to reduce most errors of this type, based on the previous literature study [29],
one possible solution is: integrating the learning of term embeddings with the distance
measure as the feature (e.g., 1-norm distance) into the model.
5 Conclusion
In this paper, we presented a neural network model, which can enhance the represen-
tations of term pairs by incorporating their separative accurately textual definitions, for
identifying the taxonomic relation of pairs. In our experiments, we showed that our
model outperforms several competitive baseline methods and achieves more than 82%
F-score on two domain-specific datasets. Moreover, our model, once trained, performs
competitively in various open domain datasets. This demonstrates the good generaliza-
tion capacity of our model. Apart from this, we also conducted detailed analysis to give
more insights on the error distribution.
In the future, our work can be extended by addressing the following issues: one is to
consider how to integrate multiple types of knowledge (e.g., word meanings, definitions,
knowledge graph paths, and images) to enhance the representations of term pairs and
further improve the performance of this work. Since our model seems straightforwardly
applicable for multi-class classification problem via some tuning, hence, the other work
is to investigate whether this model would be used to the task of multiple semantic
relations classification.
References
1. Anh, T.L., Tay, Y., Hui, S.C., Ng, S.K.: Learning term embeddings for taxonomic relation
identification using dynamic weighting neural network. In: EMNLP, pp. 403–413 (2016)
2. Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align
and translate. In: ICLR (2015)
16 Y. Sheng et al.
3. Baroni, M., Lenci, A.: How we blessed distributional semantic evaluation. In: Proceedings
of the GEMS 2011 Workshop on GEometrical Models of Natural Language Semantics, pp.
1–10 (2011)
4. Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., Kuksa, P.: Natural lan-
guage processing (almost) from scratch. J. Mach. Learn. Res. 12, 2493–2537 (2011)
5. Fu, R., Guo, J., Qin, B., Che, W., Wang, H., Liu, T.: Learning semantic hierarchies via word
embeddings. In: ACL (Volume 1: Long Papers), pp. 1199–1209 (2014)
6. Graves, A., Schmidhuber, J.: Framewise phoneme classification with bidirectional LSTM
and other neural network architectures. Neural Netw. 18(5–6), 602–610 (2005)
7. Harabagiu, S.M., Maiorano, S.J., Paşca, M.A.: Open-domain textual question answering
techniques. Nat. Lang. Eng. 9(3), 231–267 (2003)
8. Harris, Z.S.: Distributional structure. Word 10(2–3), 146–162 (1954)
9. Hearst, M.A.: Automatic acquisition of hyponyms from large text corpora. In: COLING, pp.
539–545. Association for Computational Linguistics (1992)
10. Kiela, D., Rimell, L., Vulić, I., Clark, S.: Exploiting image generality for lexical entailment
detection. In: ACL-IJCNLP (Volume 2: Short Papers), pp. 119–124 (2015)
11. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: ICLR (2015)
12. Kotlerman, L., Dagan, I., Szpektor, I., Zhitomirsky-Geffet, M.: Directional distributional
similarity for lexical inference. Nat. Lang. Eng. 16(4), 359–389 (2010)
13. Levy, O., Remus, S., Biemann, C., Dagan, I.: Do supervised distributional methods really
learn lexical inference relations? In: NAACL, pp. 970–976 (2015)
14. Liu, Y., Sun, C., Lin, L., Wang, X.: Learning natural language inference using bidirectional
LSTM model and inner-attention (2016). https://arxiv.org/abs/1605.09090
15. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in
vector space. In: ICLR (Workshop Poster) (2013)
16. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of
words and phrases and their compositionality. In: NIPS, pp. 3111–3119 (2013)
17. Miller, G.A.: WordNet: An Electronic Lexical Database. MIT Press, Cambridge (1998)
18. Mou, L., et al.: Natural language inference by tree-based convolution and heuristic matching.
In: ACL (2014)
19. Nakashole, N., Weikum, G., Suchanek, F.: Patty: a taxonomy of relational patterns with
semantic types. In: EMNLP-CoNLL, pp. 1135–1145 (2012)
20. Navigli, R., Velardi, P., Faralli, S.: A graph-based algorithm for inducing lexical taxonomies
from scratch. In: IJCAI, pp. 1872–1877 (2011)
21. Neculoiu, P., Versteegh, M., Rotaru, M.: Learning text similarity with Siamese recurrent
networks. In: Proceedings of the 1st Workshop on Representation Learning for NLP, pp.
148–157 (2016)
22. Santus, E., Lenci, A., Lu, Q., Schulte im Walde, S.: Chasing hypernyms in vector spaces
with entropy. In: EACL, pp. 38–42 (2014)
23. Shwartz, V., Goldberg, Y., Dagan, I.: Improving hypernymy detection with an integrated
path-based and distributional method. In: ACL, pp. 2389–2398 (2016)
24. Shwartz, V., Levy, O., Dagan, I., Goldberger, J.: Learning to exploit structured resources for
lexical inference. In: CoNLL, pp. 175–184 (2015)
25. Shwartz, V., Santus, E., Schlechtweg, D.: Hypernyms under Siege: linguistically-motivated
artillery for hypernymy detection. In: EACL, pp. 65–75 (2017)
26. Snow, R., Jurafsky, D., Ng, A.Y.: Learning syntactic patterns for automatic hypernym dis-
covery. In: NIPS, pp. 1297–1304 (2004)
27. Wong, M.K., Abidi, S.S.R., Jonsen, I.D.: A multi-phase correlation search framework for
mining non-taxonomic relations from unstructured text. Knowl. Inf. Syst. 38(3), 641–667
(2014)
Term Definitions for Taxonomic Relation Identification 17
28. Wu, W., Li, H., Wang, H., Zhu, K.Q.: Probase: a probabilistic taxonomy for text understand-
ing. In: SIGMOD, pp. 481–492 (2012)
29. Yu, Z., Wang, H., Lin, X., Wang, M.: Learning term embeddings for hypernymy identifica-
tion. In: IJCAI, pp. 1390–1397 (2015)
30. Zeng, D., Liu, K., Lai, S., Zhou, G., Zhao, J., et al.: Relation classification via convolutional
deep neural network. In: COLING, pp. 2335–2344 (2014)
Report on the First Knowledge Graph
Reasoning Challenge 2018
Toward the eXplainable AI System
results, and Sect. 5 introduces related works. Finally, Sect. 6 considers the 2019
challenge.
annotated the semantic roles (five Ws in this case) to each clause in the schema
defined files in Google Sheets. We then normalized the notations of the subject,
verb, object, etc. and added the relationships of the scenes, such as temporal
transitions. Finally, we transformed the sheet into a Resource Description Frame-
work (RDF) file. Thus, the knowledge graph includes facts written in the story,
testimonies of characters, and the contents introduced by Holmes’ s reasoning,
which are all types of information used to identify criminals. Notably, indirect
information not useful for criminal identification, such as emotional landscapes,
should also be incorporated in the knowledge graph. However, we leave this as
an issue for after the first year.
We then opened the knowledge graph to the public and collected the methods
to identify criminals and the results. Application guidelines were published at the
official website2 (in Japanese). After opening the knowledge graph to the public,
we held three orientation meetings in August, September, and October 2018, and
more than 200 participants, including engineers in tech ventures and researchers
at universities and companies, conducted the active discussion. The application
deadline was the end of October 2018, and presentations of all applications
and an awards ceremony were held at an event collocated with the 8th Joint
International Semantic Technology Conference (JIST 2018).
– when, then, after, if, because, etc.: the relationship between scenes (the values
are scene IDs)
– time: absolute time the scene occurs (xsd:DateTime)
– source: original sentences that describe the scene (Literal in English and
Japanese)
3
http://knowledge-graph.jp/visualization/.
Report on the First Knowledge Graph Reasoning Challenge 2018 23
(a) (b)
Fig. 4. (a) NRI team’s approach, (b) One of the solution corresponding to the day of
Julia’s death: night (1) and mid-night (2). Due to lack of the facts it cannot identify
whether the small poisonous animal stays in the safe or Roylott’s room in the night.
24 T. Kawamura et al.
SAT problem is a subject for future works. By using the SAT problem solution
method here, we hope that we will help explore basic artificial intelligence and
machine learning technology with a high interpretability in the future.
CCG2lambda [4] and AlloyAnalyzer greatly contributed to this analysis.
incident, and excluding the scenes which were inferred to be after the occurrence
of the incident by the property “then”. It is inferred that the five characters who
were near the crime scene were as follows: Julia, Helen and Roylott were in their
own bedrooms and Roma was in the garden.
Next, we infer whether Helen, Roylott and Roma, other than the murdered
Julia, could move to the bedroom where Julia was in. Enumerating the con-
nections that are made by the hole, and describing the connections that people
cannot pass through.
Means: A part which narrows down the killing method based on the condition of
the victim and the scene on the night of the incident and a part which deduces the
person who satisfies the necessary condition for carrying out the narrowed killing
method were implemented. It was inferred from this query that the method
of killing was poisoning, and the symptoms were “dizziness”, “pale” and “no
scar”. “Murder with venomous snake” or “Venom killing” is inferred as a feasible
measure for Roylott. The reason was the whip which was in his room.
Total Judgement
From the above, it is inferred that Roylott killed Julia by the use of a venomous
snake for money.
Motive Basis
Tantrum In a fit of anger, however, caused by some robberies which had been
perpetrated in the house, he beat his native butler to death and narrowly
escaped a capital sentence
Money Nothing was left save a few acres of ground, and the two-hundred-year-old
house, which is itself crushed under a heavy mortgage
The result of the criminal prediction was not different from the general
interpretation of TSB. As for the motives of Roylott’s crime, tantrums and
money problems were extracted, but because Roylott’s crime was premeditated,
a tantrum was inappropriate as a motive. However, since there are no other
teams that consider tantrums as a motive in this Knowledge Graph Reasoning
Challenge, we have found that machine learning can be used to roughly grasp
matters that are difficult to cover by knowledge. On the other hand, our expla-
nation method is too simple for explaining complicated procedure such as the
means of the crime in TSB. As the future, we have to consider the construction
of the knowledge which can explain complicated procedure and how to associate
the knowledge with the prediction results.
28 T. Kawamura et al.
1. As shown in (1) of Fig. 6, issue nodes for asking who is the murderer for each
victim v appearing in G are appended to the IBIS structure.
2. A hypothesis isKilledBy(v, x), i.e., v is killed by x, is generated for each pair
of a victim v and a murderer x. As shown in (2) of Fig. 6, the hypothesis
isKilledBy(v, x) is appended to the IBIS structure as an idea node.
3. A discussion agent d(v, x) is assigned to each hypothesis isKilledBy(v, x) to
try generate detail explanation of the hypothesis. The agent d(v, x) firstly
has a knowledge graph Gv,x , a duplication of G. The agent d(v, x) appends
isKilledBy(v, x) to Gv,x and the IBIS node.
4. A facilitator agent generate questions such as “How does x killed v?” and
“Why does x killed v?” and append them to the IBIS structure as issue
nodes as shown in (4) of Fig. 6.
5. A discussion agent d(v, x) respectively try to generate hypotheses to answer
the questions from the facilitator agent, e.g., a hypothesis how(v, x) about
Report on the First Knowledge Graph Reasoning Challenge 2018 29
how x killed v and a hypothesis why(v, x) about why x killed v. These expla-
nations are appended to Gv,x and the IBIS structure.
6. A discussion agent d(v, x) respectively try to disprove a hypothesis
isKilledBy(v, x ) for each x = x. d(v, x) tries to generate counterargument
against how(v, x ) and why(v, x ). If the counterargument is successfully gen-
erated, d(v, x) appends it to Gv,x , the knowledge of d(v, x ), and to the IBIS
structure.
7. The facilitator agent evaluates each cnsstcy(v, x), scores representing the
consistency of Gv,x including the hypothesis isKilledBy(v, x), and selects a
candidate of murderer xv = arg max{cnsstcy(v, x)} for each victim v. The
x
selected discussion agent d(v, xv ) outputs the hypothesis isKilledBy(v, xv )
and its explanation how(v, xv ) and why(v, xv ).
4 Evaluation
Designing appropriate metrics is necessary for evaluating estimation and rea-
soning techniques that have explainability. In addition to leading to the correct
answer, several metrics, such as explainability, utility, novelty, and performance,
should be designed. Then, the proposed approaches are evaluated for their advan-
tages and disadvantages based on the metrics, and classified into categories that
correspond to practical use cases. The evaluation is based not only on numerical
metrics, but also on a qualitative comparison of the approaches and the com-
mon recognition of problems through discussion and peer reviews of evaluators
and applicants. The Defense Advanced Research Projects Agency eXplainable
AI (DARPA XAI) described in Sect. 5 states that the current AI techniques
have a trade-off between accuracy and explainability, so both properties should
be measured. In particular, to measure the effectiveness of the explainability,
DARPA XAI rates user satisfaction regarding its clarity and utility. Referring
to such activities, we designed the following metrics for this challenge and will
further improve them for future challenges. We first share the basic information
of the proposed approaches, and then discuss the evaluation of experts and of
the general public.
First, the following information was investigated and shared with experts in
advance. The experts were seven board members of the Special Interest Group
on Semantic Web and Ontology (SIGSWO) in the Japan Society of Artificial
Intelligence.
30 T. Kawamura et al.
Correctness of the Answer: Check if the resulting criminal was correct or not,
regardless of the approach. The criminal, in this case, is the one designated in
the novel or story. In the case that several criminals are presented, if the criminal
in the novel is included among them, we decided the approach as correct but
made a note.
Feasibility of the Program: Check if the submitted program correctly worked
and the results were reproduced (excluding idea-only submissions).
Performance of the Program: Referential information on the system envi-
ronment and performance of the submitted program, except for the idea only.
Amount of Data/Knowledge to Be Used: How much did the approach use
the knowledge graph (the total number of scene IDs used)? If the approach used
external knowledge and data, we noted information about them.
Over more than a week, the experts evaluated the following aspects according
to five grades (1–5). For estimation and/or reasoning methods, they considered:
Significance: Novelty and technical improvement of the method.
Applicability: Is the approach applicable to the other problems? As a guide,
3 means applicable to the other novels and stories and 5 means applicable to
other domains.
Extensibility: Is the approach expected to have a further technical extension?
For example, if a problem is solved, can the process or result be further improved?
For use of knowledge and data, they considered the following:
Originality of Knowledge/Data Construction: Originality of knowl-
edge/data construction (amount × quality × process). For example, how much
external knowledge and data were prepared?
Originality of Knowledge/Data Use: How efficiently were the provided
knowledge and self-constructed knowledge used? For example, was a small set of
knowledge used efficiently, or was a large set of knowledge used to simplify the
process.
They also considered the following:
Feasibility of Idea (for Idea Only): Feasibility of idea including algorithms
and data/knowledge construction.
Logical Explainability: Is an explanation logically persuadable? As a guide,
1 indicates no explanation and evidence, 3 indicates that some evidence in any
form is provided, and 5 indicates that there is an explanation that is consistent
with the estimation and reasoning process.
Effort: Amount of effort required for the submission (knowledge/data/system).
Report on the First Knowledge Graph Reasoning Challenge 2018 31
a statistically significant difference between the first and second prizes, but the
score for the explainability was not significantly different.
In terms of the results of the experts, the averages of each metric in the first
prize were higher than those of the second prize, except for the explainability
score, which was statistically significantly higher for the second prize according
to the t-test. We should note that the standard deviations of the averages for
each metric were less than 0.1; thus, there were no big differences among their
evaluations. Among the metrics, explainability had the least variance, and the
effort required had the biggest variance.
Therefore, the final decision was left to the expert peer review. As a result,
we decided this prize order, since the metrics other than the explainability of the
first prize were higher than or equal to the second prize. At the same time, the
evaluation of the estimation and reasoning techniques including explainability,
which was a key goal of this challenge, was left to the future challenge. In addition
to the first and second prize, we gave a best resource and a best idea prize based
on the comments of the experts.
5 Related Work
In terms of AI development with explainability, the Defense Advanced Research
Projects Agency (DARPA) started the eXplainable AI (XAI) project in 2017.
DARPA XAI is a research and development project to help soldiers understand,
trust, and manage future AI partners4 , and it is developing machine learning
techniques to generate more explainable models while retaining the high-level
learning function. At the same time, the model should be able to translate
an explanation that is more understandable and useful to human users using
the latest human-computer interaction (HI) techniques. The integration of the
eXplainable AI model and human interaction was intended from the beginning of
the project. Specifically, two tasks corresponding to the DARPA missions, data
analytics, and autonomy were set as problems to be solved. The data analytics
task is technically a classification problem of multimedia data and indicates the
basis for the decision to the human analyst when automatically identifying tar-
gets from images. The autonomy task is a reinforcement learning problem of an
autonomous system, such as the type used in drones and robots, and presents
why the next action was selected in a given situation to human operators using
the autopilot mode. To indicate the reason, three methods are discussed. Deep
explanation shows which features are important for identification in deep learn-
ing [2]. Interpretable models mainly use random forests, Bayesian networks, and
probabilistic logics, and they show the meanings and correlations of nodes in
the constructed network. Model induction handles a model as a black box and
creates a simpler and more analytical model with the same input and output.
The explainability of AI also has a social need. The Japanese Ministry of
Internal Affairs and Communications prepared ten general principles for AI pro-
motion and its risk reduction in 2018. Although these are not rules, they are
4
https://www.darpa.mil/program/explainable-artificial-intelligence.
Report on the First Knowledge Graph Reasoning Challenge 2018 33
expected to evoke public opinion by discussion in and outside Japan. The prin-
ciple of transparency (#9) defines that service providers and business users of AI
must pay attention to the verification possibilities of input and output, and the
explainability of AI system/service results. The principle of accountability (#10)
defines that service providers and business users of AI should have accountabil-
ity to stakeholders including consumers and end-users. In the European Union
(EU), article 22 in the General Data Protection Regulation (GD-PR) enforced
May 2018 defines that service providers of data-based decision-making must have
the responsibility to safeguard users rights, at least the right to obtain human
intervention.
Accordingly, in top conferences of AI and neural networks, such as IJCAI,
AAAI, NIPS, and ICML, papers and workshops that have “expandability” as a
keyword and that analyze the properties of AI models have significantly increased
since 2016. However, there is no research activity like this challenge, which uses
knowledge graphs including social problems as common test-sets and tries to
solve the problems with explainability, aiming to integrate inductive estimation
and deductive reasoning.
Although knowledge graphs and schema constructed for this challenge are our
original work, related works include EventKG [8], ECG [9], and Drammer [10].
Knowledge graphs such as Wikidata and DBpedia focus on entities of persons
and objects, but EventKG is a Knowles graph that describes 690,000 historic
and modern events to generate question answering and history (timeline) from
specific aspects. It uses a schema that extends temporal relation expressions
based on the Simple Event Model [11]. Although there are similarities to our
schema, e.g., definitions of event relationships, the granularity of their events is
much bigger than one of our scenes; thus, it is difficult to describe who, whom,
and what for each scene using the EventKG schema5 . ECG provides a schema to
annotate extracted information when directly constructing a knowledge graph
from a news event that is described in natural languages. However, since it is
for automatic extraction, the schema is simple and only includes who, what,
where, and when. Drammer focuses on fictional contents and aims not only to
sequentially express the content, but also to dramatically present the narrative
contents, and it define a schema (or, ontology) including conflict of characters,
story segmentation, emotional expression, and belief. That is an intensive work
constructed after analysis of several dramas, but is different from our schema for
expressing facts and relations in real society.
References
1. Mehdi, G., et al.: Semantic rule-based equipment diagnostics. In: d’Amato, C.,
et al. (eds.) ISWC 2017. LNCS, vol. 10588, pp. 314–333. Springer, Cham (2017).
https://doi.org/10.1007/978-3-319-68204-4 29
2. Brina, O., Cotton, C.: Explanation and justification in machine learning: a survey.
In: Proceedings of IJCAI 2017 Workshop on Explainable AI (2017)
3. Fauna of India Wiki. https://en.wikipedia.org/wiki/Fauna of India. Accessed 18
Jan 2019
4. Mineshima, K., Tanaka, R., Gomez, P.M., Miyao, Y., Bekki, D.: Building compo-
sitional semantics and higher-order inference system for a wide-coverage Japanese
CCG parser. In: Proceedings of the 2016 Conference on Empirical Methods in
Natural Language Processing, pp. 2236–2242 (2016)
5. Kitagawa, K., Shiramatsu, S., Kamiya, A.: Developing a method for quantifying
degree of discussion progress towards automatic facilitation of web-based discus-
sion. In: Lujak, M. (ed.) AT 2018. LNCS (LNAI), vol. 11327, pp. 162–169. Springer,
Cham (2019). https://doi.org/10.1007/978-3-030-17294-7 12
6. Ikeda, Y., Shiramatsu, S.: Generating questions asked by facilitator agents using
preceding context in web-based discussion. In: Proceedings of the 2nd IEEE Inter-
national Conference on Agents, pp. 127–132 (2017)
7. Noble, D., Rittel, H.W.: Issue-based information systems for design. In: Proceed-
ings of the Computing in Design Education, pp. 275–286 (1988)
8. Gottschalk, S., Demidova, E.: EventKG: a multilingual event-centric temporal
knowledge graph. In: Gangemi, A., et al. (eds.) ESWC 2018. LNCS, vol. 10843, pp.
272–287. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-93417-4 18
9. Rospocher, M., et al.: Building event-centric knowledge graphs from news. J. Web
Semant. 37–38, 132–151 (2016)
10. Lombardo, V., Damiano, R., Pizzo, A.: Drammar: a comprehensive ontological
resource on drama. In: Vrandečić, D., et al. (eds.) ISWC 2018. LNCS, vol. 11137,
pp. 103–118. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-00668-6 7
11. van Hage, W.R., Malaise, V., Segers, R., Hollink, L., Schreiber, G.: Design and use
of the simple event model (SEM). J. Web Semant. 9(2), 128–136 (2011)
Violence Identification in Social Media
1 Introduction
During the recent years the rapid growth and popularity of the social media
through social networks and channels, blogs, forums or any public resource on
the Internet for inter-personal communication have motivated the people to share
opinions, thoughts and concerns in the world around them. The users usually
transmit implicitly emotions and present patterns of conduct as response to
people or specific topics which can describe social phenomenons. One important
social issue in the social media is the violence; it may include blaming, verbal
c Springer Nature Switzerland AG 2020
X. Wang et al. (Eds.): JIST 2019, LNCS 12032, pp. 35–49, 2020.
https://doi.org/10.1007/978-3-030-41407-8_3
36 J. Vizcarra et al.
assault, humiliation, intimidation, etc. Some types of violence are for instance
gender, elder or relationship abuse, threat, intentionally frightening, excessively
criticizing, etc., which can lead in physical aggression. Hence the importance of
an in-depth analysis and identification of violent patterns on the user-generated
data in order to classify endangered groups and potential aggressors. Once the
identification is performed prevention mechanisms can be applied on the people
through social plans.
Current efforts in violence identification that motivated our proposal are car-
ried out by international organizations. These initiatives are for instance: (1) the
World Health Organization (WHO) published in March 2017 the estimation of
mortality caused by interpersonal violence (global health estimates 2015: deaths
by cause, age, sex, by country and by region, 2000–2015) [16]. In this study, many
countries were listed such as Mexico, Brazil, Colombia, India, Pakistan, Nigeria
as countries with high number of deaths caused by interpersonal violence. Fur-
thermore, on january 2016 in an historic summit the (2) United Nations (UN)
established the sustainable development goals (SDGs) (17 goals to transform
the world) [2] to accomplished by 2030. The goal 16 encourages to the countries
that compose the UN to promote just, peaceful and inclusive societies where
all forms of violence and related death rates have to be significantly reduced
everywhere Moreover, the goal 5 mentioned that importance of “gender equality
where women and girls continue to suffer discrimination and violence in every
part of the world”. Gender equality is not only a fundamental human right, but
a necessary foundation for a peaceful, prosperous and sustainable world.
Based on the previous motivations the present work aims in the identification
of violence in comments on social media. In this gap the methodology focuses in
a better understanding of content, context and sense with a semantic approach.
The main contributions are: (1) the identification of violent comments based on
well-defined knowledge graphs. (2) The methodology computes similarities mea-
sures and conceptual distances in order to discover semantically violent content
on social media.
2 Background
Describing briefly the current state-of-the-art in relation with our proposal some
of research lines and works are listed. Regarding the topic analysis in social
media, the work of Garimella et al. [11] constructed a conversation graph about
a topic measuring the amount of controversy from characteristics of the graph.
Analyzing the relation between opinion and topic, the work of Xiong et al. [18]
proposed an opinion model on topic interactions, individual opinions and topic
features which are represented by a multidimensional vector in order to mea-
sure an user’s action towards a specific topic. Discovering topics, the work of
Davis et al. [9] proposed an unsupervised methodology which identifies new top-
ics prevalent in both social media and news. The work was able to rank topics
by relevance, media focus, user’s attention and level of interaction. Georgiou
et al. [12] proposed a topic and community detection algorithm utilizing social
Violence Identification in Social Media 37
3 Methodology
This section describes our contribution in four main stages that compose the
methodology. In first stage “knowledge base construction” the knowledge is
described and the types of violence are defined. The second stage “social media
data collection” retrieves and stores comments as well as maps the social
38 J. Vizcarra et al.
During the violence description the subjects that belong to only one category
related with violence are discarded in further processing due the low level of
violence. This exclusion aims in the reduction of misclassification for comments
with violence.
Natural Language Processing. In this step for each concept in the comment
a natural language pre-processing is computed in order to provide to the next
stages of the methodology adequate terms that match with the knowledge base.
The processes related are listed as follows:
– Tokenizer. In this process a sequence of strings is divided into individual
words called tokens.
– Removal of stop words. If a concept belongs to a stop word list (words with
little meaning) it is removed. The words related with negation (no. not, etc.)
are first identified in order to compute the negation processing. After they
are removed.
– Lemmatization. The purpose of this processing is to reduce words (inflected
or derived) to their word lemma (dictionary form). Each concept is reduced
to its lemma by using Stanford CoreNLP [14] .
– Removal of unknown concepts in the knowledge graph. This process is exe-
cuted in order to reduce number of words and discard concepts that cannot
be located into the knowledge graph. This step also reduces extra processing.
– Part-of-speech. This process identifies the part-of-speech of each sentence in
order to identify the concepts negated as well as reduce disambiguation by
limiting the number of senses for a word.
– Negation. In this process the concepts affected by negation are identified and
handled.
The final step is to select the main type(s) of violence T vx for a comment
SCcommentx by either (1) selecting all the resources T vx linked to a comment
where the value of lvRcommentx is higher than a pre-established threshold value
or (2) select the resource T vx with highest lvRcommentx value (Eqs. 6 and 7).
4 Evaluation
The evaluation that considers “level of violence” related to comments for L4,
L3 and baselines is shown in Fig. 6.
Table 2 presents the confusion matrix considering the processing “level L3
and L4”.
Violence Identification in Social Media 45
Evaluation with Twitter CNN’s Account. In this section the results and
evaluation on Twitter for the account: “CNN news” are introduced. Some rele-
vant examples processed by our methodology are presented in the Table 3.
The evaluation of our methodology on Twitter (level L4) against the baseline
lexical matching (Baseline LM) is presented in the Fig. 7.
The confusion matrix for the identification of violent and nonviolent hashtags
is presented in Table 4. It is important to notice that the identification of violent
hashtags performed better than nonviolent using a threshold >0.1 (methodology
Eq. 6).
5 Conclusions
research which its evaluation was performed on same dataset. In addition, dur-
ing the identification we explored the increment of levels of expansion in the
discovery of concepts with violence through the levels (iterations) 3 and 4. The
expansion until level 4 performed the best precision but limiting the number of
comments processed due the high consumption of resources such as memory and
processing time.
Regarding influence of negation, this processing was included due the high
number of comments that presented this scenario. The negation handling
improved the performance of our methodology because we noticed that the nega-
tion is frequently used in violent comments.
Some of the main advantages of our proposal are: (1) the effort of understand-
ing the content on the semantic level. The (2) violence identification executes a
disambiguation processes in order to discover the context and words’ sense in
the estimations. As (3) the methodology is based on a well-structured knowledge
base, the definition of concepts related with violence is more accurate and covers
a wider number of types of violence. (4) Regarding the adaptability our system
is flexible which implies it might be focused on other domains not limited to
violence by just modifying the knowledge base.
The results obtained in the present work can be consulted at the github site:
https://github.com/samscarlet/SBA/tree/master/ViolenceAnalysis.
Acknowledgments. This work was supported in part by Council for Science, Tech-
nology and Innovation, “Cross-ministerial Strategic Innovation Promotion Program
(SIP), Big-data and AI-enabled Cyberspace Technologies”. (funding agency: NEDO),
JSPS KAKENHI Grant Number JP17H01789 and CONACYT.
References
1. Princeton university “about wordnet.” wordnet. Princeton university (2010).
http://wordnet.princeton.edu
2. Assembly, G.: Sustainable development goals. SDGs), Transforming our world: the
2030 (2015)
3. Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., Ives, Z.: DBpedia:
a nucleus for a web of open data. In: Aberer, K., et al. (eds.) ASWC/ISWC -2007.
LNCS, vol. 4825, pp. 722–735. Springer, Heidelberg (2007). https://doi.org/10.
1007/978-3-540-76298-0 52
4. Birjali, M., Beni-Hssane, A., Erritali, M.: Machine learning and semantic senti-
ment analysis based algorithms for suicide sentiment prediction in social networks.
Procedia Comput. Sci. 113, 65–72 (2017)
5. Bond, F., Baldwin, T., Fothergill, R., Uchimoto, K.: Japanese SemCor: a sense-
tagged corpus of Japanese. In: Proceedings of the 6th Global WordNet Conference
(GWC 2012), pp. 56–63 (2012)
6. Bond, F., Foster, R.: Linking and extending an open multilingual wordnet. In:
Proceedings of the 51st Annual Meeting of the Association for Computational
Linguistics (Volume 1: Long Papers), vol. 1, pp. 1352–1362 (2013)
7. Cheng, Q., Li, T.M., Kwok, C.L., Zhu, T., Yip, P.S.: Assessing suicide risk and
emotional distress in chinese social media: a text mining and machine learning
study. J. Med. Internet Res. 19(7), e243 (2017)
Violence Identification in Social Media 49
8. Davidson, T., Warmsley, D., Macy, M., Weber, I.: Automated hate speech detection
and the problem of offensive language. In: Proceedings of the 11th International
AAAI Conference on Web and Social Media, ICWSM 2017, pp. 512–515 (2017)
9. Davis, D., Figueroa, G., Chen, Y.S.: SociRank: identifying and ranking prevalent
news topics using social media factors. IEEE Trans. Syst. Man Cybern. Syst. 47(6),
979–994 (2016)
10. Dokuz, A.S., Celik, M.: Discovering socially important locations of social media
users. Expert Syst. Appl. 86, 113–124 (2017)
11. Garimella, K., Morales, G.D.F., Gionis, A., Mathioudakis, M.: Quantifying con-
troversy on social media. ACM Trans. Soc. Comput. 1(1), 3 (2018)
12. Georgiou, T., El Abbadi, A., Yan, X.: Extracting topics with focused communities
for social content recommendation. In: Proceedings of the 2017 ACM Conference
on Computer Supported Cooperative Work and Social Computing, pp. 1432–1443.
ACM (2017)
13. Isahara, H., Bond, F., Uchimoto, K., Utiyama, M., Kanzaki, K.: Development of
the Japanese wordnet (2008)
14. Manning, C.D., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S.J., McClosky,
D.: The stanford CoreNLP natural language processing toolkit. In: Association
for Computational Linguistics (ACL) System Demonstrations, pp. 55–60 (2014).
http://www.aclweb.org/anthology/P/P14/P14-5010
15. Nguyen, T., ODea, B., Larsen, M., Phung, D., Venkatesh, S., Christensen, H.:
Using linguistic and topic analysis to classify sub-groups of online depression com-
munities. Multimed. Tools Appl. 76(8), 10653–106762 (2017)
16. World Health Organization: World health statistics 2015. World Health Organiza-
tion (2015)
17. Vizcarra, J., Kozaki, K., Ruiz, M.T., Quintero, R.: Content-based visualization
system for sentiment analysis on social networks. In: JIST (2018)
18. Xiong, F., Liu, Y., Wang, L., Wang, X.: Analysis and application of opinion model
with multiple topic interactions. Chaos Interdisc. J. Nonlinear Sci. 27(8), 083113
(2017)
19. Yao, H., Xiong, M., Zeng, D., Gong, J.: Mining multiple spatial-temporal paths
from social media data. Future Gener. Comput. Syst. 87, 782–791 (2018)
Event-Oriented Wiki Document
Generation
1 Introduction
An event means a particular thing happening at a specific time and place [1].
An event is usually described by multiple related news, which describe it from
different aspects in an unorganized way. Human-constructed Wikipedia arti-
cles for events compress these related news into a more organized way, which
is more comprehensive and detailed, helping readers to learn about the event
more efficiently. However, writing a Wikipedia document manually can be time-
consuming and difficult, so automating the writing process is a valuable research
topic.
There have been various methods aiming to automatically generate
Wikipedia documents, which generally employ a two-step structure: first induce
some topics from existing related Wikipedia articles as the content table, then
Fig. 1. The data flow of our model. A multi-layer topic template called topic tree, along
with word distributions of topics will be induced from existing Wikipedia documents
of certain event type. Then we will collect candidate news excerpts related to target
event from the Internet, and generate summaries for each topic to get Wikipedia article
for the target event.
collect a summary for each topic from web news. However, there exists sev-
eral flaws in these methods. First, they usually use a single-layer content table,
neglecting the widespread use of multi-layer content tables as shown in Fig. 1,
which is able to depict the whole event with different granularity. Second, they
fail to explicitly utilize the word distribution of topics. Third, they do not fil-
ter out noises that are inevitable in web news, posing a potential threat to the
quality of generated documents.
In this paper, we propose a new model named WikiGen to automatically
generate corresponding Wikipedia document for a new event. This model consists
of two parts: topic tree induction and a two-step summary generation. Given a
certain event type (e.g., earthquake), we first combine structural and textual
relation to induce a topic tree from existing Wikipedia documents belong to
the given type (e.g., 2010 Haiti Earthquake, 2010 Chile Earthquake, etc.), then
utilize word distribution of different topics to generate unique summaries for
each topic in the previous generated topic tree: we will coarsely identify related
snippets from web data at first, and afterwards further select sentences with an
extractive neural network model, thus forming final results.
As there is no standard benchmark for event-oriented Wikipedia article gen-
eration, we constructed a dataset on three event types: earthquake, election and
tornado, which contains about 2000 documents and has 190 candidate excerpts
per document on average. We conducted extensive experiments on our dataset to
evaluate the performance of both topic tree induction and summary generation.
Experimental results show that WikiGen is capable of generating fine-grained
52 F. Zhu et al.
topic trees and high-quality Wikipedia documents. Specifically, the topic trees
we inducted retain an accuracy of about 95%, and the documents we generated
significantly outperform previous works on ROUGE-1 F1 score.
The contribution of this paper can be concluded as follows:
– We build a new dataset for event-oriented wikipedia article generation, which
contains three event categories: earthquake, election and tornado;
– We propose WikiGen, a two-step model to automatically induce multi-layer
topic trees from existing Wikipedia articles and then collect topic-specific
summaries for each topic, thus forming high-quality documents;
– We demonstrate that our model outperforms existing models, and is highly
data-efficient and interpretable.
2 Related Work
There have been some works focusing on Wikipedia document generation. In
general, these works share a common two-step structure: first determine a topic
template for the new document, then generate a topic-specific summary for each
topic in the template.
Topic Template Induction. Notice that human written articles have different
subtitles due to personal preference, a general topic template should be inducted
to ensure consistency. Previous works tend towards generating single-layer tem-
plates: Google’s WikiSum [10] generates only abstracts; Sauper and Barzilay [12]
clusters the titles of existing Wikipedia articles and chooses the most common
titles of each cluster as the template. WikiWrite [3] discovers similar existing
articles and copys their content table as new templates. However, compared
with multi-layer content tables used in real Wikipedia articles, single-layer tem-
plates fail to reflect the hierarchical structure of points of interest in real-world
events, and lack details in different aspects. To address this problem, Hu [7]
tries to build multi-layer topic templates by combining structural dependency
and textual correlation to judge subtopic relations between topics.
Topic-Specific Summary Generation. After determining the topic template,
generating topic-specific summary can be viewed as a document summarization
problem. Sauper and Barzilay [12] made the first attempt in 2009, which uses
integer linear programming (ILP) to rank excerpts retrieved from the Inter-
net, and then finds optimal excerpts with rank scores for each topic. WikiWrite
[3] took one step further, adding new features like sentence importance, intra-
sentence similarity and linguistic quality and generated more fluent documents.
With the rise of deep learning, attempts of utilizing neural networks have also
been made. There have been extractive models like SummaRunner [11] and
DeepChannel [14] aiming to choose core sentences from raw documents, and
abstractive methods like PointerGenerator [13] trying to capture important infor-
mation to generate the summary. In the Wikipedia document generation field,
WikiSum [10] uses a decoder-only transformer based on attention mechanism
[15] instead of ILP to summarize long sequences and achieves a state-of-the-art
Event-Oriented Wiki Document Generation 53
3 Method
We aim to generate a Wikipedia document for a given new event name based
on existing human-authored Wikipedia articles. We assume that relevant infor-
mation can be found in a wide range of websites across the internet, however,
noise like irrelevant pages and advertisements needs to be dealt with.
Formally, we have three parts of inputs:
Provided with the Wikipedia document set W K, topic template induction aims
to find latent topics Tc = {t} from existing sections g, and then identify subtopic
relations Rc = {(ti , tj )} on the base of subsection relations Ri = {(gx , gy )}.
Topics and subtopic relations will form a topic tree Hc = {Tc , Rc } as the final
multi-layer topic template.
Topic Discovery. There exist similar sections in Wikipedia documents, for
example, “Tectonics” and “Tectonic background”, and merging these similar
sections into one topic could greatly reduce the redundancy in our generated
document. The topic discovery process can be viewed as an unsupervised clus-
tering problem whose expected cluster number is unknown.
We use an double-pass incremental clustering algorithm to tackle with this
problem. This algorithm is based on the work of Hammouda [6], which combines
54 F. Zhu et al.
count(sim(di , dj ) > σ1 )
HRnew1 =
count(di , dj )
(1)
count(sim(di , dj ) > σ2 )
HRnew2 =
count(di , dj )
In practice, we use the TF-IDF similarity of all texts under two sections as
the similarity between sections. Titles that occur only once and clusters whose
numbers of occurrence are less than 3 are discarded. We use the cluster list after
the second pass as the topic list Tc , where every topic is denoted by the most
frequently appeared title.
Subtopic Relation Discovery. After acquiring the topic list Tc , we need to
further identify the subtopic relation set Rc = {(ti , tj )} and combine them to
get the complete topic tree Hc .
Previous work use a probabilistic model to build the whole tree from the top.
That model aims to maximize the occurrence probability of every topic in the
tree under its father topic, and its goal can be described as Eq. 2.
H ∗ = argmaxH P (N |H)
= argmaxH P (root) P (n|parH (n))
n∈N \root (2)
= argmaxH logP (n|parH (n))
n∈N
H ∗ = argmaxH P (N |H)
= argmaxH P (parH (n)|n) · P (n)
n∈N \root n∈N \root (4)
= argmaxH logP (parH (n)|n)
n∈N \root
– Textual Correlation: For topic tj and its father topic tj , it can be expected
that the word distribution of tj resembles that of tj , which corresponds with
hierarchical Dirichlet model. If we use the normalized bag of words model
to measure word distribution, the
weight of textual correlation can be rep-
Γ (αφti ,w )
resented as Eq. 6, where Z = Γ (w∈V is the normalize factorΓ (·) is
w∈V αφti ,w )
standard Gamma distribution. Specially, when tj is the root topic, if root
topic doesn’t have any text description, we set log(Ptext (ti |tj )) = 0.
1 βφti ,w −1
Ptext (ti |tj ) = φtj ,w (6)
Z
w∈V
If we view every topic as a node and every relation as a directed edge, build-
ing the topic tree can be converted into finding a maximum spanning tree in
a directed graph. We utilize the classic Chu-Liu/Edmonds [4,5] algorithm to
extract a maximum spanning tree as the final topic tree H = (T, R).
first coarsely filter the corpus with an augmented topic model to identify related
excerpts and reduce the input scale, then utilize neural networks to further filter
those excerpts, generating the text sg for each topic g.
Coarse Filtration. Provided excerpts e1 , e2 , . . . , en and topic g, this step aims
to discard excerpts with low relativity to g and reduce the corpus to a reasonable
scale.
A reasonable approach is ranking these excerpts with relativity and pick
some top excerpts as the new input of next step. Google used the TF-IDF sim-
ilarity between excerpts and topic name to measure relativity, but TF-IDF is
not capable of identifying synonyms. Assume that we are looking for excerpts
belonging to “Damage” topic, TF-IDF will neglect excerpts containing “injury”
or “death”, which often appear to be the correct choice.
We use an augmented topic model to better weigh the contribution of syn-
onyms. In the previous step, we can acquire topic set T = t and Wikipedia text
W Kt of each topic t. According to the topic model, every topic has its specific
word distribution, and words closely related to the topic will appear more fre-
quently. If the contribution of certain word to given topic could be measured,
we could better sort excerpts according to the sum of all words’ distribution in
the excerpts.
For each topic t, we use the bag of words model to process its Wikipedia
text W Kt . Its word probability distribution Ft = {(w, pw,t )} can be acquired
through tokenizing and normalizing. It is obvious that if a word w appears
frequently under topic t, w should have a closer connection to t; However, if w
simultaneously appears in many other topics, its contribution should be lowered
in accordance. Taking these rules into consideration, we use Eq. 8 to quantify
the contribution of word w to topic t.
pw,t p2w,t
W (w, t) = pw,tt = (8)
tt∈T pw,t tt∈T pw,tt
p
Specially, if w only occurs under one topic t, tt∈T pw,tt
w,t
= 1, W (w, t) = pw,t ,
which corresponds with our expectation. If w doesn’t appear under t, we set
W (w, t) = 0.
W (w, t)
Score(e, t) = w∈e (9)
|e|
Considering the features of documents about events, four additional optimiz-
ing steps have been conducted:
– Stemming: The same word may appear in different tenses (for example
“damage” and “damaged”). We use snowball stemmer to stem words, reduc-
ing the disturbance of tenses.
– Requiring High-contribution Words: Name of places and people appear
fairly frequent in Wikipedia documents and thus have considerable contribu-
tion, but they have no direct connection with topics. We record k words
w1 , w2 , . . . , wk 2 with highest contribution for each topic t, and if an excerpt
doesn’t contain any of these k words, its contribution Score(e, t) is set to 0.
For every candidate excerpt e of topic t, we calculate the arithmetic mean of
all words in e as its contribution (See Eq. 9). After sorting excerpts by contri-
bution, we choose top excerpts with total length no longer than L = 1000 and
concatenate them to generate new text D as the input of fine-grained filtration.
Fine-Grained Filtration. After the coarse filtration step, the scale of new
input corpus D becomes suitable for neural networks. Due to limited training
data, we have high requirement on data efficiency of the model. We choose
DeepChannel [14] (see Fig. 2) after comparison, which utilizes channel model to
select most significant sentences from the input document and represents well in
small datasets.
4 Dataset Construction
Among previous work, [12] and [3] didn’t provide dataset, meanwhile the dataset
of [10] isn’t event-oriented and contains only abstracts rather than full Wikipedia
articles, leading to the fact that we don’t have an available existing dataset.
Considering that Wikipedia itself changes over time, it is also nearly impossible
to reconstruct previous datasets. We constructed a new dataset for our task.
2
k = 20 in experiment.
Event-Oriented Wiki Document Generation 59
For every existing Wikipedia article ai , We build a web corpus Dj for each
section g in it to simulate documents collected for the new event. Given that
Wikipedia articles can be viewed as human-authored summaries, we build the
web corpus Dj on two sources:
– Cited references: A well-written Wikipedia document should cite the
source of its important information in the Reference section. For each
Wikipedia article ai , we extract undecorated text snippets from websites
listed in the Reference section as cited corpus Cg .
– Search results: We found that a portion of cited websites are no longer
available, so additional data needs to be collected. Search engines can effi-
ciently find relevant information about certain entity, however, appropriate
search queries have to be provided. We employ query [2], combining docu-
ment title t and section name g to build a appropriate search query for each
section. For example, “2008 Sichuan earthquake” Geology for section Geology
in article 2008 Sichuan earthquake. We use the first 20 result pages of Bing
search engine for each query. After removing results of Wikipedia websites,
we extract text from remaining pages as searched corpus Sg .
We use BeautifulSoup, a Python library, to remove useless information like
scripts and styles in web pages. Moreover, we discarded text snippets whose
length is greater than 400 or less than 5 to reduce noises like comments and
advertisements. The full web corpus Dj can be achieved by combining filtered
Cg and Sg .
5 Experiments
In this section, we will build a new dataset to evaluate the performance of Wiki-
Gen. We conducted experiments on topic template induction and topic summary
generation, results show that WikiGen outperforms previous work on both fields.
5.1 Dataset
We choose three event categories, earthquake, election and tornado, from English
Wikipedia to build our dataset. The data source is XML dump of English
Wikipedia from WikiMedia. See Table 1 for detailed parameters of our dataset.
We compare our method with Hu’s work [7] to evaluate it both quantitatively
and qualitatively. To compare more precisely, we don’t discard low-frequency
topics according to Hu’s experiment.
Evaluation Method. We use F1-score to evaluate the performance of different
methods. Assume that R is the generated subtopic relation set and Rgt is the
ground-truth subsection relation set, we can calculate precision with Eq. 10 and
recall with Eq. 11, where gi , gj are sections and tk , tl are topics.
We note that adding textual correlation does not improve the F1-score obvi-
ously, it may happen as a result of the fact that text of different topics after
clustering already differ greatly from each other.
Qualitative Analysis. We take the topic tree of category “election” as exam-
ple (See Fig. 3). There are first-level topics like “Results”, “Preliminaries” and
“Campaign” under the root topic, followed by more detailed topics, for example
Event-Oriented Wiki Document Generation 61
Evaluation Method. We use the classic ROUGE [9] score to measure our
model. Due to the fact that the length of section text varies greatly in our
dataset (there exists text longer than 2000 words while the average length is
100), considering only precision (which prefers short text) or recall (which prefers
long text) both produce bias. We use the ROUGE-1 F1 score, which combines
precision and recall, to evaluate the results.
Method of Coarse Filtration. We use three different methods to calculate
the similarity between excerpt e and topic t:
– TF-IDF: We combine document title tid and topic name t as q = (tid + t),
then calculate the TF-IDF cosine similarity between e and q as the final result
62 F. Zhu et al.
Results show that using search results greatly improved the performance
compared with using only citations. Two reasons may contribute to this phe-
nomena: first, some cited pages of old events have been outdated and no longer
Event-Oriented Wiki Document Generation 63
available; second, a percentage of cited pages are written in languages other than
English, but search engine can provide us with English results only.
Although only using search results can achieve better scores on both F1-score
and recall, we decided to use combined corpus as the model input. The reason
is twofold: first, cited sources are chosen by human authors and mainly consists
of first-time news and information from authoritative websites, which matches
what we except to gather when new event happens; second, some websites in
search results copy sentences from Wikipedia articles, making their ROUGE
scores higher than normal. To make the experimental environment more close
to real scenarios, we didn’t discard cited sources.
Corpus of Fine-Grained Filtration. Considering that we have limited train-
ing data, we merge all three event categories, and divide all data into training
set, verification set and test set by the ratio of 80%/10%/10%. We use four dif-
ferent methods to generate the final document: random, lead-7, DeepChannel
from scratch and pretrained DeepChannel. Moreover, we choose sentences with
highest ROUGE-1 F1 scores as the theoretically optimal result. See Table 6 for
results.
If we merge the result two steps in topic summary generation (See Fig. 4 for
results), we can find that WikiGen using pretrained DeepChannel gets the best
Precision Recall F1
Random 40.5 43.7 36.4
Lead-7 42.1 46.9 38.8
WikiGen (from scratch) 45.4 39.4 39.5
WikiGen (pretrained) 52.3 39.2 42.8
Theoretically optimal 68.1 56.7 59.5
result. Lead-7 gets higher recall scores but its F1 score is far lower than our
model, proving that WikiGen is capable of selecting key sentences. Meanwhile,
fine-grained filtration can greatly improve the quality of generated documents,
which matches the way how human writers write articles – first coarsely select
relevant materials then rewrite them, and proves the correctness of our model.
Comparison with Previous Work. It is quite difficult to directly compare our
method with previous work for the following reasons: (1) Our method focuses
on events rather than general entities, direct comparing with methods orient
towards entities would cause bias; (2) No available datasets have been provided
by previous work, and it is impossible to rebuild former datasets due to the
constantly changing Internet and Wikipedia; (3) No executable model code has
been provided by previous work, making it even harder to compare with them.
However, we tried our best to conduct some comparision with those works. We
choose the “Disease” category in previous work, which has a maximum recall of
59 that resembles our dataset, to compare with the results of our model. Table 7
demonstrates the results. We can see that our model far outperforms previous
work on F1 score. There is some lack in recall score, partly because our model
tends to choose sentences that relate closely to the topic, which reduces the
breadth of coverage.
Precision Recall F1
Sauper 36 39 37
WikiSum 25 48 29
WikiGen (from scratch) 45.4 39.4 39.5
WikiGen (pretrained) 52.3 39.2 42.8
Case Analysis. To comprehensively judge the effect of our model, we use the
title “2008 Sichuan Earthquake” as input to generate a full Wikipedia document
with multiple topics. See Fig. 5 for part of the results.
Results show that although the documents we generated is not as fluent as
those written by human, our model captured the most important information
under different topics, for example the number of people injured in Damage and
the Longmenshan Fault in Tectonics. The results have great value in the form
of both first-time summary and reference for human writing.
6 Conclusion
In this paper, we propose a new model named WikiGen to automatically generate
Wikipedia documents for new events. This model will induce a multi-layer topic
tree for each event category and generate a summary from gathered news for
each topic. For topic tree generation, we use a double-pass incremental clustering
algorithm to convert this step into finding a maximum spanning tree in a directed
graph. For topic summary generation, we imitate the way human write articles,
designing a two-step procedure to generate the full document: first coarsely filter
useful information with an augmented topic model, then generate a reasonable
summary with the DeepChannel model pretrained on CNN/Dailymail dataset.
Our method outperforms comparable methods on both topic tree genera-
tion and topic summary generation. Results show that our model is capable of
generating comprehensive and detailed Wikipedia documents, and can be easily
expanded to other fields. Our model also shows high data efficiency, being able
to produce high-quality result with litter training data. In the future, we will
try to rearrange selected sentences to acquire more fluent documents.
References
1. Allan, J., Carbonell, J., Doddington, G., Yamron, J., Yang, Y.: Topic detection
and tracking pilot study final report. In: Proceedings of the Darpa Broadcast News
Transcription & Understanding Workshop (1998)
2. Aula, A.: Query formulation in web information search. In: ICWI (2003)
3. Banerjee, S., Mitra, P.: WikiWrite: generating Wikipedia articles automatically.
In: IJCAI (2016)
4. Chu, Y.J., Liu, T.H.: On shortest arborescence of a directed graph. Sci. Sinica
14(10), 1396 (1965)
5. Edmonds, J.: Optimum branchings. J. Res. Nat. Bureau Standard B 71(4), 233–
240 (1967)
6. Hammouda, K.M., Kamel, M.S.: Incremental document clustering using cluster
similarity histograms. In: Proceedings of the IEEE/WIC International Conference
on Web Intelligence (WI 2003). IEEE (2003)
66 F. Zhu et al.
7. Hu, L., et al.: Learning topic hierarchies for Wikipedia categories. In: ACL (2015)
8. Lebret, R., Grangier, D., Auli, M.: Neural text generation from structured data
with application to the biography domain. In: EMNLP (2016)
9. Lin, C.Y.: Rouge: a package for automatic evaluation of summaries. In: ACL (2004)
10. Liu, P.J., et al.: Generating Wikipedia by summarizing long sequences. arXiv
preprint arXiv:1801.10198 (2018)
11. Nallapati, R., Zhai, F., Zhou, B.: SummaRuNNer: a recurrent neural network based
sequence model for extractive summarization of documents. In: AAAI (2017)
12. Sauper, C., Barzilay, R.: Automatically generating Wikipedia articles: a structure-
aware approach. In: ACL (2009)
13. See, A., Liu, P.J., Manning, C.D.: Get to the point: summarization with pointer-
generator networks. In: ACL (2017)
14. Shi, J., Liang, C., Hou, L., Li, J., Liu, Z., Zhang, H.: DeepChannel: salience esti-
mation by contrastive learning for extractive document summarization. In: AAAI
(2019)
15. Vaswani, A., et al.: Attention is all you need. In: NIPS (2017)
A Linked Data Model-View-* Approach
for Decoupled Client-Server Applications
1 Introduction
Separation of concern, the separation of parts of programs with distinct purpose
in both architecture and code, is found to be a crucial design requirement for
maintainable, extendable and understandable software [1–5].
For a clean separation within runtime applications, the paradigm of Aspect
oriented Programming [2] motivates to decouple application modules to avoid
cross-cutting concerns, in particular code tangling (logic of a module directly
depends on code implemented in other modules), and code scattering (code
that implements a certain aspect of an application is distributed over several
modules).
c Springer Nature Switzerland AG 2020
X. Wang et al. (Eds.): JIST 2019, LNCS 12032, pp. 67–81, 2020.
https://doi.org/10.1007/978-3-030-41407-8_5
68 T. Spieldenner and R. Schubotz
2 Related Work
Enriching Web services an applications with Linked Data APIs and lifting them
to a Linked Data architecture [16] has faced much attention in research in recent
years.
Especially in the domain of Internet of Things (IoT), work has been car-
ried out to find suitable expressive Linked Data representations of devices and
interfaces [13,17,18], up to using the W3C recommendation of the Linked Data
Platform5 as integration layer for heterogeneous IoT devices [19] with the goal
to overcome a lack of sufficiently described Web APIs [20,21]. Also for exist-
ing Enterprise applications, the Linked Data Platform has been found to be a
suitable integration layer [22].
An often considered case for Linked Data application development is the sim-
plification of creating User Interfaces (UI) for existing datasets. The Information
Workbench by Haase et al. [23] supports widget-based Linked Data application
development, mainly for data integration, providing users with a rich UI that
can be customized by an SDK. LD-R by Khalili et al. [24,25] are Linked Data
driven Web components6 for quick bootstrapping of Linked Data based Web
UIs, along with an approach to map SPARQL queries to interactive UI compo-
nents [26]. LD Viewer by Lukovnikov et al. [27] is a framework, based on the
DBPedia viewer, for customizable Linked Data UI presentations.
When it comes to connecting clients to servers, research has been evolv-
ing around simpler and more versatile usage of the SPARQL query language7 .
This includes work to wrap stored SPARQL queries into HTTP Web APIs
in BASIL by Daga et al. [28], and vice versa, JSON-based Web APIs into
SPARQL queryable endpoints, as for example in SPARQL Micro Services by
Michel et al. [29,30].
Fafalios et al. present SPARQL-LD [31,32], which generalizes the semantics
of the SPARQL1.1 SERVICE keyword to dynamically fetch RDF datasets from
Web resources, also during evaluation of the query. Vogelgesang et al. present
5
https://www.w3.org/TR/ldp/.
6
https://developer.mozilla.org/en-US/docs/Web/Web Components.
7
https://www.w3.org/TR/sparql11-overview/.
70 T. Spieldenner and R. Schubotz
SPARQλ [33], which modifies parts of the query semantics of the SPARQL GRAPH
keyword to dynamically specify target datasets during query execution, and by
this use pre-stored SPARQL queries as lambda function like micro-services.
While existing work mostly considers static datasets, as for example legacy
databases [34] or existing RDF datasets [24,25,27], we explicitly provide an app-
roach that uses a Linked Data Platform representation of run-time application
data as API towards Web clients. Unlike existing approaches that focus on UI
development as View on the Linked Data representations [23–27], our approach
targets general Web application development and does not limit client business
logic to UI rendering and interaction. The semantically enriched Linked Data
representation on server-side of our approach highly supports SPARQL-based
client queries, and by this profits from findings in respective research [31–33].
The respective architecture is not tied towards a specific framework, but we pro-
vide a thorough analysis how any client-server based application benefits from
the advantages of the Linked Data Platform by implementing a Model-View-
based design pattern.
3 Preliminaries
The proposed architecture in this paper is based mainly on three core technologies
resp. design choices: We build the architecture around a Model-View-* like design
pattern, precisely, a Model-View-Presenter-ViewModel pattern8 . We employ an
Entity-Component-Attribute data model [5,35] as ViewModel on server-side, and
base the server-side View on the server data on the W3C Linked Data Platform
recommendation (see Footnote 5). In the following, we outline the core concepts
as to be understood for the remainder of the paper.
10
https://www.unrealengine.com/.
11
https://www.unity.com/.
A Linked Data Model-View-* Approach for Client-Server Applications 73
(na , v, t) ∈ Ac,e
ν(na ) rdf:type ldp:RDFResource .
ν(na ) dct:identifier “na ”ˆˆxsd:String .
➂
ν(na ) dct:isPartOf ν(nc ) .
ν(na ) rdf:value ν(nva ).
74 T. Spieldenner and R. Schubotz
We changed the way how ECA2LD renders Attribute values directly as suit-
able RDF representation in step ➂. We instead provide Attribute values as sepa-
rate resource with resolvable URI ν(nva ). This allows the server to specify further
interaction methods on Attribute values, as described in Sects. 4.2 and 4.3.
We moreover extend above representations by generating triple sets to
describe collections of Entities. We for this assume an entity collection E to
be assigned a unique name nE .
∀(ne , Ce ) ∈ E
ν(nE ) rdf:type ldp:BasicContainer .
➃
ν(nE ) ldp:contains ν(nne ) .
The respective Entity collection resource then serves as entry point for client
applications to explore the server data.
The resulting RDF modeled Linked Data Platform representation allows to
further augment the resources with domain semantic information. For this, the
RDF description of the data structure serves as input for RDF mapping vocab-
ularies like SPIN SPARQL12 , RIF in RDF13 , the LDIF framework14 or the R2R
framework15 . For a detailed explanation of domain semantic augmentation, we
refer to the original paper [5]. In the context of our approach, additional domain
semantic information on top of the structural description allows clients to iden-
tify relevant resources based on domain semantic information.
Fig. 2. Web resources and respective endpoints that are generated for each of the
elements of the runtime application.
16
https://docs.microsoft.com/en-us/dotnet/standard/events/.
76 T. Spieldenner and R. Schubotz
the RDF description in ➄, clients can keep their local model consistent with the
server data.
Figure 2 shows the set of created resources, their relation to each other, and
examples of further interaction methods that can be retrieved by clients after
performing the steps outlined in Sects. 4.1, 4.2 and 4.3.
We aim to keep the Business Logic of the client independent of the server’s data
model, and will for this use the Linked Data View onto the server data from
Sect. 4 to build a Data Access Layer between client business logic and server
data.
Based on both structural and domain semantic information provided on
each of the HTTP resources on server-side, clients identify relevant resources
that provide access to specific pieces of data. For this, clients may either
explore the server-data autonomously by following links between the resources,
or by performing queries against a RDF query processor provided by the
server. Figure 3 shows a respective (parameterized) SPARQL query to retrieve
Attribute Value endpoints based on field names used for components. The
parameters [$entityID] and [$fieldName] may be resolved and set by the
client business logic before performing the query.
The returned result of such queries (or of the autonomous exploration) is a
set of URIs that point to Linked Data resources. Once the relevant Linked Data
resources are identified, clients retrieve relevant modes of interaction directly
on these URIs via HTTP OPTIONS requests (cf. Sects. 4.2 and 4.3). Requests to
these URIs are handled by the ECA2LD Presenter on server-side, applied to the
ViewModel, and ultimatively on the local server-side domain models.
The so generated Data Access Layer reduces interaction between client and
server to basic HTTP request dispatch and handling. All relevant information
for client-server interaction is gathered by clients dynamically from information
provided by the server during run-time. By this, clients do not require previous
knowledge about server data or APIs or build the Data Access Layer.
A Linked Data Model-View-* Approach for Client-Server Applications 77
6 Discussion
In the following we discuss how the proposed architecture provides a valid imple-
mentation of the Model-View-Presenter-ViewModel pattern (Sect. 6.1), and show
3-RMM compliance of the data-centric server interface (Sect. 6.2).
Fig. 4. The manifestation of the MVPVM pattern in our proposed client-server archi-
tecture.
In this section, we show that by the design choices presented in Sects. 4 and 5,
we implement a Model-View-Presenter-ViewModel pattern as shown in Fig. 4.
For this, we first discuss the realization of ViewModel, View, and Presenter on
server-side. Following, we discuss realization of Data Access Layer, and how
Business Logic accesses it, on client-side.
Presenter : The ECA2LD Library. ECA2LD creates the Linked Data Platform
resources along with a respective RDF graph that describes the resources on
every level of the ECA model. The Presenter handles HTTP requests and pro-
vides subscription channels. Via those, it adapts data in the ViewModel and the
View accordingly.
ViewModel : The domain objects modeled in terms of Entities, Components, and
Attributes. The ViewModel provides an intermediate layer between the native
data Model of the server, and any attached View.
View : The RDF Graph describing the Linked Data Platform resources, relations
between them, and modes of interaction as created by ECA2LD. Changes in the
(View)Model are reflected in the View by the Event system implemented in
ECA2LD.
Data Access Layer : Wrapper around HTTP and subscription endpoints as pro-
vided by steps in Sects. 4.2 and 4.3. The client builds the Data Access Layer
dynamically by exploring the server-side Linked Data View on the data accord-
ing to Sect. 5.
78 T. Spieldenner and R. Schubotz
abstract from local domain models, and build the pattern around an explorable
and understandable data and interface representation. By this, we brought the
often proclaimed benefit of Linked Data interfaces to actual application in a
widely applied design pattern.
We considered the automatically generated Linked Data representation of
server run-time data as View on the server data for client applications, the
respective mapping routine as Presenter, and links to the respective data
resources from client to server as Data Access Layer on client-side. The Data
Access Layer can be build by clients autonomously by exploiting knowledge
derived from the provided Linked Data representation in the View. The result-
ing architecture removes the need for fixed server-side API, and instead provides
direct access to server-data via HTTP and publish-subscribe mechanisms.
The current architecture does not yet consider secure communication between
client and server. We plan to extend the capabilities of extendable subscription
channels by secured and encrypted communication protocols. The current imple-
mentation does so far only support XSD compliant datatypes for Attribute val-
ues. We are working on proper semantic description of complex structured data
and binary IoT device streams as serialization/de-serialization instructions for
client on Attribute value endpoints.
Acknowledgements. The work described in this paper has been partially funded
by the German Federal Ministry of Education and Research (BMBF) through the
project Hybr-iT under the grant 01IS16026A, and by the German Federal Ministry for
Economic Affairs and Energy (BMWi) through the project SENSE under the grant
01MT18007A.
References
1. Bower, A., McGlashan, B.: Twisting the triad. Tutorial Paper for European
Smalltalk User Group (ESUP) (2000)
2. Kiczales, G., et al.: Aspect-oriented programming. In: Akşit, M., Matsuoka, S.
(eds.) ECOOP 1997. LNCS, vol. 1241, pp. 220–242. Springer, Heidelberg (1997).
https://doi.org/10.1007/BFb0053381
3. Syromiatnikov, A., Weyns, D.: A journey through the land of model-view-design
patterns. In: 2014 IEEE/IFIP Conference on Software Architecture, pp. 21–30.
IEEE (2014)
4. Spieldenner, T., Byelozyorov, S., Guldner, M., Slusallek, P.: FiVES: an aspect-
oriented approach for shared virtual environments in the web. Vis. Comput. 34(9),
1269–1282 (2018)
5. Spieldenner, T., Schubotz, R., Guldner, M.: ECA2LD: from entity-component-
attribute runtimes to linked data applications. In: Proceedings of the International
Workshop on Semantic Web of Things for Industry 4.0. Extended Semantic Web
Conference (ESWC 2018), International Workshop on Semantic Web of Things for
Industry 4.0, Located at 15th ESWC Conference, Heraklion, Crete, Greece, 3–7
June 2018. Springer (2018)
6. Krasner, G.E., Pope, S.T., et al.: A description of the model-view-controller user
interface paradigm in the smalltalk-80 system. J. Object Oriented Program. 1(3),
26–49 (1988)
80 T. Spieldenner and R. Schubotz
24. Khalili, A., Loizou, A., van Harmelen, F.: Adaptive linked data-driven web com-
ponents: building flexible and reusable semantic web interfaces. In: Sack, H.,
Blomqvist, E., d’Aquin, M., Ghidini, C., Ponzetto, S.P., Lange, C. (eds.) ESWC
2016. LNCS, vol. 9678, pp. 677–692. Springer, Cham (2016). https://doi.org/10.
1007/978-3-319-34129-3 41
25. Khalili, A., de Graaf, K.A.: Linked data reactor: towards data-aware user interfaces.
In: Proceedings of the 13th International Conference on Semantic Systems, pp.
168–172. ACM (2017)
26. Khalili, A., Merono-Penuela, A.: WYSIWYQ - what you see is what you query. In:
VOILA@ ISWC, pp. 123–130 (2017)
27. Lukovnikov, D., Stadler, C., Lehmann, J.: LD viewer-linked data presentation
framework. In: Proceedings of the 10th International Conference on Semantic Sys-
tems, pp. 124–131. ACM (2014)
28. Daga, E., Panziera, L., Pedrinaci, C.: A BASILar approach for building web APIs
on top of SPARQL endpoints. In: CEUR Workshop Proceedings, vol. 1359, pp.
22–32 (2015)
29. Michel, F., Zucker, C.F., Gandon, F.: SPARQL micro-services: lightweight integra-
tion of web APIs and linked data. In: LDOW 2018-Linked Data on the Web, pp.
1–10 (2018)
30. Michel, F., Faron-Zucker, C., Gandon, F.: Bridging web APIs and linked data with
SPARQL micro-services. In: Gangemi, A., et al. (eds.) ESWC 2018. LNCS, vol.
11155, pp. 187–191. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-
98192-5 35
31. Fafalios, P., Tzitzikas, Y.: SPARQL-LD: a SPARQL extension for fetching and
querying linked data. In: International Semantic Web Conference (Posters &
Demos) (2015)
32. Fafalios, P., Yannakis, T., Tzitzikas, Y.: Querying the web of data with SPARQL-
LD. In: Fuhr, N., Kovács, L., Risse, T., Nejdl, W. (eds.) TPDL 2016. LNCS,
vol. 9819, pp. 175–187. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-
43997-6 14
33. Vogelgesang, C., Spieldenner, T., Schubotz, R.: SPARQλ: a functional perspec-
tive on linked data services. In: Ichise, R., Lecue, F., Kawamura, T., Zhao, D.,
Muggleton, S., Kozaki, K. (eds.) JIST 2018. LNCS, vol. 11341, pp. 136–152.
Springer, Cham (2018). https://doi.org/10.1007/978-3-030-04284-4 10
34. Groth, P., Loizou, A., Gray, A.J.G., Goble, C., Harland, L., Pettifer, S.: API-
centric linked data integration: the open PHACTS discovery platform case study.
Web Semant. Sci. Serv. Agents World Wide Web 29, 12–18 (2014)
35. Alatalo, T.: An entity-component model for extensible virtual worlds. IEEE Inter-
net Comput. 15(5), 30–37 (2011)
36. Dahl, T., Koskela, T., Hickey, S., Vatjus-Anttila, J.: A virtual world web client
utilizing an entity-component model. In: NGMAST, pp. 7–12. IEEE (2013)
37. Moltchanov, B., Rocha, O.R.: A context broker to enable future IoT applications
and services. In: 2014 6th International Congress on Ultra Modern Telecommuni-
cations and Control Systems and Workshops (ICUMT), pp. 263–268. IEEE (2014)
38. Wiebusch, D., Latoschik, M.E.: Decoupling the entity-component-system pattern
using semantic traits for reusable realtime interactive systems. In: 2015 IEEE 8th
Workshop on Software Engineering and Architectures for Realtime Interactive Sys-
tems (SEARIS), pp. 25–32. IEEE (2015)
39. Parastatidis, S., Webber, J., Silveira, G., Robinson, I.S.: The role of hyperme-
dia in distributed system development. In: Proceedings of the First International
Workshop on RESTful Design, pp. 16–22. ACM (2010)
JECI: A Joint Knowledge Graph
Embedding Model for Concepts
and Instances
1 Introduction
Knowledge graphs organize the human knowledge in the form of triple facts (head
entity, relation, tail entity), abridged as (h, r, t), which are also usually recorded
as (subject, predicate, object). The goal of knowledge graph embedding is to
embed entities and relations to a continuous low-dimensional vector space. It can
encode both topology structure and semantic information of knowledge graph
into the embeddings of entities and relations. It enables the knowledge graph
more computable, which benefits tasks such as knowledge graph completion and
relation extraction.
Recent years have witnessed the rapid development of knowledge graph
embedding [1]. Network-based one-hot representation is simple and interpretable
[2], but it often suffers from computational efficiency and data sparsity due to
the complicated network structure and the long tail distribution of the knowl-
edge graphs. To tackle this issue, distributed knowledge graph embedding mod-
els are proposed to learn low dimensional embeddings by machine learning and
deep learning. Some of the them utilize triple facts observed in the knowledge
graph to learn embeddings. Among which, translation-based models view the
relation as translation from the head entity to the tail entity. In TransE [3], the
embedded entity h and t can be linked in lower error by embedded relation r,
i.e., h + r ≈ t when (h, r, t) actually exists in the knowledge graph. TransH
[4], TransR/CTransR [5] and TransD [6] are proposed to improve TransE in
dealing with complex relations. DistMult [7], HolE [8] and ComplEx [9] model
the multi-relational data in knowledge graph as matrices or tensors to capture
the inherent semantics between entities and relations. SLM [10], SME [11] and
ConvE [12] apply neural networks to model connections between entities and
relations. Though the matrices, the tensors and the networks can better capture
the semantics, they also cause expensive computation due to the large amount of
parameters. In addition, there are many models taking advantage of multi-source
information besides triple facts, such as the entity types [13,14], the relation
paths [15], the textual descriptions [10,16], the logical rules [17,18] and so on.
Although these knowledge graph embedding models achieve promising exper-
imental results, most of them ignore differences between instances and concepts,
and treat them as entities equally, which causes following drawbacks:
• Unique features of concepts and instances are not captured in embeddings.
Concepts are abstract and can be seen as categories, which contain sub-
concepts and similar instances. However, instances are specific and each of
them refers to a unique physical object, which may belong to more than one
concept [19].
• Hierarchical structure of concepts is ignored. Concepts are hierarchical nat-
urally. As shown in Fig. 1(a), (Scientist, subClassOf, Person) and (Writer,
subClassOf, Person) form the hierarchical structure, in which concepts of
different granularities are in different layers.
• Transitivity of isA relation is not preserved. InstanceOf and subClassOf are
special relations in knowledge graphs, called isA[20]. They have the property
of transitivity, which is useful for knowledge graph completion. As shown in
Fig. 1(b), if (Coco, instanceOf, Dog) and (Dog, subClassOf, Animal ) are facts
in knowledge graph, then we can infer that Coco is also an instance of Animal,
which is represented by the dotted line.
These problems have been discussed in few works. In SSE [13], instances
belonging to the same concept are supposed to lie close to each other in
the embedding space. TKRL [14] incorporates entity types (i.e., concepts) as
84 J. Zhou et al.
Animal
Person
subClassOf
Dog instanceOf
instanceOf
assistant information for learning embeddings. TransC [21] models each concept
as a sphere and each instance as a point in a same semantic space. The relative
position between the point and the sphere is used to model the relation between
the instance and the concept. However, the sphere is unable to capture the com-
plex semantics of the concepts, since the sphere is a highly symmetrical spatial
geometry. Moreover, although the instances are constrained inside the spheres,
TransC still has limitations in dealing with complex relations existing in most
knowledge graph embedding models.
In order to reduce the impact caused by differences between concepts and
instances, we propose a novel knowledge graph embedding model to jointly
embed concepts and instances, named JECI. For each instance, we generate a
context vector from its neighbors and design a prediction function based on the
context vector, which is formalized as a circular convolution. The prediction func-
tion is utilized to progressively predict which hierarchical concepts the instance
belongs to in the order of coarse to fine granularity, based on the subClassOf
relation and instanceOf relation. Then JECI locates the instance in embedding
space using the most fine-grained concept it belongs to. We minimize the gap
between the prediction and the reality, and iteratively learn the embeddings. In
this way, concepts and instances are jointly embedded. For relational triples, we
select triple-based models such as TransE and TransD to learn the embeddings.
Take TransE, TransH, TransR, TransD, HolE, DistMult, ComplEx and TransC
as baselines, experiments on YAGO39K and M-YAGO39K [21] show that JECI
achieves outer performance in most cases. The main contributions of this paper
can be summarized as follows:
• We propose a novel knowledge graph embedding model, which can distinguish
concepts and instances.
• Hierarchical structure of concepts is preserved in our embedding model due
to the progressive predictions for instances.
• Transitivity of isA relations (i.e., subClassOf and instanceOf ) is captured.
• Problem of complex relations is also addressed in our model by incorporating
neighbor information of instances.
The rest of this paper is organized as follows. In Sect. 2, significant symbols
and definitions used throughout this paper are listed. In Sect. 3, we introduce
JECI: A Joint Knowledge Graph Embedding Model 85
JECI model in detail. The performance of our model is shown in Sect. 4 with
experiments. Finally, Sect. 5 draws the conclusion and the future work.
2 Preliminaries
For clear illustration, the symbols used throughout the paper are summarized
in Table 1. Bold italic x denotes the embedding of x.
Given a knowledge graph KG with instances, concepts and relations, it can
be formalized as KG = {I, C, R, S}. There are three kinds of relations in this
knowledge graph: (1) InstanceOf relation, which indicates that an instance is an
instantiation of a concept, denoted as re . For example, (Shakespeare, instanceOf,
writer ) indicates that Shakespeare is an instance of writer. (2) SubClassOf
relation, which indicates that a concept is a subconcept of the other concept,
denoted as rc . For example, (writer, subClassOf, Person) indicates that a writer
is also a person. (3) General relation, which indicates the relation between two
instances. For example, (Shakespeare, write, Hamlet) indicates that Shakespeare
wrote Hamlet. Relations set R is formalized as {re , rc } ∪ Rl , where Rl is a
set of general relations. Then three kinds of triple sets are denoted as fol-
lows: (1) InstanceOf triples set Se = {(i, re , c)|i ∈ I ∧ c ∈ C}. (2) SubClas-
sOf triples set Sc = {(ci , rc , cj )|ci ∈ C ∧ cj ∈ C}. (3) Relational triples set
Sl = {(h, r, t)|h ∈ I ∧ t ∈ I ∧ r ∈ Rl }. Thus, triples set S is composed of these
three disjoint triple sets corresponding to three kinds of relations respectively,
formalized as Se ∪ Sc ∪ Sl .
Definition 1 (Neighbor context). Neighbor context for an instance x is
defined as a set of its neighbors in the knowledge graph, denoted as N (x) =
{i|(x, r, i) ∈ Sl ∨ (i, r, x) ∈ Sl }.
Symbols Descriptions
KG knowledge graph
I instances set
C concepts set
re instanceOf relation
rc subClassOf relation
Rl general relations set
R relations set
Sl relational triples set
Se instanceOf relation triples set
Sc subClassOf relation triples set
S triples set
HT hierarchical tree
N neighbor context
86 J. Zhou et al.
Thing
Hierarchical Thing x
Tree
Generator
x
x
Supervision
information
x Embeddings Learner
Generator Context vector
Embeddings
3 JECI Model
This paper emphasizes the differences between concepts and instances in knowl-
edge graphs, and proposes a novel knowledge graph embedding model, named
JECI, to jointly embed concepts and instances to low dimensional vectors.
As shown in Fig. 2, JECI has three functional parts: hierarchical tree generator,
context vector generator and embeddings learner. Hierarchical tree generator
maps hierarchical concepts in a knowledge graph to a tree. Context vector gen-
erator extracts neighbors of a target instance from the original knowledge graph
and utilizes them to generate a context vector. Embeddings learner first links
instances to the leaf nodes of the tree, and obtains supervision information from
the tree. Then it learns embeddings based on the context vector and the super-
vision information. The details of these three parts will be illustrated below.
Person Literature
• All concepts in C are mapped to independent trees with single nodes respec-
tively.
• For each triple (ci , rc , cj ) ∈ Sc , the tree with ci as the root is mapped to a sub-
tree of cj , rc is mapped to the branch between ci and cj . Many independent
trees are constructed when this step is finished.
• JECI introduces an assistant concept Thing satisfying (c, rc , Thing) for all
concepts in C and maps Thing to a tree with single node. Then All indepen-
dent trees are mapped to sub-trees of Thing.
In this way, all concepts are organized as HT and the hierarchical structure
is preserved in the tree. We can infer that all nodes in HT are corresponding to
concepts in C and all branches in HT are corresponding to subClassOf relations.
(ci , subClassOf, cj ) indicates that the concept cj is more coarse-grained than
the concept ci , that is cj is more general than ci . If (ci , rc , ck ) ∈ Sc ∧ (cj , rc , ck ) ∈
Sc ∧ (ci , rc , cj ) ∈ Sc , the sorted order by the granularity is ci < cj < ck . JECI
ensures that the more coarse-grained concepts lie closer to Thing in HT , i.e.,
the tree with ci as the root is mapped to a sub-tree of cj , then the tree with cj
as the root is mapped to a sub-tree of ck .
For example, Fig. 3(a) shows a part of a knowledge graph, with concepts
represented by squares, instances represented by circles, subClassOf relations
represented by green arrows, instanceOf relations represented by yellow arrows
and general relations represented by blue arrows. The concepts in the knowledge
graph are mapped to a hierarchical tree inside the dotted square (shown in
Fig. 3(b)).
...
...
...
i5 ...
...
i6 i4 wi
r6 ... cx i
... i N(x) z(x)
x
...
... Aggregation
...
...
Context
i2 ... vector
...
...
...
Part of KG
Embeddings
• Generating neighbor context N (x) for x based on the knowledge graph, and
picking up their embeddings.
• Aggregating the embeddings of the instances in N (x) as context vector of x,
denoted as cx .
We have tried several aggregating methods for constructing the context vec-
tor, including addition, multiplication and simple concatenation. The experi-
ments show that addition is more effective than others. We assume that not
all neighbors make same contributions to the target instance. Intuitively, if i1
and i2 are both neighbors of x, and i1 is linked to more instances than i2 , i.e.,
|N (i1 )| > |N (i2 )|, it is reasonable to suppose that i2 makes more contribution
to x than i1 . Based on this point of view, we define the addition operation for
generating context vector as:
wi
cx = i (1)
z(x)
i∈N (x)
hasFriend
Alice Mary
Bob
Matt Era
d−1
[Gc x (c (k)
)]t = [cx c (k)
]t = [cx ](i+t)mod d · [c(k) ]i , t = 0, 1, · · · , d − 1 (3)
i=0
cx
cx << 1
cx << 2
cx<< n-2
cx<< n-1
cx c(k ) Gcx (c ( k ) )
Fig. 6. Circular convolution for prediction. Concepts are represented by squares and
instances are represented by circles. The deeper the red is, the more fine-grained the
concept is. representes circular convolution. (Color figure online)
A lower score indicates that the results of the prediction function are more
precise. Then we adopt the margin-based loss function in Eq. 5 as the optimized
objective.
m
L1 = [γ1 + f1 (ξk ) − f1 (ξk )]+ (5)
x∈I p∈P (x) k=0
JECI: A Joint Knowledge Graph Embedding Model 91
L2 = [γ2 + f2 (τ ) − f2 (τ )]+ (7)
τ ∈Sl τ ∈Sl
L = L1 + L2 (8)
4 Experiments
4.1 Datasets
Since most datasets used in previous works mainly consist of instances or con-
cepts such as FB15K [3] and WN18 [11], they are not suitable for evaluating
our model. We adopt the datasets YAGO39K and M-YAGO39K proposed in
TransC [21] and move the subClassOf triples in the valid sets and the test sets
to the training set. Compared to YAGO39K, the valid and the test datasets
of M-YAGO39K include new triples inferred based on the transitivity of isA
relations from the existing triples in the training. The detailed statistics of new
YAGO39K and M-YAGO39K are shown in Table 2.
JECI: A Joint Knowledge Graph Embedding Model 93
Link prediction is to predict the missing head instance or tail instance for an
incomplete relational triple based on the trained embeddings on condition of our
work. For each testing relational triple (h, r, t), we adopt the method proposed
in [23] to replace h and t respectively with all instances in I and use the scoring
function in Eq. 6 to calculate the scores for each restructured triple. After ranking
these restructured triples in ascending order based on the scores, we can get the
rank of (h, r, t).
Following most previous works, we adopt the mean reciprocal rank (MRR)
of all the correct instances and the proportion of correct instances that rank
no larger than N(Hits@N) as the evaluation metrics. Note that, a restructured
triple may have already existed in relational triples set Sl . In order to eliminate
the negative impact on evaluation caused by such false negative triples, we adopt
a filtering method proposed in TransE, that is filtering the false negative triples
from candidates triples before ranking, then we get filtered results, called Filter
to compare with the previous results called Raw. Hits@N adopts the results of
Filter in evaluation.
We use YAGO39K for training the model for 1000 rounds and evaluation.
The valid dataset is utilized to select the learning rate η for SGD among
{0.1, 0.01, 0.001}, the dimension of embeddings d among {20, 50, 100}, the num-
ber of neighbors N among {3, 5, 7}, margin γ1 and γ2 among {0.1, 0.3, 0.5, 1, 2}.
The optimal hyperparameters are η = 0.001, d = 100, N = 5, γ1 = 1 and γ2 = 1.
We choose L2 distance to evaluate the difference between the prediction and the
reality in Eq. 4.
Table 3 shows the results, parts of which are referred from [21]. The cbow and
sg denote strategies of learning instanceOf triples and the subClassOf triples.
JECI outperforms the baselines in most cases, since embeddings of instances
are learned by incorporating the hierarchical concepts they belong to. Neighbor
information is also incorporated to help address the problem of complex relations.
94 J. Zhou et al.
Especially, the sg strategy performs better than the cbow strategy, since the
Skip-Gram takes turns to use the target instance to locate each instance in the
neighbor context, enabling the target instance to be encoded with more precise
semantic information from the neighbors.
maximizing the classification accuracy on valid set. For a relational triple (h, r,
t), if the score calculated by the Eq. 6 is lower than δr , the triple will be classified
as a positive one, otherwise negative. For an instanceOf triple (x, re , c), we first
generate cx , and then utilize it to progressively predict which concepts x belongs
to. If there exists a path containing c and each score calculated by the Eq. 4 is
lower than δrc or δre , the triple will be classified as a positive one, otherwise
negative.
We use same method in link prediction to select hyperparameters. The opti-
mal hyperparameters for YAGO39K are η = 0.001, d = 100, N = 5, γ1 = 0.1 and
γ2 = 1. The optimal hyperparameters for M-YAGO39K are η = 0.001, d = 100,
N = 5, γ1 = 0.3 and γ2 = 1. Tables 4, 5 are the results of relational triples and
instanceOf triples, respectively.
4.4 Limitations
Experimental results demonstrate that JECI outperforms state-of-the-art models
in most cases. However, there exist some limitations.
96 J. Zhou et al.
• Neighbor information is incorporated into JECI, which helps solve the prob-
lem of complex relations. But it is just part of information in knowledge
graphs. And when we extract the neighbors of an instance, we treat different
relations equally.
• In fact, most knowledge graphs dynamically change mainly among instances.
In other words, the concepts and the connections between the concepts are
almost constant over time. We construct the hierarchical tree before training
and assume that such structure will not change. Thus, JECI is not suitable
for a few part of knowledge graphs with concepts changing.
5 Conclusion
In this paper, we propose a novel knowledge graph embedding model called JECI.
JECI differentiates the concepts and the instances in the knowledge graph and
jointly embeds them in a low-dimensional space. It encodes the transitivity of
isA relations by progressively predicting hierarchical concepts which an instances
belongs to, and using the circular convolution as the prediction function. Fur-
thermore, JECI takes advantage of the neighbor information of the instances
in the knowledge graph to address the problem of complex relations existing in
some knowledge graph embedding models, e.g., TransE and TransC. The exper-
imental results show that JECI improves the performance of link prediction and
triple classification in most cases, especially outperforms the major baselines in
handling the transitivity of isA relations.
In the future, we will explore the following researches to improve the limita-
tions mentioned above: (1) Taking the differences of kinds of relations into con-
sideration when extracting the neighbors of an instance. (2) Incorporating more
information into our model to better solve the problem of complex relations,
such as multimodal information of the instances. (3) Designing an incremen-
tal learning method based on our model to learn the embeddings of unregistered
instances and concepts. (4) Learning the structure of the hierarchical tree dynam-
ically, rather than constructing it directly. (5) Constructing a new dataset such
as a products knowledge graph and evaluating our model by fine-grained entity
typing [24], i.e., identifying types in different granularities of a giving instance.
References
1. Wang, Q., Mao, Z., Wang, B., Guo, L.: Knowledge graph embedding: a survey of
approaches and applications. IEEE Trans. Knowl. Data Eng. 29(12), 2724–2743
(2017)
2. Turian, J., Ratinov, L., Bengio, Y.: Word representations: a simple and general
method for semi-supervised learning. In: Proceedings of the 48th Annual Meeting
of the Association for Computational Linguistics, pp. 384–394 (2010)
3. Bordes, A., Usunier, N., Garcia-Duran, A., Weston, J., Yakhnenko, O.: Translating
embeddings for modeling multi-relational data. In: Advances in Neural Information
Processing Systems, pp. 2787–2795 (2013)
JECI: A Joint Knowledge Graph Embedding Model 97
4. Wang, Z., Zhang, J., Feng, J., Chen, Z.: Knowledge graph embedding by trans-
lating on hyperplanes. In: Proceedings of the 28th AAAI Conference on Artificial
Intelligence, pp. 1112–1119 (2014)
5. Lin, Y., Liu, Z., Sun, M., Liu, Y., Zhu, X.: Learning entity and relation embeddings
for knowledge graph completion. In: Proceedings of the 29th AAAI Conference on
Artificial Intelligence, pp. 2181–2187 (2015)
6. Ji, G., He, S., Xu, L., Liu, K., Zhao, J.: Knowledge graph embedding via dynamic
mapping matrix. In: Proceedings of the 53rd Annual Meeting of the Association
for Computational Linguistics, vol. 1, pp. 687–696 (2015)
7. Yang, B., Yih, W., He, X., Gao, J., Deng, L.: Embedding entities and relations
for learning and inference in knowledge bases. In: 3rd International Conference on
Learning Representations (2015)
8. Nickel, M., Rosasco, L., Poggio, T.A., et al.: Holographic embeddings of knowledge
graphs. In: Proceedings of the 30th AAAI Conference on Artificial Intelligence, pp.
1955–1961 (2016)
9. Trouillon, T., Welbl, J., Riedel, S., Gaussier, É., Bouchard, G.: Complex embed-
dings for simple link prediction. In: International Conference on Machine Learning,
pp. 2071–2080 (2016)
10. Socher, R., Chen, D., Manning, C.D., Ng, A.: Reasoning with neural tensor net-
works for knowledge base completion. In: Advances in Neural Information Process-
ing Systems, pp. 926–934 (2013)
11. Bordes, A., Glorot, X., Weston, J., Bengio, Y.: A semantic matching energy func-
tion for learning with multi-relational data. Mach. Learn. 94(2), 233–259 (2014)
12. Dettmers, T., Minervini, P., Stenetorp, P., Riedel, S.: Convolutional 2D knowledge
graph embeddings. In: Proceedings of the 32nd AAAI Conference on Artificial
Intelligence, pp. 1811–1818 (2018)
13. Guo, S., Wang, Q., Wang, B., Wang, L., Guo, L.: Semantically smooth knowledge
graph embedding. In: Proceedings of the 53rd Annual Meeting of the Association
for Computational Linguistics, pp. 84–94 (2015)
14. Xie, R., Liu, Z., Sun, M.: Representation learning of knowledge graphs with hierar-
chical types. In: Proceedings of the 25th International Joint Conference on Artificial
Intelligence, pp. 2965–2971 (2016)
15. Lin, Y., Liu, Z., Luan, H., Sun, M., Rao, S., Liu, S.: Modeling relation paths for
representation learning of knowledge bases. In: Proceedings of the 2015 Conference
on Empirical Methods in Natural Language Processing, pp. 705–714 (2015)
16. Zhong, H., Zhang, J., Wang, Z., Wan, H., Chen, Z.: Aligning knowledge and text
embeddings by entity descriptions. In: Proceedings of the 2015 Conference on
Empirical Methods in Natural Language Processing, pp. 267–272 (2015)
17. Guo, S., Wang, Q., Wang, L., Wang, B., Guo, L.: Jointly embedding knowledge
graphs and logical rules. In: Proceedings of the 2016 Conference on Empirical
Methods in Natural Language Processing, pp. 192–202 (2016)
18. Ding, B., Wang, Q., Wang, B., Guo, L.: Improving knowledge graph embedding
using simple constraints. In: Proceedings of the 56th Annual Meeting of the Asso-
ciation for Computational Linguistics, pp. 110–121 (2018)
19. Asprino, L., Basile, V., Ciancarini, P., Presutti, V.: Empirical analysis of founda-
tional distinctions in linked open data. In: Proceedings of the 27th International
Joint Conference on Artificial Intelligence, pp. 3962–3969 (2018)
20. Miller, G.: WordNet: an on-line lexical database. special issue of the international.
J. Lexicogr. 3(4) (1990)
98 J. Zhou et al.
21. Lv, X., Hou, L., Li, J., Liu, Z.: Differentiating concepts and instances for knowledge
graph embedding. In: Proceedings of the 2018 Conference on Empirical Methods
in Natural Language Processing, pp. 1971–1979 (2018)
22. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word repre-
sentations in vector space. In: 1st International Conference on Learning Represen-
tations (2013)
23. Bordes, A., Weston, J., Collobert, R., Bengio, Y.: Learning structured embeddings
of knowledge bases. In: Proceedings of the 25th AAAI Conference on Artificial
Intelligence, pp. 301–306 (2011)
24. Ling, X., Weld, D.S.: Fine-grained entity recognition. In: Proceedings of the 26th
AAAI Conference on Artificial Intelligence, pp. 94–100 (2012)
Enhanced Entity Mention Recognition
and Disambiguation Technologies for
Chinese Knowledge Base Q&A
Gang Wu1,2(B) , Wenfang Wu1 , Hangxu Ji1 , Xianxian Hou1 , and Li Xia1
1
School of Computer Science and Engineering, Northeastern University,
Shenyang 110004, China
[email protected]
2
State Key Laboratory for Novel Software Technology, Nanjing University,
Nanjing 210023, China
1 Introduction
Entity linking (EL) [4], which serve as the underlying technology of the Chinese
KBQA, is the process of chaining the fragments of the entities in the text to the
entities in the knowledge base. It still faces many challenges. The first impor-
tant challenge facing Chinese EL is the complexity of Chinese expression and
the lack of contextual information due to short text. The substantial existing
large number of EL work mainly focuses on long text in English, for example,
the traditional naming recognition method BiLSTM-CRF [9] has achieved cred-
itable results in English named entity recognition. However, in the case of short
c Springer Nature Switzerland AG 2020
X. Wang et al. (Eds.): JIST 2019, LNCS 12032, pp. 99–115, 2020.
https://doi.org/10.1007/978-3-030-41407-8_7
100 G. Wu et al.
text which lacks context, the effective use of contextual information is of vital
importance. Another challenge is that there are usually multiple references to an
entity in the KB. Entity disambiguation is required to address the above issue.
Entity disambiguation is often seen as a sorting problem. For example, Zheng
et al. [16] realized entity disambiguation based on Pairwise and Listwise Learning
to Rank (L2R) methods respectively. However, the pre-existent methods often
focuses on the information at the lexical level or at shallow semantic level. In
response to the above problems, the following will be presented from two aspects,
namely mention recognition and entity disambiguation.
Topic Entity Mention Recognition Module Based on Sequence Anno-
tation. In view of the above mentioned challenge presented in short text, this
paper combines various features to construct a feature vector based on word
embedding. We also propose an improved dedicated sequence labeling model,
which can obtain the exclusive topic entity mention of the question. When the
BiGRU model is used for sequence labeling, the labeling is not performed on
each and every word comprised of the topic entity mention, instead the words
before and after the mention are specially labeled to mark the beginning and end
of the mention. Through experimental verification, the proposed algorithm can
overcome the influence of the sparse feature of the entity’s mention formation in
the existing model, and makes the model focus on the learning of the contextual
information of the topic entity.
Entity Disambiguation Based on Extended Information Similarity
Calculation. For another challenges, entity disambiguation is dedicated to cal-
culating information similarity from multiple perspectives. In this paper, the
similarity calculation is embodied as the calculation of the similarity between
the user’s question and the questions related to the candidate entity. In order to
make full use of the contextual information of the short text, the similarity is cal-
culated at the lexical and semantic levels respectively, then is combined with the
popularity of the entity. To obtain deep semantic information, the convolutional
neural networks (CNN) are applied. The superiority of the proposed method
is proved by experimental comparison with the classical similarity calculation
method based on average vector.
The main contributions of this study are: (i) The training speed is improved
by replacing the traditional LSTM model with a simpler GRU model. (ii) In order
to describe the possibility that the words are included in the topic entity mention
from different perspectives, we extract a series of features, such as part of speech,
and then combine these features with the word embedding to construct the
feature vector. (iii) In the aspect of entity disambiguation, we not only expand
the information of the candidate entities, but also take the lexical level similarity,
the semantic level similarity, and the entity popularity into consideration to
make the most out of the contextual information. (iv) This paper applies CNN
to obtain deep semantic information.
Enhanced Entity Mention Recognition and Disambiguation Technologies 101
2 Related Work
In this section, some work related to our study are discussed, including mention
recognition research and Entity disambiguation research.
triad in the question. This process is called the recognition of the topic entity.
In this paper, we consider the topic entity mention recognition as a sequence
labeling problem, and ensure a high recall rate of entity mention recognition by
considering all the n-grams in the question.
To describe the possibility that the words are included in the topic entity
from different perspectives, this paper extracts a series of features and splices
them with the word embedding to form the feature vector. These vectors are
used as the input of the BiGRU-CRF model, and the topic entity mention of
the question is obtained by labeling. The details of the model are presented in
Sect. 3.1.
The process of the topic entity mention recognition algorithm based on the
sequence annotation model is shown in Fig. 1. The annotation set of the model
is defined as: B, the previous position of the topic entity’s mention and marks
the beginning of the mention; E, the latter position of the topic entity’s mention
and marks the end of the mention; O, other locations. The pre-processing of
the question is performed before the task begins, which includes the segmen-
tation, the part-of-speech tagging of each word, the named entity recognition,
dependency parsing analysis and semantic role labeling.
Firstly, the input of the model is constructed. Considering that there may
be multiple candidate entity mentions in the sentence, such as in the ques-
tion “ (Where was Yun Ma of Alibaba born)”, both
1 2
“ ” and “ ” may be referred to as the topic entity. Therefore,
this paper extracts some features by preprocessing and splices them with the
word embedding to eliminate these interferences. These selected features will be
detailed in Sect. 3.2. Secondly, the feature vector is input into the BiGRU model.
In this paper, BiGRU model is used to construct sequence annotation model.
GRU model has a strong learning ability for the long-term dependence between
word sequences. Compared with LSTM [2], GRU has a simpler structure, fewer
parameters, and a faster training speed. The model uses two opposite GRU lay-
ers, starting from the front-end and the back-end respectively, thus learns the
forward and backward contextual features. The output of the hidden state of
the two GRU layers are spliced as the output of the BiGRU network.
Thirdly, since the topic entity of the question is composed of consecutive
text fragments, and there is a strong dependence between adjacent words in
the sentence, this paper adds a CRF [3] layer after the BiGRU. The CRF layer
can automatically learn some constraint rules from the training data and use
these rules to ensure that the labels output by the BiGRU are legal. Conse-
quently, the CRF layer predicts the global optimal labeling sequence. As shown
in Fig. 1, enter “<b> <e>”, where “<b>”, “<e>”
are the beginning and end of the sentence, respectively, and the output label
sequence is “OOBEOOOO”.
Finally, the combination of the annotation sequence is used to get the topic
entity mention of the sentence.
The parametric description of the algorithm is as follows:
Take W as a set of all n-grams constructed from the result of the question
segmentation, defining p(w|q) as the probability that one of the n-grams (w ) is
the topic entity of the question (q). The objective function can be defined as:
Where H(q) is the output score matrix of BiGRU, H(q)i,j represents the score of
the j th mark of the ith word in the question; T represents the transition matrix
1
Alibaba Co.
2
Yun Ma is the creator of Alibaba Co.
104 G. Wu et al.
of the CRF, and Ti,j represents the transition score from the label i to the label
j. Softmax is used to normalize all possible annotation sequences for the input:
S(q, y)
p(y|q) = S(q,ȳ)
(3)
ȳYq e
Yq represents the collection of all possible labeling sequences of the word sequence
q of the input sentence. In this process, the objective function can be transformed
into a probability p(y|q) that maximizes the correct labeling sequence. And in the
training process, the maximum likelihood estimation (MLE) is used to maximize
the logarithmic probability of the annotation sequence.
log (p(y|q)) = S(q, y) − log ( eS(q,ȳ) ) (4)
ȳYq
Finally, the model generates the highest-ranked sequence of the labels as output:
Part of Speech. Considering that the topic entity of the question is all nominal,
it can be added to the model as an important feature to avoid interference from
other words or phrases. In this paper, the part-of-speech tagging results of each
word are mapped into a real-valued vector of dimension dpos .
The Semantic Role of the Word. In most instances, the topic entity men-
tions in the question often only correspond to a single semantic role or part
of a semantic role. By mapping the semantic role labels of the words in the
preprocessing results to a real value vector of dimension dsrl .
4 Entity Disambiguation
Entity disambiguation refers to the process of generating a set of candidate enti-
ties that may be chained by a given entity, and by using valid information, finding
the entity item that is most likely to be chained in the current context. In this
paper, the entity disambiguation is regarded as a sorting problem. According to
the characteristics of Chinese question, an entity disambiguation algorithm based
on extended information similarity calculation is proposed, which transforms the
similarity calculation problem between topic entity and candidate entity into the
similarity calculation of their extension information.
first three relevant pages with the candidate entity’s name as the searching key-
word. This set of questions can be seen as extended information for the candidate
4
entity. For instance, for the candidate entity “ ” in the candidate
entity set, searching on Baidu Knows with its name as a key can obtain several
problems such as “ ? (What is the business philoso-
phy of the Smartisan?)”. A certain number of questions are selected to form
a problem set, which is regarded as the extended information of the candidate
entity.
Information Filtering. Firstly, the user’s question and the questions in the
candidate question set are segmented, and the stop words are filtered out. The
Jaccard Distance [6] is then used to calculate the literal similarity between the
user’s question and each question in the candidate question set. Finally, by set-
ting a threshold, the candidate question whose similarity is lower than or equal
to the threshold is filtered out.
The Similarity Calculation at the Lexical Level. For the topic entity
mention and a candidate entity, when calculating the similarity between the two,
the first consideration is the lexical feature. The closer the two words literarily
are, the more synomyms they share, that is, the higher the similarity between
the topic entity and the candidate entity will be (detailed in Sect. 4.2).
The Similarity Calculation at the Semantic Level. In Chinese natural
language, some subtle linguistic differences often lead to huge semantic differ-
ences. Therefore, in addition to the similarity calculation at the lexical level, it
is necessary to pay attention to the deep semantic feature of the text. For exam-
ple, the question “ ? (Who is the actress in the leading
role of Paprika?)” and the question “ ? (Who plays the
leading role Paprika?)” have a high degree of similarity in the lexical level, but
their semantics are very different, and there is no correlation between the topic
entities. In the process of semantic similarity calculation, this paper uses CNN
to represent the whole sentence in vector (detailed in Sect. 4.2).
Extract the Entity Popularity Characteristics. In the candidate entity
ordering process, in addition to context similarity feature, the priori information
of the candidate entity is also crucial to the ordering of the candidate entity.
Entity popularity refers to the possibility that an entity is mentioned in a ques-
tion. In this paper, the entity popularity feature is defined as the ranking of each
candidate entity in the “referral-entity” mapping.
For example, if the entity mention “ (Dumplings)”, the corresponding set
of candidate entities can be obtained through the “referential-entity” mapping
file as shown in the Fig. 3, and the ranking is obtained by corpus statistics.
Under the condition that there is no other available information, it can be known
4
Smartisan Technology Co., Ltd., commonly known as Smartisan, is a Chinese multi-
national technology company headquartered in Beijing and Chengdu.
Enhanced Entity Mention Recognition and Disambiguation Technologies 107
that, according to the popularity, the probability that the entity mention “ ”
chain to “ (Dumplings (Chinese traditional food))” is greater
than the probability of the chain to entities such as “
(Dumplings (characters in “Dragon Ball”))” or “ (Dumplings
(Li Bihua’s short story collection))”.
Candidate Score Calculation. Using the RankNet L2R method, the lexical
similarity, the semantic similarity and the entity popularity feature are integrated
to output the score of the candidate entity.
candidate question corresponding to the candidate entity, and the space vector
is Vci , and the similarity between the Vs and the Vci can be represented by the
cosine distance (Eq. 7).
V s · V ci
cos (m, c) = (7)
|Vs ||vci |
Then, the similarity between the topic entity mention (m) and the candidate
entity (c) can be calculated by the average of the similarity between the user’s
question and the K candidate questions (Eq. 8).
k
1
sim(m, c) = cos(vs , vci ) (8)
k i=1
Similarity Based on Semantic Level. Two parallel CNN [15] models are
used to learn the semantic vector representations of the user’s questions and
candidate entity related questions, respectively, and the similarity between the
two is calculated (see in Fig. 4).
The convolutional layer obtains the feature map vector by performing a con-
volution operation and an activation function on the word embedding matrix.
Two convolution kernels of different window sizes, 1 and 2, are used in the con-
volutional layer to extract local features of different granularities to maximize
information utilization. The activation function after the convolutional layer in
this paper uses ReLU.
fixed length output. This paper uses an improved pooling technique called K-
Max mean pooling technology, which combines the max-pooling and the mean-
pooling method to reduce the influence of noise while retaining the word order
information and more important features in the sentence. By selecting the largest
top K values in each feature mapping vector input, the average value is taken as
the sampling result, and finally a fixed-length one-dimensional vector is output.
The output after the pooling operation passes through a Dropout layer. Each
feature that Dropout extracts from the pooling layer is set to 0 according to a
certain probability, which can avoid the over-fitting phenomenon caused by the
excessive dependence of the model on certain features. For the main features
extracted, nonlinear recombination is performed by multi-layer perceptron to
obtain semantic vector representations of two input questions of the same length,
and then the semantic similarity is represented by cosine distance. Finally, the
semantic similarity is normalized by the Softmax layer.
5 Experiments
5.1 Data Sets
This paper uses the knowledge base files, training data and test data provided
by the Chinese Q&A evaluation task in the open field of CCKS 2018. The task in
the CCKS uses PKU BASE as the specified knowledge graph. The file “pkubase-
mention2ent.txt” in PKU BASE assists in the entity link, and the file describes
the mapping relationship of the entity in the knowledge base in the format of
“mention\t candidate entity\t ranking of candidate entities”.
The training set and verification set include 1200 and 400 labeled data respec-
tively. The test set consists of about 400 questions that do not contain the results
of the annotation. The problems in the data set are all single-factual type, that
is, the question can be answered simply through a triple in the KB, and the
answer to the question is the entity or attribute in the KB.
(α2 + 1)P ∗ R
F1 = (9)
α2 (P + R)
Where R is the recall rate and P is the accuracy rate. Both values are between
0 and 1. Generally speaking, the two are correlated, but sometimes there will be
contradictory situations. In this case, the F1 value needs to be considered. When
110 G. Wu et al.
the F1 value is high, it indicates that the experimental results are better. α, as
a balance factor, usually has a value of 1, in the absence of other conditions or
assumptions, indicating that recall is as important as accuracy.
MR method R MR method R P F1
N-Gram 94.8 HIT NER 80.9 77.3 79.1
N-Gram+ 92.9 LSTM-CRF 85.2 83.8 84.5
Topic Entity MR 94.9 Topic Entity MR 94.9 94.9 94.9
Enhanced Entity Mention Recognition and Disambiguation Technologies 111
As shown in the above experimental results: (i) HIT’s tools contain a limited
number of entity categories. When it comes to open-field KBQA, it does not
cover all entity categories, resulting in poor results. (ii) due to the complexity and
diversity of Chinese natural language expression, traditional LSTM-CRF model
fails to achieve the expected results. (iii) Unlike the above two baseline systems,
the topic entity mention recognition algorithm only produces one mention for
each question on the basis of ensuring a high recall rate, so the recall, precision
and F1 are equal, the following experiments are as the same. The algorithm
proposed in this paper showed great advantage compared with the two baseline.
This paper also incorporates a variety of features to describe the possibility
of words as the topic entity from different dimensions. In addition, we compare
the effects of various proposed features on the algorithm (see in Table 3).
After analysis, it can be found that the semantic role is based on the part-
of-speech tagging and dependency syntax analysis, and it is the most obvious
improvement of the mention recognition effect of the topic entity.
Features R/P/F1
None 89.7
Whether the Word is a Named Entity 91.8
IDF 90.3
The Dependency Parsing Node of the Word 92.1
Part of Speech 91.6
The Semantic Role of the Word 93.7
the experimental results, in order to eliminate the influence of the topic entity
mention recognition module on the entity disambiguation, this paper only per-
forms entity disambiguation on the identified correct topic entity. Therefore, the
experimental results of the recall rate, accuracy rate and F1 are equal.
Based on the analysis of the experimental results, it can be noticed that the
effect of the EL directly affects the performance of the final KBQA system. The
semantic-based similarity method uses CNN to obtain more contextual semantic
information, thus it is better than lexical-based similarity measure. Compared
with the baseline, the topic entity linking method proposed in this paper takes
114 G. Wu et al.
6 Conclusion
Entity linking are the key step in the KBQA. The current entity linking mainly
include two parts: mention recognition and entity disambiguation. For the topic
entity recognition part, this paper uses BiGRU-CRF model to carry out sequence
labeling modeling, and extracts a series of features and word embedding stitching
to construct feature vectors to ensure a high recall rate. For the entity disam-
biguation part, this paper proposed an entity disambiguation algorithm based
on a similarity calculation with extended information. The algorithm not only
considers the literal similarity feature, but also calculates the deep semantic sim-
ilarity based on the CNN, making full use of the contextual semantic information
in the short text question.
Although the method based on entity link proposed in this paper has achieved
satisfying results in the knowledge Q&A system, there are still some shortcom-
ings. In practice, in many cases, to answer a question, you may need to use more
than two triples in the knowledge base. It needs to infer the indirect relationship
between different triplet entities through reasoning, and call such problems a
complex problem. The topic entity linking technology proposed in this paper
can theoretically be extended to the topic entity link of complex questions, and
then work can be carried out.
References
1. Basile, P., Caputo, A.: Entity linking for tweets. Encycl. Seman. Comput. Rob.
Intell. 01(01), 1630020 (2017)
2. Graves, A.: Supervised Sequence Labelling with Recurrent Neural Networks. Stud-
ies in Computational Intelligence, vol. 385. Springer, Heidelberg (2008). https://
doi.org/10.1007/978-3-642-24797-2
3. Gutmann, B., Kersting, K.: TildeCRF: conditional random fields for logical
sequences. In: European Conference on Machine Learning (2006)
4. Hachey, B., Radford, W., Nothman, J., Honnibal, M., Curran, J.R.: Evaluating
entity linking with wikipedia. Artif. Intell. 194(3), 130–150 (2013)
5. Han, X., Le, S., Zhao, J.: Collective entity linking in web text: a graph-based
method (2011)
Enhanced Entity Mention Recognition and Disambiguation Technologies 115
Sheng Bi1 , Xiya Cheng1 , Jiamin Chen1 , Guilin Qi1(B) , Meng Wang1 ,
Youyong Zhou2 , and Lusheng Wang2
1
School of Computer Science and Engineering, Southeast University, Nanjing, China
{bisheng,chengxiya,cjm,gqi,meng.wang}@seu.edu.cn
2
School of Law, Southeast University, Nanjing, China
[email protected], [email protected]
1 Introduction
Dispute Generation (DG) deals with the problem of generating dispute auto-
matically from materials of a legal case, and plays an important role in judicial
decision. There is no formal definition of a dispute in law, but we can consider
DG as the task of generating a problem which is disputed in a case, as shown in
Fig. 1. DG plays a vital role in the judicial decision. Valid DG not only improves
the efficiency of the court hearing but also provides convenience for mediation
between the parties. What’s more, since the dispute is one component of the law
document, DG makes contributions to law documents writing.
As far as we know, at present time no explicit attempts have been made to
automatically generating disputes from a case document. However, DG is part
c Springer Nature Switzerland AG 2020
X. Wang et al. (Eds.): JIST 2019, LNCS 12032, pp. 116–129, 2020.
https://doi.org/10.1007/978-3-030-41407-8_8
Dispute Generation in Law Documents via Joint Context 117
(1) We are the first to formulate DG task as text generation problem and release
a real-world dataset for this task.
(2) We propose a Seq2Seq model with two dispute detection modules. The
context-level detection module is applied to improving generation accuracy.
Moreover the topic-level detection module is used to get the overlapping topic
distribution and generate all the disputes.
118 S. Bi et al.
(3) We conduct extensive experiments on the dataset. The results show that our
model significantly outperforms the state-of-art models on the same dataset.
Also, our model does improve the accuracy and coverage of the generated
disputes.
2 Related Work
Our work is mainly relevant to previous legal-related work and recent studies on
Seq2Seq text generation.
3 Problem Formulation
The dispute generation is formulated as follows: given a PA Xp and a DA Xd , our
goal is to output a dispute context Y . Here, the PA Xp = (xp,1 , xp,2 , · · · , xp,r ),
the DA Xd = (xd,1 , xd,2 , · · · , xd,s ) and generated disputes Y = (y1 , y2 , · · · , yl )
are word sequences. By taking advantage of hierarchical attention network, we
obtain sentence vectors Sp and Sd from the PA and the DA respectively. To
make our model much succincter, we omit how to get sentence vectors in Fig. 3
and use sentence vectors Sp and Sd as inputs directly. We additionally use the
topic distribution of sentences in a PA Tp = (tp,1 , tp,2 , · · · , tp,m ), and the topic
distribution of sentences in a DA Td = (td,1 , td,2 , · · · , td,n ) as extra inputs for
better detecting disputes. The model maximizes the generation probability of Y
conditioned on Xp , Xq , Tp , Tq .
120 S. Bi et al.
4 Model
Fig. 2. The overall framework of our model with two dispute detection modules, a
context-level detection module, and a topic-level detection module. Solid arrows present
the generation process.
In this section, we present our model. The overall architecture of our model
is shown in Fig. 2. Drawing inspiration from recent work on neural machine
translation, we modify the successful seq2seq attentional model [15] by adding
context-level and topic-level dispute detection modules. As is shown in Fig. 3, our
model consists of three parts: context-level detection, topic-level detection, and a
word sequence decoder. In context-level detection, we use hierarchical attention
network to detect sentences in the PA and the DA. Then we use joint context
attention by combining the two sentence attentions to match right disputes.
The topic-level detection is used to obtain the overlapping topic distribution
between sentences in the PA and the DA and help detect correct disputes. We
calculate joint topic attention by combining the two topic attentions from the
topic distributions of a PA and a DA, which are obtained from a pre-trained LDA
model1 . Finally, we concatenate the joint context attention and topic attention
to affect the generation of words in the word sequence decoder.
Our proposed model is novel in the following ways. First, for capturing dis-
putes we design context-level detection module and topic-level detection mod-
ule to handle the challenge in literal and semantic level given in the Introduc-
tion section. Second, a novel joint attention mechanism is designed for disputes
decoding.
We describe the details of different components in the following sections.
Fig. 3. Our novel Seq2Seq model with two dispute detection modules, a context-level
detection module, and a topic-level detection module. Given the input of a plaintiff
allegation and a defendant argument, joint context attention and joint topic attention
are used to guide the generation of disputes.
words by summarizing information from both directions for words, and therefore
incorporate the contextual information in the annotation.
The encoder RNN calculates the hidden state at time t by
However, not all words contribute equally to the representation of the sen-
tence meaning. To tackle this issue, we introduce an attention mechanism to
extract such words that are important to the meaning of the sentence and aggre-
gate the representation of those informative words to form a sentence vector. The
attention weight is computed as follows:
αp,it through a softmax function. Then we compute the sentence vector gp,i as
a weighted sum of the word annotations based on the weights. As for another
encoder for a DA, the sentence vector is obtained in the same way.
Then given the sentence vectors gp,i , a document vector can be obtained
similarly. We use a bidirectional GRU to encode sentences, and hidden state at
time t is calculated as:
hp,i = GRU (gp,i , hp,i−1 ), i ∈ [1, m] (5)
For rewarding sentences that are clues to generate disputes correctly, we
again use an attention mechanism. The attention weight is computed as
exp(η(sj−1 , hp,b ))
αp,jb = (6)
i exp(η(sj−1 , hp,i ))
where sj−1 is the j − 1-th hidden state in decoder, hp,b is the b-th hidden state
in sentence encoder, and η is usually implemented as a multi-layer perceptron
(MLP) with tanh as an activation function. To get the joint context attention,
we combine the sentence attention of PA αp,jb and DA αd,jb to get a overall
context attention βjb .
βjb = αp,jb · Wb · αd,jb (7)
where Wb is a matrix.
where hp,m is the final hidden state of the input message, which is used to weaken
the effect of topics that are irrelevant to the input message in generation and
highlight the importance of relevant topics. ηo is a multilayer perceptron. The
topic attention of a DA md,jq can be obtained in the same way.
Different from topic aware Seq2Seq model, we combine the topic attention of
PA mp,jq and the topic attention of DA md,jq to get overall topic attention αjq .
This method is particularly useful in finding similar topics between a PA and a
DA and helping generate disputes more accurately. The overall topic attention
can be computed by
αjq = mp,jq · Wq · md,jq (9)
where Wq is a matrix.
Then we concatenate the joint context attention βjb and the joint topic atten-
tion αjq to obtain the context vector cj .
m
n
cj = (βjb hp,j + αjq tp,j ) + (βjb hd,j + αjq td,j ) (10)
j=1 j=1
Dispute Generation in Law Documents via Joint Context 123
where tp,j is one of the embeddings of topics in Tp , and td,j is one of the embed-
dings of topics in Td .
The joint context attention and the joint topic attention form a joint atten-
tion mechanism which allows the input message and the topics to affect the
generation probability jointly. The primary advantage of the joint attention is
that it makes words in disputes not only relevant to the message but also related
to the topics of the message.
where σ(·) is tanh, w is a one-hot indicator vector of word ω, and WVd , WVy , and
bV are learning parameters. Z = v∈V eΨV (sj ,yj−1 ,v) is a normalizer.
5 Experiments
In this section, we describe the dataset used for training and evaluation, give
implementation details, introduce baseline models, explain how model output is
evaluated and report evaluation results.
5.1 Dataset
2
http://wenshu.court.gov.cn.
124 S. Bi et al.
We employ jieba3 for Chinese word segmentation. The word embedding size is set
to 300, the value of embedding is randomly initialized with uniform distribution
in [−0.1, 0.1]. In the context-level detection, the hidden size of GRU is set to 300
for each direction in Bi-GRU. In the topic-level detection, the hidden size of GRU
is set to 300 for each direction in Bi-GRU. We choose ROUGE as the update
metric. Adam [7] is adopted to optimize the model with initial learning rate
= 0.0001, gradient clipping = 0.1, and dropout rate = 0.5. Model performance
will be checked on the validation set after every 1000 batches training and keep
the parameters with the lowest ROUGE. Training process will be terminated if
model performance is not improved for successive eight times. We repeat all the
experiments for ten times, and report the average results.
Both automatic and human evaluation metrics are used to analyze the model’s
performance.
3
https://github.com/fxsjy/jieba.
Dispute Generation in Law Documents via Joint Context 125
Fig. 4. Case study. We mark three correct disputes and their corresponding content in
the plaintiff allegation and the defendant argument in yellow, green and purple. From
this case, it is clear that our model generates all correct disputes and other methods
only generate one right dispute and even several wrong disputes compared with the
gold. (Color figure online)
Dispute Generation in Law Documents via Joint Context 127
The context-level detection is used to solve this condition where the PA and
the DA have distinct disputes in the context. As is shown in Fig. 4, without
context-level detection, some baselines generate wrong disputes. For example,
the wrong dispute “Beating and scolding the plaintiff”, generated by HRED, is
not important enough to be a dispute although it is described similarly in both
PA and DA. Moreover, Ourscontext only generates correct dispute, which verifies
the effectiveness of context-level detection.
Fig. 5. The heatmap represents a soft alignment between the input (right) and the
generated dispute (top). The columns represent the attention distribution over the
input after generating each word. For topic level, these key words are paid attention
to because they are topic words by the means of LDA model.
128 S. Bi et al.
References
1. Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning
to align and translate. arXiv preprint arXiv:1409.0473, September 2014
2. Boella, G., Caro, L.D., Humphreys, L.: Using classification to support legal knowl-
edge engineers in the Eunomos legal document management system. In: Fifth Inter-
national Workshop on Juris-Informatics (JURISIN) (2011)
3. Chin-Yew, L.: ROUGE: a package for automatic evaluation of summaries. In: Pro-
ceedings of the ACL-04 Workshop, pp. 74–81. Association for Computational Lin-
guistics (2004)
4. Gu, J., Lu, Z., Li, H., Li, V.O.: Incorporating copying mechanism in sequence-to-
sequence learning. In: Association for Computational Linguistics (2016)
Dispute Generation in Law Documents via Joint Context 129
5. Hu, Z., Li, X., Tu, C., Liu, Z., Sun, M.: Few-shot charge prediction with discrimi-
native legal attributes. In: Proceedings of COLING (2018)
6. Kim, M.-Y., Xu, Y., Goebel, R.: Legal question answering using ranking SVM
and syntactic/semantic similarity. In: Murata, T., Mineshima, K., Bekki, D. (eds.)
JSAI-isAI 2014. LNCS (LNAI), vol. 9067, pp. 244–258. Springer, Heidelberg (2015).
https://doi.org/10.1007/978-3-662-48119-6 18
7. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: ICLR (2015)
8. Liu, Y.H., Chen, Y.L.: A two-phase sentiment analysis approach for judgement
prediction. J. Inf. Sci. 44, 594–607 (2017)
9. Long, S., Tu, C., Liu, Z., Sun, M.: Automatic judgment prediction via legal reading
comprehension. In: Proceedings of EMNLP (2017)
10. Luo, B., Feng, Y., Xu, J., Zhang, X., Zhao, D.: Learning to predict charges for
criminal cases with legal basis. In: Proceedings of EMNLP (2017)
11. Mi, H., Sankaran, B., Wang, Z., Ittycheriah., A.: Coverage embedding models for
neural machine translation. In: Empirical Methods in Natural Language Processing
(2016)
12. Nallapati, R., Zhou, B., dos santos, C.N., Gulcehre, C., Xiang, B.: Abstractive text
summarization using sequence-to-sequence RNNs and beyond. In: Computational
Natural Language Learning (2016)
13. Papineni, K., Roukos, S., Ward, T., Zhu, W.: BLEU: a method for automatic
evaluation of machine translation. In: the 40th Annual Meeting of the Association
for Computational Linguistics, pp. 311–318 (2002)
14. Raghav, K., Reddy, P.K., Reddy, V.B.: Analyzing the extraction of relevant legal
judgments using paragraph-level and citation information. AI4J Artif. Intell. Jus-
tice (2016)
15. Rush, A.M., Chopra, S., Weston., J.: A neural attention model for abstractive
sentence summarization. In: Empirical Methods in Natural Language Processing
(2015)
16. See, A., Liu, P.J., Manning, C.D.: Get to the point: summarization with pointer-
generator networks. In: Proceedings of the 55th Annual Meeting of the Association
for Computational Linguistics (Volume 1: Long Papers), pp. 1073–1083, July 2017
17. Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural
networks. In: Advances in Neural Information Processing Systems 27: Annual Con-
ference on Neural Information Processing Systems 2014, pp. 3104–3112 (2014)
18. Sutton, R.S., Barto, A.G.: Reinforcement learning: an introduction. J. Artif. Intell.
Res. 47, 253–279 (1998)
19. Tu, Z., Lu, Z., Liu, Y., Liu, X., Li, H.: Modeling coverage for neural machine
translation. In: Association for Computational Linguistics (2016)
20. Vinyals, O., Fortunato, M., Jaitly, N.: Pointer networks. In: Neural Information
Processing Systems (2002)
21. Xing, C., et al.: Topic aware neural response generation. In: Proceedings of AAAI,
pp. 3351–3357 (2017)
22. Ye, H., Jiang, X., Luo, Z., Chao, W.: Interpretable charge predictions for criminal
cases: learning to generate court views from fact descriptions. In: Proceedings of
NAACL-HIT, pp. 1854–1864 (2018)
Richpedia: A Comprehensive
Multi-modal Knowledge Graph
1 Introduction
With the rapid development of Semantic Web technologies, various knowledge
graphs are published on the Web using Resource Description Framework (RDF),
such as Wikidata [18] and DBpedia [2]. Knowledge graphs provide for setting
RDF links among different entities, thereby forming a large heterogeneous graph,
supporting semantic search [19], question answering [16] and other intelligent ser-
vices. Meanwhile, public availability of visual resource collections has attracted
much attention for different Computer Vision [6,10] (CV) research purposes,
c Springer Nature Switzerland AG 2020
X. Wang et al. (Eds.): JIST 2019, LNCS 12032, pp. 130–145, 2020.
https://doi.org/10.1007/978-3-030-41407-8_9
Richpedia: A Comprehensive Multi-modal Knowledge Graph 131
rp:003648
including visual question answering [20], image classification [15], object and
relationship detection [11], etc. And we have witnessed promising results by
encoding entity and relation information of textual knowledge graphs for CV
tasks. Whereas most knowledge graph construction work in the Semantic Web
and Natural Language Processing (NLP) [3,13,14] communities still focus on
organizing and discovering only textual knowledge in a structured representa-
tion. There is a relatively small amount of attention in utilizing visual resources
for KG research. A visual database is normally a rich source of image or video
data and provides sufficient visual information about entities in KGs. Obviously,
making link prediction and entity alignment in wider scope can empower mod-
els to make better performance when considering textual and visual features
together.
In order to bring the advantages of Semantic Web to the academic and indus-
try community, a number of KGs have been constructed over the last years, such
as Wikidata [18] and DBpedia [2]. These datasets make the semantic relation-
ships and exploration of different entities possible. However, there are few visual
sources within these textual KGs. In order to improve visual question answer-
ing and image classification performance, several methods [5,7,12,20] have been
developed for connecting textual facts and visual resources, but the RDF links
[4] from different entities and images to objects in the same image are still very
limited. Hence, little of the existing data resources is bridging the gap between
visual resources and textual knowledge graphs.
As mentioned above, general knowledge graphs focus on the textual facts.
There is still no comprehensive multi-modal knowledge graph dataset prohibiting
further exploring textual and visual facts on either side. To fill this gap, we
provide a comprehensive multi-modal dataset (called Richpedia) in this paper,
as shown in Fig. 1.
132 M. Wang et al.
In summary, our Richpedia data resource mainly makes the following contri-
butions:
The rest of this paper is organized as follows. Section 2 describes the construc-
tion details of proposed dataset. Section 3 describes the overview of Richpedia
ontology. The statistics and evaluation are reported in Sect. 4. Section 5 describes
related work and finally, Sect. 6 concludes the paper and identifies topics for fur-
ther work.
2 Richpedia Construction
A knowledge graph (KG) can often be viewed as a large-scale multi-relational
graph consisting of different entities and their relations. We follow the RDF
model [4] and introduce the definition of the proposed multi-modal knowledge
graph, Richpedia, as follows:
Richpedia Definition: Let E = EKG ∪ EIM be a set of general KG entities
EKG and image entities EIM , R be a set of relations between entities. E and R
will be denoted by IRIs (Internationalized Resource Identifiers)2 . L be the set of
literals (denoted by quoted strings, e.g. “London”, “750px”), and B be the set
of blank nodes. A Richpedia triple t = subject, predicate, object is a member
of set (E ∪ B) × R × (E ∪ L ∪ B). Richpedia, i.e., multi-modal KG, is a finite set
of Richpedia triples.
Figure 2 illustrates the overview of Richpedia construction pipeline, which
mainly includes three phases: data collection (described in Sect. 2.1), image
processing (described in Sect. 2.2) and relation discovery (described in
Sect. 2.3).
1
https://jena.apache.org/documentation/fuseki2/index.html.
2
https://www.w3.org/TR/rdf11-concepts/#dfn-iri.
Richpedia: A Comprehensive Multi-modal Knowledge Graph 133
City
Wikipedia Bing
Google Yahoo
4000
3500 3495
3000
2500
Count
2000
1500
1000
749
500 472
286 216
90 149 70 54 32 15 6 4 3
0
0-5 5-10 10-15 15-20 20-25 25-30 30-35 35-40 40-45 45-50 50-55 55-60 60-65 65-70
Entity images frequency
Fig. 3. The entity image frequencies in Wikipedia. There are a large portion of entities
that only have a few images.
portion of images for KG entities are actually long-tail. In other words, each
KG entity will have very few visual information in Wikipedia. Therefore, as
mentioned above, we obtained sufficient images from open sources and processed
to filter out final image entities (details will be given in Sect. 2.2). After that,
we will create IRIs for each image entity. In current version, we have collected
2,883,162 images entities and kept 99.2 images per entity on average.
Triple Generation: In Richpedia, we focus on constructing three types of
triples as follows:
where rp:001564 is a picture of Sydney and its pixel information 700 ∗ 1600.
ei , rp:relation, ek establishes the semantic visual relations (rp:relation)
between two image entities. An example is
Since the each IRI is unique, we can directly generate the triple link the
ei , rp:imageof, ek and ei , rp:attribute, l during data collection. For the triple
ei , rp:relation, ek , we will leverage related hyperlinks and text in Wikipedia to
discover the relations (details will be given in Sect. 2.3).
For each image cluster, we will collect top-20 images in following process.
First, the image with highest visual score is selected as the top ranked image.
The second image is the one which has the largest distance to the first image.
The third image is chosen as the image with the largest distance to both two
previous images, and so on. During the diversity image detection retrieval, we
also generate ei , rp:attribute, l triples for the given image entity to provide its
visual features in Richpedia.
After the acquisition of the images, we need to compute some different visual
descriptors, which can describe the pixel-level features of the selected images
(for instance, gray distribution and texture information for images). We then
use these descriptors to calculate the similarity between the images, where the
similarity can be calculated by integrating the distance between different descrip-
tors. There are the descriptors we compute are the following:
In this section, we mainly introduce the process of triple ei , rp:relation, ek gen-
eration. It is hard to directly detect these semantic relations based on pixel
features of different images. The collected images from open sources are nat-
urally linked to the input crawling seeds, i.e, KG entities, and image entities
from Wikipedia and Wikidata. Therefore, we can leverage related hyperlinks
and text in Wikipedia to discover the semantic relations (rp:relation) between
image entities. Next we take rp:contain and rp:nearBy as examples to illustrate
how to discover semantic relations among image entity Place de la Concorde,
Obelisk of Luxor, and Fountain of River Commerce and Navigation.
As shown in Fig. 5, images of Place de la Concorde, Obelisk of Luxor, and
Fountain of River Commerce and Navigation are extracted from the Place de la
Concorde Wikipedia article. From the semantic visual perspective, we could find
that Place de la Concorde contains Obelisk of Luxor and Fountain of River Com-
merce and Navigation, and Obelisk of Luxor is near by Fountain of River Com-
merce and Navigation. To discover these relations, we collect textual descriptions
around these images and propose three effective rules to extract final relations:
rpo:sameAs rpo:contain
rpo:Image
rpo:Descriptor
rpo:Imageof rpo:pixel (xsd:string) rpo:Describes
rpo:KGentity
rpo:height (xsd:float) rpo:value (xsd:string)
rpo:width (xsd:float)
rdfs:subClassOf
rpo:sourceImage
rpo:targetImage
rpo:GHD rpo:CLD
rpo:ImageSimilarity rpo:DescriptorType
rpo:CM rpo:GLCM
rpo:Similarity (xsd:float)
rpo:HOGS rpo:HOGL
Rule2 : If there are multiple hyperlinks in the description, we detect the core
KG entity based on the syntactic parser and syntactic tree. Then, we take
input as the core KG entity and reduce this case to Rule1.
Rule3 : If there is no hyperlink pointing to other articles in the description, we
employ the Stanford CoreNLP to find the corresponding KG entities which
have Wikipedia articles and reduce this case to Rule1 and Rule2. Because
Rule3 relies on the NER results which have low quality than annotated hyper-
links, its priority is lower than the first two rules.
3 Ontology
In this section, we describe the ontology we built for Richpedia, which consists of
comprehensive image entities, multiple descriptors of image entities, and relation-
ships among image entities. We create a custom lightweight Richpedia ontology
to represent the data as RDF format, all the files formatted following the N-
Triples guidelines (https://www.w3.org/TR/n-triples/). All Richpedia resources
are identified under the http://rich.wangmengsd.com/resource/ namespace. The
ontology is described at http://rich.wangmengsd.com/ontology/.
As shown in Fig. 6, the overview of Richpedia ontology is as followed. The
classes are displayed in the box. The solid edges represent the relation between
instances of two classes, the dotted lines represent the relation between the
classes themselves; for conciseness, the data type properties are listed in the
class boxes.
A rpo:KGentity is an existing text knowledge graph entity that contains
a lot of existing attribute information. A rpo:Image is an abstract resource
representing an image entity of Richpedia dataset, describing the height and
width of the image and the url of the image in the display website. The data
types of rpo:Height and rpo:Width are both xsd:float. A rpo:KGentity links
Richpedia: A Comprehensive Multi-modal Knowledge Graph 139
rp:16.jpg.GHD a rpo:GHD;
rpo:Describes rp:16.jpg;
rpo:value “[2823.0 , 218.0 , 256.0 , 205.0 , ...]”;
rp:16.jpg.CLD a rpo:CLD;
rpo:Describes rp:16.jpg;
rpo:value “[189.0 , 62.0 , 66.0 , 49.0 , ...]”;
rp:16.jpg.CM a rpo:CM;
rpo:Describes rp:16.jpg;
rpo:value “[64.8369 , 92.1214 , 195.548 , ...]”;
rp:16.jpg.GLCM a rpo:GLCM;
rpo:Describes rp:16.jpg;
rpo:value “[0.0123 , 0.0035 , 0.0011 , ...]”;
rp:16.jpg.HOGL a rpo:HOGL;
rpo:Describes rp:16.jpg;
rpo:value “[0.0666 , 0.0120 , 0.0033 , 0.0012 , ...]”;
rp:gray_sim1 a rpo:ImageSimilarity;
rpo:sourceImage rp:0.jpg;
rpo:targetImage rp:3997.jpg;
rpo:Similarity 0.7750442028045654;
rpo:DescriptorType rpo:GHD;
Fig. 10. 10 nearest neighbors of an image of Russian Luzhniki Stadium using HOG.
As for the accessibility of data, due to the capacity limitation of our website
server, we set up an online access platform that only displays some of the Rich-
pedia data, but we provide Google Cloud Driver download link for all data. In
the Google Driver download link, you can find the download link of the full data
and the link of nt files about the data description, including the visual relations
between the images, the image feature descriptors and so on. With respect to
sustainability, because of the large size of the dump, we have not yet found a
mirror host to replicate the data. Because we have a long-term plan for Rich-
pedia, hence the dataset will be inactive maintenance and development. As for
updating the dataset, although it is expensive to build the original dataset, we
plan to implement incremental updates. The descriptors for these images can
then be computed, while only the k-nn similarity relations involving new images
(potentially pruning old relations) need to be computed.
Using the visual descriptors of image entities generated in Sect. 2.2, we design
an experiment to calculate the similarity between image entities. First, we use
the OpenCV library to calculate the visual descriptors for each image entity.
Next, we use visual descriptors to calculate the similarity between images. For
each image entity, we calculate ten nearest neighbors for image entities according
to each visual descriptor, for calculating the nearest neighbor image entities, we
have the classical algorithm and fast approximate NN matching algorithm.
The problem of nearest neighbor search is a major problem in many appli-
cations, such as image recognition, data compression, pattern recognition and
classification, machine learning, document retrieval system, statistics and data
analysis. However, solving this problem in high-dimensional space seems to be
a very difficult task, and no algorithm is obviously superior to the standard
brute force search. As a result, more and more people turn their interest to a
142 M. Wang et al.
class of algorithms that perform the approximate nearest neighbor search. These
methods have proved to be good enough approximation in many practical appli-
cations and most cases, which is much faster than the exact search algorithm. In
computer vision and machine learning, finding the nearest neighbor in training
data is expensive for a high-dimensional feature. For high-dimensional features,
the randomized k-d forest is the most effective method at present.
For the image entities of online access platform, because the amount of image
entities is small, which is about tens of thousands, we use the classical nearest
neighbor algorithm to calculate the similarity between image entities. The advan-
tage of this algorithm is that it traverses all data sets, so it has relatively high
accuracy and can perfectly reflect the similarity between image entities. How-
ever, its shortcomings are obvious. The classical nearest neighbor algorithm, for
each image entity, we need to traverse all other image entities, it belongs to brute
force search, so it has high time complexity and will consume a lot of time and
computing costs. But for the complete Richpedia dataset, if we want to calcu-
late the similarity between image entities, we can only choose Fast Library for
Approximated Nearest Neighbors (FLANN) since it has been proven to scale for
large datasets. Although it will decrease inaccuracy, it is an optimal choice for
large data sets in terms of integration accuracy and time complexity.
We design an experiment which contains 30,000 images. We configured
FLANN with a goal precision of 95% and tested it on a brute-forced gold stan-
dard. First, we use the classical nearest neighbor algorithm to calculate, however,
while it took 18 days to compute with 8 threads, when we test on the FLANN,
it finished in 15 hours with 1 thread. Finally, FLANN achieved the precision of
76%. In Fig. 10, we show an example of similarity search results based on HOG
descriptor, which captures information about edges in an image.
5 Use-Cases
Beijing Zoo
6 Related Work
Amongst the available datasets describing multimedia, the emphasis has been
on capturing the high-level metadata of the multimedia files (e.g., author, date
created, file size, width, duration) rather than audio or visual features of the
multimedia content itself [1,8]. Recently, several methods [5,7,12,17,20] have
been developed for connecting textual facts and visual resources. IMGpedia [5]
is a linked dataset that provides visual descriptors and similarity relationships
for Wikimedia Commons. This dataset is also linked with DBpedia and DBpe-
dia Commons to provide semantic context and further metadata. Zhu et al.
[20] exploited knowledge graphs for visual question answering, but it was cre-
ated specifically for the purpose, and consequently contains a small amount of
very specific images, and also proposed a knowledge base framework to handle all
kinds of visual queries without training new classifiers for new tasks, these anno-
tations represent the densest and largest dataset of image descriptions, objects,
attributes, relationships, and question answers. Visual Genome dataset [7] aims
144 M. Wang et al.
References
1. Addis, M., Allasia, W., Bailer, W., Boch, L., Gallo, F., Wright, R.: 100 million
hours of audiovisual content: digital preservation and access in the prestoprime
project. In: Proceedings of the 1st International Digital Preservation Interoper-
ability Framework Symposium, p. 3. ACM (2010)
2. Bizer, C., et al.: DBpedia-a crystallization point for the web of data. Web Semant.
Sci. Serv. Agents World Wide Web 7(3), 154–165 (2009)
3. Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., Kuksa, P.:
Natural language processing (almost) from scratch. J. Mach. Learn. Res. 12, 2493–
2537 (2011)
4. World Wide Web Consortium, et al.: Rdf 1.1 concepts and abstract syntax (2014)
Richpedia: A Comprehensive Multi-modal Knowledge Graph 145
5. Ferrada, S., Bustos, B., Hogan, A.: IMGpedia: a linked dataset with content-based
analysis of wikimedia images. In: d’Amato, C., et al. (eds.) ISWC 2017. LNCS,
vol. 10588, pp. 84–93. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-
68204-4 8
6. Heaton, R.K., Staff, P.: Wisconsin card sorting test: computer version 2. Odessa
Psychol. Assess. Resour. 4, 1–4 (1993)
7. Krishna, R., et al.: Visual genome: connecting language and vision using crowd-
sourced dense image annotations. Int. J. Comput. Vision 123(1), 32–73 (2017)
8. Kurz, T., Kosch, H.: Lifting media fragment URIs to the next level. In:
LIME/SemDev@ ESWC (2016)
9. Lee, S., Xin, J., Westland, S.: Evaluation of image similarity by histogram intersec-
tion. In: Color Research & Application: Endorsed by Inter-Society Color Council,
The Colour Group (Great Britain), Canadian Society for Color, Color Science
Association of Japan, Dutch Society for the Study of Color, The Swedish Colour
Centre Foundation, Colour Society of Australia, Centre Français de la Couleur,
vol. 30, no. 4, pp. 265–274 (2005)
10. Lejuez, C., Kahler, C.W., Brown, R.A.: A modified computer version of the paced
auditory serial addition task (PASAT) as a laboratory-based stressor. Behav. Ther-
apist (2003)
11. Liang, X., Lee, L., Xing, E.P.: Deep variation-structured reinforcement learning for
visual relationship and attribute detection. In: Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pp. 848–857 (2017)
12. Liu, Y., Li, H., Garcia-Duran, A., Niepert, M., Onoro-Rubio, D., Rosenblum, D.S.:
MMKG: multi-modal knowledge graphs. arXiv preprint arXiv:1903.05485 (2019)
13. Manning, C., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S., McClosky, D.: The
Stanford CoreNLP natural language processing toolkit. In: Proceedings of 52nd
Annual Meeting of the Association for Computational Linguistics: System Demon-
strations, pp. 55–60 (2014)
14. Manning, C.D., Manning, C.D., Schütze, H.: Foundations of Statistical Natural
Language Processing. MIT Press, Cambridge (1999)
15. Marino, K., Salakhutdinov, R., Gupta, A.: The more you know: using knowledge
graphs for image classification. In: Proceedings of the IEEE Conference on Com-
puter Vision and Pattern Recognition, pp. 2673–2681 (2017)
16. Trivedi, P., Maheshwari, G., Dubey, M., Lehmann, J.: LC-QuAD: a corpus for
complex question answering over knowledge graphs. In: d’Amato, C., et al. (eds.)
ISWC 2017. LNCS, vol. 10588, pp. 210–218. Springer, Cham (2017). https://doi.
org/10.1007/978-3-319-68204-4 22
17. Vaidya, G., Kontokostas, D., Knuth, M., Lehmann, J., Hellmann, S.: DBpedia
commons: structured multimedia metadata from the wikimedia commons. In: Are-
nas, M., et al. (eds.) ISWC 2015. LNCS, vol. 9367, pp. 281–289. Springer, Cham
(2015). https://doi.org/10.1007/978-3-319-25010-6 17
18. Vrandečić, D., Krötzsch, M.: Wikidata: a free collaborative knowledgebase. Com-
mun. ACM 57(10), 78–85 (2014)
19. Yih, W.t., Chang, M.W., He, X., Gao, J.: Semantic parsing via staged query graph
generation: question answering with knowledge base. In: Proceedings of the 53rd
Annual Meeting of the Association for Computational Linguistics and the 7th
International Joint Conference on Natural Language Processing (Volume 1: Long
Papers), vol. 1, pp. 1321–1331 (2015)
20. Zhu, Y., Zhang, C., Ré, C., Fei-Fei, L.: Building a large-scale multimodal knowledge
base system for answering visual queries. arXiv preprint arXiv:1507.05670 (2015)
DSEL: A Domain-Specific Entity Linking
System
Xinru Zhang1 , Huifang Xu2 , Yixin Cao3 , Yuanpeng Tan2 , Lei Hou1(B) ,
Juanzi Li1 , and Jiaxin Shi1
1
Tsinghua University, Beijing, China
[email protected], {houlei,lijuanzi}@tsinghua.edu.cn,
[email protected]
2
Artificial Intelligence Application Department,
China Electric Power Research Institute, Beijing, China
{xuhuifang,tanyuanpeng}@epri.sgcc.com.cn
3
National University of Singapore, Singapore, Singapore
[email protected]
1 Introduction
Entity linking (EL) aims to link the textual named entity mentions in the
unstructured document to proper KB entities. It is a fundamental task for many
NLP problems such as question answering, relation extraction. The task has been
extensively studied [1,6,11] in recent years, and many EL systems [9,10,16] have
been built.
The main challenge of the entity linking task is the ambiguity of named entity
mentions. A named entity mention may refer to many KB entities, and an entity
often has multiple surface names, such as its full name, partial names, aliases,
abbreviations, and alternate spellings. For example, the entity “Air Jordan” can
be identified by the plain text “Jordan” or “AJ” while the mention “Jordan” can
refer to 43 entities in Wikipedia. For a given document, an entity linking system
should detect the concerned mentions and link them to correct entities in the
knowledge base. Another challenge of the entity linking task is filtering through
all recognized named entity mentions finding more meaningful, user-concerned
ones. Existing entity linking systems try to link named entities as many as pos-
sible, however, many linkages are unnecessary. As it shows in Fig. 1, for the
given document, Babelfy and TagMe detect as many entity mentions (Babelfy
also detects concept mentions) as possible and link them to corresponding enti-
ties (concepts), however, many linkages of them are futile, such as “brand” and
“style”.
Babelfy TagMe
Fig. 1. Linking results of two entity linking systems. The left side shows the results
from Babelfy, where named mentions in orange are concept links, while the blues are
entity links. The right side is the result of TagMe who doesn’t consider the difference
between concept and entity. (Color figure online)
Extensive research papers are focusing on the challenge of named entity ambi-
guity. Currently, DNN shows its ability to solve the problem, existing NN mod-
els for entity disambiguation have two paradigms: local models, which disam-
biguate mentions independently relying on textual context information [2,3,7],
and global (collective) models, which resolve multiple mentions in a document
simultaneously by encouraging their target entities to be coherent [4]. Although
current models seem very effective in the experimental test sets, it is still diffi-
cult to get satisfactory performance in practical scenarios, especially when target
entities come from specific domains. We argue that one critical bottleneck is that
current EL models are mostly designed and deployed for the open domain, which
contains millions of entities from totally different domains of real-world and thus
is too difficult for a single model to handle. For example, as we mentioned before,
Jordan can refer to 43 entities in Wikipedia, it will cost lots of time and space for
disambiguation. The second challenge, meaningful linkages, is rarely been con-
sidered, but quite important in practice. To solve the two challenges, we propose
to move entity linking systems towards specific domains. By doing so, we can
not only reduce the searching space and problem complexity but also leverage
domain information (e.g., domain priors) to boost the performance. Besides, the
148 X. Zhang et al.
linking results are about the specified domain, which is more meaningful than
open domain results.
To provide domain-specific entity linking, we propose an unsupervised app-
roach to generate domain data from Wikipedia. We generated and published
12 domain datasets. As for the model, inspired by [1], we build a candidate
graph for multiple mentions in a document, and then utilize graph convolution
networks for information aggregation. Our models share the same framework
but are trained on different domain-specific datasets. Compared with the model
trained on the open domain, our domain-specific systems are demonstrated to
perform significantly better. Our key contributions in this work are as follows:
1. Provide an unsupervised approach to generate domain data from Wikipedia
utilizing its category system and trained category embeddings. Currently, we
have published 12 domain datasets.
2. Build an entity linking system, including dictionary-based mention parsing
& candidate generation and domain-specific neural collective entity linking
model, and publish the system as an online service.
where c represents global context and Γ = {e1 , ..., ek }, ei is one of the candidate
KB entities of mi .
DSEL: A Domain-Specific Entity Linking System 149
System Components
Mention Detection
document & DSEL Model linking results
Candidate Generation
Domain 1 Domain n
Domain Generator
Wikipedia Corpus
Fig. 2. System framework. Our system contains three parts: Domain Generation, Men-
tion Detection & Candidates Generation, and Domain-specific Entity Linking Model.
When the user inputs a document and a domain, our system detects mentions in the
document and maps them to their corresponding KB entities.
Mention Detection aims to find out entity mentions M in the textual docu-
ment D, challenges of this stage lay on the various expression forms of an entity.
3 Our Approach
3.1 Domain Generation
Concert Masters
Physical Exercise
Orchestra Leaders
Dance Music Genres
Fig. 3. Two traits of category system in Wikipedia. As it shows in the left graph, there
exist circle paths in the category system of Wikipedia. The right graph shows that
some categories may cause domain overlap because their parent categories come from
different domains.
DSEL: A Domain-Specific Entity Linking System 151
domain data for further use. The key idea behind domain data generation is
traversing the category tree of Wikipedia from seed categories of a domain to
generate domain categories, then obtaining articles under those categories to
construct domain data. Ideally, given a set of seed categories of a domain, we
can derive domain categories from the category tree by traversing it. However,
Wikipedia is an online encyclopedia that everyone can edit. Inevitably, there are
three impediments may lead to unexpected results (Fig. 3):
1. Given a top category for a domain, retrieve its first-layer sub-categories and
then filter irrelevant categories such as administration categories.
2. Get the top-k layer categories as seed categories of the domain, calculate the
average embedding of category for further use.
3. Traverse the i-th layer, i > k, calculate the cosine similarity with the average
category embedding for each category in the layer, then sort those categories
by the similarity score, drop last dropratei % categories of this layer.
4. Stop traverse when i > max_depth.
Result: domain_categories
category_queue = top_layer_categories;
level_len = category_queue.size();
curr_depth = 1;
seed_depth = k;
domain_categories = list([]);
while category_queue is not empty do
head_category = category_queue.pop();
level_len-=1;
push all sub categories of head_category into category_queue;
domain_categories.append(head_category);
if level_len == 0 then
if curr_depth == seed_depth then
calculate average category embedding of current
domain_categories.;
end
if curr_depth > seed_depth then
sort categories in category_queue by the cosine similarity with
average category embedding.;
drop last dropratecurr_depth % categories in the queue;
end
curr_depth ++;
if curr_depth > max_depth then
break ;
end
level_len = category_queue.size();
end
end
Algorithm 1: Domain categories generation
Input Features. The input of our model contains local features and global
features. Local features represent how compatible that the entity is with the
context text of its corresponding mention. There are two types of local features
for each candidate entity: string similarity by calculating the edit distance and
entity-context similarity by computing the similarity between the entity and the
average sum of context words weighted by attentions.
154 X. Zhang et al.
Global features represent the topic coherence among entities, mentions and
plain text of a given document. We extract two types of global features to capture
the global semantic, neighbor mention compatibility and subgraph structure. We
compute the similarities between the candidate entity and all neighbor mentions
to represent the neighbor mention compatibility. As for the subgraph structure
feature, we firstly build an entity graph where nodes are candidate entities of all
mentions and edges are their similarity, then for each candidate entity, extract can-
didate entities of neighbor mentions to build a subgraph as another global feature.
4 Experiments
4.1 Dataset
The dataset we used in our work is the Wikipedia corpus dumped in March
2018. To the best of our knowledge, there isn’t a proper domain dataset that
can both provide sufficient plain text and mention-entity relations for the task of
entity linking, so we built 12 domain dataset from Wikipedia for domain model
training and system usage. Wikipedia is the largest online encyclopedia that
everybody can edit, we extracted 5,133,361 instances and 1,376,896 categories
for our work. Table 1 shows the detailed statistics of the 12 domains.
Fig. 4. Domain categories and instances amount increases along with traverse depth.
Fig. 5. The coverage of total domain instances and categories. The vertical coordinate
is the category/instance coverage ratio of extended domains, while the horizontal coor-
dinate represents the new-added domain. For example, the instance coverage value at
‘politics’ is the instance coverage of mathematics, music, and politics.
Some instances not contained in our pre-defined domains because the domains
are manually set, as it shows in Fig. 5, the coverage increases after introducing
new domain.
As the Fig. 5 shows, there are three ratio jumps when adding music, law and
sports, that means the three domains are more independent to other domains
and have more content. The experiment below also reveals this phenomenon.
Domain Overlap Ratio. The domain overlap ratio explains the proportion of
category/instances that appears in other domains. This can give us an insight
into the rationality of the domain. Tables 2 and 3 reveal the overlap ratio of
domain categories and instances respectively, IDs are domain IDs and the map-
ping relations from ID to Domain Name can be found in Table 1. The value at
the i-th row and the j-th column means the proportion of categories/instances
in the i-th domain appearing in the j-th domain.
The maximum value of category overlap ratio is 0.655 where i = 2 and
j = 3, that means 65.5% categories in domain politics also belongs domain
law, and there are 5.7% categories in law belongs to politics. We also noticed
that the total category amount is 5,396 for politics comparing 61,367 for law.
The instance overlap ratio of three pairs of domains is over 0.5, they are 83.1%
instances of politics also belong to law, 52.3% military instances belong to
law, and 50.6% health instances belong to law. Intuitively, the results above
mean that there are more law-related content in Wikipedia than politics, militory
and health, and politics may be a sub-domain of law.
ID 0 1 2 3 4 5 6 7 8 9 10 11
0 1.000 0.062 0.000 0.025 0.289 0.001 0.012 0.024 0.050 0.000 0.015 0.114
1 0.003 1.000 0.000 0.007 0.015 0.000 0.005 0.003 0.000 0.000 0.009 0.005
2 0.000 0.003 1.000 0.655 0.083 0.088 0.004 0.065 0.035 0.002 0.112 0.040
3 0.000 0.002 0.057 1.000 0.047 0.024 0.005 0.056 0.014 0.005 0.140 0.045
4 0.016 0.018 0.024 0.160 1.000 0.004 0.032 0.077 0.024 0.003 0.141 0.092
5 0.000 0.002 0.070 0.219 0.012 1.000 0.004 0.023 0.000 0.000 0.098 0.022
6 0.000 0.002 0.000 0.006 0.011 0.000 1.000 0.027 0.001 0.000 0.035 0.017
7 0.002 0.006 0.029 0.296 0.120 0.013 0.122 1.000 0.008 0.009 0.073 0.067
8 0.016 0.005 0.061 0.289 0.141 0.001 0.024 0.030 1.000 0.000 0.109 0.136
9 0.000 0.000 0.003 0.091 0.016 0.000 0.001 0.029 0.000 1.000 0.051 0.004
10 0.000 0.003 0.009 0.136 0.041 0.010 0.029 0.013 0.005 0.003 1.000 0.015
11 0.008 0.008 0.015 0.193 0.118 0.010 0.063 0.055 0.030 0.001 0.068 1.000
DSEL: A Domain-Specific Entity Linking System 157
ID 0 1 2 3 4 5 6 7 8 9 10 11
0 1.000 0.019 0.012 0.114 0.458 0.005 0.023 0.063 0.091 0.006 0.074 0.250
1 0.002 1.000 0.007 0.052 0.053 0.003 0.021 0.017 0.006 0.000 0.062 0.024
2 0.004 0.020 1.000 0.831 0.271 0.183 0.024 0.184 0.123 0.008 0.337 0.174
3 0.004 0.018 0.103 1.000 0.179 0.049 0.030 0.120 0.047 0.011 0.264 0.137
4 0.033 0.034 0.059 0.318 1.000 0.017 0.089 0.141 0.065 0.011 0.203 0.185
5 0.002 0.012 0.239 0.523 0.106 1.000 0.024 0.104 0.016 0.005 0.325 0.064
6 0.001 0.011 0.004 0.046 0.076 0.003 1.000 0.045 0.019 0.000 0.083 0.041
7 0.011 0.026 0.096 0.506 0.335 0.041 0.126 1.000 0.048 0.023 0.200 0.200
8 0.037 0.021 0.149 0.468 0.362 0.015 0.128 0.112 1.000 0.003 0.232 0.319
9 0.008 0.003 0.029 0.320 0.181 0.015 0.015 0.164 0.011 1.000 0.191 0.077
10 0.003 0.023 0.044 0.282 0.122 0.033 0.058 0.050 0.025 0.007 1.000 0.080
11 0.028 0.024 0.059 0.379 0.288 0.017 0.076 0.131 0.090 0.007 0.208 1.000
Table 4. Model validation results. The left side lists results of domain-specific models,
while the right side shows results of open-domain models.
DCki represents the randomly selected categories in the k-th time for domain
i, n is the time of randomly selecting m domain categories. In this experiment,
we set n = 10 and m = 100. The results are listed in Table 5.
ID 0 1 2 3 4 5 6 7 8 9 10 11
0 0.151 0.021 0.062 0.056 0.069 0.063 0.008 0.072 0.072 0.084 0.044 0.069
1 0.026 0.221 0.019 0.00 0.040 0.00 0.03 0.008 0.021 0.006 0.001 0.00
2 0.042 0.00 0.172 0.149 0.074 0.176 0.033 0.108 0.082 0.110 0.103 0.097
3 0.066 0.003 0.141 0.165 0.075 0.164 0.041 0.086 0.073 0.134 0.125 0.130
4 0.060 0.023 0.078 0.060 0.116 0.075 0.031 0.073 0.031 0.062 0.056 0.075
5 0.070 −0.01 0.201 0.170 0.094 0.201 0.303 0.141 0.073 0.160 0.161 0.127
6 0.013 −0.01 0.015 0.040 0.029 0.054 0.133 0.045 −0.00 0.101 0.057 0.063
7 0.056 0.015 0.097 0.103 0.081 0.109 0.042 0.165 0.038 0.100 0.091 0.114
8 0.063 0.001 0.076 0.065 0.042 0.074 0.003 0.026 0.112 0.038 0.089 0.077
9 0.079 0.025 0.146 0.153 0.094 0.172 0.075 0.172 0.032 0.365 0.142 0.125
10 0.062 0.002 0.115 0.140 0.065 0.147 0.055 0.095 0.051 0.118 0.163 0.121
11 0.079 0.008 0.114 0.106 0.065 0.147 0.074 0.109 0.061 0.124 0.105 0.167
5 System Implementation
6 Related Work
Fig. 6. System screenshot. Users input the plain text document, select a domain, click
the button ‘Link it!’, and then our system will return the linking results of the docu-
ment.
based on it. The category system of Wikipedia consists of two parts: entity-
category and category hierarchy. The former refers to adding tags, called cate-
gories, to a Wikipedia article, and the latter is the way to organize all categories
in Wikipedia, both parts are user-generated. The category system has been well
studied [5,12,13] and proved valuable for introducing semantic information. The
category hierarchy is far away from perfection since the definition of subcat-
egory relation is not clear enough for a knowledge base, and there are many
categories are added for the administration of the category system itself. As we
want to capture the domain semantics from category system, it’s okay for us
that the subcategory relation is not clear. As for the administration categories,
we directly remove them since they will introduce semantic bias (Fig. 6).
Entity linking models can be classified into two types: local models and global
models. Local models [2,3,15] resolve mention ambiguity by computing the sim-
ilarity between candidate entities and mention contexts, to make semantic of
target entity aligns with local context. Global models [1,4,6] solve the problem
160 X. Zhang et al.
by modeling the coherence score among all candidate entities in the given doc-
ument. Definition 4 reveals it is difficult to optimize the problem because the
quality of the score for one entity depends on all other entity scores. Researchers
explored various approaches limiting the searching space to improve optimiza-
tion performance. [11] proposed a system for integrating symbolic knowledge into
the reasoning process of a neural network through a type system, [14] designed
a label hierarchy aware loss function that relies on the ultrametric tree distance
between labels and [8] used NER types to constrain the behavior of an Entity
Linking system.
7 Conclusion
In this paper, we build a domain-specific entity linking system and publish it
as an online website. Firstly, we propose an unsupervised method to generate
domain dataset from Wikipedia, including instances, categories, and mention-
candidate entity pairs. Then we build a domain-specific neural collective entity
linking model for each domain. With the domain dataset and domain models,
we build a domain-specific entity linking system and publish it online. Sufficient
experiments are conducted to demonstrate the superiority of our domain-specific
models and the validity of category embedding for domain generation. We pub-
lished 12 domain datasets and our DSEL system is released as an online website,
http://dsel.xlore.org.
References
1. Cao, Y., Hou, L., Li, J., Liu, Z.: Neural collective entity linking. arXiv preprint
arXiv:1811.08603 (2018)
2. Chen, Z., Ji, H.: Collaborative ranking: a case study on entity linking. In: Proceed-
ings of the Conference on Empirical Methods in Natural Language Processing, pp.
771–781. Association for Computational Linguistics (2011)
3. Chisholm, A., Hachey, B.: Entity disambiguation with web links. Trans. Assoc.
Comput. Linguist. 3, 145–156 (2015)
4. Durrett, G., Klein, D.: A joint model for entity analysis: coreference, typing, and
linking. Trans. Assoc. Comput. Linguist. 2, 477–490 (2014)
5. Faralli, S., Stilo, G., Velardi, P.: What women like: a gendered analysis of Twitter
users’ interests based on a twixonomy. In: Ninth International AAAI Conference
on Web and Social Media (2015)
6. Han, X., Sun, L., Zhao, J.: Collective entity linking in web text: a graph-based
method. In: Proceedings of the 34th International ACM SIGIR Conference on
Research and Development in Information Retrieval, pp. 765–774. ACM (2011)
7. Lazic, N., Subramanya, A., Ringgaard, M., Pereira, F.: Plato: a selective context
model for entity resolution. Trans. Assoc. Comput. Linguist. 3, 503–515 (2015)
DSEL: A Domain-Specific Entity Linking System 161
8. Ling, X., Singh, S., Weld, D.S.: Design challenges for entity linking. Trans. Assoc.
Comput. Linguist. 3, 315–328 (2015)
9. Mihalcea, R., Csomai, A.: Wikify!: linking documents to encyclopedic knowledge.
In: Proceedings of the Sixteenth ACM Conference on Conference on Information
and Knowledge Management, pp. 233–242. ACM (2007)
10. Moro, A., Raganato, A., Navigli, R.: Entity linking meets word sense disambigua-
tion: a unified approach. Trans. Assoc. Comput. Linguist. 2, 231–244 (2014)
11. Raiman, J.R., Raiman, O.M.: Deeptype: multilingual entity linking by neural type
system evolution. In: Thirty-Second AAAI Conference on Artificial Intelligence
(2018)
12. Schönhofen, P.: Identifying document topics using the wikipedia category network.
Web Intell. Agent Syst. Int. J. 7(2), 195–207 (2009)
13. Strube, M., Ponzetto, S.P.: Wikirelate! computing semantic relatedness using
Wikipedia. In: AAAI, vol. 6, pp. 1419–1424 (2006)
14. Wu, C., Tygert, M., LeCun, Y.: Hierarchical loss for classification. arXiv preprint
arXiv:1709.01062 (2017)
15. Yamada, I., Shindo, H., Takeda, H., Takefuji, Y.: Joint learning of the embed-
ding of words and entities for named entity disambiguation. arXiv preprint
arXiv:1601.01343 (2016)
16. Zhang, J., Cao, Y., Hou, L., Li, J., Zheng, H.-T.: XLink: an unsupervised bilin-
gual entity linking system. In: Sun, M., Wang, X., Chang, B., Xiong, D. (eds.)
CCL/NLP-NABD -2017. LNCS (LNAI), vol. 10565, pp. 172–183. Springer, Cham
(2017). https://doi.org/10.1007/978-3-319-69005-6_15
Exploring the Generalization
of Knowledge Graph Embedding
Liang Zhang1 , Huan Gao1(B) , Xianda Zheng2 , Guilin Qi1,2 , and Jiming Liu3
1
School of Computer Science and Engineering, Southeast University, Nanjing, China
{230169435,gh,gqi}@seu.edu.cn
2
School of Cyber Science and Engineering, Southeast University, Nanjing, China
[email protected]
3
Itibia Technologies, Suzhou, China
[email protected]
1 Introduction
Knowledge graph contains abundant structured information. It represents the
real world things in the form of a directed graph, in which nodes represent enti-
ties and the edges of nodes represent relations. Generally speaking, a knowledge
graph contains enormous triple facts, also denoted as (h, r, t) which consists
of head entity, relation and tail entity. Due to the difficulties in dealing with
structured information, special graph algorithms need to be designed for knowl-
edge graph. However, this measure leads to inefficiency. Therefore, knowledge
representation learning has been proposed to alleviate this problem. Knowledge
representation learning or knowledge graph embedding aims at mapping entities
and relations to continuous and low-dimensional vector spaces for easy com-
putation and analysis. Knowledge graph embedding has been widely applied
in various fields, such as knowledge graph completion [1], intelligent question
answering and semantic search [2]. Especially in the task of knowledge graph
completion, some models have achieved quite well performance [3,4].
c Springer Nature Switzerland AG 2020
X. Wang et al. (Eds.): JIST 2019, LNCS 12032, pp. 162–176, 2020.
https://doi.org/10.1007/978-3-030-41407-8_11
Exploring the Generalization of Knowledge Graph Embedding 163
2 Related Work
2.1 Analysis of Translation-Based Knowledge Graph Embedding
Models
In recent years, more and more knowledge graph embedding models have been
proposed. The most representative models are translation-based models, such as
TransE [12], TransH [13], TransR [14] and so on. TransE interprets the relations
as translating operations between head and tail entities on the low-dimensional
vector space. The TransE model is relatively simple, thus, it performs well in
1-to-1 relations while has issues in modeling 1-to-N, N-to-1, and N-to-N rela-
tions. To improve the situation, many improved models of TransE have been
proposed. TransH attempts to solve the problem of TransE by modeling rela-
tions as hyperplanes and projecting h and t to the relational-specific hyperplane,
allowing entities to play different roles in different relations. TransR models
entities and relations in distinct semantic space and projects entities from entity
space to relation space when learning embeddings. Other translation models such
as TransA [15] and TransD [16] are also representative models.
3 Problem Definition
Our goal is to analyze the generalization ability of the knowledge graph embed-
ding model. By researching on generalization, we hope to find some criterions or
conclusions that affect the performance of the model. Because there are many
models of knowledge graph embedding, but we are only focused on translation-
based models in this paper. Translation-based models can be regarded as a linear
neural network model. Most of the models are improved on this, essentially an
embedding matrix is transformed into a zero matrix after elementary transfor-
mation of the matrix. As shown in Fig. 1, the learning process of entities and
relations in the model is a deep learning process.
Regarding the training process of the models, when the training loss of a
model is equal, the higher the complexity of a model is, the worse its generaliza-
tion ability will be. Therefore, we hope to find the knowledge graph embedding
model with the lowest complexity. We propose the generalization ability F of the
model consists of empirical error and generalization error:
4 Analytical Methods
of empirical error. For the knowledge graph embedding model, its parameters
are the weights of the fully connected neural network, so we can transform the
capacity of the neural network model into a norm to measure the parameters in
the model. The remaining question becomes how to measure the capacity of the
model.
y = fw (x). (2)
In many cases, we want to get a robust model which is less sensitive to input,
so that the generalization ability of the model appropriately can be improved.
Square multiplication of all parameters in the embedded matrix. The formula is
applied on the basis of Lipshitz constraints:
For a specific model, we hope to estimate the expression of C(w), and the
smaller the C(w) is, the better its generalization will be. Obviously, to ensure
that the left side does not exceed the right side, the absolute value of the f /x term
(each element) must not exceed a constant. This requires us to use activation
functions with upper and lower bounds of derivatives, and the commonly used
activation functions, such as sigmoid, tanh, ReLU, satisfy this condition. It is
assumed that the gradient of the activation function is bounded, especially for
commonly used ReLU activation function, which is still 1. So the ∂(f )/∂(x)
term has only one constant. For now only W (x1 − x2 ) is to be considered.
After transformation, it is found that C is only related to the norm of weight:
f (h, r, t) = h + r − t . (7)
We use Me to express the embedding matrix of entities and use Mr to express
the embedding matrix of relations. Matrix of entity includes head entity matrix
Mh and tail entity matrix Mt . According to the formula 7, the embedding
matrix of head entities and the embedding matrix of tail entities cancel each
other, because there is a minus sign in front of the tail entity, and finally only
the embedding matrix of relation remains:
⎡ ⎤ ⎡ ⎤ ⎡ ⎤
e1,1 · · · e1,m r1,1 · · · r1,m e1,1 · · · e1,m
⎢ .. . . . ⎥ ⎢ . .. .. ⎥ − ⎢ .. .. . ⎥
Mh + Mr − Mt = ⎣ . . .. ⎦ + ⎣ .. . . ⎦ ⎣ . . .. ⎦
en,1 · · · en,m rn,1 · · · rn.m en,1 · · · en,m
⎡ ⎤ (8)
r1,1 · · · r1,m
⎢ .. . . . ⎥
=⎣ . . .. ⎦ = Mr .
rn,1 · · · rn,m
Therefore, the generalized error σgen can be measured by the Lp norm of the
relation matrix Mr . Since a matrix can be viewed as a two-dimensional array,
it can be directly converted into a vector form:
⎡ ⎤
r1,1
⎢ ⎥
Mr = ⎣ ... ⎦ . (9)
rn,m
The h , r and t correspond to the replaced head entity, relation and tail entity
respectively. We use f (h, r, t) to express positive triples and use f (h , r , t ) to
express negative triples. Margin is a hyperparameter used to measure the interval
between positive and negative triples. The error function of the model is as
follows:
σemp = margin − f (h, r, t) + f (h , r , t ) . (11)
According to the error function which represents the learning model, we use
cauchy inequality [20] to transform and get the result:
σemp = margin − f (h, r, t) + f (h , r , t ) < |margin| + |−f (h, r, t)|
(12)
+ f (h , r , t ) .
Because neither the |−f (h, r, t)| and the f (h , r , t ) are will exceed the maxi-
mum of the scoring function maxf (h, r, t), so we can draw the following conclu-
sions:
σemp = margin − f (h, r, t) + f (h , r , t ) < |margin| + |−f (h, r, t)|
(13)
+ f (h , r , t )
< margin + 2 ∗ maxf (h, r, t).
Through the formula 13, we find the upper bound of the empirical error.
The upper bound of the empirical error is twice of the triple with the highest
score. Through the above methods, it can be quickly found out where the upper
bound of the error is. Finally, we hope to find a model with less upper bound of
empirical risk.
5 Experiment
In this section, we present the experimental design and results of our meth-
ods. We have quantified the generalization error and the empirical error by two
measures, and further analyzed them.
For the generalization error, according to the proof in Sect. 3, the general-
ization error is only related to the Lp norm of the relation. And the smaller the
Lp norm of the relation is, the smaller the generalization error of the model will
be. So we could see from the experimental results that the generalized results
of the model and the relation Lp norm show the opposite trend. That is to
say, when the Lp norm is small, the generalization ability of the corresponding
model should be better. To reflect the generalization result more intuitively, we
use MRR to evaluate the generalization ability. The higher the value of MRR is,
the better of the model effect will be.
For the empirical error, according to the upper bound of empirical error and
the convergence rate of the model, we can obtain the corresponding results. That
is, as the convergence of the model becomes faster and faster, the upper bound
of empirical error should be smaller and smaller when the optimal objective is
reached. And to correspond to the upper bound of the empirical error with the
convergence rate of the model, we choose different embedding sizes to carry out
experiments. In Fig. 2, we limit the loss of the TransE model to roughly the
same situation which facilitates comparison. We also adopt the same strategy
for TransH and TransR. The experimental design is that the size of embedding
is 50, 100, 150, 200, 250, 300 under the same loss. And the larger the embedding
size of the model, the less time it takes to complete the training is, and the
smaller the upper bound of the corresponding empirical error will be. That is to
say, according to the experimental results, we should see that the upper bound
of the empirical error decreases with the increase of embedding size. In training
set, the upper bound is max(h + r + t). We first load the embedding
matrix of the model and the data of the training set. Then the corresponding Lp
norm values are calculated and recorded by the algorithm, and finally, the Lp
norm curve is formed. In order to find the upper bound of the empirical error,
we traverse the case of the training set and record the relevant data.
Fig. 2. Control the loss of the model in approximately the same situation.
170 L. Zhang et al.
For each model, we get the experimental results and show them with three
graphs. These three pictures respectively reflect the relation Lp norm, the MRR
result and the upper bound of the empirical error. For the generalization error, we
use a Lp norm to measure it and We get a reverse concave curve. About TransE
model, the MRR result is in Fig. 3(a) and the relation Lp norm is in Fig. 3(b)
are examined respectively. The upper bound of empirical error corresponds to
Fig. 3(c).
Fig. 3. Experimental results of the TransE model. According to the dimensions set
by the experiment, we record the training of the model in different dimensions, and
calculate the MRR value, the relation Lp norm and the upper bound of empirical error.
Exploring the Generalization of Knowledge Graph Embedding 171
Similarly, we experimented with the TransH and TransR model. The results
of the TransH model are as follows: the MRR results of TransH are shown in
Fig. 4(a). The relation Lp norm of TransH corresponds to Fig. 4(b). The upper
boundary is Fig. 4(c). We also get the results of the TransR model from Fig. 5(a)
to Fig. 5(c). Under the different embedding sizes, we record the time overhead
of three models on different sizes and get some results. For convenience, we
use epoch to measure the training time required under different embedding size
settings when the model achieves approximately the same loss. The results cor-
respond to Tables 2, 3 and 4, respectively. Generally speaking, the complexity
of the TransH model itself is higher than that of TransE, so its convergence is
relatively slow. The complexity of the TransH model itself refers to the model
construction. For example, TransE only assumes that the tail entity is trans-
lated from the head entity through the relation. TransH adds operations such
as projection of head and tail entities. This is different from the complexity dis-
cussed in this article. Since the TransR model uses TransE as its training, its
convergence will be faster.
5.4 Discussion
Fig. 4. Experimental results of the TransH model. Like TransE model, MRR results,
Lp norms and upper bounds of empirical errors are obtained respectively.
Fig. 5. Experimental results of the TransR model. It can be seen that the experimental
results of TransE and TransH are almost the same. TransR is slightly different from
the former two. It may be that the structure of the model itself has an effect on the
experiment. But on the whole, it is quite in line with our goal. Due to the relatively
fast convergence of TransR model, we have selected 50, 75, 100, 125, 150, 175, 200, 225
and 250 values as the abscissa dimensions in the (a), (b) and (c).
we first predict the tail entity and input (h, r, ?). Then bring all the entities
in? and calculate it get a value according to the scoring function, and then sort
it according to the score. Find the tail entity t according to the sorting table,
and record the rank value at this time. Predict the header entity in the same
way, then process all the cases in the test set in this way, and finally make the
rank values according to the mean reciprocal, finally get the MRR. We can see
that the MRR curve increases first and then decreases in Fig. 3(a). This shows
that the curve has a maximum at some point, and we can easy to know that
there exists an embedding size interval to make the model achieve the optimal
effect. Then we look at the relation Lp norm of the TransE model. As a result
in Fig. 3(b), we find that the curve of the relation Lp norm decreases first and
then increases, it is contrary to the result of MRR. We have already assumed
that the smaller of the relational Lp norm is, the stronger the generalization
ability of the model will be. Combining these two graphs, we can find that the
trends of the two graphs are completely opposite. For example, from 50 to 150
dimensions, the value of MRR is getting larger and larger. On the contrary,
the value of relation Lp norm is getting smaller and smaller. After 200 or so
dimensions, the value of MRR is getting smaller and smaller, while the value of
the relational norm is getting larger and larger. We can see from the graph that
in the range of 150 to 200, the MRR and the relation Lp norm of the TransE
reach the maximum and minimum respectively, that is to say, this interval is the
dimension interval in which the generalization ability of reaches the maximum.
This confirms our previous assumptions in the experimental design section. The
lower the relation Lp norm of TransE is, the higher of the generalization ability
will be. Since MRR corresponds to the generalization result of, we can conclude
that the smaller of the relation Lp norm is, the better its generalization ability of
the model will be. In this way we can find the best experimental dimension. Since
TransH is an improvement based on TransE, its corresponding experimental
results will be better and better reflect our hypotheses. As can be seen from
Exploring the Generalization of Knowledge Graph Embedding 173
Fig. 4(a) and (b), the MRR and the relation Lp norm of TransH model are almost
opposite in about 150 dimensions. Since the TransH model is designed to solve
the problem that the TransE model can not deal with the modeling of complex
relations, its experimental results can better confirm our hypotheses. As can
be seen from Fig. 4(b), and its minimum value can be seen more clearly. From
Fig. 4(a) and (b), we can see that the MRR and relation Lp norm of TransH
reach their maximum and minimum values almost simultaneously in the 150
dimensions. The relation Lp norm curve of TransH has a more obvious turning
point. Although the experimental results of TransR are not as perfect as TransH,
they are also very consistent with our hypotheses. As can be seen from Fig. 5(a)
and (b), the MRR and relation Lp norm of TransR also reach maximum and
minimum values in the dimensions range of 150 to 200 respectively. We can see
from the Fig. 5(a) that it is not as smooth as TransE or TransH, but fluctuate
slightly, which may be related to the nature of the model itself, because the
complexity of TransR model is higher than the former two, and this is not the
content of this paper. Even so, the experimental results of TransR are very
consistent with our hypotheses. Through these experiments, we measured the
generalization error σgen of the model.
Secondly, we measured the upper bound of empirical error. According to
our hypotheses, the upper bound of the empirical error should become smaller
and smaller with the training of the model, that is, the model is closer to the
minimum error and achieves the optimization. From the Figs. 3(c), 4(c) and
5(c), we can see that with the increase of dimensions, the upper boundary of
empirical error is smaller and smaller, which indicates that the convergence of the
model is faster and faster, and finally the optimal result is obtained. Combining
with Tables 2, 3 and 4 we also prove that the time required to achieve similar
loss under different embedding sizes decreases with the increase of dimensions,
which coincides with the trend of empirical error upper bound curve, that is the
higher of the dimensions are, the smaller the error upper bound of the model
will be. For example, we first look at the upper boundary of the TransE model in
Fig. 3(c). As the dimensions increase, the upper boundary of the model becomes
smaller and smaller, the higher the dimensions are, the faster the decline will
be. According to the inference in Sect. 3, the upper bound of the empirical error
is twice that of the corresponding value when the score of the model scoring
function is maximized by a triple. The smaller the upper bound of the model
is, the faster its convergence will be. Referring to the training time when the
TransE model achieves the same loss in different dimensions, we can see from
Table 2 that the higher the dimensions are, the shorter of the training time will
be and the convergence rate of the model will also be accelerated. The upper
boundary in Fig. 3(c) also decreases with the increase of dimensions, which shows
that the smaller the upper bound is, the faster the convergence of the model will
be. So we prove that max(h + r + t) can be used to reflect the empirical
error through experiments. TransR and TransH have similar results. Generally
speaking, the upper bound of their empirical error decreases with the increase of
dimensions, which is consistent with our previous assumptions and experimental
174 L. Zhang et al.
Table 2. Convergence time of the TransE model under different embedding sizes
Table 3. Convergence time of the TransH model under different embedding sizes
Table 4. Convergence time of the TransR model under different embedding sizes
the other factors on the training, such as using different optimization methods.
(2) Since our method is only focused on translation-based models, it is neces-
sary to explore other types of models, such as the neural network model for
knowledge graph embedding. (3) According to the guidance of model generaliza-
tion research, we will attempt to construct a new and effective knowledge graph
embedding model.
References
1. Trouillon, T., Welbl, J., Riedel, S., Gaussier, É., Bouchard, G.: Complex embed-
dings for simple link prediction. In: International Conference on Machine Learning,
pp. 2071–2080 (2016)
2. Szumlanski, S., Gomez, F.: Automatically acquiring a semantic network of related
concepts. In: Proceedings of the 19th ACM International Conference on Informa-
tion and Knowledge Management, pp. 19–28. ACM (2010)
3. Xiao, H., Huang, M., Meng, L., Zhu, X.: SSP: semantic space projection for knowl-
edge graph embedding with text descriptions. In: Thirty-First AAAI Conference
on Artificial Intelligence (2017)
4. Shi, B., Weninger, T.: ProjE: embedding projection for knowledge graph comple-
tion. In: Thirty-First AAAI Conference on Artificial Intelligence (2017)
5. Guan, S., Jin, X., Wang, Y., Cheng, X.: Shared embedding based neural networks
for knowledge graph completion. In: Proceedings of the 27th ACM International
Conference on Information and Knowledge Management, pp. 247–256. ACM (2018)
6. Neyshabur, B., Tomioka, R., Srebro, N.: In search of the real inductive bias: on
the role of implicit regularization in deep learning. arXiv preprint arXiv:1412.6614
(2014)
7. Keskar, N.S., Mudigere, D., Nocedal, J., Smelyanskiy, M., Tang, P.T.P.: On large-
batch training for deep learning: generalization gap and sharp minima. arXiv
preprint arXiv:1609.04836 (2016)
8. Neyshabur, B., Salakhutdinov, R.R., Srebro, N.: Path-SGD: path-normalized opti-
mization in deep neural networks. In: Advances in Neural Information Processing
Systems, pp. 2422–2430 (2015)
9. Hoffer, E., Hubara, I., Soudry, D.: Train longer, generalize better: closing the gen-
eralization gap in large batch training of neural networks. In: Advances in Neural
Information Processing Systems, pp. 1731–1741 (2017)
10. Neyshabur, B., Tomioka, R., Srebro, N.: Norm-based capacity control in neural
networks. In: Conference on Learning Theory, pp. 1376–1401 (2015)
11. Sharma, A., Talukdar, P., et al.: Towards understanding the geometry of knowledge
graph embeddings. In: Proceedings of the 56th Annual Meeting of the Association
for Computational Linguistics (Volume 1: Long Papers), pp. 122–131 (2018)
12. Bordes, A., Usunier, N., Garcia-Duran, A., Weston, J., Yakhnenko, O.: Translating
embeddings for modeling multi-relational data. In: Advances in Neural Information
Processing Systems, pp. 2787–2795 (2013)
176 L. Zhang et al.
13. Wang, Z., Zhang, J., Feng, J., Chen, Z.: Knowledge graph embedding by translat-
ing on hyperplanes. In: Twenty-Eighth AAAI Conference on Artificial Intelligence
(2014)
14. Lin, Y., Liu, Z., Sun, M., Liu, Y., Zhu, X.: Learning entity and relation embeddings
for knowledge graph completion. In: Twenty-Ninth AAAI Conference on Artificial
Intelligence (2015)
15. Xiao, H., Huang, M., Hao, Y., Zhu, X.: TransA: an adaptive approach for knowledge
graph embedding. arXiv preprint arXiv:1509.05490 (2015)
16. Ji, G., He, S., Xu, L., Liu, K., Zhao, J.: Knowledge graph embedding via dynamic
mapping matrix. In: Proceedings of the 53rd Annual Meeting of the Association for
Computational Linguistics and the 7th International Joint Conference on Natural
Language Processing (Volume 1: Long Papers), pp. 687–696 (2015)
17. Neyshabur, B., Bhojanapalli, S., McAllester, D., Srebro, N.: Exploring generaliza-
tion in deep learning. In: Advances in Neural Information Processing Systems, pp.
5947–5956 (2017)
18. Kawaguchi, K., Kaelbling, L.P., Bengio, Y.: Generalization in deep learning. arXiv
preprint arXiv:1710.05468 (2017)
19. Zhang, C., Bengio, S., Hardt, M., Recht, B., Vinyals, O.: Understanding deep
learning requires rethinking generalization. arXiv preprint arXiv:1611.03530 (2016)
20. Dragomir, S.S.: A survey on Cauchy-Bunyakovsky-Schwarz type discrete inequal-
ities. J. Inequal. Pure Appl. Math. 4(3), 1–142 (2003)
21. Bollacker, K., Evans, C., Paritosh, P., Sturge, T., Taylor, J.: Freebase: a collabo-
ratively created graph database for structuring human knowledge. In: Proceedings
of the 2008 ACM SIGMOD International Conference on Management of Data, pp.
1247–1250. ACM (2008)
22. Han, X., et al.: OpenKE: an open toolkit for knowledge embedding. In: Proceedings
of the 2018 Conference on Empirical Methods in Natural Language Processing:
System Demonstrations, pp. 139–144 (2018)
23. Bottou, L.: Large-scale machine learning with stochastic gradient descent. In:
Lechevallier, Y., Saporta, G. (eds) Proceedings of COMPSTAT 2010. Physica-
Verlag HD, Heidelberg (2010). https://doi.org/10.1007/978-3-7908-2604-3 16
Incorporating Instance Correlations
in Distantly Supervised Relation
Extraction
1 Introduction
Relation extraction aims to extract semantic relations between pairs of entities
from plain texts. Due to the significant power and large incompletion of knowl-
edge graphs (KGs), this task has become an important task in KG construction
and completion. It can be modeled as a supervised classification task after the
entity pair is identified by named entity recognizer. Formally, given the entity
pair (e1 , e2 ) and the instances (sentences) containing the entity pair, it aims to
predict the relation label r between e1 and e2 from a predefined relation set.
As shown in Fig. 1 a, given a bag of instances (S1 , S2 , · · · , Sm ) that all contain
entity pair (Barack Obama, U nited States), the task is to classify the relation
label president to them.
Supervised relation extraction methods demand large-scale labeled data,
while manual labeling is time-consuming. Therefore, [14] proposes distant super-
vision to address the challenge. It assumes that if two entities have a relation
c Springer Nature Switzerland AG 2020
X. Wang et al. (Eds.): JIST 2019, LNCS 12032, pp. 177–191, 2020.
https://doi.org/10.1007/978-3-030-41407-8_12
178 L. Zhang et al.
Fig. 1. An example of entity pair (Barack Obama, U nited States), their relation label
president, and corresponding training instances
in a KG, then all instances mentioning the two entities express this relation.
Thus, a large number of labeled data can be generated automatically by dis-
tant supervision. However, since not all sentences containing the target entities
exactly express their relations in KGs, it often suffers from the noisy data that
are labeled by mistake [16,21].
Recently, significant progress has been made in the use of deep neural net-
works for relation extraction [21,22]. To alleviate the noise in distant supervised
datasets, attention has been utilized by [4,13]. Some efforts have also been made
on leveraging relevant side information to improve relation extraction. [17] uses
entity type and relation alias information from KGs. [11] incorporates entity
descriptions to provide background knowledge. Due to the usage of more rele-
vant information imposing soft constraints while prediction, they achieve better
performance.
However, these models treat the instances within a bag independently and
ignore the semantic correlations among the instances. For example, in Fig. 1,
the instance S1 does not express the relation label president directly. However,
the existing S2 can provide significant background knowledge without other side
information. Therefore, it is significant to build the correlation among multiple
instances. Besides, the graph convolution network has shown its superiority in
learning the structural correlation in social networks.
Therefore, in this paper, we propose a novel GCN based model ICRE to
incorporate the instance correlations for improving relation extraction. Inspired
by recent work on GCNs, we note that the semantic structure can be built
through the dependency tree, which is shown in Fig. 2(a). Therefore to model
the correlation among instances within a bag, we construct the graph for each
bag based on dependency trees after pruning by removing stop words, shown in
Fig. 2(b). After the graph construction, we utilize a graph convolution network
that maps every node into an embedding vector, which explores the correlation
among instances. Through feeding the learned node (word) embeddings into the
instance encoder, we capture the context information of each instance. Besides,
an attention mechanism is introduced to attend over the bag of instances for
relation classification.
Finally, the learned graph embeddings are used for our relation classification.
The contributions of this paper can be summarized as follows:
Incorporating Instance Correlations 179
1:
2:
Fig. 2. The dependency parse tree of two instances and the constructed graph.
(1) We propose a novel GCN based model ICRE to incorporate the instance
correlations for improving relation extraction.
(2) The learned node embeddings through GCNs are viewed as our new word
embeddings, which may contain the implied background knowledge in other
instances.
(3) Extensive experiments on a benchmark dataset demonstrate that our model
significantly outperforms compared baselines.
2 Related Work
In this section, we will detail our proposed GCN-based model with an attention
mechanism for relation extraction. As shown in Fig. 2, our model ICRE consists
of four steps:
(1) Graph Construction. As shown in Fig. 3(a), we first get the dependency
parse tree for all instances in both the training set and testing set through
NLP tools. Then we build a graph with the dependency tree of each instance
in the same bag.
(2) Graph Convolution Layer. For the constructed graph, we exploit the
graph convolution network to learn the node embeddings, shown as Fig. 3(b).
(3) Relation Classifier. As observed from Fig. 3(c), the learned embeddings
are taken as the new representations of words and be employed to initialize
each instance embedding with them. Then CNN is used as another encoder
to capture the semantic information of each instance. Besides, an attention
mechanism is introduced to attend over the bag of instances for relation
classification. Finally, the learned graph representation is used to train the
relation classifier and get its corresponding relation label.
NLP tools to get the base dependency of each instance. After pruning by remov-
ing stop words, the graph G(V, E) of this bag is constructed through the common
words or the entity pair (h, t), where vertex set V = {v1 , v2 , · · · , vl }, |V | = l con-
sists of words and the dependency between words are modeled as edges. In this
way, the correlations between instances are built without losing semantic infor-
mation. Let X = {x1 , x2 , · · · , xl } ∈ Rl×df denote the graph’s feature matrix
with each row representing a vertex, where df is the dimension of the feature
vectors. Specifically, We initialize feature vector xi with the pre-trained word
vectors.
the adjacency matrix A ∈ R
l×l
For convenience, we introduce of G and its
degree matrix D, where Dii = j Aij . Note that, a word never connects to itself
in dependency trees, which results in that the information in the word self is lost
in the convolution operation. Thus, every node is assumed to be connected to
itself, so that the diagonal elements of A are set to 1.
where vi is the corresponding node embedding of wi , and pi1 and pi2 are its
position representations to encode relative distances to the target entities (h, t)
into dp -dimensional vectors. For example, in the instance “Barack Obama is the
president of United States”, the relative distance from the word “president” to
the head entity Barack Obama is 3 and tail entity U nited States is 2.
Then, we use CNN with window size c as another encoder to capture the
semantic information of each instance si .
exp(ei )
αi = l , (7)
i=1 exp(ei )
where ei is a query-based function which scores how well the vertex vi and the
predict relation label r matches. Specifically, ei is calculated as follows:
ei = si We r, (8)
l
s= αi si . (9)
i=1
Finally, to compute the confidence of each relation class, we feed the represen-
tation of graph G into a softmax classifier after being processed by a linear
transformation. Formally,
4 Experiments
4.1 Dataset
Riedel. The dataset is developed by [15] by aligning Freebase with New York
Times (NYT) corpus, which has been widely used for distantly supervised rela-
tion extraction [5,13]. Specifically, the training set consists of the sentences from
the year 2005–2006 and the test set includes those from the year 2007. Stanford
NER1 is used to annotate the entity mentions. Consequently, there are 53 rela-
tion labels containing a special relation NA that indicates there is no relation
between the target entity pair.
1
https://stanfordnlp.github.io/CoreNLP/.
Incorporating Instance Correlations 185
4.2 Baselines
Noting that, Mintz, MultiR and MIMILRE are based on human-designed fea-
tures. The results on Riedel dataset are obtained from their corresponding paper.
Therefore, we just select CNN+ATT as the baseline compared with our method
in terms of GIDS dataset.
Following previous works [11,13], the model is evaluated held out with comparing
the relations discovered from test corpus with those in Freebase. We report the
Precision-Recall curve and top-N precision (P@N) metric on the Riedel dataset.
To further evaluate the performance of our model, we use average mean precision
(MAP) and F1 value as metrics over GIDS dataset.
Parameter Value
Word Dimension dw 50
Position Dimension dp 5
Hidden Layer Dimension dh 230
Learning Rate α 0.5
Regularization Coefficient η 0.0001
Dropout Probability p 0.5
Layer Number t 2
186 L. Zhang et al.
For all models, we employ the word embeddings pre-trained by word2vec tool2
on NYT corpus. We select the learning rate α between {0.1, 0.01, 0.005, 0.0001}
for minimizing the loss. We set other parameters by following the settings used in
[5,11]. Dropout strategy is employed on the output layer to prevent overfitting.
All parameters used in our experiments are detailed in Table 2. All experiments
are conducted on a machine with four GPUs (NVIDIA GTX-1080*4).
2
https://code.google.com/p/word2vec/.
Incorporating Instance Correlations 187
Fig. 7. P@N evaluation on the dataset which contains one instance for each entity pair
from Riedel.
Fig. 8. P@N evaluation on the dataset which contains two instances for each entity
pair from Riedel.
190 L. Zhang et al.
and use the results to predict the relation label, which are denoted as One and
Two. As shown in Figs. 7 and 8, our model still maintains advantages in all
situations. Note that, in case of Two, GCN can propagate the valid features
between them. However, in case of One, there are no other instances and ICRE
also outperforms than CNN+ATT. It demonstrates that our graph convolution
layer over dependency tree could capture more fine-grained semantic information
even though there is only one instance in the bag.
5 Conclusion
In this work, we consider leveraging the graph convolution network to encode
the dependency tree and learn word embeddings. In this way, the correlations
among instances are built through their common words. Then, another encoder
CNN is used to capture the context information of each instance itself. Besides,
an instance-level attention mechanism is introduced to select valid instances and
learn the textual relation embedding. Finally, the learned embedding is used to
train our relation classifier. Our model takes full advantage of both structural
and context information, while avoiding the imposed noise. Experiments on two
benchmark datasets demonstrate that our model significantly outperforms the
compared baselines.
In the future, we will explore more advanced encoder, such as graph attention
networks. Besides, we expire to discover more complex correlations and utilize
the advanced encoders.
References
1. Bastings, J., Titov, I., Aziz, W., Marcheggiani, D., Simaan, K.: Graph convo-
lutional encoders for syntax-aware neural machine translation. arXiv preprint
arXiv:1704.04675 (2017)
2. Bruna, J., Zaremba, W., Szlam, A., Lecun, Y.: Spectral networks and locally con-
nected networks on graphs. In: International Conference on Learning Representa-
tions (ICLR2014), CBLS, April 2014 (2014)
3. Defferrard, M., Bresson, X., Vandergheynst, P.: Convolutional neural networks on
graphs with fast localized spectral filtering. In: Advances in Neural Information
Processing Systems, pp. 3844–3852 (2016)
4. Du, J., Han, J., Way, A., Wan, D.: Multi-level structured self-attentions for dis-
tantly supervised relation extraction. In: EMNLP, pp. 2216–2225 (2018)
5. Han, X., Liu, Z., Sun, M.: Neural knowledge acquisition via mutual attention
between knowledge graph and text. In: AAAI, pp. 4832–4839 (2018)
6. Han, X., Yu, P., Liu, Z., Sun, M., Li, P.: Hierarchical relation extraction with
coarse-to-fine grained attention. In: EMNLP, pp. 2236–2245 (2018)
Incorporating Instance Correlations 191
7. He, Z., Chen, W., Li, Z., Zhang, M., Zhang, W., Zhang, M.: See: syntax-aware
entity embedding for neural relation extraction. In: Thirty-Second AAAI Confer-
ence on Artificial Intelligence (2018)
8. Henaff, M., Bruna, J., LeCun, Y.: Deep convolutional networks on graph-structured
data. arXiv preprint arXiv:1506.05163 (2015)
9. Hoffmann, R., Zhang, C., Ling, X., Zettlemoyer, L., Weld, D.S.: Knowledge-based
weak supervision for information extraction of overlapping relations. In: ACL, pp.
541–550 (2011)
10. Jat, S., Khandelwal, S., Talukdar, P.: Improving distantly supervised relation
extraction using word and entity based attention. arXiv preprint arXiv:1804.06987
(2018)
11. Ji, G., Liu, K., He, S., Zhao, J., et al.: Distant supervision for relation extraction
with sentence-level attention and entity descriptions. In: AAAI, pp. 3060–3066
(2017)
12. Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional
networks. arXiv preprint arXiv:1609.02907 (2016)
13. Lin, Y., Shen, S., Liu, Z., Luan, H., Sun, M.: Neural relation extraction with
selective attention over instances. In: ACL, vol. 1, pp. 2124–2133 (2016)
14. Mintz, M., Bills, S., Snow, R., Jurafsky, D.: Distant supervision for relation extrac-
tion without labeled data. In: ACL/IJCNLP, pp. 1003–1011 (2009)
15. Riedel, S., Yao, L., McCallum, A.: Modeling relations and their mentions without
labeled text. In: ECML/PKDD, pp. 148–163 (2010)
16. Surdeanu, M., Tibshirani, J., Nallapati, R., Manning, C.D.: Multi-instance multi-
label learning for relation extraction. In: EMNLP-CoNLL, pp. 455–465 (2012)
17. Vashishth, S., Joshi, R., Prayaga, S.S., Bhattacharyya, C., Talukdar, P.: Reside:
improving distantly-supervised neural relation extraction using side information.
In: EMNLP, pp. 1257–1266 (2018)
18. Yao, L., Mao, C., Luo, Y.: Graph convolutional networks for text classification
(2018)
19. Yuan, C., Huang, H., Feng, C., Liu, X., Wei, X.: Distant supervision for relation
extraction with linear attenuation simulation and non-IID relevance embedding.
In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp.
7418–7425 (2019)
20. Yuan, Y., et al.: Cross-relation cross-bag attention for distantly-supervised relation
extraction. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol.
33, pp. 419–426 (2019)
21. Zeng, D., Liu, K., Chen, Y., Zhao, J.: Distant supervision for relation extraction
via piecewise convolutional neural networks. In: EMNLP, pp. 1753–1762 (2015)
22. Zeng, D., Liu, K., Lai, S., Zhou, G., Zhao, J.: Relation classification via convolu-
tional deep neural network. In: COLING, pp. 2335–2344 (2014)
23. Zeng, W., Lin, Y., Liu, Z., Sun, M.: Incorporating relation paths in neural relation
extraction. In: EMNLP, pp. 1768–1777 (2017)
A Physical Embedding Model
for Knowledge Graphs
1 Introduction
The number and size of knowledge graphs (KGs) available on the Web and in
companies grows steadily.1 For example, more than 150 billion facts describ-
ing more than 3 billion things are available in the more than 10,000 knowledge
1
https://lod-cloud.net/.
This work was supported by the German Federal Ministry of Transport and Digital
Infrastructure project OPAL (GA: 19F2028A) as well as the H2020 Marie Sklodowska-
Curie project KnowGraphs (GA no. 860801).
c Springer Nature Switzerland AG 2020
X. Wang et al. (Eds.): JIST 2019, LNCS 12032, pp. 192–209, 2020.
https://doi.org/10.1007/978-3-030-41407-8_13
A Physical Embedding Model for Knowledge Graphs 193
2 Related Work
A large number of KGE approaches have been developed to address tasks such as
link prediction, graph completion and question answering [7,8,12,13,18] in the
recent past. In the following, we give a brief overview of some of these approaches.
More details can be found in the survey at [19]. RESCAL [13] is based on com-
puting a three-way factorization of an adjacency tensor representing the input
KG. The adjacency tensor is decomposed into a product of a core tensor and
embedding matrices. RESCAL captures rich interactions in the input KG but is
limited in its scalability. HolE [12] uses circular correlation as its compositional
operator. Holographic embeddings of knowledge graphs yield state-of-the-art
results on link prediction task while keeping the memory complexity lower than
2
lodstats.aksw.org.
194 C. Demir and A.-C. N. Ngomo
RESCAL and TransR [8]. ComplEx [18] is a KGE model based on latent fac-
torization, wherein complex valued embeddings are utilized to handle a large
variety of binary relations including symmetric and antisymmetric relations.
Energy-based KGE models [1–3] yield competitive performances on link pre-
diction, graph completion and entity resolution. SE [3] proposes to learn one
low-dimensional vector (Rk ) for each entity and two matrices (R1 ∈ Rk×k ,
R2 ∈ Rk×k ) for each relation. Hence, for a given triple (h, r, t), SE aims to min-
imize the L1 distance, i.e., fr (h, t) = ||R1 h − R2 t||. The approach in [1] embeds
entities and relations into the same embedding space and suggests to capture
correlations between entities and relations by using multiple matrix products.
TransE [2] is a scalable energy-based KGE model wherein a relation r between
entities h and t corresponds to a translation of their embeddings, i.e., h + r ≈ t
provided that (h, r, t) exists in the KG. TransE outperforms state-of-the-art mod-
els in the link prediction task on several benchmark KG datasets while being
able to deal with KGs containing up to 17 million facts. DistMult [22] proposes
to generalize neural-embedding models under an unified learning framework,
wherein relations are bi-linear or linear mapping function between embeddings
of entities.
With Pyke, we propose a different take to generating embeddings by com-
bining a physical model with simulated annealing. Our evaluation suggests that
this simulation-based approach to generating embeddings scales well (i.e., lin-
early in the size of the KG) while outperforming the state of the art in the type
prediction and clustering quality tasks [20,21].
Notation Description
G An RDF knowledge graph
R, P, B, L Set of all RDF resources, predicates, blank nodes and literals respectively
S Set of all RDF subjects with type information
V Vocabulary of G
σ Similarity function on V
→
−
x Embedding of x at time t
t
Fa , Fr Attractive and repulsive forces, respectively
K Threshold for positive and negative examples
P Function mapping each x ∈ V to a set of attracting elements of V
N Function mapping each x ∈ V to a set of repulsive elements of V
P Probability
ω Repulsive constant
E System energy
Upper bound on alteration of locations of x ∈ V across two iterations
Δe Energy release
The increase of a deforming force on the spring is linearly related to the increase
of the magnitude of the corresponding deformation. In equation form, Hooke’s
law can be expressed as follows:
F = −k Δ (1)
P P M I(a, b) is defined as
P(a, b)
P P M I(a, b) = max 0, log , (3)
P(a)P(b)
4 Pyke
In this section, we introduce our novel KGE approach dubbed Pyke (a physical
model for knowledge graph embeddings). Section 4.1 presents the intuition
behind our model. In Sect. 4.2, we give an overview of the Pyke framework,
starting from processing the input KG to learning embeddings for the input in
a vector space with a predefined number of dimensions. The workflow of our
model is further elucidated using the running example shown in Fig. 1.
4.1 Intuition
Pyke is an iterative approach that aims to represent each element x of the vocab-
ulary V of an input KG G as an embedding (i.e., a vector) in the n-dimensional
space Rn . Our approach begins by assuming that each element of V is mapped to
a single point (i.e., its embedding) of unit mass whose location can be expressed
via an n-dimensional vector in Rn according to an initial (e.g., random) distribu-
tion at iteration t = 0. In the following, we will use −
→
x t to denote the embedding
of x ∈ V at iteration t. We also assume a similarity function σ : V × V → [0, ∞)
(e.g., a PPMI-based similarity) over V to be given. Simply put, our goal is to
improve this initial distribution iteratively over a predefined maximal number of
iterations (denoted T ) by ensuring that
1. the embeddings of similar elements of V are close to each other while
2. the embeddings of dissimilar elements of V are distant from each other.
Let d : Rn × Rn → R+ be the distance (e.g., the Euclidean distance) between
two embeddings in Rn . According to our goal definition, a good iterative embed-
ding approach should have the following characteristics:
C1 : If σ(x, y) > 0, then d(−→x t, −
→
y t ) ≤ d(−
→
x t−1 , −
→
y t−1 ). This means that the
embeddings of similar terms should become more similar with the number of
iterations. The same holds the other way around:
C2 : If σ(x, y) = 0, then d(−
→
x t, −
→
y t ) ≥ d(−
→
x t−1 , −
→
y t−1 ).
We translate C1 into our model as follows: If x and y are similar (i.e., if σ(x, y) >
0), then a force Fa (−→
x t, −
→
y t ) of attraction must exist between the masses which
stand for x and y at any time t. Fa (− →x t, −
→
y t ) must be proportional to d(−
→
x t, −
→
y t ),
→
−
i.e., the attraction between must grow with the distance between ( x t and y t ). →
−
A Physical Embedding Model for Knowledge Graphs 197
These conditions are fulfilled by setting the following force of attraction between
the two masses:
||Fa (−
→
x t, −
→
y t )|| = σ(x, y) × d(−
→
x t, −
→
y t ). (4)
From the perspective of a physical model, this is equivalent to placing a spring
with a spring constant of σ(x, y) between the unit masses which stand for x and
y. At time t, these masses are hence accelerated towards each other with a total
acceleration proportional to ||Fa (− →
x t, −
→
y t )||.
The translation of C2 into a physical model is as follows: If x and y are not
similar (i.e., if σ(x, y) = 0), we assume that they are dissimilar. Correspondingly,
their embeddings should diverge with time. The magnitude of the repulsive force
between the two masses representing x and y should be strong if the masses
are close to each other and should diminish with the distance between the two
masses. We can fulfill this condition by setting the following repulsive force
between the two masses:
||Fr (−
→
x t, −
→ ω
y t )|| = − , (5)
d(−
→
x t, −
→
y t)
where ω > 0 denotes a constant, which we dub the repulsive constant. At itera-
tion t, the embeddings of dissimilar terms are hence accelerated away from each
other with a total acceleration proportional to ||Fr (−
→
x t, −
→
y t )||. This is the inverse
of Hooke’s law, where the magnitude of the repulsive force between the mass
points which stand for two dissimilar terms decreases with the distance between
the two mass points.
Based on these intuitions, we can now formulate the goal of Pyke formally:
We aim to find embeddings for all elements of V which minimize the total dis-
tance between similar elements and maximize the total distance between dissim-
ilar elements. Let P : V → 2V be a function which maps each element of V to
the subset of V it is similar to. Analogously, let N : V → 2V map each element
of V to the subset of V it is dissimilar to. Pyke aims to optimize the following
objective function:
⎛ ⎞ ⎛ ⎞
J(V) = ⎝ d(−
→
x,−→y )⎠ − ⎝ d(−→x,− →
y )⎠ . (6)
x∈V y∈P (x) x∈V y∈N (x)
4.2 Approach
explain each of the steps of the approach in detail. We use the RDF graph shown
in Fig. 1 as a running example.3
Fig. 2. PPMI similarity matrix of resources in the RDF graph shown in Fig. 1
4
We use A for the sake of explanation. For practical applications, this step can be
implemented using priority queues, hence making quadratic space complexity for
storing A unnecessary.
5
Preliminary experiments suggest that applying a singular value decomposition on A
and initializing the embeddings with the latent representation of the elements of the
vocabulary along the n most salient eigenvectors has the potential of accelerating
the convergence of our approach.
200 C. Demir and A.-C. N. Ngomo
Iteration. This is the crux of our approach. In each iteration t, our approach
assumes that the elements of P (x) attract x with a total force
Fa (−
→
x t) = σ(x, y) × (−
→
yt−− →
x t ). (10)
y∈P (x)
On the other hand, the elements of N (x) repulse x with a total force
ω
Fr (−
→
x t) = − . (11)
(yt−−
−
→ →
x t)
y∈N (x)
We assume that exactly one unit of time elapses between two iterations.
The embedding of x at iteration t + 1 can now be calculated by displacing
→
−x t proportionally to (Fa (−→
x t ) + Fr (−
→
x t )). However, implementing this model
directly leads to a chaotic (i.e., non-converging) behavior in most cases. We
enforce the convergence using an approach borrowed from simulated annealing,
i.e., we reduce the total energy of the system by a constant factor Δe after each
iteration. By these means, we can ensure that our approach always terminates,
i.e., we can iterate until J(V) does not decrease significantly or until a maximal
number of iterations T is reached.
Fig. 3. PCA projection of 50-dimensional embeddings for our running example. Left are
the randomly initialized embeddings. The figure on the right shows the 50-dimensional
Pyke embedding vectors for our running example after convergence. Pyke was con-
figured with K = 3, ω = −0.3, Δe = 0.06 and = 10−3 .
A Physical Embedding Model for Knowledge Graphs 201
Algorithm 1. Pyke
Require: T , V, K, , Δe, ω, n
//initialize embeddings
for each x in V do
→
−x 0 = random vector in Rn ;
end for
//initialize similarity matrix
A = new Matrix[|V|][|V|];
for each x in V do
for each y in V do
Axy = P P M I(x, y);
end for
end for
// perform positive and negative sampling
for each x in V do
P (x) = getPositives(A, x, K) ;
N (x) = getNegatives(A, x, K) ;
end for
// iteration
t = 1;
E = 1;
while t < T do
for each x in V do
Fa = σ(x, y) × (− →y t−1 − −
→
x t−1 );
y∈P (x)
ω
Fr = − →
−y t−1 −→−
x t−1
;
y∈N (x)
−
→
xt = −
→x t−1 + E × (Fa + Fr );
end for
E = E − Δe;
if ||−
→
xt−− →x t−1 || < then
x∈V
break
end if
t = t + 1;
end while
return Embeddings −
→
xt
5 Complexity Analysis
5.1 Space Complexity
Let m = |V|. We would need at most m(m−1)2 entries to store A, as the matrix is
symmetric and we do not need to store its diagonal. However, there is actually no
need to store A. We can implement P (x) as a priority queue of size K in which
the indexes of K elements of V most similar to x as well as their similarity to x
202 C. Demir and A.-C. N. Ngomo
are stored. N (x) can be implemented as a buffer of size K which contains only
indexes. Once N (x) reaches its maximal size K, then new entries (i.e., y with
P P M I(x, y)) are added randomly. Hence, we need O(Kn) space to store both
P and N . Note that K << m. The embeddings require exactly 2mn space as we
x t and −
store −
→ →
x t−1 for each x ∈ V. The force vectors Fa and Fr each require a
space of n. Hence, the space complexity of Pyke lies clearly in O(mn + Kn) and
is hence linear w.r.t. the size of the input knowledge graph G when the number
n of dimensions of the embeddings and the number K of positive and negative
examples are fixed.
6 Evaluation
The goal of our evaluation was to compare the quality of the embeddings gener-
ated by Pyke with the state of the art. Given that there is no intrinsic measure
for the quality of embeddings, we used two extrinsic evaluation scenarios. In the
first scenario, we measured the type homogeneity of the embeddings generated
by the KGE approaches we considered. We achieved this goal by using a scal-
able approximation of DBScan dubbed HDBSCAN [4]. In our second evaluation
scenario, we compared the performance of Pyke on the type prediction task
against that of 6 state-of-the-art algorithms. In both scenarios, we only consid-
ered embeddings of the subset S of V as done in previous works [10,17]. We
set K = 5, Δe = 0.0414 and ω = 1.45557 throughout our experiments. The
values were computed using a Sobol Sequence optimizer [16]. All experiments
were carried out on a single core of a server running Ubuntu 18.04 with 126 GB
RAM with 16 Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10 GHz processors.
A Physical Embedding Model for Knowledge Graphs 203
6
download.bio2rdf.org/#/release/4/drugbank.
7
Note that we compile the DBpedia datasets by merging the dumps of mapping-based
objects, skos categories and instance types provided in the DBpedia download
folder for version 2016-10 at downloads.dbpedia.org/2016-10.
204 C. Demir and A.-C. N. Ngomo
predicted with x’s known type vector using the cosine similarity:
1
prediction score = cos type(x), type(y) , (13)
|S|
x∈S y∈μnn(x)
6.2 Results
Cluster Purity Results. Table 3 displays the cluster purity results for all
competing approaches. Pyke achieves a cluster purity of 0.75 on Drugbank and
clearly outperforms all other approaches. DBpedia turned out to be a more dif-
ficult dataset. Still, Pyke was able to outperform all state-of-the-art approaches
by between 11% and 26% (absolute) on Drugbank and between 9% and 23%
(absolute) on DBpedia. Note that in 3 cases, the implementations available were
unable to complete the computation of embeddings within 24 h.
Table 3. Cluster purity results. The best results are marked in bold. Experiments
marked with * did not terminate after 24 h of computation.
Type Prediction Results. Figures 4 and 5 show our type prediction results
on the Drugbank and DBpedia datasets. Pyke outperforms all state-of-the-
art approaches across all experiments. In particular, it achieves a margin of
up to 22% (absolute) on Drugbank and 23% (absolute) on DBpedia. Like in
the previous experiment, all KGE approaches perform worse on DBpedia, with
prediction scores varying between <0.1 and 0.32.
Fig. 4. Mean results on type prediction scores on 105 randomly sampled entities of
DBpedia
Fig. 6. Runtime performances of Pyke on synthetic KGs. Colored lines represent fitted
linear regressions with fixed K values of Pyke. (Color figure online)
K Coefficient Intercept R2
5 4.52 10.74 0.997
10 4.65 13.64 0.996
20 5.23 19.59 0.997
We believe that the good performance of Pyke stems from (1) its sampling
procedure and (2) its being akin to a physical simulation. Employing PPMI to
quantify the similarity between resources seems to yield better sampling results
than generating negative examples using the local closed word assumption that
underlies sampling procedures of all of competing state-of-the-art KG models.
More importantly, positive and negative sampling occur in our approach per
resource rather than per RDF triple. Therefore, Pyke is able to leverage more
from negative and positive sampling. By virtue of being akin to a physical sim-
ulation, Pyke is able to run efficiently even when each resource x is mapped to
45 attractive and 45 repulsive resources (see Table 5) whilst all state-of-the-art
KGE required more computation time.
A Physical Embedding Model for Knowledge Graphs 207
7 Conclusion
References
1. Bordes, A., Glorot, X., Weston, J., Bengio, Y.: A semantic matching energy func-
tion for learning with multi-relational data. Mach. Learn. 94, 233–259 (2014)
2. Bordes, A., Usunier, N., Garcia-Duran, A., Weston, J., Yakhnenko, O.: Translating
embeddings for modeling multi-relational data. Curran Associates, Inc. (2013)
3. Bordes, A., Weston, J., Collobert, R., Bengio, Y.: Learning structured embeddings
of knowledge bases. In: Twenty-Fifth AAAI Conference on Artificial Intelligence
(2011)
4. Campello, R.J.G.B., Moulavi, D., Sander, J.: Density-based clustering based on
hierarchical density estimates. In: Pei, J., Tseng, V.S., Cao, L., Motoda, H., Xu,
G. (eds.) PAKDD 2013. LNCS (LNAI), vol. 7819. Springer, Heidelberg (2013).
https://doi.org/10.1007/978-3-642-37456-2 14
5. Guo, Y., Pan, Z., Heflin, J.: LUBM: a benchmark for owl knowledge base systems.
Web Semant. Sci. Serv. Agents World Wide Web 3(2–3), 158–182 (2005)
6. Hitchcock, F.L.: The expression of a tensor or a polyadic as a sum of products. J.
Math. Phys. 6(1–4), 164–189 (1927)
7. Huang, X., Zhang, J., Li, D., Li, P.: Knowledge graph embedding based question
answering. In: Proceedings of the Twelfth ACM International Conference on Web
Search and Data Mining. ACM (2019)
8. Lin, Y., Liu, Z., Sun, M., Liu, Y., Zhu, X.: Learning entity and relation embeddings
for knowledge graph completion. In: Twenty-Ninth AAAI Conference on Artificial
Intelligence (2015)
9. Manning, C., Raghavan, P., Schütze, H.: Introduction to information retrieval. Nat.
Lang. Eng. (2010)
10. Melo, A., Paulheim, H., Völker, J.: Type prediction in RDF knowledge bases using
hierarchical multilabel classification. In: Proceedings of the 6th International Con-
ference on Web Intelligence, Mining and Semantics, p. 14. ACM (2016)
11. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed repre-
sentations of words and phrases and their compositionality. In: Advances in Neural
Information Processing Systems (2013)
12. Nickel, M., Rosasco, L., Poggio, T.: Holographic embeddings of knowledge graphs.
In: Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, AAAI
2016 pp. 1955–1961 (2016)
13. Nickel, M., Tresp, V., Kriegel, H.P.: A three-way model for collective learning on
multi-relational data. In: ICML, vol. 11 (2011)
14. Pennington, J., Socher, R., Manning, C.: GloVe: global vectors for word represen-
tation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural
Language Processing (EMNLP) (2014)
15. Ristoski, P., Paulheim, H.: RDF2Vec: RDF graph embeddings for data mining. In:
International Semantic Web Conference (2016)
16. Saltelli, A., Annoni, P., Azzini, I., Campolongo, F., Ratto, M., Tarantola, S.: Vari-
ance based sensitivity analysis of model output. Design and estimator for the total
sensitivity index. Comput. Phys. Commun. 181(2), 259–270 (2010)
17. Thoma, S., Rettinger, A., Both, F.: Towards holistic concept representations:
embedding relational knowledge, visual attributes, and distributional word seman-
tics. In: d’Amato, C., et al. (eds.) ISWC 2017. LNCS, vol. 10587. Springer, Cham
(2017). https://doi.org/10.1007/978-3-319-68288-4 41
18. Trouillon, T., Welbl, J., Riedel, S., Gaussier, É., Bouchard, G.: Complex embed-
dings for simple link prediction. In: International Conference on Machine Learning
(2016)
A Physical Embedding Model for Knowledge Graphs 209
19. Wang, Q., Mao, Z., Wang, B., Guo, L.: Knowledge graph embedding: a survey of
approaches and applications. IEEE Trans. Knowl. Data Eng. 29, 2724–2743 (2017)
20. Wang, X., Cui, P., Wang, J., Pei, J., Zhu, W., Yang, S.: Community preserving
network embedding. In: AAAI (2017)
21. Xie, R., Liu, Z., Jia, J., Luan, H., Sun, M.: Representation learning of knowledge
graphs with entity descriptions. In: Proceedings of the Thirtieth AAAI Conference
on Artificial Intelligence, AAAI 2016, pp. 2659–2665. AAAI Press (2016)
22. Yang, B., Yih, W.t., He, X., Gao, J., Deng, L.: Embedding entities and relations for
learning and inference in knowledge bases. arXiv preprint arXiv:1412.6575 (2014)
Iterative Visual Relationship Detection
via Commonsense Knowledge Graph
Hai Wan1 , Jialing Ou1 , Baoyi Wang1 , Jianfeng Du2(B) , Jeff Z. Pan3 ,
and Juan Zeng4(B)
1
School of Data and Computer Science, Sun Yat-sen University, Guangzhou, China
[email protected],{oujl5,wangby9}@mail2.sysu.edu.cn
2
School of Information Science and Technology/School of Cyber Security,
Guangdong University of Foreign Studies, Guangzhou, China
[email protected]
3
Department of Computing Science, The University of Aberdeen, Aberdeen, UK
[email protected]
4
School of Geography and Planning, Sun Yat-sen University, Guangzhou, China
[email protected]
1 Introduction
Visual relationship detection, introduced by [12], aims to capture a wide variety
of interactions between pairs of objects in an image. Visual relation can be
represented as a set of relation triples in the form of (subject, predicate, object),
e.g., (person, ride, horse). Visual relationship detection can be used for many
high-level image understanding tasks such as image caption [1] and visual QA [6].
c Springer Nature Switzerland AG 2020
X. Wang et al. (Eds.): JIST 2019, LNCS 12032, pp. 210–225, 2020.
https://doi.org/10.1007/978-3-030-41407-8_14
Iterative Visual Relationship Detection via Commonsense Knowledge Graph 211
(horse,b2) on (sand,b4)
Fig. 1. An example image from VRD. The relation between (person, b1 ) and (horse, b2 )
is ride. Although (horse, b3 ) is similar with (horse, b2 ) in visual feature and positional
feature, the relation between (person, b1 ) and (horse, b3 ) is next to but not ride.
image, and a relation is the edge from the head entity to the tail entity. An exam-
ple in VRD is shown in Fig. 1. The relation between (person, b1 ) and (horse, b2 )
is ride. This visual triple is in the form of ((person, b1 ), ride, (horse, b2 )). The
visual triple is shown in SG1 .
However, Fig. 1 also shows that, if only considering the appearance or spatial
feature between (person, b1 ) and (horse, b3 ), it is more likely that the relation-
ship between these two objects are incorrectly detected as ride, as shown in
SG1 . To avoid that, we introduce the notion of the commonsense knowledge
graph (CKG), in which each triple is labeled with its conditional probability.
For example, the conditional probability of next to with 0.21 between horse
and horse in CKG is higher than ride with 0.11 between person and horse, so
we can get the correct visual triple ((person, b1 ), next to, (horse, b3 )) after iter-
atively updating with commonsense knowledge graph, as shown in SG2 . This
suggests that it is important to consider CKG in visual relationship detection.
While the task is challenging and there are at least three challenges:
1. CKG is a global graph for the image set rather than a graph that aims at one
image, while visual relationship detection focuses on a given image.
2. A pair of object classes may have different relations even in the same image
(e.g. “person” and “horse” show in the CKG of Fig. 1), making it difficult to
update CKG.
3. CKG and feature information of images should be considered jointly in order
to facilitate visual relationship detection.
2 Preliminary
In this section, we first recall the definitions of scene graph and visual relationship
detection. Then we give the definition of commonsense knowledge graph. We also
recall the bi-directional recurrent neural network used in our model.
[17] identified the visual triples of scene graph. We only consider entities and
relations without attributes in this paper and give the definition of scene graph
as follows. W.l.o.g. we assume that all images are in a finite set I. All classes in
I are in a finite set C. All predicates1 in I are in a finite set P.
There are 4 objects, 2 predicates, and 4 visual relation triples in the scene
graph TI of Fig. 1, e.g., ((person, b1 ), ride, (horse, b2 )). For simplicity, we write
it as (person, ride, horse).
1
Throughout this paper, we identify that the predicate in visual relationship detection
is the relation in scene graph.
214 H. Wan et al.
... F f ... F
aC0 aCn f C0 f Cn
concat
0
CNN
... a …
Bi-RNN
Cross-feed
Word2vec
horse sand
vw (o)
…
CKG sky horse
...
face below walk
on
tree Commonsense Knowledge Module
has person next to
snowboard person horse
phone on
jacket
wear
use person sand
probability
aC0 0 CK 1 … ((person,b1), ride, (horse,b3))
C C
Module ((person,b1), next to, (horse,b2))
...
helmet shoes
...
Fig. 2. The overview of our visual relation detection framework. Beside an object
detector that gives a group of detected bounding boxes and their corresponding clas-
sification probability, the framework has two modules to perform detection: a feature
module and a commonsense knowledge module. Both modules roll-out iteratively while
cross-feeding beliefs. The final prediction f is produced by combining each prediction
with attention mechanism.
Iterative Visual Relationship Detection via Commonsense Knowledge Graph 215
3 Method
In this section, we propose a model named Iterative Visual Relationship Detec-
tion with Commonsense Knowledge Graph (IVRDC). The overall pipeline of
IVRDC (Fig. 2) is divided into object detection and relationship detection. Rela-
tionship detection consists of two modules: feature module and commonsense
knowledge module. Both modules roll-out iteratively while cross-feeding beliefs.
The final prediction is obtained by combining predictions from each iteration
with attention mechanism.
In object detection, for each image I, we use Faster R-CNN [14] to obtain a
group of bounding boxes and their classes and pack each bounding box bI,k with
its class c together to be an object oc,I,k . So for each image, we obtain several
objects labeled with classes and the corresponding boxes.
Visual relation prediction is to predict visual triples (subject, predicate,
object). The feature module captures the interactions between objects by using
feature vectors. And the commonsense module provides the conditional proba-
bility for reference. We construct a memory for iteration to store information.
Then the model combines the outputs of the two modules, fF and fC , to update
the two memories, MF and MC . We will discuss the iteration and attention
mechanism of each module in detail.
3.1 Feature Module
In the feature module, three features are taken into consideration: appearance
feature, spatial feature and word vector. And the module employs Bi-RNN to
learn those features to detect predicates [11].
We encode an image I of shape H ×W ×C, where H and W denote the height
and the width, and C denotes the channels of the image. For our work, C = 3. For
each image I, each candidate object oc,I,k = (c, bI,k ) ∈ OI has a bounding box
bI,k = (xmin , ymin , xmax , ymax ) and its detected class c. Since visual information
of an image can implicit interaction among objects and is particularly useful
for visual relation detection, we construct an appearance feature vapp to encode
visual information, which restores not only object features but also their context
information. For preprocessing, we construct a new larger bounding box bo,o to
encompass the two boxes of an object pair (o, o ). We use VGG16 [16] to encode
the region enclosed by bo,o , of shape H × W × C, where H = W = 224
and C = 3. The region through VGG16 net and we obtain the corresponding
features. Then make it as inputs of a convolution net of two convolution layers
and one 300-D fully-connected layer to get the appearance feature vapp .
Spatial information is also a key factor that influences our detection. The
spatial feature is learned by a convolution neural network. In an image I, an
object pair (oc,I,k , oc ,I,k ) contains two bounding boxes boc,I,k and boc ,I,k . First,
we apply dual spatial masks for bounding boxes to get two binary masks, one for
object oc,I,k and another for object oc ,I,k . Then the masks are down-sampling to
a predefined square (32 × 32) [5]. Finally, a convolution net of three convolution
layers and a 300-D fully-connected layer take the masks as inputs to obtain the
300-D spatial feature vspa .
216 H. Wan et al.
Features mentioned above are visual features and express the relation
between two objects. To consider the semantic feature and independence of
objects, we represent an object class as a word vector. In this work, W ord2vec
[13] is used to learn the word vectors. For an image I, each object cc,I,k has an
object class c, then we can find the word vector corresponding to the name of
c (e.g., person). The relation between two words is an inherent semantic rela-
tionship instead of the mathematics distance with one-hot vector. Obviously,
similar object pairs may have similar relationships. For example, the relation-
ship between “person” and “sand” is normally “on”. “horse” and “person” are
similar in semantic space. Then it can reason that (horse, on, sand). Similarly,
some infrequent relations can be learned by the normal relation. For a pair of
object (oc,I,k , oc ,I,k ) in the image I, we generate two feature vectors, vw(oc,I,k )
and vw(oc ,I,k ) , for subject and object, simplify as vw(s) and vw(o) .
Before feeding features into a Bi-RNN, we concatenate appearance feature
vapp and spatial feature vspa and make the concatenated vector through a fully-
connected layer to obtain visual feature vvis . Then we combine the visual feature
and semantic feature. Applying Bi-RNN to predict relationships, we feed feature
vector vw(s) , vvis and vw(o) to input x1 , x2 , and x3 (shown in Eq. (1) and (2)),
respectively. The Bi-RNN structure is shown in Figure 3.
Output y
Backward h1 h2 h3
states
Forward h1 h2 h3
states
Input vs vvis vo
Fig. 3. Bi-RNN has three inputs in sequence (vw(s) , vvis and vw(o) ) and one output
(predicate prediction y).
3.3 Iteration
where W1 and W2 are weights, fC,i+1 is the updated probability, fC,i denotes the
result from the commonsense knowledge module, and fF denotes the prediction
from the feature module. Then fC,i+1 can be used to get the updated memories,
Mi+1
C and Mi+1
F .
Then we update the feature memory MF by a convolutional gate recurrent
unit (GRU) [4]. F denotes a memory for a pair of candidate objects. Fup denotes
a memory that we construct to update memory. We extract the appearance
feature and the spatial feature from memory F . Then we combine fC,i+1 with
addition and convolution layers to form memory Fup . We update the feature
memory as the following formula:
where u denotes the update gate, r denotes the reset gate, Fi+1 is the updated
memory. Wf ,WF , and b are convolutional weights and bias, and ◦ is entry-wise
product. σ() is an activation function. After that, Fi+1 is used to update memory
Mi+1
F .
The new memories, Mi+1 C and Mi+1
F , will lead to another round of updated
fC and fF and the iteration goes on. In this way, the feature memory can
benefit from commonsense knowledge graph. At the same time, the subgraph of
commonsense knowledge graph can get a better sense of the particular image.
3.4 Attention
To modify the model output, we generate the final prediction f by the combi-
nation of each iteration prediction instead of the last iteration prediction. To
combine the predictions from each iteration, we introduce attention mechanism
[3] to our framework. It means that the final output is a weighted version of all
predictions using attentions. Mathematically, if the model iterate n times, then
outputs N = 2n (including n times feature module and n times commonsense
module) prediction fn by attention an , the final output f is represented as:
N
f= wn fn (7)
n
exp(an )
wn = (8)
n exp(an )
an = ReLU (W fn + b) (9)
where fn is the logits before softmax wn denotes the weight of each prediction, an
is produced by fn with an activation function ReLU . The introduction of atten-
tion mechanism enables the model to select feasible predictions from different
modules and iterations.
3.5 Training
The total loss function consists of the feature module loss LF , the commonsense
module loss LC and the final prediction loss Lf . To take more attention on the
harder examples, we give different weights for the loss examples, based on the pre-
dictions from previous iterations. Then the cross-entropy loss is represented as:
the spatial feature [5]. For word vectors for classes, we train our Word2vec model
based on the class set and the triples in CKG. For Bi-RNN, it has two hidden
layers and each layer has 128 hidden states. We roll out the feature module and
the commonsense module three times and update the subgraph of commonsense
knowledge graph at each iteration.
4 Experiments
We evaluate the proposed method on two recently released datasets. We first
introduce datasets and experimental settings, and then analyze the experimental
results in detail.
4.1 Datasets
We evaluate our proposed model on Visual Relationship Datasets (VRD) [12]
and Visual Genome(VG) [8] shown in Table 1.
VRD contains 5000 images with 100 object classes and 70 predicates. VRD
contains 37,993 relation annotations with 6,672 type triples in total. Following
the same train/test split as in [12], we split images into two sets, 4,000 images
for training and 1,000 for testing.
VG contains 99,658 images with 200 object classes and 100 predicates.
Totally, VG contains 1,174,692 relation annotations with 19,237 type triples.
Following the experiments in [19], we split the data into 73,801 for training and
25,857 for testing.
Like [12], we evaluate our proposed method for the following tasks:
– Predicate detection: this task focuses on the accuracy of predicate predic-
tion. The input includes the object classes and the bounding boxes of both
the subject and object. In this condition, we can learn how difficult it is to
predict relationships without the limitations of object detection.
– Phrase/Union detection: the task treats the whole triple (sub, pred, obj)
as a union bounding box which contains the subject and object. A prediction
is considered correct if all the three elements in a triple are correct and the
Intersection over Union (IoU) between the detected box and the ground truth
bounding box is greater than 0.5.
– Relationship detection: this task treats a triple (sub, pred, obj) as three
components. A prediction is considered correct if three elements in a triple
are correct and the IoU of subject and object are both above 0.5 with the
ground-truth bounding box.
RANK Ans#1 Ans#2 Ans#3 Ans#4 Ans#5 RANK Ans#1 Ans#2 Ans#3 Ans#4 Ans#5
airplane lamp
IVRDC-F 2 on next to above wear in the front of IVRDC-F 3 of on side of near above attach to
lamp
#1 airplane
next to #1 near
IVRDC-FC - on above wear in the front of beside bed IVRDC-FC 2 of near next to on side of hold by
engine
engine IVRDC-FCI 1 next to on on the left of aƩach to beside bed IVRDC-FCI 3 of on side of near mount to attach to
person RANK Ans#1 Ans#2 Ans#3 Ans#4 Ans#5 RANK Ans#1 Ans#2 Ans#3 Ans#4 Ans#5
person sandwich
plate
IVRDC-F 1 on has next to above stand IVRDC-F 4 of with under have behind
VRD
VG
horse
#2 on
#2 have
IVRDC-FC 1 on ride next to above stand IVRDC-FC 3 on of have hold eat
plate
horse IVRDC-FCI 1 on stand next to beside above sandwich IVRDC-FCI 2 on have of with cover with
RANK Ans#1 Ans#2 Ans#3 Ans#4 Ans#5 RANK Ans#1 Ans#2 Ans#3 Ans#4 Ans#5
umbrella building
person building
IVRDC-F 4 has next to on under below IVRDC-F 1 on by near on side of beside
#3 #3
under IVRDC-FC 4 has next to in the front of under hold road on IVRDC-FC 2 near on by on side of beside
umbrella IVRDC-FCI 2 has under in the front of hold beneath road IVRDC-FCI 1 on by near beside on side of
person
Fig. 4. Qualitative examples of relation prediction. We show the correct relation rank-
ings and the top-5 answers from IVRDC-F, IVRDC-FC and IVRDC-FCI on VRD and
VG. Relations in bold, italic, and underline fonts denote the correct, plausible, and
wrong answers respectively.
Since the task of visual relationship detection is proposed, only VRD dataset
is publicly released. All of proposed works conduct experiments in this dataset,
and select data from the whole VG dataset [8] by themselves. Recently, VTransE
has released their VG dataset. In VG dataset, we compare our proposed model
with the model applying the same implement methods, which use the same
dataset to train and test, e.g., VTransE, PPR-FCN, DSL, and DSR.
on VRD in zero-shot learning demonstrated in Table 4, we can see that our pro-
posed model works best on phrase detection. Our best result achieved 6.92 and
8.73 for R@50 and R@100 respectively. As for predicate detection, our method
outperforms DSR for R@50. And our proposed model achieved the average level
of the pervious models.
– IVRDC-FC : Part of our model. We combine the feature module and the
commonsense knowledge module without an iteration.
– IVRDC-FCI : Our model introduced in Fig. 2.
From Tables 2 and 3, we observe that our best results outperform pervious
best state-of-the-art results by up to 25%, and even our worst result achieved
the average level. Moreover, our method with different components performs
differently on the three detection tasks. IVRDC-FC is relatively strong in phrase
detection and relation detection, while IVRDC-FCI performs better in predicate
detection.
The results in Tables 2, 3 and 4 show: (1) The joint model IVRDC-FC signif-
icantly performs better in phrase detection and relation detection, which means
that CKG is very useful in visual relation detection. The combination of fea-
ture module and commonsense knowledge module considerably outperforms the
model IVRDC-F with only feature module. (2) The model IVRDC-FCI performs
best in predicate detection. It indicates that iteratively using image features and
CKG have benefit on enhancing predication detection by making use of image
feature information and commonsense knowledge. (3) Relation detection has
achieved the average level. Since the relation detection depends a lot on the
accuracy of the object detector, our result is probably limited by the perfor-
mance of the object detector. By using the same object detector of VTransE,
our result outperforms VTransE by 32.6% in R@100.
Figure 4 further shows the predicted relationships on serval example images.
As the example (plate, have, sandwich) in image VG#2 shown in Fig. 4, IVRDC-
F with image features performs better in predict predicate according images, e.g.,
under. IVRDC-FCI is able to learn the meaning of have from combining CKG
and iterations, bringing it to a higher ranking. Since commonsense knowledge
is a statistical result, the more a predicate occurs, the higher the probability of
the predicate will be, e.g., on.
References
1. Anderson, P., Fernando, B., Johnson, M., Gould, S.: SPICE: semantic proposi-
tional image caption evaluation. In: Leibe, B., Matas, J., Sebe, N., Welling, M.
(eds.) ECCV 2016, Part V. LNCS, vol. 9909, pp. 382–398. Springer, Cham (2016).
https://doi.org/10.1007/978-3-319-46454-1 24
2. Bordes, A., Usunier, N., Garcı́a-Durán, A., Weston, J., Yakhnenko, O.: Translating
embeddings for modeling multi-relational data. In: Proceedings of International
Conference on Neural Information Processing Systems (NIPS2013), pp. 2787–2795
(2013)
3. Chen, L., Yang, Y., Wang, J., Xu, W., Yuille, A.L.: Attention to scale: scale-
aware semantic image segmentation. In: Proceedings of CVPR, 2016, pp. 3640–
3649 (2016). https://doi.org/10.1109/CVPR.2016.396
4. Chung, J., Gülçehre, Ç., Cho, K., Bengio, Y.: Empirical evaluation of gated recur-
rent neural networks on sequence modeling. CoRR abs/1412.3555 (2014). http://
arxiv.org/abs/1412.3555
5. Dai, B., Zhang, Y., Lin, D.: Detecting visual relationships with deep relational
networks. In: Proceedings of CVPR, 2017, pp. 3298–3308 (2017). https://doi.org/
10.1109/CVPR.2017.352
6. Dong, L., Wei, F., Zhou, M., Xu, K.: Question answering over freebase with multi-
column convolutional neural networks. In: Proceedings of ACL, 2015, pp. 260–269
(2015). http://aclweb.org/anthology/P/P15/P15-1026.pdf
7. Johnson, J., et al.: Image retrieval using scene graphs. In: Proceedings of CVPR,
pp. 3668–3678 (2015). http://dx.doi.org/10.1109/CVPR.2015.7298990
8. Krishna, R., et al.: Visual genome: connecting language and vision using crowd-
sourced dense image annotations. Int. J. Comput. Vis. 123(1), 32–73 (2017).
https://doi.org/10.1007/s11263-016-0981-7
9. Liang, K., Guo, Y., Chang, H., Chen, X.: Visual relationship detection with deep
structural ranking. In: Proceedings of AAAI, 2018 (2018). https://www.aaai.org/
ocs/index.php/AAAI/AAAI18/paper/view/16491
10. Liang, X., Lee, L., Xing, E.P.: Deep variation-structured reinforcement learning
for visual relationship and attribute detection. In: Proceedings of CVPR, 2017,
pp. 4408–4417 (2017). https://doi.org/10.1109/CVPR.2017.469
11. Liao, W., Lin, S., Rosenhahn, B., Yang, M.Y.: Natural language guided visual
relationship detection. CoRR abs/1711.06032 (2017). http://arxiv.org/abs/1711.
06032
12. Lu, C., Krishna, R., Bernstein, M., Fei-Fei, L.: Visual relationship detection with
language priors. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016,
Part I. LNCS, vol. 9905, pp. 852–869. Springer, Cham (2016). https://doi.org/10.
1007/978-3-319-46448-0 51
13. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word rep-
resentations in vector space. CoRR abs/1301.3781 (2013). http://arxiv.org/abs/
1301.3781
14. Ren, S., He, K., Girshick, R.B., Sun, J.: Faster R-CNN: towards real-time object
detection with region proposal networks. In: Proceedings of NIPS, 2015, pp.
91–99 (2015). http://papers.nips.cc/paper/5638-faster-r-cnn-towards-real-time-
object-detection-with-region-proposal-networks
Iterative Visual Relationship Detection via Commonsense Knowledge Graph 225
15. Schuster, M., Paliwal, K.K.: Bidirectional recurrent neural networks. IEEE Trans.
Signal Process. 45(11), 2673–2681 (1997). https://doi.org/10.1109/78.650093
16. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale
image recognition. CoRR abs/1409.1556 (2014). http://arxiv.org/abs/1409.1556
17. Wan, H., Luo, Y., Peng, B., Zheng, W.: Representation learning for scene graph
completion via jointly structural and visual embedding. In: Proceedings of IJCAI,
2018, pp. 949–956 (2018). https://doi.org/10.24963/ijcai.2018/132
18. Yu, R., Li, A., Morariu, V.I., Davis, L.S.: Visual relationship detection with internal
and external linguistic knowledge distillation. In: Proceedings of ICCV, 2017, pp.
1068–1076 (2017). https://doi.org/10.1109/ICCV.2017.121
19. Zhang, H., Kyaw, Z., Chang, S., Chua, T.: Visual translation embedding network
for visual relation detection. In: Proceedings of CVPR, 2017, pp. 3107–3115 (2017).
https://doi.org/10.1109/CVPR.2017.331
20. Zhang, H., Kyaw, Z., Yu, J., Chang, S.: PPR-FCN: weakly supervised visual rela-
tion detection via parallel pairwise R-FCN. In: Proceedings of IEEE, 2017, pp.
4243–4251 (2017). http://doi.ieeecomputersociety.org/10.1109/ICCV.2017.454
21. Zhu, Y., Jiang, S.: Deep structured learning for visual relationship detection. In:
Proceedings of AAAI, 2018 (2018). https://www.aaai.org/ocs/index.php/AAAI/
AAAI18/paper/view/16475
A Dynamic and Informative Intelligent
Survey System Based
on Knowledge Graph
1 Introduction
This paper is about how to use knowledge graph to build an intelligent survey
system. In fields such as Linguistics, Psychology, and Medicine, researchers rely
on data from human participants, which are gathered either by verbal communi-
cation, written questionnaires, or Internet-based questionnaires. Online surveys
are particularly popular in contemporary research due to their global reach, flexi-
bility, ease of data analysis, and low administration cost, among other advantages
[10]. However, research suggests that participant motivation in surveys decreases
over time such that respondents are likely to engage in a sub-optimal way, low-
ering the overall quality of data collected [15]. Respondents may be reluctant
to complete surveys due to low interest in participation, resulting in decreased
response rates overall [18,28]. Internet-based questionnaires are also by nature
less interactive than face-to-face data collection, limiting researchers’ ability to
c Springer Nature Switzerland AG 2020
X. Wang et al. (Eds.): JIST 2019, LNCS 12032, pp. 226–241, 2020.
https://doi.org/10.1007/978-3-030-41407-8_15
A Dynamic and Informative Intelligent Survey System 227
2 Background
2.1 Knowledge Graph
A knowledge graph G = (D , S ) consists of a data sub-graph D of interconnected
typed entities and their attributes as well as a schema sub-graph S that defines
228 P. Bansky et al.
Previous linguistic studies on the AEP (without the presence of to be) point
to need being the most commonly used main verb, followed by want and like
[19]. Moreover, Inanimate subjects seem to be more acceptable with the use of
want and like in the AEP than the StEP [7]. However, these findings are based
on studies conducted only on the North American population using American
English, and therefore may not apply to Scottish and Northern Irish speakers
who use the AEP.
A Dynamic System. The dynamic system should serve the purposes of the infor-
mative system. In other words, the algorithms which deliver the questions to
respondents ought to do so in such a way as to realise the requirements which
make the system informative. This is in contrast to typical grammaticality judge-
ment questionnaires, where the same questions are asked in every survey, not
taking into account the participants’ responses. In these traditional static sur-
veys the order of the questions is predefined or entirely randomised in advance,
and therefore cannot be changed as the survey is conducted. The fixed presenta-
tion of questions does not allow for a more tailored experience for the respondent
and does not allow for user feedback in the form of comments to be taken into
account. Additionally, the number of questions is limited, which means that a
researcher may only be able to cover a select few variables of interest.
Two key ontologies are designed for the proposed system: a general purpose
Survey Ontology and a domain specific ontology, such as a Linguistic Feature
Ontology.
The Survey Ontology contains classes such as SurveyQuestion, AnswerOp-
tion, SurveyAnswer and User, Participation, Hypothesis. It contains properties,
such as hasSurveyUser, hasSurveyQuestion and hasSurveyAnswer. We refer the
reader to [31] for more details of the Survey Ontology.
The Linguistic Feature Ontology has classes such as, Sentence, POS, Subject
(Subject POS), AnimateSubject (AnimateSubject Subject), InanimateSub-
ject (InanimateSubject Subject), DefiniteSubject (DefiniteSubject Subject),
IndefiniteSubject (IndefiniteSubject Subject), Verb (Verb POS), MainVerb
(with instances need/want/like, MainVerb Verb), AEP (AEP POS) and
StEP (StEP POS). The Linguistic Feature Ontology has properties, such as
hasPOS and hasString.
When a linguistic researcher annotate survey questions (such as the one con-
taining Sentence S1, The dog needs walked ), a set of statements will be con-
structed in the knowledge graph:
4.2 Algorithms
In this sub-section, we will present two algorithms that are able to responsively
select sentence for the next question, with the help of sentence classification
discussed in the previous sub-section (cf. the discussion of S1) .
Algorithm 1 considers the effects of linguistic features such as the choice
of the main verb, namely need, like, want, as well as whether the subject is
Animate/Inanimate or Definite/Indefinite. Along with these variables, the pres-
ence/absence of the non-finite passive auxiliary to be gives a total of 3∗2∗2∗2 =
A Dynamic and Informative Intelligent Survey System 231
The key challenge is how to select some of the 144 sentences into a survey,
which typically includes about 30 questions. The main idea is to use results form
some baseline studies of these 144 sentences to learning the acceptability ranking
of these sentences and related families. Instead of covering every one of the 144
sentences, Algorithm 1 selects the next sentence based on user judgements of
the current sentence (lines 4 and 7), resulting in having 2–4 sentences per family
group.
232 P. Bansky et al.
5.1 Hypotheses
The AEP has been claimed to be found among speakers in Scotland and
Northern Ireland, but there has been little investigation of this feature for these
populations. We therefore seek to investigate the following hypotheses:
– Hypothesis 1: Speakers who use AEP like will also use AEP want, and
speakers who use AEP want will also use AEP need.
– Hypothesis 2: Some subset of speakers will allow inanimate subjects with
AEP want and like, but not StEP want and like. Speakers who allow inani-
mate subjects with StEP want and like will also allow them with AEP want
and like.
234 P. Bansky et al.
Experiment Setup. Based on the results from [31], a pool of 144 sentences were
divided into 24 families, paired into 12 groups comprising both AEP and StEP
sentences. The sentences in each group shared the same set of linguistic features:
main verb (need, want, like), subject (in)animacy, and subject (in)definiteness.
For instance, the group for need, animate subject, and definite subject included
the following sentences.1
The sentences were ranked according to their mean ratings in the baseline
results from [31], which had 50 participants over six versions, each consisting
of 24 sentences covering all combinations of the main verb, (in)definiteness,
(in)animacy, and [±to be] variables. They were presented to participants accord-
ing to Algorithm 1.
The family groups were ordered to present those with main verb need, fol-
lowed by those with main verb want, followed by those with main verb like. For
each rejected sentence participants were asked ‘What would you say instead?’.
Forty-six participants, who were recruited through word of mouth and social
media, completed the survey online. Each answered a minimum of 24 questions;
those who chose to continue could answer up to 30 questions. At the end of the
survey participants were provided with an individualised map comparing their
answers on one of the AEP sentences (without to be) with other users who had
made judgements on sentences with the same set of linguistic features. See Fig. 1.
1
While the sentences may vary in singular/plural subject, this is not a relevant exper-
imental variable, but provided only for variety.
A Dynamic and Informative Intelligent Survey System 235
Experiment Setup. The same set of 144 sentences was used, divided into
12 families, paired into six groups, based on main verb and (in)animacy:
(in)definiteness was not used as a variable, as it was deemed irrelevant to any
hypotheses of interest.
A further 18 sentences were added to the set of possible questions in order to
test a number of additional variables: use of adverbs with the AEP (e.g. The books
need sorted alphabetically); use of by-phrases (e.g. My car needs checked by a
mechanic; use of purpose-clauses (e.g. The screws need tightened to hold the shelf
up); questions (e.g. Does the door need opened? ); negation (e.g. Those carpets
don’t need cleaned ); and relative clauses (e.g. Those are the shirts that need
ironed ). These additional linguistic features were included to measure a number
of other hypotheses examined in previous work, though which are tangential to
the hypotheses we address in this paper.
In this iteration the system was coded to recognise comments in response
to ‘What would you say instead?’, in particular, the use of to be or an alter-
native main verb need, want or like. The sentences were presented according to
Algorithm 2, again using participants’ judgements from the baseline survey for
ranking of the 144 original sentences.
Fifty-three participants were recruited through paid social media advertising
which targeted users in Scotland and Northern Ireland. Each participant gave
judgements on 24 sentences and, as in Case Study 1, was presented with an
individualised map upon completion of the questionnaire and encouraged to
share the survey on social media.
42 of these accepted AEP want. Of these 42, 17 accepted AEP like. A single
participant appeared to accept AEP want, but not AEP need. However, closer
inspection revealed that they did accept several of the sentences with additional
variables (e.g. use of by-phrases), all of which had AEP need, and so did not
contradict this hypothesis.
Of 35 participants who accepted both AEP and StEP want, 34 were asked
about these with inanimate subjects. Five accepted an inanimate subject with
AEP want, but not StEP want; two rejected an inanimate subject with AEP
want but accepted one with StEP want. Of 15 participants who accepted both
AEP and StEP like, 13 were asked about these with inanimate subjects. Two
accepted an inanimate subject with AEP like, but not StEP like; one rejected an
inanimate subject with AEP like but accepted one with StEP like. The rest of the
speakers either accepted or rejected all inanimate subjects in both constructions
for want and like. These results therefore weakly support Hypothesis 2, that
inanimate subjects are more acceptable for want and like in AEP constructions
than StEP constructions.
6 Related Work
6.1 Intelligent Surveys
One of the intelligent surveys systems already implemented is the Dynamic Intel-
ligent Survey Engine DISE [29], which aims to have an as flexible as possible
approach to creating a survey while avoiding being restricted. Similarly to our
old system, it uses a wide variety of data methods and an advanced data collec-
tion approach with the intent to measure the consumer preferences. However, in
contrast to our system, which uses a drag and drop interface for creating surveys,
survey creating in their system is done by XML markup language, which may
have a rather steep learning curve and thus cumbersome to learn. Furthermore,
the system does not allow for conditional trigger for better user experience, nor
does it use its knowledge to prioritise the most significant questions first.
people complete surveys for money. Johnson suggests that using such a platform
can provide a large participation-pool with necessary tools to build an exper-
iment in quick and efficient manner [14]. Turkolizer [11] and Turktools [9] are
two tools that run on this crowd sourcing platform. While this approach may
potentially present benefits in large-scale experiments, this platform presents
only a basic statistical analysis of the data. To do any form of knowledge pow-
ered services, for instance syntactic and semantic evaluation of the results, a
knowledge structure would have to be implicitly hard-coded. As a result of this,
the experiments data is hard to transmit, link or reuse.
7 Conclusion
With the help of Knowledge Graph, we propose a dynamic approach to the
questionnaire component of the survey, yielding more informative results. The
240 P. Bansky et al.
questions are ordered by a model based on their importance. Once the questions
are ordered, a set of conditional triggers are set to provide a more dynamic expe-
rience, which benefits the researcher in maximising the quality and quantity of
data collected, and the user in creating a more varied survey. Follow-up questions
are asked in a case of the user accepting or rejecting certain questions.
In the evaluation we have shown that the dynamic component can have
a positive impact on the quality of the data as well as limiting the number
of questions asked in the survey. The previous system performed 6 different
surveys, each of which had 24 questions; a total of 50 people participated in
those surveys. With our system we have managed to achieve the same results
as the previous study in the one iteration of the survey, asking 28.2 questions
on average with the same types of questions having only 25 participated in the
survey. Such improvement is based on the semantic understanding of survey
questions enabled by knowledge graphs.
References
1. Abernethy, J., Evgeniou, T., Vert, J.P.: An Optimization Framework for Adaptive
Questionnaire Design. INSEAD, Fontainebleau (2004)
2. Callegaro, M., Wells, T., Kruse, Y.: Effects of precoding response options for five
point satisfaction scales in web surveys. In: 2008 PAPOR Conference. Citeseer
(2008)
3. Capterra: Survey software buyers’ guide (2019). https://www.capterra.com/
survey-software/#buyers-guide. Accessed 5 Mar 2019
4. Chen, T.Y., Myers, J.: Worldlikeness: a web-based tool for typological psycholin-
guistic research. Univ. Pennsylvania Working Pap. Linguist. 23(1), 4 (2017)
5. Dolnicar, S., Grün, B., Yanamandram, V.: Dynamic, interactive survey questions
can increase survey data quality. J. Travel Tour. Mark. 30(7), 690–699 (2013)
6. Drummond, A.: Ibex 0.3. 7 manual (2013)
7. Edelstein, E.: This syntax needs studied. In: Micro-syntactic variation in North
American English, pp. 242–268 (2014)
8. Elmes, D.G., Kantowitz, B.H., Roediger III, H.L.: Research Methods Inpsychology.
Cengage Learning (2011)
9. Erlewine, M.Y., Kotek, H.: A streamlined approach to online linguistic surveys.
Nat. Lang. Linguist. Theory 34(2), 481–495 (2016)
10. Evans, J.R., Mathur, A.: The value of online surveys. Internet Res. 15(2), 195–219
(2005)
11. Gibson, E., Piantadosi, S., Fedorenko, K.: Using mechanical turk to obtain and
analyze english acceptability judgments. Lang. Linguist. Compass 5(8), 509–524
(2011)
12. Guin, T.D.L., Baker, R., Mechling, J., Ruyle, E.: Myths and realities of respondent
engagement in online surveys. Int. J. Mark. Res. 54(5), 613–633 (2012)
13. Here, M., Now, P.: Bing, bang, bong. Blah
14. Johnson, D.R., Borden, L.A.: Participants at your fingertips: using amazons
mechanical turk to increase student-faculty collaborative research. Teach. Psychol.
39(4), 245–251 (2012)
15. Kaminska, O., McCutcheon, A.L., Billiet, J.: Satisficing among reluctant respon-
dents in a cross-national context. Public Opin. Q. 74(5), 956–984 (2010)
A Dynamic and Informative Intelligent Survey System 241
16. Katz, J.: The british-irish dialect quiz. New York Times, 15 February 2019
17. Keller, F., Gunasekharan, S., Mayo, N., Corley, M.: Timing accuracy of web exper-
iments: a case study using the webexp software package. Behav. Res. Methods
41(1), 1–12 (2009)
18. Kropf, M.E., Blair, J.: Eliciting survey cooperation: incentives, self-interest, and
norms of cooperation. Eval. Rev. 29(6), 559–575 (2005)
19. Murray, T.E., Simon, B.L.: At the intersection of regional and social dialects: the
case of like+ past participle in american english. Am. Speech 77(1), 32–69 (2002)
20. Mwamikazi, E., Fournier-Viger, P., Moghrabi, C., Barhoumi, A., Baudouin, R.:
An adaptive questionnaire for automatic identification of learning styles. In: Ali,
M., Pan, J.-S., Chen, S.-M., Horng, M.-F. (eds.) IEA/AIE 2014, Part I. LNCS
(LNAI), vol. 8481, pp. 399–409. Springer, Cham (2014). https://doi.org/10.1007/
978-3-319-07455-9 42
21. Mwamikazi, E., Fournier-Viger, P., Moghrabi, C., Baudouin, R.: A dynamic ques-
tionnaire to further reduce questions in learning style assessment. In: Iliadis, L.,
Maglogiannis, I., Papadopoulos, H. (eds.) AIAI 2014. IAICT, vol. 436, pp. 224–235.
Springer, Heidelberg (2014). https://doi.org/10.1007/978-3-662-44654-6 22
22. Myers, J.: Minijudge: software for small-scale experimental syntax. Int. J. Comput.
Linguist. Chin. Lang. Process. 12(2), 175–194 (2007)
23. Nokelainen, P., Niemivirta, M., Kurhila, J., Miettinen, M., Silander, T., Tirri, H.:
Implementation of an adaptive questionnaire. In: Proceedings of the ED-MEDIA
Conference, pp. 1412–1413 (2001)
24. Ortigosa, A., Paredes, P., Rodriguez, P.: Ah-questionnaire: an adaptive hierarchical
questionnaire for learning styles. Comput. Educ. 54(4), 999–1005 (2010)
25. Pan, J., et al.: Reasoning Web: Logical Foundation of Knowledge Graph Construc-
tion and Querying Answering. Springer, Switzerland (2017). https://doi.org/10.
1007/978-3-319-49493-7
26. Pan, J., Vetere, G., Gomez-Perez, J., Wu, H.: Exploiting Linked Data and Knowl-
edge Graphs for Large Organisations. Springer, Switzerland (2016). https://doi.
org/10.1007/978-3-319-45654-6
27. Puleston, J., Sleep, D.: The game experiments: researching how gaming techniques
can be used to improve the quality of feedback from online research. In: Proceedings
of ESOMAR Congress (2011)
28. Saleh, A., Bista, K.: Examining factors impacting online survey response rates
in educational research: perceptions of graduate students. J. MultiDiscip. Eval.
13(29), 63–74 (2017)
29. Schlereth, C., Skiera, B.: Dise: dynamic intelligent survey engine. In: Diamantopoulos,
A., Fritz, W., Hildebrandt, L. (eds.) Quantitative Marketing and Marketing Manage-
ment, pp. 225–243. Gabler Verlag, Wiesbaden (2012). https://doi.org/10.1007/978-
3-8349-3722-3 11
30. Schütze, C.T.: The Empirical Base of Linguistics: Grammaticality Judgments and
Linguistic Methodology. Language Science Press, Berlin (2016)
31. Soares, R., Edelstein, E., Pan, J.Z., Wyner, A.: Knowledge driven intelligent survey
systems for linguists. In: Ichise, R., Lecue, F., Kawamura, T., Zhao, D., Muggleton,
S., Kozaki, K. (eds.) JIST 2018. LNCS, vol. 11341, pp. 3–18. Springer, Cham
(2018). https://doi.org/10.1007/978-3-030-04284-4 1
32. SoftwareAdvice: Buyer’s guide, March 2019. https://www.softwareadvice.com/za/
survey/#buyers-guide. Accessed 23 Apr 2019
33. Stoet, G.: Psytoolkit: a novel web-based method for running online questionnaires
and reaction-time experiments. Teach. Psychol. 44(1), 24–31 (2017)
CICO: Chemically Induced Carcinogenesis
Ontology
Resource: http://bike.cico.snu.ac.kr/
1 Introduction
1.1 Background
Over the past few decades, research has been done to better understand and treat cancer.
However, experiments on humans except for the purpose of treatment are impossible.
Therefore, research using the model system, which is also known as a model organism, is
actively conducted in biomedicine and biology, and as a result, vast amounts of disease-
drug and gene-disease data is being produced. Following this, standardization and data
structure design for effective data sharing and systematic management have emerged.
The Experimental Factor Ontology (EFO) [1], Disease Ontology (DO) [2, 3], Gene
Ontology (GO) [4], Chemical Entities of Biological Interest (ChEBI) [5] are examples
© Springer Nature Switzerland AG 2020
X. Wang et al. (Eds.): JIST 2019, LNCS 12032, pp. 242–254, 2020.
https://doi.org/10.1007/978-3-030-41407-8_16
CICO: Chemically Induced Carcinogenesis Ontology 243
The remainder of the paper is structured as follows. In the next section, we will
discuss prior studies related to our research. In Sect. 3, we describe the dataset of the
experiment and the chemical cause of carcinogenesis. Section 4 introduces our ontology
and examines the results of our ontology based on real scenario. Next, Sect. 5 demon-
strates an application that is reusable and allows users to easily access data. Finally,
Sect. 6 describes conclusion and future work.
2 Related Work
Initial studies that collect and provide data related to animal experiments using chemicals
include the Carcinogenic Potency Database (CPDB) [9], the National Toxicology Pro-
gram, and Toxicology Literature Online (TOXLINE) [10]. The above three studies are
provided in a structure designed by each person to experiment with chemically induced
cancer. In addition, each database differs in the way it collects data and is provided as
an independent system. First, TOXLINE features literature-based data collection and is
up to date. Next, NTP is a program administered by the National Cancer Institute. NTP
plays a critical role in generating, interpreting, and sharing toxicological information
about potentially hazardous substances in our environment. Finally, the CPDB collects
experimental data from the University of California, Berkeley and the Lawrence Berke-
ley National Laboratory. Unfortunately, these datasets existed separately even though
CICO: Chemically Induced Carcinogenesis Ontology 245
they have the same purpose. In addition, the biggest problem is the simple search service
that reflects only a part of the researchers’ requirements and is difficult to access. For
example, all three studies collected experimental data independently on the likelihood
of developing cancer using chemicals and they are not integrated. All three systems are
only able to perform searches with chemicals. This paper presents an knowledge graph
called Chemically Induced Carcinogenesis Ontology (CICO). First, we integrate data
from three sources into a knowledge graph. This improves data reusability for effective
sharing and performs ontology modeling. In this paper we use knowledge graph to mean
the total set of entities, their attributes and their relations with other entities [11]. DBpe-
dia [12] is an example of a general-domain knowledge graph and KnowLife [13] is a
knowledge graph for biomedicine. In addition, we use the Disease Ontology (DO) to
map the experimental data to the Human-Mouse Disease Connection (HMDC) database
to provide experimental data as well as a list of notable genes in the study.
data, 2% of the hamsters, most of which consisted of the rodent model and 1% of the
primates model in Fig. 2c. These model systems have a strain name that corresponds
to a unique subtype of species. Our system contains a total of 221 strains, of which
the most commonly used strain is b6c for mouse models and the rat is used f34 strain
for rat models which can be seen in Fig. 2d. However, some experimental data do not
contain strain information. Nonetheless, depending on the strain of the mouse used, it is
important information for researchers because the capacity of the chemical and tolerance
of cancer may be different.
Fig. 2. Descriptive statistics. a, the distribution of 20 types of chemical substance and the number
of experiments used in the experiment. b, various routes and distributions of chemical agents. c,
distribution by various model systems. d, strain classification and distribution of model system.
Third, the concept of the experiment includes the above two factors together with the
type of carcinogenesis and affected tissue. Our collected data includes tissue information
from 193 cancers. The formation of tumors occurred most in liver tissue, followed by lung
tissue, kidney, uterus, brain, and skin. We looked at the site and type of cancer in more
detail. As a result, hepatocellular carcinoma, neoplastic nodule, and hepatoblastoma
were found in the same liver tissue in Table 1. However, because the data is generated
by various researchers, the names of the tumor types varied.
CICO: Chemically Induced Carcinogenesis Ontology 247
information that animal experimenters are curious about is the duration of the experiment,
the dosage of the chemical, and the way the chemical is administered.
Model System. The Model System class contains information about the animal spec-
imens used in the experiment. As described in Sect. 3, several species of animals are
included, and the majority of animals are mice and rats. This class has three DataProp-
erties and one ObjectProperty. First, DataProperty is :hasMutagencity, :hasSystemGen-
der, :hasSystemSpecies, and ObjectProperty is :relatedStrain. Mutagencity expresses
whether a chemical mutation occurred in a specific tissue or cell before conducting an
experiment. The gender of the model system can vary depending on the purpose of each
study. Therefore, the sex of the animal subjects used in the experiment can be important
information. Finally, the species of the model system seems similar to each other but
include genetic differences with different immune systems, which are considered essen-
tial for experimental research. Using information from the Model System, researchers
can find specific experiments with species, race, and gender.
Chemical. The chemical class has the name of the chemical and the CAS number. For
each piece of information, use the :hasChemicalsName property and the :hasCAS prop-
erty. Researchers can use the chemical name or CAS number to find the carcinogenesis
experiment.
Effect. There are four different tables associated to Effect class such as Tumor, Tissue,
Experiment and HMDC. First, tissue table is sets of record including name and abbrevi-
ation of occurred tissue which has abnormal change. Next, tumor table incorporate the
shape or the location of cancer at the tissue. Experiment table is described above and
HMDC table explain below.
HMDC. The HMDC class is data provided by Mouse Genome Informatics (MGI),
which provides genes related to specific diseases to human and mouse. Some genes
have the same function but can have different names depending on the species. This is an
important concept that must be considered when applied to humans using animal experi-
mental studies. Because of this need, it is important for researchers to know the genes that
are remarkable for specific cancers when conducting experimental studies. This class
has three properties related to the disease :hasDOID, :hasDiseaseName, :hasOMIMIds
properties and genes. There are EntrezGene ID, GeneSymbol, and HomologGene, which
represent genetic information, and we provide genetic information by selecting a widely
used GeneSymbol.
4.2 Scenario
The first scenario is to obtain relevant experiments using the chemical. The method of
finding the experiments is possible with the chemical name or the CAS number. For
example, a researcher uses a chemical called ‘1,3-BUTADIENE’ in a laboratory and can
use the following SPARQL query if you want to know the type and tissue of cancer that
can be caused by this substance, along with references.
CICO: Chemically Induced Carcinogenesis Ontology 249
The chemical class of the ontology we have developed has the: hasChemicalsName
and :hasCAS properties. If the researcher knows the name of the chemical or the CAS
number, you can select the chemical used in the experiment. Next, the chemical class has a
relationship with the experimental class as :potentiallyInducedCarcinogenesisChemical
property. A reference to the experiment can be obtained with: hasReferenceOfExperi-
ment. The experiment class also links to the Effect class with the :hasEffectOfExperiment
property. As you can see in the scenario, you can use the :affectedTumorType property
of the Effect class to find various types of tumors, and the :affectedTissue property to get
associated with tissue. Through SPARQL results, researchers are provided with knowl-
edge on various experiments using the desired chemical in Table 2. The results of the
chemical-based search show five of the experiments for Scenario 1. All five experiments
use the same chemical ‘1,3-BUTADIENE’ and the CAS number for this substance is
106-99-0. The first experiment has acinar-cell carcinoma type of tumor in mammary
gland tissue. This experiment is recorded in the NCI-NTP TR288 report. On the other
hand, 217024 describes an experiment in which a follicular-cell adenoma tumor type
of thyroid gland tissue was generated by the literature reference ‘P E: Owen; amih, 48,
407-413; 1987’.
In the second scenario, the purpose of the study is to focus on specific cancers.
Firstly, the researcher finds the Disease Ontology (DO) ID or disease. Afterward, we
provide experimental information that generates cancer in the mouse system and genes
250 S. Yang et al.
associated with specific cancers. In addition, the researchers attempt to know how much
of the chemicals used in the experiment were used. For example, if the researcher is
interested in a disease called ‘hepatocellular carcinoma’, the disease name can be used to
search the list of experiments, or the DOID (DOID: 684) corresponding to ‘hepatocellular
carcinoma’ can be used. In this study, genes related to mouse disease are provided by
Human-Mouse Disease Connection (HMDC) class. This allows the researcher to give
the insight to look at the relationship between the disease that interact with the notable
genes when conducting experimental studies. Now, let’s look at the SPARQL query in
the second scenario.
Unlike the previous scenario, the second scenario includes disease-gene information
other than experimental data. The researcher finds the data of the HMDC using the name
of the target disease, and uses the: hasDOID property to obtain the DOID corresponding
to the disease name and link it with the experimental data. We set the value of the has-
SpeciesOfrelatedGene property to mouse to provide the disease-gene data related to the
mouse model. Throughout the process, we know the genes and DOID values associated
with the disease name. Next, you need to link with the effect class’s effect_sameAsDOID
property to find experiment data related to the disease. As a result, experimental data
and disease-gene data can be integrated.
The next step is to get experiment information with the: hasEffectOfExperiments
property. We obtained information about chemical dosage, model system information,
and chemical information through :dosageOfPossibleInducinogenesChemical, :has-
ExperimentalModelSystem, :potentiallyInducifiedCarcinogenesisChemical properties.
The above query results can be seen in the Table 3 below. The results of the search
using the disease name include the amount of chemicals and used substances, laboratory
animals, Disease Ontology ID, disease related genes, and experimental references. All
of the four results of the search were directed to the mouse system and experiments were
conducted using different chemicals and dosage. The list of genes included in the result
table is related to the mouse.
CICO: Chemically Induced Carcinogenesis Ontology 251
Our study involves two heterogeneous data sets. The first is experimental data using
chemicals and the other is collection of disease-gene associations. The two datasets
were generated independently so the disease names were annotated differently. Disease
Ontology (DO) is one solution to solve this problem. The Human-Mouse Disease Con-
nection (HMDC) has a unified name for each disease. On the other hand, integrated
experiment knowledge graph has no standardized name for each disease. We need to
generate a unified name for each disease in the experiment knowledge graph. First, we
unified the names in the “affected tissue” and “type of tumor” properties. Each property
can be obtained using the: affectedTissue and: affectedTumorType properties of the effect
class. We manually identified 67 “affected tissues” and 388 “types of tumors”. Table 4
shows some of the DOID results we created. The column “# of records” represent pairs
of “affected tissue” and “type of tumor”. We manually matched each pair of “affected
tissue” and “type of tumor” with DOID. The reason for pairing is because it is difficult
to determine the type of tumor without affected tissue. An example of this is when ade-
noma is the “type of tumor” as can be seen in Table 4. The tissue-tumor pair with the
most records is a pair of “all tumor-bearing animals” and “more than one tumor type”.
Unfortunately, “all tumor bearing animals” cannot assign DOIDs. The reason is that
there is no specific DOID that matches. The next largest set is hepatocellular carcinoma
in liver tissue, which includes a total of 3,428 rows. We refer to the DO and assign DOID:
684 to this result. Tissue-tumor pairs having the same affected tissue and different types
of cancer may be expressed with the same DOID since this is how Disease Ontology
defines them. For instance, “type of tumor” hepatocellular carcinoma and hepatoma in
Table 4. The next tissue with a lot of experimental records is lung. Experiments in lung
tissue are performed by inhaling the injected drug into the respiratory tract. We store the
generated DOID in the effect class as :sameAsDOID. We have seen how to utilize these
heterogeneous classes in this section.
252 S. Yang et al.
We will discuss how to use resources through search and exploration. To demonstrate an
integrated animal experiment knowledge graph, we have created a visualization tool that
displays experimental data retrieved under various conditions. The visualization tool is
based on the d2rq framework, available at http://bike.cico.snu.ac.kr/. The application
of this study consists of 6 search categories (disease name, DOID, chemical name,
CAS number, tissue, type of tumor). The following introduces the examples of a search
term for each category: First, the disease name can be retrieved by the name of the
cancer present in the Disease Ontology. For example, Breast Cancer, Lung Cancer,
and hepatocellular carcinoma. Next, the search using the DOID is performed using the
disease ID value of the Disease Ontology. For example, DOID: 3910, DOID: 4450, and
DOID: 1324. The chemical name looks for various experiment data using the name
of the chemical to be used. For example, 1,1,1-TRICHLOROETHANE, TECHNICAL
GRADE, FUROSEMIDE, METHYLENE CHLORIDE, 1,2-DICHLOROBENZENE.
The CAS number is another search method using a chemical. For example, 75-09-
2, 106-99-0, and 95-50-1. Next, a tissue is a search method using the name of tissue
in which cancer has occurred. For example, liver, lung, and kidney. Finally, a type of
tumor can be detected by sarcoma, adenocarcinoma, adenocarcinoma, and adenoma. The
search results include the name of the chemical that can cause carcinogenesis, the dose
of use, mouse genes associated with cancer, and other additional information. Further
information on the retrieved experiments can be viewed in detail through the URI. In the
Resources tab, you can see instances of all classes. This application internally creates
a SPARQL query when the user searches for a request and provides information to the
CICO: Chemically Induced Carcinogenesis Ontology 253
user through the web UI. In addition, some users can retrieve information directly using
SPARQL endpoints.
At the left of the UI, the menu bar allows the user to navigate search points and re-
sources in Chemically Induced Carcinogenesis Ontology (CICO). Figure 4 represents an
example of a search using a specific disease name which researches conducting in vivo
experiments are interested. The amount of chemical used in each experiment, remarkable
genes, and reference information are basically expressed.
Acknowledgements. This research was supported by the MSIT (Ministry of Science and ICT),
Korea, under the ITRC (Information Technology Research Center) support program, (IITP-2017-
0-00398) supervised by the IITP (Institute for Information & communications Technology Pro-
motion) and the Institute for Information & communications Technology Promotion (IITP) grant
funded by the Korea government (MSIP) (No.2013-0-00109, WiseKB: Big data based self-
evolving knowledge base and reasoning platform). Authors want to thank Junhyuk Shin for the
discussions they had.
References
1. Malone, J., et al.: Modeling sample variables with an experimental factor ontology.
Bioinformatics 26(8), 1112–1118 (2010)
2. Bauer, S., Seelow, D., Horn, D., Robinson, P.N., Ko, S., Mundlos, S.: The human phenotype
ontology : a tool for annotating and analyzing human hereditary disease. Am. J. Hum. Genet.
83, 610–615 (2008)
3. Köhler, S., et al.: The human phenotype ontology in 2017. Nucleic Acids Res. 45(D1), D865–
D876 (2017)
4. G. O. Consortium: The gene ontology (GO) database and informatics resource. Nucleic Acids
Res. 32, 258–261 (2004)
5. Hastings, J., et al.: The ChEBI reference database and ontology for biologically relevant
chemistry : enhancements for 2013. Nucleic Acids Res. 41, 456–463 (2013)
6. Chen, S., et al.: Genome-wide CRISPR screen in a mouse model of tumor growth and metasta-
sis resource genome-wide CRISPR screen in a mouse model of tumor growth and metastasis.
Cell 160(6), 1246–1260 (2015)
7. Morton, C.L., Houghton, P.J.: Establishment of human tumor xenografts in immunodeficient
mice. Nat. Protoc. 2(2), 247–250 (2007)
8. Blake, J.A., et al.: Mouse genome database (MGD) - 2017: community knowledge resource
for the laboratory mouse. Nucleic Acids Res. 45, 723–729 (2017)
9. Gold, L.S., Manley, N.B., Slone, T.H., Rohrbach, L., Garfinkel, G.B.: Supplement to the
carcinogenic potency database (CPDB): results of animal bioassays published in the general
literature through 1997 and by the national toxicology program in 1997–1998. Toxicol. Sci.
85(2), 747–808 (2005)
10. Schultheisz, R.J.: TOXLINE: evolution of an online interactive bibliographic database. J. Am.
Soc. Inf. Sci. 32(6), 421–9 (1981)
11. Pan, J.Z., Gomez-Perez, J.M., Vetere, G., Wu, H., Zhao, Y., Monti, M.: Enterprise knowl-
edge graph: looking into the future. In: Exploiting Linked Data and Knowledge Graphs in
Large Organisations, pp. 237–249. Springer, Cham (2017). https://doi.org/10.1007/978-3-
319-45654-6_9
12. Lehmann, J., et al.: DBpedia - a large-scale, multilingual knowledge base extracted from
Wikipedia. Semant. Web 6(2), 167–195 (2015)
13. Ernst, P., Siu, A., Weikum, G.: KnowLife: a versatile approach for constructing a large
knowledge graph for biomedical sciences. BMC Bioinf. 16(1), 1–13 (2015)
14. Rinaudo, J.A.S., Farber, E.: The pattern of metabolism of 2-acetylaminofluorene in
carcinogen-induced hepatocyte nodules in comparison to normal liver. Carcinogenesis 7(4),
523–528 (1986)
Retrofitting Soft Rules for Knowledge
Representation Learning
1 Introduction
Knowledge graph (KG) resources such as Freebase [1] and YAGO [2] are widely
used in many natural language processing (NLP) applications. Typically, a
knowledge graph consists of a set of triples {(h, r, t)}, where h, r, and t stand
for head entity, relation, and tail entity, respectively. Although with a very large
scale, coverage or completeness of knowledge graph is a critical issue. For exam-
ple, 75% persons in Freebase do not have their nationalities specified [3].
Recently, there has been increased interest in learning distributed represen-
tation of knowledge graph. By projecting all elements in a knowledge graph into
a dense vector space, the semantic distance between all elements can be easily
calculated, and thus enables many applications such as link prediction and triple
classification [4].
Translation-based models, including TransE [5], TransH [6], TransD [7], and
TransR [8], have obtained promising results in learning distributed representa-
tions of knowledge graph. Furthermore, ComplEx [9] achieves the state-of-the-
art performances on KG completion tasks, such as triple classification and link
prediction.
c Springer Nature Switzerland AG 2020
X. Wang et al. (Eds.): JIST 2019, LNCS 12032, pp. 255–270, 2020.
https://doi.org/10.1007/978-3-030-41407-8_17
256 B. An et al.
Despite the success of the above methods in KG completion, they learn knowl-
edge representation based on the triples in a given KG, which inevitably suffer
from the incompleteness issue. The logic rules in the form of first order logic con-
tain rich information, and useful for incomplete KG. For example, if the entity
‘Kampala’ appears only once in a KG as <Kampala, capitalOf, Uganda>, we
can inference the triple <Kampala, locatedIn, Uganda> based on the logic rule
<capitalOf ⇒ locatedIn>. In addition, the logic rules are the structural con-
strains for the learning representations of relations. For the above reasons, there
are a number of works encode the hard logic rules (defined by experts) into
knowledge representation learning [10,11]. However, the hard logic rules are dif-
ficult to collect and domain specific. Therefore, RUGE [12] firstly employs soft
logic rules which extracted automatically via modern rule mining systems [13] to
enhanced the knowledge representation. The confidences of the soft logic rules
are traditionally calculated based on the number of instances belonging or not
belonging to the knowledge graph. However, the relatedness between relations
and entities are ignored although they are critical to determining the confi-
dences of extracted rules. For example, the greater similarity between bornIn
and nationality is, the more likely the soft rule <bornIn ⇒ nationalityOf > is
valid. Unfortunately, such inference is not considered in existing rule-enhanced
methods for knowledge representation, which is the main difference with our
method.
Knowledge Embeddings θ
Triples
m.03hkp
m.03hkp, /language/../main country, m.03spz supervise
m.2gd6x, /film/film/country, m.0f8l9c m.03spz
.... /../main country
....
pz
m .03s
gs: untry,
ndin co
grou ./main
.
age/
a ngu
Rule Mining p,/l projecting supervise
3hk te
m.0 upd
a
ce
den
c onfi
rules
m.03hkp
Rule 1: <x,/language/human language/main country, y >⇒
<x,/language/human language/countries spoken in,y> ;Conf=0.68 m.03spz
.... /../main country
....
Rules
confidence update Rule Subspace θ
Fig. 1. Simple illustration of retrofitting soft rules for learning knowledge representa-
tion method.
2 Related Work
Many structure-based knowledge representation learning methods have been
introduced, such as Neural Tensor Network [4] and Single Layer Model [4].
Recently, various translation-based methods are introduced, including TransE
and its extensions like TransH, TransD and TransR [5–8]. Trouillon [9] employed
complex-valued embeddings to fit the structural information.
There are a number of methods which utilize text descriptions to enhance
the knowledge representation, including entity names, wikipedia anchors,
entity/triple descriptions and text mention of relations. The entity descriptions
to enhance the knowledge representation [4]. [6] proposed a model which com-
bines the entity embeddings with word embeddings by the entity names and
Wikipedia anchors. Zhong [14] improved the model of [6] by aligning entity and
258 B. An et al.
text using entity descriptions. Zhang et al. (2015) proposed to model entities with
word embeddings of entity names or entity descriptions. Xie [15] introduced a
model to learn the embeddings of a knowledge graph by modelling both knowl-
edge triples and entity descriptions. [16] generated different representations for
entities based on the attention from the relation. [17] utilized relation mentions
and entity mentions to enhance the knowledge representations.
The universal schema based models [18,19] enhance the knowledge represen-
tation by incorporating the textual triples, which assume that all the extracted
triples express a relationship between the entity pair, and they treat each pattern
as a separate relation. However, this line of research assumes that all the relation
mentions express relationship between entity pairs, which inevitably introduces
a lot of noisy information. For example, the sentence ‘Miami Dolphins in 1966
and the Cincinnati Bengals in 1968’ does not express any relationship between
‘miami dolphins’ and ‘cincinnati bengals’. Even worse, the diversity of language
leads to the data sparsity problem. Xiao [20] proposed a generative model to han-
dle the ambiguity of relations. Wang [21] extended the translation-based models
by textual information, which assigns a relation with different representations
for different entity pairs.
The path information has been proved to be beneficial for learning knowl-
edge representation. PTransE [22] introduced path information between entities
for enhancing knowledge representation. [23] encoded the relation path and text
for learning knowledge representation. [24] first improved the knowledge embed-
dings based on reinforcement learning. [25] applied 2D convolution directly on
embeddings to model knowledge graph.
Recently, injecting human knowledge into the neural network has become
a new research hotspot, and the hard logic rules were exploited to enhance
the knowledge representation. The hard logic rules and type constraints were
introduced to enhance the knowledge embeddings [10,11,26,27]. [28] proposed to
generate adversarial triples which conform to the logic rules based on adversarial
neural network.
However, all of these models utilize hard logic rules, which are difficult to
extract and usually are KG specific. [12] first enhanced the knowledge repre-
sentation with soft rules extracted from the automatic rule mining system. But
despite its apparent success, there remains a major drawback: the performance
of this method is limited by the confidences of soft rules extracted by rule min-
ing system, which usually ignores the semantic relatedness between the relations
and entities. However, the semantic relatedness is critical to determine the log-
ical relationship among relations. The main difference of our paper is that our
model not only enhances the knowledge representation, but also optimizes the
soft rules jointly.
[¬A] = 1 − [A]
[A ∧ B] = [A][B]
(1)
[A ∨ B] = [A] + [B] − [A][B]
[A ⇒ B] = [A][B] − [A] + 1
where A, B are logical expressions, which can either be a single triple or complex
triples connected by logical conjunctions, such as ¬, ∧, ∨, ⇒; and [A] is the truth
score value of the expression A.
In this way, the qθ (h, r, t) is consistent with the existing triples while satisfies
the soft rules.
Secondly, to encode the soft rules information into the knowledge representa-
tion θ, we update θ based on both the existing triples and qθ (h, r, t). The latter
Retrofitting Soft Rules for Knowledge Representation Learning 261
In this way, knowledge representation θ is optimized by the soft rules while the
soft rules are retrofitted based on the representation iteratively.
4 Experiments
In this section, we describe the settings in our experiments and conduct extensive
experiments on link prediction and triple classification tasks.
base model, such as TransE, TransH, and ComplEx. In this paper, we imple-
ment our framework based on ComplEx [9], which has achieved state-of-the-
art performances on knowledge graph completion tasks [9] in Python3.61 with
Pytorch2 . To train the model, we generate negative triples using the local close
word assumption [3]. Specifically, for a given triple (h, r, t), we generate the nega-
tive instances (h, r, t ), (h, r , t), and (h , r, t) by replacing the entity and relation
with random entities and relations from E and R.
Hyper-Parameter. To reduce training time, and to avoid overfitting, we pre-
train the knowledge representation with ComplEx with the same parameters in
each experiment. In our experiments, for both tasks the hyper-parameters are set
by grid search as follows: the embedding dimension d in {50, 100, 150, 200, 300},
the learning rate η for SGD among {0.1, 0.001, 0.0001}, the margin λ among
{0.5, 1.0, 2.0}, the number of negatives is in {2, 5, 10} and the batch size among
{100, 500, 2000}. The regularization parameter C in {0.1, 0.001, 0.0001}.
Note that the RUGE-uniform refers to RUGE [12] model implemented based
on all the confidences of the soft rules are set at 0.5. The RUGE-AMIE refers
to RUGE model implemented based on all the confidences of the soft rules are
generated from AMIE++. The Our-uniform refers to our models implemented
based on all the confidences of the soft rules are set at 0.5. And Our-AMIE refers
to our model implemented based on the confidences of soft rules automatically
generated from AMIE++.
The task of link prediction aims to predict the missing head or tail entity for a
triple, which is widely employed for evaluating the knowledge graph completion
models [21,33]. Given a head entity h (or tail entity t) and a relation r, the
system is asked to return a ranked list of candidate entities. Following [12], we
conduct the link prediction task on FB15k and YAGO37 datasets.
In the testing phase, for each triple (h, r, t), we replace the head/tail entity
by all entities to construct candidate triples, and calculate the scores of the can-
didate triples based on score function. We ranked all these entities in descending
order of the scores. Based on the entity ranking list, the evaluation protocols
include: (1) mean reciprocal rank of correct entities (MRR); (2) the median of
the ranks (MED), and (3) the proportion of correct entities in top-N rank enti-
ties (Hit@N). A useful link predictor should achieve low MED, and high MRR
or Hit@N. We tune the parameters on the validation sets. The best configura-
tions obtained on the validation sets are: d = 200, C = 0.01, π = 0.5, η = 0.01
and margin λ = 0.5 on FB15K, and d = 200, C = 0.015, π = 0.2, η = 0.01
and margin λ = 0.5 on YAGO37. Our model is compared to the state-of-the-art
base models, including TransE, DistMult, HolE, ComplEx, PTransE, KALE and
RUGE, whose results were reported in their papers [5,9,11,12,22,34,35].
1
https://www.python.org/.
2
https://pytorch.org/.
264 B. An et al.
FB15K YAGO37
Hit@N Hit@N
Method MRR MED 1 3 5 10 MRR MED 1 3 5 10
TransE 0.400 4.0 0.246 0.495 0.576 0.662 0.303 13.0 0.218 0.336 0.387 0.475
DistMult 0.644 1.0 0.532 0.730 0.769 0.812 0.365 6.0 0.262 0.411 0.493 0.575
HolE 0.600 2.0 0.485 0.673 0.722 0.779 0.380 7.0 0.288 0.420 0.479 0.551
ComplEx 0.690 1.0 0.598 0.756 0.793 0.837 0.417 4.0 0.320 0.471 0.533 0.603
PTransE 0.679 1.0 0.565 0.768 0.810 0.855 0.403 9.0 0.339 0.444 0.473 0.506
KALE 0.523 2.0 0.383 0.616 0.683 0.762 0.321 9.0 0.215 0.372 0.438 0.522
RUGE-Uniform 0.713 1.0 0.641 0.768 0.792 0.821 0.423 4.0 0.337 0.471 0.536 0.596
RUGE-AMIE 0.768 1.0 0.70 0.815 0.836 0.865 0.431 4.0 0.340 0.482 0.541 0.603
Our-Uniform 0.773 1.0 0.717 0.823 0.840 0.867 0.435 4.0 0.353 0.481 0.542 0.605
Our-AMIE 0.774 1.0 0.709 0.821 0.842 0.871 0.433 4.0 0.345 0.483 0.545 0.606
C = 0.01, π = 0.2, η = 0.01 and margin γ = 0.5 on FB13; and d = 200, C = 0.01,
π = 0.5, η = 0.01 and margin γ = 0.5 on FB15K. Our model is compared to
the state-of-the-art base models, including TransE, TranH, TransR, ComplEx
and RUGE, whose results were reported in [5,6,8,9] and [12], respectively. The
results of various models on triple classification are listed in Table 3.
From Table 3, it can be seen that:
(1) Our model improves the accuracies on triple classification task over the base
models.
(2) Our method achieves better results than RUGE on both datasets. This find-
ing suggests that it is important to retrofit the confidences of the soft rules
to enhance knowledge representation.
(3) We find similar results as link prediction that our model achieves comparative
performances with different confidences settings. However, the performances
of RUGE dropped significantly with uniform initialized confidences. To make
the matter worse, they have achieved worse results than ComplEx model,
which means that the improper soft rules even weaken the performance.
(4) The improvement brought by our approach is slightly lower on FB13 than
on FB15K (+4.9 vs +6.3 based on ComplEx). A likely cause for the less
improvement on FB13 is that FB13 contains fewer number of relations than
FB15K, which causes AMIE++ to extract fewer useful soft rules.
5 Detailed Analysis
To better understand the way the proposed method works, this section provides
a detailed analysis of the quality of the confidences of the soft rules, and their
influences on knowledge representation and link prediction tasks.
266 B. An et al.
In this section, we further analyze how the quality of the soft rules evolves
from the initial form, given by AMIE, to the final settings, updated by our
proposed method. Several illustrative examples with different confidence levels
from FB15K are listed in Table 4. The scores in the last column are learned by
our proposed method.
Fig. 2. The MRR of link prediction for each soft rules with different confidence settings
on FB15K. (Color figure online)
Retrofitting Soft Rules for Knowledge Representation Learning 267
Table 5. The triples whose tail entities were failed to be ranked in top 5 candidates.
From Table 5, it can be seen that the first candidate entity predicted by our
model for the first triple is Film Actor. However, it is interesting to find out that
Film Actor, although not found in Freebase, is in fact a valid candidate according
to the Wikipedia page3 . This is also supported by additional evidences: the triple
(Film Actor,/people/person/profession,Justin Timberlak) can be found in the
training set, and according to the 4th rule in Table 4, Film Actor should receive
a high score. Note that other top-ranked entities such as Musician are also valid
candidates. Therefore, it is likely the case that our model has mined some new
facts in addition to the ground-truth with high scores.
Indeed, the failures are mostly caused by the data sparsity problem, which
results in relatively small coverage of rules. For example, we find out that there are
no soft rule with /people/person/nationality or /film/film/language as the head
relation, which may lead to false predictions for the second and third triples in
Table 5. Furthermore, the entities in the triples only appear limited times in the
training data (Apache Licence one time). All the above findings suggest that the
data sparsity of relations may degrade the effectiveness of our method (and AMIE).
This problem can be partially solved by adding a small number of manually defined
hard logic rules, which can be easily incorporated into our model.
From Table 6, it can be seen that our model is slower than ComplEx and
RUGE model. But in most of the case, it is worthwhile to spend acceptable more
time on learning better knowledge embeddings and soft rules in the training
process. And in the test phase, all our models calculate the truth value of a
candidate triple as formula (2). Therefore, the time consumption of the three
models in the test phase should be basically the same.
6 Conclusions
In this paper, we have proposed a retrofit framework to enhance the knowledge
representation and soft rules with each other in an iterative fashion. The soft rules
are used as regularizations for learning knowledge representation, and the repre-
sentation provides semantic relatedness for retrofitting soft rules. Our final results
3
https://en.wikipedia.org/wiki/Justin Timberlake.
Retrofitting Soft Rules for Knowledge Representation Learning 269
have achieved new state-of-the-art performance on both link prediction and triple
classification tasks. The additional analysis suggests that our model can effectively
learn the appropriate confidences of the soft rules. Failure analysis shows that our
method may suffer from the data sparsity issue, even though useful rules can still
be extracted. In future work, we plan to extract soft rules in a uniform framework
without depending on current rule mining systems.
References
1. Bollacker, K., Evans, C., Paritosh, P., Sturge, T., Taylor, J.: Freebase:a collabora-
tively created graph database for structuring human knowledge. In: ACM SIGMOD
International Conference on Management of Data, SIGMOD 2008, Vancouver, Bc,
Canada, June, pp. 1247–1250 (2008)
2. Suchanek, F.M., Kasneci, G., Weikum, G.: Yago: a core of semantic knowledge.
In: International Conference on World Wide Web, WWW 2007, Banff, Alberta,
Canada, May, pp. 697–706 (2007)
3. Dong, X., et al.: Knowledge vault: a web-scale approach to probabilistic knowledge
fusion. In: ACM SIGKDD International Conference on Knowledge Discovery and
Data Mining, pp. 601–610 (2014)
4. Socher, R., Chen, D., Manning, C.D., Ng, A.Y.: Reasoning with neural tensor
networks for knowledge base completion. In: International Conference on Intelligent
Control and Information Processing, pp. 464–469 (2013)
5. Bordes, A., Usunier, N., Garcia-Duran, A., Weston, J., Yakhnenko, O.: Translating
embeddings for modeling multi-relational data. In: NIPS, pp. 2787–2795 (2013)
6. Wang, Z., Zhang, J., Feng, J., Chen, Z.: Knowledge graph embedding by translating
on hyperplanes. In: AAAI, pp. 1112–1119 (2014)
7. Ji, G., He, S., Xu, L., Liu, K., Zhao, J.: Knowledge graph embedding via dynamic
mapping matrix. In: Meeting of the Association for Computational Linguistics and
the International Joint Conference on Natural Language Processing, pp. 687–696
(2015)
8. Lin, Y., Liu, Z., Sun, M., Liu, Y., Zhu, X.: Learning entity and relation embeddings
for knowledge graph completion. In: AAAI, pp. 2181–2187 (2015)
9. Trouillon, T., Welbl, J., Riedel, S., Gaussier, É., Bouchard, G.: Complex embed-
dings for simple link prediction. In: International Conference on Machine Learning,
pp. 2071–2080 (2016)
10. Rocktäschel, T., Singh, S., Riedel, S.: Injecting logical background knowledge into
embeddings for relation extraction. In: HLT-NAACL, pp. 1119–1129 (2015)
11. Guo, S., Wang, Q., Wang, L., Wang, B., Guo, L.: Jointly embedding knowledge
graphs and logical rules. In: EMNLP, pp. 192–202 (2016)
12. Guo, S., Wang, Q., Wang, L., Wang, B., Guo, L.: Knowledge graph embedding
with iterative guidance from soft rules. In: Thirty-Second AAAI Conference on
Artificial Intelligence (2018)
13. Galárraga, L., Teflioudi, C., Hose, K., Suchanek, F.M.: Fast rule mining in onto-
logical knowledge bases with AMIE+. VLDB J. 24(6), 707–730 (2015)
14. Zhong, H., Zhang, J., Wang, Z., Wan, H., Chen, Z.: Aligning knowledge and text
embeddings by entity descriptions. In: EMNLP, pp. 267–272 (2015)
270 B. An et al.
15. Xie, R., Liu, Z., Jia, J., Luan, H., Sun, M.: Representation learning of knowledge
graphs with entity descriptions. In: AAAI, pp. 2659–2665 (2016)
16. Xu, J., Chen, K., Qiu, X., Huang, X.: Knowledge graph representation with jointly
structural and textual encoding. arXiv preprint arXiv:1611.08661 (2016)
17. An, B., Chen, B., Han, X., Sun, L.: Accurate text-enhanced knowledge graph rep-
resentation learning. In: Proceedings of the 2018 Conference of the North American
Chapter of the Association for Computational Linguistics: Human Language Tech-
nologies, Volume 1 (Long Papers), pp. 745–755 (2018)
18. Riedel, S., Yao, L., McCallum, A., Marlin, B.M.: Relation extraction with matrix
factorization and universal schemas. In: HLT-NAACL. pp. 74–84 (2013)
19. Toutanova, K., Chen, D., Pantel, P., Poon, H., Choudhury, P., Gamon, M.: Repre-
senting text for joint embedding of text and knowledge bases. EMNLP 15, 1499–
1509 (2015)
20. Xiao, H., Huang, M., Zhu, X.: Transg: a generative model for knowledge graph
embedding. In: Proceedings of the 54th Annual Meeting of the Association for
Computational Linguistics (Volume 1: Long Papers), vol. 1, pp. 2316–2325 (2016)
21. Wang, Z., Li, J., Liu, Z., Tang, J.: Text-enhanced representation learning for knowl-
edge graph. In: To appear in IJCAI 2016, pp. 04–17 (2016)
22. Lin, Y., Liu, Z., Luan, H., Sun, M., Rao, S., Liu, S.: Modeling relation paths for
representation learning of knowledge bases. arXiv preprint arXiv:1506.00379 (2015)
23. Toutanova, K., Lin, X.V., Yih, W.T., Poon, H., Quirk, C.: Compositional learning
of embeddings for relation paths in knowledge bases and text. In: ACL2016, vol.
1, pp. 1434–1444 (2016)
24. Xiong, W., Hoang, T., Wang, W.Y.: Deeppath: a reinforcement learning method
for knowledge graph reasoning. arXiv preprint arXiv:1707.0669 (2017)
25. Dettmers, T., Minervini, P., Stenetorp, P., Riedel, S.: Convolutional 2D knowledge
graph embeddings. In: Thirty-Second AAAI Conference on Artificial Intelligence
(2018)
26. Wang, Q., Wang, B., Guo, L.: Knowledge base completion using embeddings and
rules. In: International Conference on Artificial Intelligence, pp. 1859–1865 (2015)
27. Guo, S., Ding, B., Wang, Q., Wang, L., Wang, B.: Knowledge base completion via
rule-enhanced relational learning. In: Chen, H., Ji, H., Sun, L., Wang, H., Qian,
T., Ruan, T. (eds.) CCKS 2016. CCIS, vol. 650, pp. 219–227. Springer, Singapore
(2016). https://doi.org/10.1007/978-981-10-3168-7 22
28. Minervini, P.: Adversarial sets for regularising neural link predictors. In: Confer-
ence on Uncertainty in Artificial Intelligence (2017)
29. Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network.
arXiv preprint arXiv:1503.02531 (2015)
30. Hu, Z., Ma, X., Liu, Z., Hovy, E., Xing, E.: Harnessing deep neural networks with
logic rules. arXiv preprint arXiv:1603.06318 (2016)
31. Hájek, P.: Metamathematics of Fuzzy Logic, vol. 4. Springer Science & Business
Media, Dordrecht (1998)
32. Ganchev, K., Gillenwater, J., Taskar, B., et al.: Posterior regularization for struc-
tured latent variable models. J.f Mach. Learn. Res. 11, 2001–2049 (2010)
33. Bordes, A., Weston, J., Collobert, R., Bengio, Y.: Learning structured embeddings
of knowledge bases. In: AAAI Conference on Artificial Intelligence, AAAI 2011,
San Francisco, California, USA, August (2011)
34. Yang, B., Yih, W.T., He, X., Gao, J., Deng, L.: Embedding entities and relations
for learning and inference in knowledge bases. Eprint Arxiv (2014)
35. Nickel, M., Rosasco, L., Poggio, T.A., et al.: Holographic embeddings of knowledge
graphs. In: AAAI, pp. 1955–1961 (2016)
Entity Synonym Discovery via Multiple
Attentions
1 Introduction
People often describe a real-world entity in a variety of ways, which makes the
text analysis and understanding more challenging. Thus, automatic entity syn-
onym discovery has become a considerable task, and it can benefit many down-
stream applications, such as web search [7,8], question answering [35], knowledge
graph construction [3], and social media analysis [1], etc.
One straightforward approach to obtain synonyms is from public knowledge
bases, such as WordNet [10], ConceptNet [30] and DBpedia [15]. For example,
WordBet groups terms into synsets, and DBpedia uses Redirects to URIs to
indicate synonyms. However, these synonyms are constructed manually, which
makes the coverage rather limited.
Many efforts have been made to discover synonyms automatically. Some
approaches discover synonyms from query logs [5,25] and web tables [11]. How-
ever, these approaches are limited to structured or semi-structured data. In order
to discover synonyms from massive raw text corpora, two types of approaches are
widely exploited, including the distributional based approaches [31] and pattern
based approaches [20].
The distributional based approaches assume that if two terms appear in sim-
ilar contexts, they are likely to be synonyms. For example, “USA” and “the
United States” are often mentioned in similar contexts, and they both refer
c Springer Nature Switzerland AG 2020
X. Wang et al. (Eds.): JIST 2019, LNCS 12032, pp. 271–286, 2020.
https://doi.org/10.1007/978-3-030-41407-8_18
272 J. Yu et al.
to the same entity country “USA”. However, only using similar contexts may
bring in noises into synonym discovery. For instance, “USA” and “Canada” are
two countries but they have similar contexts in sentences. Different from the
distributional based approaches, which consider the corpus-level statistics, pat-
tern based approaches lay emphasis on the local contexts, which are the textual
sequences from sentences. For example, we can find the pattern “commonly
known as” from the sentence “The United States of America, commonly known
as the United States”, since “The United States of America” and “the United
States” are synonyms. With this pattern, we can find more synonyms from sen-
tences in which two synonymous terms co-occur. However, the pattern based
approaches are too strict, so they suffer from low recall.
In fact, in order to judge whether two terms are synonymous or not, there
is always a bag of sentences mentioning these two terms. Therefore, the task of
synonym discovery is mainly fed with a bag of sentences. However, there is a
challenge since sentences in a bag inescapably have many noises, which does not
reflect synonymous relations and would affect discovery performance. Thus, it
is crucial to select valuable sentences in a bag for synonym discovery. But the
approaches mentioned above only consider either corpus-level statistics or local
contexts.
Therefore, in this work, we propose a novel framework, SynMine, which aims
to extract synonyms from massive raw text corpora by leveraging existing syn-
onyms from encyclopedias as distant supervision. The framework can integrate
corpus-level statistics and local contexts in a unified way via a multi-attention
mechanism. Extensive experiments on a real-world text corpus show the effec-
tiveness of SynMine over many baseline approaches.
The rest of the paper is organized as follows. In Sect. 2, we review work related
to our framework. In Sect. 3, we formally define the problem and describe the
proposed framework in Sect. 4. In Sect. 5, we conduct experiments on a real-
world dataset to illustrate the effectiveness of the proposed approach. Finally,
we conclude our work with future directions in Sect. 6.
2 Related Work
Synonym discovery is a crucial task in NLP, and many efforts have been invested,
especially focus on detecting synonyms from structured or semi-structured data
such as query logs [5,6,25,33] and web table schemas [4,11]. While in this work,
we aim to mine synonyms from a raw text corpus, which is more sophisticated
and challenging.
There are various methods developed to deal with this kind of tasks. Tex-
tual pattern based methods aim at learning frequent textual patterns with seeds
and then use these patterns to discover more target pairs, which is introduced
to hypernym detection [29], relation extraction [24], and information extrac-
tion [16]. Distributional based methods attempt to detect synonyms [17,21] and
hypernyms [27] by utilizing distributional features and training a classifier. Fur-
thermore, Qu et al. [23] also proposed a combinational method for synonym
discovery. Our approach integrates these two types of methods as well.
Entity Synonym Discovery via Multiple Attentions 273
Our work is also related to the distant relation extraction task. Since
most relation extraction works focus on supervised methods, which are time-
consuming and need a great deal of manually annotated data. To address this
issue, distant supervision approaches are proposed to align plain texts with a
given KB and regard the alignment as supervision. Nevertheless, distant super-
vision suffers from the wrong label problem and may introduce lots of noises.
Mintz et al. [19] neglected the data noises, while Riedel et al. [26] adopted the
multi-instance method and at-least-one assumption. Zeng et al. [34] used piece-
wise convolution neural networks to model sentence-level features and selected
the most likely valid sentence to predict relations. Lin et al. [18] and Ji et al. [13]
employed two different attention mechanisms into PCNNs to make better use of
supervision information.
Inspired by these methods, our approach also adopts PCNNs with attention
to capture the most related information. In addition, we integrate context fea-
tures extracted by SetExpan model [28] as corpus-level supervision to improve
the effect of attention.
3 Problem Formulation
Given an encyclopedia E, we would like to build a synonym mining framework,
and then discover all synonyms from a given text corpus D.
For convenience, we list the main symbols used in this paper in Table 1.
Symbol Meaning
|T |
T A set of entity synonym pairs T = {ti1 , ti2 |ti1 ≈ ti2 }i=1 , where
each entity synonym pair contains two terms (i.e. words or phrases)
that refers to the same real-world entity. Here, ≈ means two terms
are synonymous
Si A bag of sentences {s1 , s2 , ..., s|Si | } for synonym pair ti1 , ti2 ∈ T ,
where each sentence sk ∈ Si contains two terms ti1 and ti2
C The context features C = {c1 , c2 , ..., c|F | }, where each context
feature ci ∈ C is a sequence of terms
wci The global attention weight for the context feature ci ∈ C
wsi The local attention weight for the sentence si ∈ S
softmax
Local Attention
Distant Supervision Global Attention
After building the bipartite graph, we assign the weight for each pair of
context features ck ∈ C and synonym pair pi = ti1 , ti2 using the TF-IDF
transformation as in [28], which is calculated as:
fck ,pi = log(1 + Xck ,pi )[log |T | − log( Xck ,pj )]
pj ∈T
where Xck ,pi is the co-occurrence count between the context feature c and the
synonym pair pi . Similar to TF-IDF, each context feature and synonym pair can
be considered as a “term” and “document” respectively.
Therefore, the global weight for context features ck can be calculated as:
pi ∈T fck ,pi
wck = tanh( ) (1)
|T |
context feature
softmax
Local Attention
= -
Sentence Encoder
text corpus
Local Attention Mechanism. Since the sentences in each bag are obtained
through distant supervision, some sentences are invalid to prove the synonym
relation between two terms. Therefore, we apply the attention model to reduce
the impact of these invalid sentences.
Inspired by many knowledge graph embedding approaches, such as
TransE [2], the relation can be represented by the difference vector between
1
https://code.google.com/p/word2vec/.
Entity Synonym Discovery via Multiple Attentions 277
two entities. Thus, the synonym relation could also be represented by the two
synonym items v r = t i1 − t i2 . Consequently, if a sentence s expresses the syn-
onym relation, its embedding vector p should be similar to the vector v r , and
the sentence should have higher attention weight. We compute this intra-bag
attention weight using the following formulas:
exp(qi )
wsi = |S| (2)
j=1 exp(qj )
qi = W Ta (tanh[p i ; v r ]) + ba
5 Experiments
5.1 Experimental Setup
Dataset. We evaluate SynMine and other baseline methods on a real-world
dataset which is developed from Baidu Baike2 . Baidu Baike is a Chinese ency-
clopedia and contains more than 15M articles with abundant synonyms.
We collect existing synonym pairs as positive examples and randomly sam-
pled term pairs as negative examples. Then we align these pairs with articles in
2
https://baike.baidu.com/.
278 J. Yu et al.
Baidu Baike and obtain various bags of sentences. For evaluation, we randomly
partition them into training, validation and testing dataset. The statistics are
presented in Table 2.
From Table 3, we have the following observations: (1) BMPM is inferior to other
methods on all evaluation metrics. It indicates that context features of term pairs
can provide a great deal of useful information for synonym predictions; (2) PCNN
and PCNN+ONE achieve a higher precision but a lower recall which denotes that
they tend to misclassify positive examples to negative. It shows that effective
sentence selection can alleviate the wrong label problem but only selecting one
sentence may lose lots of valuable information; (3) SynMine obtains the best
performance on Recall and F1 score, although it is a little inferior to PCNN+ONE in
Precision. This is because it integrates corpus-level statistics and local contexts,
and both of them are beneficial to sentence selection.
Figure 4 displays the aggregate precision/recall curves of all methods. We can
see that SynMine achieves the best performance. It verifies the effectiveness of
our proposed method and proves the reasonability of global-and-local attention.
We evaluate the precision of the top 100, top 200 and top 500 results in
Table 4. The results show that: (1) All methods except BMPM achieve high preci-
sion. This indicates that the surface strings of terms cannot reflect the synonym
relation well; (2) SynMine has a better ability to tolerate noises in sentences, so
it performs best in the top 500 results, because more noises would occur in the
latter results.
0.86
0.85
0.84
0.83
F1
0.82
0.81
0.8
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
α
Table 4. The precision of the top 100, top 200, and top 500 extracted synonym pairs
upon manual evaluation.
1. 33.3% errors are caused by a large number of noisy sentences. Distant super-
vision may bring in lots of noise, and some entity pairs almost have no valid
sentences. e.g., the semantics of and
in sentences are implicit.
2. 26.7% errors are caused by long-distance between two entities. CNN model
can not capture long-distance semantics very well. Therefore, entity pairs
with long-distance tend to be predicted as a negative example. For exam-
ple, has a distance more than ten words from
, which may dilute the semantics.
3. 23.3% errors are caused by the separation of coordinative synonym enti-
ties. It is difficult to identify the relation of two entities which have coor-
dinative relation, since there are few context between these entities. For
example, and are both the Chinese name
of sweet potato as shown in the Case 3 in Table 7, but and
282 J. Yu et al.
Table 7. Cases of false negatives, where two terms for synonym prediction are under-
lined. In Case 3, since
and are all the Chinese name of sweet potato, so we use Pinyin to distinguish
them in sentences.
Table 8. Cases of false positives, where two terms for synonym prediction are under-
lined.
As for false positives, we categorize the causes into the following five types,
and several examples are listed in Table 8.
1. 50.0% errors are caused by related entity pairs. Entity pairs with
hypernym relation or causality relation are not distinguished well. e.g.,
is hypernym of , but our model
mispredicts it.
2. 20.0% errors are caused by coordinative entities as discussed for the false
negatives. This type of entity pairs often co-occurs with similar contexts,
such as and .
3. 16.7% errors are caused by a very short distance between two entities. For
instance, “ (Herbal Supplements)” is the source of “ (Hutu-
izi or Elaeagnus pungens thunb)”, so they are not synonymous. While,
in the sentence
, the entities
in parentheses are synonymous with the entities in front of them. For example,
is the common name of , so they are
synonyms. Thus, it is difficult to distinguish the relations only based on contexts
and patterns.
4. 10.0% errors are caused by incomplete entity name. Wrong entity recognition
in sentences will affect the results. For example, the first sentence of Case 2
in Table 8 shows that is synonymous with
284 J. Yu et al.
6 Conclusion
In this paper, we propose a novel framework SynMine to extract synonyms from
massive raw text corpora. The framework can integrate corpus-level statistics
and local contexts in a unified way via a multi-attention mechanism. Extensive
experiments on a real-world dataset show the effectiveness of our approach.
In the future, we will explore reinforcement learning technologies [22] to fur-
ther reduce the noises in sentences, and utilize advanced pre-trained models such
as BERT [9] to improve the performance. In addition, entity type and transitive
relation of the synonyms can also be utilized in the synonym prediction. Further-
more, polysemy of words should also be considered in the synonym prediction.
References
1. Antoniak, M., Bell, E., Xia, F.: Leveraging paraphrase labels to extract synonyms
from twitter. In: FLAIRS Conference (2015)
2. Bordes, A., Usunier, N., Garcı́a-Durán, A., Weston, J., Yakhnenko, O.: Translating
embeddings for modeling multi-relational data. In: NIPS (2013)
3. Boteanu, A., Kiezun, A., Artzi, S.: Synonym expansion for large shopping tax-
onomies. In: AKBC (2019)
4. Cafarella, M.J., Halevy, A.Y., Wang, D.Z., Wu, E., Zhang, Y.: Webtables: exploring
the power of tables on the web. Proc. VLDB Endow. 1(1), 538–549 (2008)
5. Chakrabarti, K., Chaudhuri, S., Cheng, T., Xin, D.: A framework for robust dis-
covery of entity synonyms. In: KDD (2012)
6. Chaudhuri, S., Ganti, V., Xin, D.: Exploiting web search to generate synonyms for
entities. In: WWW (2009)
7. Cheng, T., Lauw, H.W., Paparizos, S.: Entity synonyms for structured web search.
IEEE Trans. Knowl. Data Eng. 24, 1862–1875 (2012)
8. Clements, M., de Vries, A.P., Reinders, M.J.T.: Detecting synonyms in social tag-
ging systems to improve content retrieval. In: SIGIR (2008)
9. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidi-
rectional transformers for language understanding. CoRR abs/1810.04805 (2018)
10. Fellbaum, C.: Wordnet: An Electronic Lexical Database (2000)
Entity Synonym Discovery via Multiple Attentions 285
11. He, Y., Chakrabarti, K., Cheng, T., Tylenda, T.: Automatic discovery of attribute
synonyms using query logs and table corpora. In: WWW (2016)
12. Hinton, G.E., Srivastava, N., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.R.:
Improving neural networks by preventing co-adaptation of feature detectors. CoRR
abs/1207.0580 (2012)
13. Ji, G., Liu, K., He, S., Zhao, J.: Distant supervision for relation extraction with
sentence-level attention and entity descriptions. In: AAAI (2017)
14. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: ICLR
(2015)
15. Lehmann, J., et al.: Dbpedia - a large-scale, multilingual knowledge base extracted
from wikipedia. Semant. Web 6, 167–195 (2015)
16. Li, Q., et al.: Truepie: discovering reliable patterns in pattern-based information
extraction. In: KDD, pp. 1675–1684. ACM (2018)
17. Lin, D., Zhao, S., Qin, L., Zhou, M.: Identifying synonyms among distributionally
similar words. In: IJCAI, vol. 3, pp. 1492–1493. Citeseer (2003)
18. Lin, Y., Shen, S., Liu, Z., Luan, H., Sun, M.: Neural relation extraction with
selective attention over instances. In: ACL, vol. 1, pp. 2124–2133 (2016)
19. Mintz, M., Bills, S., Snow, R., Jurafsky, D.: Distant supervision for relation extrac-
tion without labeled data. In: ACL/IJCNLP, pp. 1003–1011. Association for Com-
putational Linguistics (2009)
20. Nguyen, K.A., Schulte im Walde, S., Vu, N.T.: Distinguishing antonyms and syn-
onyms in a pattern-based neural network. In: EACL, pp. 76–85 (2017)
21. Pantel, P., Crestan, E., Borkovsky, A., Popescu, A., Vyas, V.: Web-scale distribu-
tional similarity and entity set expansion. In: EMNLP, pp. 938–947. Association
for Computational Linguistics (2009)
22. Qin, P., Xu, W., Wang, W.Y.: Robust distant supervision relation extraction via
deep reinforcement learning. In: ACL (2018)
23. Qu, M., Ren, X., Han, J.: Automatic synonym discovery with knowledge bases. In:
KDD, pp. 997–1005. ACM (2017)
24. Qu, M., Ren, X., Zhang, Y., Han, J.: Weakly-supervised relation extraction by
pattern-enhanced embedding learning. In: WWW, pp. 1257–1266. International
World Wide Web Conferences Steering Committee (2018)
25. Ren, X., Cheng, T.: Synonym discovery for structured entities on heterogeneous
graphs. In: WWW (2015)
26. Riedel, S., Yao, L., McCallum, A.: Modeling relations and their mentions without
labeled text. In: Balcázar, J.L., Bonchi, F., Gionis, A., Sebag, M. (eds.) ECML
PKDD 2010, Part III. LNCS (LNAI), vol. 6323, pp. 148–163. Springer, Heidelberg
(2010). https://doi.org/10.1007/978-3-642-15939-8 10
27. Roller, S., Erk, K., Boleda, G.: Inclusive yet selective: supervised distributional
hypernymy detection. In: COLING, pp. 1025–1036 (2014)
28. Shen, J., Wu, Z., Lei, D., Shang, J., Ren, X., Han, J.: Setexpan: corpus-based
set expansion via context feature selection and rank ensemble. In: ECML/PKDD
(2017)
29. Snow, R., Jurafsky, D., Ng, A.Y.: Learning syntactic patterns for automatic hyper-
nym discovery. In: NIPS, pp. 1297–1304 (2005)
30. Speer, R., Chin, J., Havasi, C.: Conceptnet 5.5: an open multilingual graph of
general knowledge. In: AAAI (2017)
31. Wang, J., Lin, C., Li, M., Zaniolo, C.: An efficient sliding window approach for
approximate entity extraction with synonyms. In: EDBT (2019)
32. Wang, Z., Hamza, W., Florian, R.: Bilateral multi-perspective matching for natural
language sentences. In: IJCAI, pp. 4144–4150 (2017)
286 J. Yu et al.
33. Wei, X., Peng, F., Tseng, H., Lu, Y., Dumoulin, B.: Context sensitive synonym
discovery for web search queries. In: CIKM, pp. 1585–1588. ACM (2009)
34. Zeng, D., Liu, K., Chen, Y., Zhao, J.: Distant supervision for relation extraction
via piecewise convolutional neural networks. In: EMNLP (2015)
35. Zhou, G., Liu, Y., Liu, F., Zeng, D., Zhao, J.: Improving question retrieval in
community question answering using world knowledge. In: IJCAI (2013)
Towards Association Rule-Based Complex
Ontology Alignment
1 Introduction
Ontology alignment is an important step in enabling computers to query and
reason across the many linked datasets on the semantic web. This is a difficult
challenge because the ontologies underlying different linked datasets can vary in
terms of subject area coverage, level of abstraction, ontology modeling philos-
ophy, and even language. Due to the importance and difficulty of the ontology
alignment problem, it has been an active area of research for over a decade [21].
Ideally, alignment systems should be able to uncover any entity relationship
across two ontologies that can exist within a single ontology. Such relationships
have a wide range of complexity, from simple 1-to-1 equivalence, such as a Person
in one ontology being equivalent to a Human in another ontology, to arbitrary m-
to-n complex relationships, such as a Professor with a hasRank property value
of “Assistant” in one ontology being a subclass of the union of the Faculty
and TenureTrack classes in another. Unfortunately, the majority of the research
activities in the field of ontology alignment remains focused on the simplest
end of this scale – finding 1-to-1 equivalence alignments between ontologies.
Indeed, identifying arbitrarily complex alignment is known to be significantly
c Springer Nature Switzerland AG 2020
X. Wang et al. (Eds.): JIST 2019, LNCS 12032, pp. 287–303, 2020.
https://doi.org/10.1007/978-3-030-41407-8_19
288 L. Zhou et al.
harder than finding 1-to-1 equivalences. In the latter case, a naive approach can
compare every entity from the source ontology against every entity in the target
ontology, which is feasible for small- and medium-sized ontologies. However, a
complex alignment can potentially involve many entities from both ontologies,
so pair-wise comparison is insufficient, and the search space become very large
even for small ontologies. It is indeed very difficult for either a human expert or
an automated system to evaluate all possible combinations [2,19].
In this paper, we propose a complex alignment algorithm based on asso-
ciation rule mining. Our algorithm automatically discovers potential complex
correspondences which can then be presented to human experts in order to
effectively generate complex alignment between two ontologies with populated
common instance data. We evaluate the performance of our system on one of the
benchmarks from the complex alignment track of the OAEI 2018,1 the GeoLink
benchmark, which contains around 74k instances from real-world datasets. Sig-
nificant instance data, which is required for the association rule mining approach,
is not available for the remaining benchmarks.2 The main contributions of this
paper are the following:
There is a side contribution when we analyze the results, which is that our
algorithm shows that shared instance data between two ontologies can be a
good resource to improve the performance of ontology alignment.
The rest of the paper is organized as follows. Section 2 discusses related work
in ontology alignment using association rule mining and instance data and com-
plex ontology alignment, including existing alignment algorithms and relevant
benchmarks. Section 3 gives background on the FP-growth association rule min-
ing algorithm. Section 4 illustrates the association rule-based alignment algo-
rithm in detail, along with the alignment patterns used to generate the alignment
between ontologies. The analysis of the performance of the system is discussed
in Sect. 5. Section 6 concludes with a discussion of potential future work in this
area.
2 Related Work
Association rule mining has already been used for finding 1:1 simple alignments.
AROMA [4] is a hybrid, extensional and asymmetric ontology alignment method
that makes use of association rules and a statistical measure. It relies on the idea
that “An entity A will be more specific than or equivalent to an entity B if the
vocabulary used to describe A and its instances tends to be included in that of B
1
http://oaei.ontologymatching.org/2018/complex/index.html.
2
It might be available for OAEI 2019.
Towards Association Rule-Based Complex Ontology Alignment 289
and its instances.” In addition, association rule mining is also used in discovering
rules in ontological knowledge bases [10] and logical linked data compression [15].
There are also some instance-based ontology alignment systems that utilize
Abox information to generate 1:1 simple alignments between ontologies. GLUE
[6] uses joint probability distributions to describe the similarity of concepts in
two ontologies. For example, p(A, B) is the probability that an instance in the
domain belongs to both concept A and concept B. And then, if the instances of
concept A and concept B are in isolation, GLUE uses the instances of A to learn
a classifier for A, and then classifies instances of B according to that classifier,
and vice-versa. FCA MERGE also utilizes common instances between ontologies
[22]. FCA MERGE extracts instances from a given set of domain-specific text
documents by applying nature language processing techniques. Based on the
extracted instances, FCA MERGE applies mathematical techniques to derive a
lattice of concepts as a structural result of FCA MERGE. More instance-based
alignment systems have been discussed in the survey [26].
There are some related studies on creating algorithms to find complex align-
ment between ontologies. Early work on generating complex alignment is [19,20].
Therein, three complex alignment patterns were described, which are Class by
Attribute Type (CAT), Class by Attribute Value (CAV), and Property Chain
(PC). Based on these patterns, the authors generated complex alignments on
the Conference and Benchmark datasets from the OAEI. [13] identified com-
plex alignments by defining knowledge rules and using a probabilistic framework
to integrate a knowledge-based strategy with standard terminology-based and
structure-based strategies. More recent related work is currently being under-
taken by Thieblin et al. [24]. They propose a complex alignment approach that
relies on the notion of Competency Question for Alignment (CQA). The app-
roach translates a CQA into a SPARQL query and extracts a set of instance
data from the source ontology. Then the matching is performed by finding the
lexically similar surroundings between the set of instance data and the instances
in the target ontology. This approach resulted in the CANARD system [23].
However, the current version of the system is limited to finding complex corre-
spondences that only involve classes. More complex correspondences containing
properties are still not taken into account [23]. Another alignment system that
works on the detection of the complex alignment is the complex version of Agree-
mentMakerLight (AMLC) [9]. This system focuses on the complex Conference
benchmark to find alignments that follow the CAT and CAV patterns.
In OAEI 2018, the first version of the complex alignment track [25] opened
new perspectives in the field of ontology matching. It comprised four different
benchmarks containing complex relations. However, the results from the first
year were rather poor. Only 2 out of 15 systems, AMLC and CANARD, were
able to generate any correct complex correspondences on the complex Conference
and Taxon benchmarks, and the correct number of mappings found was quite
limited. The very limited performance of the two systems of course shows avenues
for improvement in the future. More details of evaluations and results can be
accessed on the OAEI 2018 website.3
3
http://oaei.ontologymatching.org/2018/complex/index.html.
290 L. Zhou et al.
Our algorithm differs from the above methods in several aspects. First,
[9,13,19] focus on computing lexical or terminological similarity to decide on
complex alignments, while our system takes advantage of instance data to gener-
ate association rules between ontologies. While the CANARD system also relies
on the instance data, we use it in completely different ways. In addition, the
current version of CANARD is limited to finding complex correspondences that
involve only classes, while our algorithm does not have this limitation. Second,
our evaluation of results is more detailed, in order to provide insight into how
to improve the performance of complex alignment algorithms. Specifically, we
break the evaluation process down into two subtasks: entity identification and
relationship identification. We utilize a variation of traditional evaluation met-
rics called relaxed precision, recall, and f-measure [7] to present the final results
of the full complex alignment.
3 Background
In order to help the reader understand how we apply association rule mining
and the FP-growth algorithm on the ontology alignment task, we introduce here
some concepts that we frequently mention in the rest of the paper.
Association Rule Mining. Our alignment system mainly depends on a data
mining algorithm called association rule mining, which is a rule-based machine
learning method for discovering interesting relations between variables in large
databases [17]. Over the years, association rule mining has played an important
role in many data mining tasks, such as market basket analysis, web usage
mining, and bioinformatics. Many algorithms for generating association rules
have been proposed, like Apriori [1] and FP-growth algorithm [11]. In this paper,
we use FP-growth to generate association rules between ontologies, since the
FP-growth algorithm has been proven superior to other algorithms [11] and will
improve the algorithm in terms of run-time.
Transaction Database. Let I = {i1 , i2 , . . . , in } be a set of distinct attributes
called items. Let D = {t1 , t2 , . . . , tm } be a set of transactions where each trans-
action in D has a unique transaction ID and contains a subset of the items
in I. Table 1 shows a list of transactions corresponding to a list of triples. The
data in an ontology can be displayed as a set of triples, each consisting of sub-
ject, predicate, and object. Here, subjects represent the identifiers and the set
of corresponding properties with the objects represent transactions, which are
separated by the symbol “|”. I.e., a transaction is a set T = (s, Z) such that s
is a subject, and each member of Z is a pair (p, o) of a property and an object
such that (s, p, o) is a triple.
FP-growth. The FP stands for frequent pattern. The FP-growth algorithm is
run on the transaction database in order to determine which combinations of
items co-occur frequently. The algorithm first counts the number of occurrences
of all individual items in the database. Next, it builds an FP-tree structure by
inserting these instances. Items in each instance are sorted by descending order
of their frequency in the dataset, so that the tree can be processed quickly. Items
Towards Association Rule-Based Complex Ontology Alignment 291
in each instance that do not meet the predefined thresholds, such as minimum
support and minimum confidence (see below for these terms), are discarded.
Once all large itemsets have been found, the association rule creation begins.
Association Rule. Every association rule is composed of two sides. The left-
hand-side is called the antecedent, and the right-hand-side is the consequent.
These rules indicate that whenever the antecedent is present, the consequent is
likely to be as well. Table 2 shows some examples of association rules generated
from the transaction database in Table 1.
Support. Support indicates how frequently an itemset appears in the dataset.
The FP-growth algorithm finds the frequent itemsets from the dataset based on
the minimum support threshold. In our alignment system, the minimum support
value is examined and set to 0.001 to guarantee the best performance.
Confidence. Confidence is an indication of how often an association rule has
been found to be true, i.e. how often the presence of the antecedent is associated
with the presence of the consequent. The minimum confidence can be tuned to
find relatively accurate rules. In this paper, we use the minimum confidence of
0.3 as default value. And we tune the value to 1 when we mine the association
rules that may contain complex relations, because our algorithm would focus on
precision-oriented results.
Lift. Lift is the ratio of the observed support to that expected if the antecedent
and consequent were independent. If the lift is greater than 1, it means that the
two items are dependent on one another, which indicates that the association rule
useful. In our approach, lift is used to choose between otherwise equal options
when detecting simple mappings. When the confidence values of two association
rules are the same, the one with higher lift value is selected as the basis for the
mapping.
We first extract all triples Subject, Predicate, Object from the source and
target ontologies. Each item in a triple is expressed as a web URI. After collecting
all of the triples, we prepare the data as follows: we only keep the triples that
contain at least one entity under the source or the target ontology namespace
and also the triples that contain rdf:type information, since our algorithm relies
on this information. After this, there are still some triples containing less useful
information for association rule mining, which follow this format: x rdf:type
owl:NamedIndividual. This triple is not very informative except stating the subject
x is an individual. But, it frequently occurs in the dataset and may lead to
noises when applying the FP-growth algorithm, since the frequency of occurrence
impacts the results of FP-growth. So, we filter out such noise from the dataset
as well.
After this filtering process, we generate the transaction database for the FP-
growth algorithm based on all of the remaining triples. The subjects serve as the
transaction IDs, and the predicates with the objects separated by the symbol “|”
are the items for each transaction. Then we replace the object in the triples with
its rdf:type,4 because we focus on generating schema-level (rather than instance-
level) mapping rules between two ontologies, and the type information of the
object is more meaningful than the original URI. If an object in a triple has
rdf:type of a class in the ontology, we replace the URI of the object with its
class. If the object is a data value, the URI of the object is replaced with the
datatype. If the object already is a class in the ontology, it remains unchanged.
Tables 3 and 4 show some examples of the conversion.
4
Our evaluation data has only single type. If there are multiple types of the object, it
can also combine the subject and predicate as additional information to determine
the correct type, or keep both types as two triples.
Towards Association Rule-Based Complex Ontology Alignment 293
TID Itemsets
x1 gbo:hasAward|y1 , gmo:fundedBy|y2
x2 gbo:hasFullName|y3 , gmo:hasPersonName|y4
x3 rdf:type|gbo:Cruise, rdf:type|gmo:Cruise
TID Itemsets
x1 gbo:hasAward|gbo:Award, gmo:fundedBy|gmo:FundingAward
x2 gbo:hasFullName|xsd:string, gmo:hasPersonName|gmo:PersonName
x3 rdf:type|gbo:Cruise, rdf:type|gmo:Cruise
from the source ontology and whose consequent only contains entities from the
target ontology. The association rules tell us which source entities are related to
which target entities, but they do not give us information on how those entities
are related. In order to determine this, we analyze the output of the association
rule mining step in light of the common alignment patterns introduced in [19,
27]. In the following, we introduce how we leverage these alignment patterns
to filter the association rules and generate the corresponding alignment. The
following examples that we use in this paper are from the GeoLink benchmark
[27]. gbo: is the prefix of the namespace of the GeoLink Base Ontology (GBO),
and gmo: is the prefix of the namespace of the GeoLink Modular Ontology
(GMO). The alignment between the two ontologies contains both simple and
complex correspondences. To deal with the redundancy of generated association
rules, we always keep the simpler rule as the result. For example, there are two
association rules generated by our system. Cruise in the GBO is equivalent to
the domain of fundedBy with it range of FundingAward in the GMO. And Cruise
in the GBO is also equivalent to Cruise in the GMO, which is the domain of
fundedBy. Therefore, the two mapping rules are semantically equivalent. And we
only keep the second rule which is the simpler one as our result.
1:1 Class Alignment. The first pattern is simple 1-to-1 class relationships.
Classes C1 and C2 are from ontology O1 and ontology O2 , respectively. So,
we target the association rules with the following format:
The left and right hand side of the arrow represent the antecedent and conse-
quent in the association rules, respectively. In the example, the association rule
implies that if an individual x has rdf:type of gbo:Award, then x also has rdf:type of
gmo:FundingAward. This means that gbo:Award is a subclass of gmo:FundingAward.
If there is another association rule containing the reverse information, which means
that gmo:FundingAward is also a subclass of gbo:Award then we can generate an
alignment based on the two association rules stating that gbo:Award is equivalent
to gmo:FundingAward. This method of choosing between subsumption and equiva-
lence relationships is used for all of the following types of correspondences as well.
1:1 Property Alignment. This pattern captures simple 1-to-1 property mappings.
The property can be either an object property or a data property.
(1) Object Property Alignment. Since we have the information of the type of the
object in the association rule, we can use the type information to filter the
mapping candidates. When we align two object properties, the range types
of the properties are usually either equivalent to each other or compatible
(because they are in a subclass or superclass relationship). In this paper, our
algorithm is precision-oriented. Therefore, we require the object properties
in the two ontologies to have equivalent (rather than compatible) ranges in
order to be considered equivalent. Range equivalence is determined through
the results of the simple class alignment introduced above. Object Property
op1 with its range type t1 and object property op2 with its range type t2
are from ontology O1 and ontology O2 , respectively. In order to find this
alignment, we select the association rules with the following format:
Association Rule format: op1 |t1 → op2 |t2
Example: gbo:hasAward|gbo:Award → gmo:fundedBy|gmo:FundingAward
Generated Alignment: gbo:hasAward(x, y) → gmo:fundedBy(x, y)
We know from the results of the simple class alignment that gbo:Award is
equivalent to gmo:FundingAward. This association rule says that gbo:hasAward is
subsumed by gmo:fundedBy. If there is another association rule containing the
reverse relationship, we can generate the mapping that gbo:hasAward is equiva-
lent to gmo:fundedBy.
(2) Data Property Alignment. Similar to aligning object properties, when align-
ing two data properties, the range values of the two properties should be of a
compatible datatype. In this paper, we only investigate equivalent datatypes.
Data Property dp1 with its range value t1 and property dp2 with its range
value t2 are from ontology O1 and ontology O2 , respectively.
Association Rule format: dp1 |t1 → dp2 |t2
Example:
gbo:hasIdentifierValue|xsd:string → gmo:hasIdentifierValue|xsd:string
Generated Alignment:
gbo:hasIdentifierValue(x, y) → gmo:hasIdentifierValue(x, y)
Towards Association Rule-Based Complex Ontology Alignment 295
1:n Class Alignment. This type of pattern was first introduced in [19]. It contains
two different patterns: the Class by Attribute Type pattern (CAT) and the Class
by Attribute Value pattern (CAV). In addition, [27] introduced another pattern
called Class Typecasting.
(4) Class by Attribute Type. This pattern states that a class in the source
ontology is in some relationship to a complex construction in the target
ontology. This complex construction may comprise an object property and
its range type. Class C1 is from ontology O1 , and object property op1 and
its range type t1 are from ontology O2 .
Association Rule format: rdf:type|C1 → op1 |t1
Example: rdf:type|gbo:PortCall → gmo:atPort|gmo:Place
Generated Alignment: gbo:PortCall(x) → gmo:atPort(x, y) ∧ gmo:Place(y)
(5) Class by Attribute Value. This pattern is similar to the previous one. It just
replaces the object property with a data property. Class C1 is from ontology
O1 , and data property dp1 and its datatype of the range value t1 are from
ontology O2 .
Association Rule format: rdf:type|C1 → dp1 |t1
Example: rdf:type|gbo:Identifier → gmo:hasIdentifierScheme|xsd:string
Generated Alignment: gbo:Identifier(x) → gmo:hasIdentifierScheme(x, y)
296 L. Zhou et al.
(7) 1:n Property Typecasting. This pattern is similar in spirit to the Class Type-
casting patterns mentioned above. However, in this case, a property from
one ontology is cast into a class assignment statement in the other ontology.
Association Rule format: p1 |t1 → rdf:type|C2
Example: gbo:hasPlaceType|gbo:PlaceType → rdf:type|gmo:Place
Generated Alignment:
gbo:hasPlaceType(x, y) ∧ gbo:PlaceType(y) → gmo:Place(x)
m:n Complex Alignment. This group contains the most complex correspon-
dences.
(8) m:n Property Chain. This pattern applies, for example, when a property,
together with type restrictions on one or both of its fillers, in one ontology,
has been used to “flatten” the structure of the other ontology by short-
cutting a property chain in that ontology. The pattern also ensures that
the types of the property fillers involved in the property chain are typed
appropriately in the other ontology. The class C1 and property r1 with its
range restriction t1 are from ontology O1 , and classes Bi and properties pi
with its range restriction di are from ontology O2 .
Association Rule format:
rdf:type|C1 , r1 |t1 → rdf:type|B1 , p1 |d1 , . . . , rdf:type|Bi , pi |di
Example:
gbo:Award, gbo:hasSponsor|gbo:Organization
→ rdf:type|gmo:FundingAward,
gmo:providesAgentRole|gmo:SponsorRole,
gmo:performedBy|gmo:Organization
Generated Alignment:
gbo:Award(x) ∧ gbo:hasSponsor(x, z) ∧ gbo:Organization(z)
→ rdf:type|gmo:FundingAward(x)∧
gmo:providesAgentRole(x, y) ∧ gmo:SponsorRole(y)∧
gmo:performedBy(y, z) ∧ gmo:Organization(z)
In this example, the association rule implies that in the GBO, the prop-
erty gbo:hasSponsor with the domain type of gbo:Award and the range type of
gbo:Organization has been used to “flatten” the complex structure in the GMO
by short-cutting a property chain. Note that in this pattern, C1 and any of the
Bi may be omitted (in which case they are essentially ).
Towards Association Rule-Based Complex Ontology Alignment 297
5 Evaluation
In this section, we show the experimental results of our proposed alignment
algorithm on the OAEI GeoLink benchmark and analyze the results in detail.
The GeoLink benchmark [27] is composed of two ontologies in the geosciences
domain. These two ontologies are both populated with 100% shared instance
data collected from the real-world GeoLink knowledge base [3], in order to help
the evaluation of alignment algorithms depending on instance data.5 The subset
used for this study contains around 74k triples, which is suitable for applying
association rule mining.
We originally planned to compare the performance of our system against
pattern based system in [19], CANARD, and AMLC. However, the GeoLink
benchmark is a property-oriented dataset which involves many object or data
properties in the complex correspondences. As we discussed in Sect. 2, CANARD
is currently limited to finding complex mappings that only involve classes. Even
though pattern based system in [19] can generate property-based complex corre-
spondences, like property chains, there are several rules that the system follows
that largely limit its results, and it ends without finding any complex alignment
on the GeoLink ontology pair. AMLC currently only works for the complex
Conference benchmark [2,9]. Therefore, there are no complex alignment sys-
tems against which we could compare the performance of our system. So in this
paper we are limited to reporting the performance of our system against the
reference alignment when it comes to the identification of complex alignment.
Performance on the identification of simple alignment is compared against that
of systems that participated in the OAEI 2018.
Because the systems we compare against are only capable of identifying sim-
ple correspondences, we present the results on the simple and complex portions
of the overall alignment separately.6 For simple correspondences, we use the tra-
ditional precision, recall and F-measure metrics, in order to compare against
other simple alignment systems. However, in order to provide more insight into
the underlying nature of the performance on complex correspondences, we take
a slightly different approach. Semantic precision and recall, which compare cor-
respondences based on their semantic meaning rather than their syntactic repre-
sentation [8]. This is done by applying a reasoner to determine when one mapping
is logically equivalent to another. Even though the semantic approaches solve
an important problem for evaluating alignments with complex correspondences,
they still have several limitations. One is that the reasoning takes a significant
amount of time, particularly for large ontologies. Furthermore, such reasoning is
not possible if the merged ontology is not in OWL DL. The GeoLink benchmark
is one example of this case, since there are many correspondences involving an
object property on one side and a data property on another side, which is not
5
https://doi.org/10.6084/m9.figshare.5907172.
6
We are aware that this may not be the most general way to evaluate complex align-
ments, but the community does not yet have any guidelines or tangible results which
could be used. And solving the evaluation problem is out of scope of this paper.
298 L. Zhou et al.
permissible in OWL DL. Instead, we utilize relaxed precision and recall [7]. More
specifically, a correspondence consists of two aspects: the entities involved, and
the relationship between them (e.g. equivalence, subsumption, disjunction). In
order to assess performance on both of these aspects, we evaluate them sepa-
rately. This roughly corresponds to the first and second subtasks described for
some of the test sets within the complex track of the OAEI.7 However, the types
of relationships we consider are limited to equivalence and subsumption rather
than the arbitrary OWL constructs considered there.
the expected output from an alignment system is that hasSponsor in the GBO is
related to FundingAward, providesAgentRole, SponsorRole and performedBy
in the GMO and Award in the GBO. Based on the two lists of entities from
the reference alignment and the matcher, precision, recall, and f-measure can be
calculated.
(2) Relationship Identification: In terms of the example above, an align-
ment system needs to eventually determine that the relationship between the
two sides is equivalence. Based on our algorithm, if there is only one association
rule holding the information, we consider the relationship to be subsumption. If
there are two association rules containing the information for both directions,
an equivalence relationship is generated. At this stage, we do not further assess
300 L. Zhou et al.
Matcher 1:n Property subsum. m:n Complex equiv. m:n Complex subsum.
Reference alignment 5 26 17
Our algorithm 3 15 7
Relaxed Precision 0.60 0.90 0.53
Relaxed Recall 0.36 0.36 0.16
Relaxed F-measure 0.45 0.51 0.24
other complex relationships. Table 6 shows the different similarities for differ-
ent situations. We slightly penalize differently for the situations in finding less
information, but all the information returned is correct, and finding more infor-
mation, but part of the information is incorrect. We do not penalize the incorrect
relationship by giving a ZERO value because that would completely neglect the
entity identification outputs without considering whether it is a reasonable result
or a completely incorrect one. In order to generate the final results, we multiply
the results from the entity identification by the penalty of the relations.8 The
formulas for computing the final results are as follows:
Relaxed precision = Precision entity × Relation similarity
Relaxed recall = Recall entity × Relation similarity
Relaxed f-measure = F-measure entity × Relation similarity
Table 7 shows the results of our algorithm. In total there are 48 complex
mappings in the reference alignment. For 1:n property subsumption, our algo-
rithm finds 3 mappings that fall into this category. For example, we find that the
domain of gbo:hasSampleType is equivalent to gmo:PhysicalSample. However, the
correct relationship should be subsumption. So, the final result should be penal-
ized based on Table 6. For m:n complex equivalence, since our default confidence
value for complex alignment is 1, the alignment that we found may miss some
entities that should exist in the alignment. For example, referring to the exam-
ple we use in the entity identification, the expected output from the alignment
system is that the property hasSponsor in the GBO is related to FundingAward,
providesAgentRole, SponsorRole, performedBy in the GMO and Award in the GBO.
However, our algorithm misses one entity which is performedBy in the GMO.
Errors such as this may of course be easily corrected by human interaction. For
m:n complex subsumption, our algorithm does not generate the correct relation-
ships for all the mappings we found. However, overall, our association rule-based
algorithm can effectively come up with rather high quality simple and complex
alignment automatically.9
8
To be accurate, it could also have been better aggregated with other aggregation
functions rather than multiplication [7]. But we would not focus on this question in
this paper.
9
All the data and alignment that we use and generate can be accessed via the link
http://tiny.cc/rojy4y. We utilize the Apache Spark frequent pattern mining library
to generate association rules.
Towards Association Rule-Based Complex Ontology Alignment 301
6 Conclusion
Complex ontology alignment has been discussed for a long time, but relatively
little work has been done to advance the state of the art in this field. In this
paper, we proposed a complex ontology alignment algorithm based on association
rule mining. Our algorithm takes advantage of instance data to mine frequent
patterns, which show us which entities in one ontology are related to which
entities in the other. Then we apply common simple and complex patterns to
arrange these related entities into the formal alignment. We evaluated our system
on the complex alignment benchmark from the OAEI and analyzed the results
in detail to provide a better understanding of the challenges related to complex
ontology alignment research.
There are still some limitations of our algorithm. First, our system relies
on instance data for mining the association rules, which is not available for
all ontology pairs. However, this could possibly be resolved with automated
instance data generation to populate common instances into the ontologies or
instance matching techniques. Second, we incorporate some common patterns
that have been widely accepted in the ontology alignment community in this
paper. This could be another limitation, since the set of mapping patterns in our
system is likely not comprehensive. However, our algorithm is extensible, more
patterns can be easily added in the future as the need arises. Third, it is possible
that there are situations that the association rule would fail in term of finding
simple property alignment. For example, if there are two properties livesIn and
bornIn in source and target ontologies respectively, and the association rules
would say if livesIn|Place, then bornIn|Place if they occur frequently. livesIn and
bornIn would be considered as equivalent. In this case, there are many different
methods that could be applied to improve the performance, like using lexical-
based comparison or utilizing external knowledge base to annotate these entities.
Fourth, we are collaborating with other benchmark and system developers to
enable the comparison and prepare our alignment system to participate in the
complex alignment track of the OAEI.
References
1. Agrawal, R., Srikant, R.: Fast algorithms for mining association rules in large
databases. In: VLDB 1994, Proceedings of 20th International Conference on Very
Large Data Bases, 12–15 September 1994, Santiago de Chile, Chile, pp. 487–499
(1994)
2. Algergawy, A., et al.: Results of the ontology alignment evaluation initiative
2018. In: Proceedings of the 13th International Workshop on Ontology Matching,
OM@ISWC 2018, Monterey, CA, USA, 8 October 2018, pp. 76–116 (2018)
3. Cheatham, M., et al.: The geolink knowledge graph. Big Earth Data 2(2), 131–143
(2018)
302 L. Zhou et al.
4. David, J., et al.: Association rule ontology matching approach. Int. J. Semantic
Web Inf. Syst. 3(2), 27–49 (2007)
5. Djeddi, W.E., et al.: XMap: results for OAEI 2018. In: Proceedings of the 13th
International Workshop on Ontology Matching, OM@ISWC 2018, Monterey, CA,
USA, 8 October 2018, pp. 210–215 (2018)
6. Doan, A., et al.: Ontology matching: a machine learning approach. In: Staab,
S., Studer, R. (eds.) Handbook on Ontologies, pp. 385–404. Springer, Heidelberg
(2004). https://doi.org/10.1007/978-3-540-24750-0 19
7. Ehrig, M., Euzenat, J.: Relaxed precision and recall for ontology matching. In: Inte-
grating Ontologies 2005, Proceedings of the K-CAP 2005 Workshop on Integrating
Ontologies, Banff, Canada, 2 October 2005
8. Euzenat, J.: Semantic precision and recall for ontology alignment evaluation. In:
IJCAI 2007, Proceedings of the 20th International Joint Conference on Artificial
Intelligence, Hyderabad, India, 6–12 January 2007, pp. 348–353 (2007)
9. Faria, D., et al.: Results of AML participation in OAEI 2018. In: Proceedings of the
13th International Workshop on Ontology Matching, OM@ISWC 2018, Monterey,
CA, USA, 8 October 2018, pp. 125–131 (2018)
10. Galárraga, L.A., et al.: AMIE: association rule mining under incomplete evidence in
ontological knowledge bases. In: 22nd International World Wide Web Conference,
WWW 2013, Rio de Janeiro, Brazil, 13–17 May 2013, pp. 413–422 (2013)
11. Han, J., et al.: Mining frequent patterns without candidate generation: a frequent-
pattern tree approach. Data Min. Knowl. Discov. 8(1), 53–87 (2004)
12. Hertling, S., Paulheim, H.: DOME results for OAEI 2018. In: Proceedings of the
13th International Workshop on Ontology Matching, OM@ISWC 2018, Monterey,
CA, USA, 8 October 2018, pp. 144–151 (2018)
13. Jiang, S., et al.: Ontology matching with knowledge rules. T. Large-Scale Data
Knowl.-Cent. Syst. 28, 75–95 (2016)
14. Jiménez-Ruiz, E., Grau, B.C., Cross, V.: LogMap family participation in the OAEI
2018. In: Proceedings of the 13th International Workshop on Ontology Matching,
OM@ISWC 2018, Monterey, CA, USA, 8 October 2018, pp. 187–191 (2018)
15. Joshi, A.K., Hitzler, P., Dong, G.: Logical linked data compression. In: Cimiano,
P., Corcho, O., Presutti, V., Hollink, L., Rudolph, S. (eds.) ESWC 2013. LNCS,
vol. 7882, pp. 170–184. Springer, Heidelberg (2013). https://doi.org/10.1007/978-
3-642-38288-8 12
16. Laadhar, A., et al.: OAEI 2018 results of POMap++. In: Proceedings of the 13th
International Workshop on Ontology Matching, OM@ISWC 2018, Monterey, CA,
USA, 8 October 2018, pp. 192–196 (2018)
17. Piatetsky-Shapiro, G.: Discovery, analysis, and presentation of strong rules. In:
Knowledge Discovery in Databases, pp. 229–248. AAAI/MIT Press (1991)
18. Portisch, J., Paulheim, H.: ALOD2Vec matcher. In: Proceedings of the 13th Inter-
national Workshop on Ontology Matching, OM@ISWC 2018, Monterey, CA, USA,
8 October 2018, pp. 132–137 (2018)
19. Ritze, D., et al.: A pattern-based ontology matching approach for detecting com-
plex correspondences. In: Proceedings of the 4th International Workshop on Ontol-
ogy Matching (OM-2009), Chantilly, USA, 25 October 2009
20. Ritze, D., et al.: Linguistic analysis for complex ontology matching. In: Proceedings
of the 5th International Workshop on Ontology Matching (OM-2010), Shanghai,
China, 7 November 2010
21. Shvaiko, P., Euzenat, J.: Ontology matching: state of the art and future challenges.
IEEE Trans. Knowl. Data Eng. 25(1), 158–176 (2013)
Towards Association Rule-Based Complex Ontology Alignment 303
22. Stumme, G., Maedche, A.: FCA-MERGE: bottom-up merging of ontologies. In:
Proceedings of the Seventeenth International Joint Conference on Artificial Intel-
ligence, IJCAI 2001, Seattle, Washington, USA, 4–10 August 2001, pp. 225–234
(2001)
23. Thiéblin, É., et al.: CANARD complex matching system: results of the 2018 OAEI
evaluation campaign. In: Proceedings of the 13th International Workshop on Ontol-
ogy Matching, OM@ISWC 2018, Monterey, CA, USA, 8 October 2018, pp. 138–143
(2018)
24. Thiéblin, É., et al.: Complex matching based on competency questions for align-
ment: a first sketch. In: Proceedings of the 13th International Workshop on Ontol-
ogy Matching, OM@ISWC 2018, Monterey, CA, USA, 8 October 2018, pp. 66–70
(2018)
25. Thiéblin, É., et al.: The first version of the OAEI complex alignment benchmark.
In: Proceedings of the ISWC 2018 Posters & Demonstrations, Industry and Blue
Sky Ideas Tracks at (ISWC 2018), Monterey, USA, 8th October–12th 2018 (2018)
26. Thiéblin, É., et al.: Survey on complex ontology alignment. Semant. Web J. (2019,
to appear)
27. Zhou, L., Cheatham, M., Krisnadhi, A., Hitzler, P.: A complex alignment bench-
mark: geolink dataset. In: Vrandečić, D., et al. (eds.) ISWC 2018. LNCS, vol.
11137, pp. 273–288. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-
00668-6 17
Autonomous RDF Stream Processing
for IoT Edge Devices
1 Introduction
Over the last few years, Semantic Web technologies have provided promising
solutions for achieving semantic interoperability in the IoT (Internet of Things)
domain. Ranging from ontologies for describing streams and devices [10,11],
to continuous query processors and stream reasoning agents [8], these efforts
constitute important milestones towards the integration of heterogeneous IoT
platforms and applications. While these different technologies enable the pub-
lication of streams using semantic technologies (e.g., RDF streams), and the
querying of streaming data over ontological representations, most of them tend
to centralise the processing, relegating interactions among IoT devices simply
to data transmission. This approach may be convenient in certain scenarios
c Springer Nature Switzerland AG 2020
X. Wang et al. (Eds.): JIST 2019, LNCS 12032, pp. 304–319, 2020.
https://doi.org/10.1007/978-3-030-41407-8_20
Autonomous RDF Stream Processing on IoT Edge Devices 305
where the streams, typically time-annotated RDF data, are integrated following
a top-down approach, for instance using cloud-based solutions for RDF Stream
Processing (RSP). However, in the context of IoT, decentralised integration
paradigms fit better with the distributed nature of autonomous deployments
of smart devices [22]. Moreover, moving the computation closer to the edge net-
works, such as sensor nodes or IoT gateways, will not only create more chances
to improve performance and to reduce network overhead/bottlenecks, but also to
enable flexible and continuous integration of new IoT devices/data sources [19].
Thanks to recent developments in the design of embedded devices, e.g.,
ARM boards [23], single board computers are getting cheaper and smaller while
increasing their computational power. For example, a Raspberry computer costs
less than 40 EUR and its size is just roughly as big as the size of a credit card.
Despite the size, they are powerful enough to run a fully-functioning Linux dis-
tribution that provides both operational and deployment advantages. On the
one hand, they are both power efficient and cost-effective, while computation-
ally powerful. On the other hand, their small sizes make it easier to embed or
bundle them with other IoT devices (e.g., sensors and actuators) as a processing
gateway interfacing with outer networks, called edge devices.
RDF Stream Processing (RSP) [21] extends the RDF data model, enabling
to capture and to process heterogeneous streaming sensor sources under a uni-
fied data model. An RSP engine usually supports a continuous query language
based on SPARQL, e.g. C-SPARQL [3] and CQELS-QL [15]. Hence, an edge
device equipped with an RSP engine could play the role of an autonomous data
processing gateway. Such an autonomous gateway can coordinate the actions
with other peers connected to it to execute a data processing pipe in a collab-
orative fashion. However, to the best of our knowledge, there has not been any
in-depth study on how such a decentralised processing paradigm would work
with edge devices. In particular, an edge device has 10–100 times less resources
than those of a PC counter-part which is originally the expected execution set-
ting for an RSP engine. Hence, this raises two main questions: how feasible would
it be to enable such a paradigm for edge devices, and how it would affect the
performance and scalability. Putting our motivation in the context of 100 billion
ARM chips that have been shipped so far [4], enabling computational and pro-
cessing autonomy along with semantic interoperability will have a huge impact
even for a small fraction of this number of devices (e.g. 0.1% would account for
100s millions devices).
To this end, this paper investigates how to realise this edge computing
paradigm by extending an RSP engine (i.e., CQELS) as a continuous query feder-
ation engine to enable a decentralised computation architecture for edge devices.
A prototype engine was implemented to empirically study the performance and
scalability aspects on “cooperative sensing” scenarios. Our experiment results
on a realistic setup with the biggest network of its kind in Sect. 4 show that our
federation engine can considerably scale the processing throughput of a network
of edge devices by adding more nodes on demand. We believe this is the largest
experiment setup of its kind so far. The main contributions of the paper are
summarised below:
306 M. Nguyen-Duc et al.
device, the physical limits of its hardware quickly becomes a bottleneck as shown
in Sect. 4. To create a more scalable processing system, we need to decentralise
the processing pipelines of similar queries to a network of edge devices connected
to these stream nodes. The following two sections describe our approach to enable
this type of network.
This federation process can be carried out dynamically thanks to the dynamic
subscription and discovery capability above. Moreover, the processing topology
of such as processing pipelines in our experiment scenarios of Sect. 4 can be
dynamically configured by changing where and how participant nodes subscribed
themselves to the processing networks. For example, we carried out five different
federation topologies in Sect. 4. The biggest advantage of this federation mecha-
nism is the ability to dynamically push some processing operations closer to the
streaming nodes to alleviate the network and processing bottlenecks which often
happen at edge devices. Moreover, this mechanism can significantly improve the
processing throughput by adding more processing nodes on demand as shown in
the experiments in Sect. 4.
Autonomous RDF Stream Processing on IoT Edge Devices 309
back to a lexical representation before sending them to the Stream Output Han-
dler. The Encoder and Decoder share the Dictionary for encoding and decoding.
Instead of using a 64-bit integer for encoding node as in the original version
of CQELS, the Dictionary of RDF4Led uses 32-bit integers, which entails less
memory footprint for cached data. Therefore, backed by RDF4Led, Fed4Edge
can process 30 million triples with only 80 MB of memory [17] on ARM comput-
ing architectures.
The Buffer Manager is responsible for managing the buffered data of windows
and then feeding the data to the Dynamic Executor. Furthermore, the Buffer
Manager also manages cached data for querying and writing the static data
in the Thing Directory. Stream data is evicted from the buffer by the data
invalidating policy defined by the window operators [12,15]. Meanwhile, the
flash-aware updating algorithms of RDF4Led are reused in order to achieve fast
updating for static data [17].
The Dynamic Executor employs a routing-based query execution algorithm
that provides dynamic execution strategies in each node [12,13]. During the
lifetime of a continuous query, the query plan can be changed by redirecting
data flow on the routing network. The Adaptive Optimiser continuously adjusts
the efficient query plan according to the data distribution in each execution
step [15,17]. RDF4Led and CQELS employ a similar query execution paradigm.
While CQELS uses routing-based query execution algorithms, RDF4Led exe-
cutes SPARQL with a one-tuple-at-a-time policy. Therefore, the same cost model
of the Adaptive Optimiser can be applied when calculating the best plan for a
query that has static data patterns. The Buffer Manager treats the buffer for
join results of the static patterns as a window, and depending on the available
memory, it will apply the fresh update or incremental update policy.
The Adaptive Federator acts as the query rewriter, which adaptively divides
the input query into subqueries. For the implementation used in our experiments
in Sect. 4, the rewriter will push down operators as close to the streaming nodes
as possible by following the predicate pushdown practice in common logical opti-
misation algorithms. The Thing Directory stores the metadata subscribed by the
other Fed4Edge engines (cf. Section 2) in the default graph. Similar to [7], such
metadata allows endpoint services of the Fed4Edge engines to be discovered via
the Adaptive Federator. When the Adaptive Federator sends out a subquery, it
notifies the Stream Input Handler to subscribe and listens to the results return-
ing from the subquery. On the other hand, the Stream Output Handler sends
out the subqueries to other nodes or sends back the results to the requester.
the most prominent weather datasets, which contains weather observation data
from 1901 to present, from nearly 20 K stations over the world. A weather reading
of a station produces an observation that covers measurements for temperature,
wind speed, wind gust, etc. depending on the types of sensors equipped for
that station. Each observation needs approximately 87 RDF triples to map its
values and attributes to the schema illustrated in Fig. 2. The data from different
weather stations was split to multiple devices which acted as streaming nodes
(i.e., the white nodes in Fig. 4). Each streaming node hosts a Websocket server
which manages WebSocket stream endpoints. The data is read from CSV files in
local storage, then mapped to the RDF data schema in Fig. 2 before streaming
out.
For the queries that can show the collaborative behaviour of the participant
edge nodes, we used the queries Q3 (as described in the example of Sect. 2) and
query Q4 in Listing 6. The query Q3 has aggregation and top-k operators and
the Q4 includes a complex join across windows.
1 SELECT ?temp ?lat ?lon ?resultTime
2 WHERE {
3 STREAM ?streamURI [LATEST ON ssn:resultTime] {
4 ?obs sosa:hasSimpleResult ?temp; sosa:resultTime ?resultTime.
5 ?sensor rdf:type iot:TempSensor; made:Observation ?obs.}
6 ?streamURI prov:wasGeneratedBy ?sensor. ?sensor sosa:isHostedBy ?station.
7 ?station wgs84:Point ?loc. ?loc wgs84:lat ?lat; wgs84:lon ?lon.}
Listing 5. Q2: Return the location where the latest temperature is higher than 30
degree.
Listing 6. Q4: Return the city where the temperature is higher than 30◦ and the wind
speed is higher than 15 km in the last 5 min.
4.2 Experiments
Baseline Calibration (Exp1): In this
experiment, we calibrated the maxi-
mum processing capability of a process-
ing node as the baseline for the following
federation experiment. We increased the
number of stream nodes to observe the
bottleneck phenomena whereby increas-
ing more streaming nodes decreases the
processing throughput of the network.
Each streaming node will stream out
recorded data as its maximum capacity.
We will use Query 1 and its two variants
as the testing queries. These two vari-
ants are made by reducing four triple
patterns into 1 and 2 patterns respec-
tively. The throughput is measured by
using a timing stream whereby each
streaming nodes will send timing triples
indicating when each of them starts and
finishes sending their data. In each test
we will equally split 500 k–1 M obser-
Fig. 3. The evaluation cluster of 85 Rasp-
vations among streaming nodes and
berry PI nodes
record how much time to process these
observations to calculate the average throughput. Note that we separated the
streaming and processing processes in different physical devices to avoid per-
formance and bandwidth interference which might have an impact on our
measurements.
Fan-out Federation (Exp2): To test the possibility of increasing the processing
throughput by increasing more edge nodes as autonomous agents to the network,
we carried out the tests on five topologies as shown in Fig. 4. The first topology
(1-hop) in Fig. 4a was the configuration that gave the peak throughput in Exp1.
Let k be the number of hops the data has travel to reach to the final destination,
we will increase k to add more intermediate nodes to this topology to create new
topologies. As a result, we can recursively add n nodes to the root node (k = 2,
namely 2-hop) and then n nodes to the root node’s children nodes (k = 3, namely
3-hop) whereby n is called the fanout factor (denoted as n-fanout). Then, we
k−1
have i=0 ni as the number of nodes of a topology with k hops and fanout
factor n. We choose n = 2 and n = 4 (corresponding to the number of streaming
nodes at the maximum throughput reported in Exp1 below), thus, we have four
new topologies with 3, 5, 7 and 21 processing nodes in Figs. 4b, c, d, and e.
In each processing topology, the lowest processing nodes are connected with 4
streaming nodes. We will record the throughput and delay for processing three
queries (Q1, Q2, Q3 and Q4) on these five topologies in a similar fashion to
Exp1.
314 M. Nguyen-Duc et al.
(a) 1 node (1- (b) 3 nodes (2 hop, 2- (c) 5 nodes (2 hop, 4-fanout)
hop) fanout)
nodes. However, the increase is not consistently correlated with the total number
of processing nodes. In fact, the topology with 5 nodes in Fig. 4d gives a slightly
higher throughput than those of the topology with 7 nodes in Fig. 4c. This can be
explained by the fact that both topologies have 4 processing nodes at the lowest
levels (called leaf processing nodes, i.e, connecting to streaming nodes) but the
data in the latter topology has to travel 1 more hop in comparison with the
former. Due to our pushing down rewriting strategy presented in Sect. 3, these
two upper blue nodes in Fig. 4c did not significantly contribute to the overall
throughput but on the other hand cause more communication overhead.
Look closer to the reported figures, we see a high correlation between the
number of leaf processing nodes, i.e. nk−1 , and the processing throughput in
all topologies. This shows that our proposed approach is able to linearly scale
a network of IoT devices by adding the more devices on demand. In particu-
lar, a network of 21 Raspberry Pi nodes can collaboratively process up to 74 k
triples/seconds or equivalent to roughly 8500 sensor observations/second that
are streamed from other 64 streaming nodes. Hence, the above 20 K weather
stations across the globe of NCDC can be queried via such a network with the
update rate 20–30 observations per minute which are much faster than the high-
est update rate currently supported by NCDC2 , i.e. ASOS 1-min data. Moreover,
the processing capacity of this network is twice more than that of the above PC
but it only costs roughly a half of the PC. Regarding the energy consumption,
each Raspberry Pi only consumes around 2 W in comparison of 240 W of the
above PC.
Figure 6b reports the average time for each observation to travel through a
processing pipeline specified by each query on different topologies, i.e., average
processing time. It shows that adding more intermediate nodes for query Q1
and Q2 can lower the average processing time as it can reduce queuing time at
some nodes. That means communication time might be a dominant factor for
the delay in these processing pipelines. In queries Q3 and Q4, we witness the
2
https://www.ncdc.noaa.gov/data-access/land-based-station-data.
316 M. Nguyen-Duc et al.
consistent increase in processing time wrt. the number of hops which explains
the nature of query Q3 and Q4 that needs more coordination among nodes.
However, it is interesting that increasing 1 hop in organising a network topology
just adds 10–15% delay while the maximum throughput gain is linear to nk−1 .
for local decision making (potentially through reasoning) and for a resource-
optimised distribution of tasks among a set of competing/associated nodes. The
dynamics of these federated processing networks would need to adapt to changing
conditions of load, membership, throughput, and other criteria, with emerging
behaviour patterns on the sensing and processing nodes.
5 Related Work
Semantic interoperability in the IoT domain has gained considerable attention
both in the academic and industrial spheres. Beyond syntactic standards such as
SensorML, semantically rich ontologies such as SSN-O/SOSA [10] have shown a
significant impact in different IoT projects and solutions, such as OpenIoT [24],
SymbIoTe [25], or BigIoT [5]. Other related vocabularies, such as the Things-
Description ontology, have also recently gained support from different IoT ven-
dors, aiming at consolidating it as a backbone representation model for generic
IoT devices and services. Regarding the representation of data streams them-
selves, the VoCaLS vocabulary [27] has been designed as a means for the pub-
lication, consumption, and shared processing of streams. Although these ontol-
ogy resources provide different and complementary ways to represent IoT and
streaming data, they require the necessary infrastructure and software compo-
nents (or agents) able to interpret the stream metadata, and apply coordina-
tion/cooperation mechanisms for federated/decentralised processing, as shown
in this paper.
The processing of continuous streaming data, structured according to Seman-
tic Web standards has been studied in the last decade, generally within the
fields of RDF Stream processing (RSP) and Stream Reasoning [8]. A number of
RSP engines have been developed in this period, focusing on different aspects
including incremental reasoning, continuous querying, complex event process-
ing, among others [3,6,15,20]. However, most of these RDF stream processors
lack the capability of interconnecting with each other, or to establish cooper-
ation patterns among them. The coordination among RDF stream processing
nodes is sometimes delegated to a generic cloud-based stream processing plat-
form such as Apache Storm (e.g [16]) or Apache Spark (e.g [20]). In contrast, in
this paper, we investigate a more decentralised environment whereby participant
nodes can be distributed across different organisations. Moreover, the hardware
capabilities of such processing nodes are different from the cloud-based setting,
i.e. resource-constraint edge devices.
Regarding the distributed processing and integration of RSP engines on a
truly decentralised architecture, different aspects and building blocks have sur-
faced in the latest years. Initial attempts to provide HTTP-based service inter-
faces for streaming data were explored in [3]. Other contributions in this line
are the RSP Service Interface3 , and the SLD Revolution framework [2]. These
propose the establishment of distributed workflows of RSP engines, using lazy-
transformation techniques for optimised interactions among the engines. Further
3
http://streamreasoning.org/resources/rsp-services.
318 M. Nguyen-Duc et al.
6 Conclusion
This paper presented a continuous query federation approach that uses RSP
engines as autonomous processing agents. The approach enables the coordina-
tion of edge devices’ resources to process query processing pipelines by cooper-
atively delegating partial workload to their peer agents. We implemented our
approach as an open source engine, Fed4Edge, to conduct an empirical study in
“cooperative sensing” scenarios. The resourceful experiments of the study show
that the scalablity can be significantly improved by adding more edge devices
to a network of processing nodes on demand. This opens several interesting
follow-up research challenges in enabling semantic interoperability for the edge
computing paradigm. Our next step will be investigating on how to adaptively
optimise the distributed processing pipeline of Fed4Edge. Another interesting
step is studying how the communication will effect its performance and scala-
bility on an Internet-scale setting whereby the processing nodes are distributed
across different networks and countries.
Acknowledgements. This work was funded in part by the German Ministry for Edu-
cation and Research as BBDC 2 - Berlin Big Data Center Phase 2 (ref. 01IS18025A),
Irish Research Council under Grant Number GOIPG/2014/917, HES-SO RCSO ISNet
grant 87057 (PROFILES), and Marie Skodowska-Curie Programme H2020-MSCA-IF-
2014 (SMARTER project) under Grant No. 661180.
References
1. Balazinska, M., Balakrishnan, H., Stonebraker, M.: Contract-based load manage-
ment in federated distributed systems. In: NSDI 2004 (2004)
2. Balduini, M., Della Valle, E., Tommasini, R.: SLD revolution: a cheaper, faster yet
more accurate streaming linked data framework. In: ESWC (2017)
3. Barbieri, D.F., Braga, D., Ceri, S., Grossniklaus, M.: An execution environment
for C-SPARQL queries. In: EDBT 2010 (2010)
4. Enabling mass iot connectivity as arm partners ship 100 billion chips. http://tiny.
cc/uiefcz
5. Bröring, S., et al.: The big iot api-semantically enabling iot interoperability. IEEE
Pervasive Comput. 17(4), 41–51 (2018)
6. Calbimonte, J.-P., Corcho, O., Gray, A.J.G.: Enabling ontology-based access to
streaming data sources. In: Patel-Schneider, P.F., et al. (eds.) ISWC 2010. LNCS,
vol. 6496, pp. 96–111. Springer, Heidelberg (2010). https://doi.org/10.1007/978-
3-642-17746-0 7
4
http://w3id.org/wesp/web-data-streams.
Autonomous RDF Stream Processing on IoT Edge Devices 319
7. Dell’Aglio, D., Della Valle, E., van Harmelen, F., Bernstein, A.: Stream reasoning:
a survey and outlook. Data Sci. 1(1), 59–83 (2017)
8. Dell’Aglio, D., Phuoc, D.L., Le-Tuan, A., Ali, M.I., Calbimonte, J.-P.: On a web
of data streams. In: DeSemWeb@ISWC (2017)
9. Grubenmann, T., Bernstein, A., Moor, D., Seuken, S.: Financing the web of data
with delayed-answer auctions. In: WWW 2018 (2018)
10. Haller, A., et al.: The modular SSN ontology: a joint W3C and OGC standard
specifying the semantics of sensors, observations, sampling, and actuation. Semant.
Web 10(1), 9–32 (2019)
11. Kaebisch, S., Kamiya, T., McCool, M., Charpenay, V.: Web of things (wot) thing
description. W3C, W3C Candidate Recommendation (2019)
12. Le-Phuoc, D.: Operator-aware approach for boosting performance in RDF stream
processing. J. Web Semant. 42, 38–54 (2017)
13. Le-Phuoc, D.: Adaptive optimisation for continuous multi-way joins over rdf
streams. In: Companion Proceedings of the the Web Conference 2018, WWW
2018, pp. 1857–1865 (2018)
14. Le-Phuoc, D., Dao-Tran, M., Le Van, C., Le Tuan, A., Manh Nguyen Duc, T.T.N.,
Hauswirth, M.: Platform-agnostic execution framework towards rdf stream pro-
cessing. In: RDF Stream Processing Workshop at ESWC2015 (2015)
15. Le-Phuoc, D., Dao-Tran, M., Parreira, J.X., Hauswirth, M.: A native and adaptive
approach for unified processing of linked streams and linked data. In: ISWC 2011,
pp. 370–388 (2011)
16. Le-Phuoc, D., Quoc, H.N.M., Van, C.L., Hauswirth, M.: Elastic and scalable pro-
cessing of linked stream data in the cloud. In: ISWC, pp. 280–297 (2013)
17. Le-Tuan, A., Hayes, C., Wylot, M., Le-Phuoc, D.: Rdf4led: An rdf engine for
lightweight edge devices. In: IOT 2018 (2018)
18. Le-Tuan, A., Hingu, D., Hauswirth, M., Le-Phuoc, D.: Incorporating blockchain
into rdf store at the lightweight edge devices. In: Semantic 2019 (2019)
19. Munir, A., Kansakar, P., Khan, S.U.: IFCIoT: integrated fog cloud iot a novel
architectural paradigm for the future internet of things. IEEE Consum. Electron.
Mag. 6(3), 74–82 (2017)
20. Ren, X., Curé, O.: Strider: a hybrid adaptive distributed RDF stream processing
engine. In: d’Amato, C., et al. (eds.) ISWC 2017. LNCS, vol. 10587, pp. 559–576.
Springer, Cham (2017). https://doi.org/10.1007/978-3-319-68288-4 33
21. Sakr, S., Wylot, M., Mutharaju, R., Le Phuoc, D., Fundulaki, I.: Processing of
RDF Stream Data. Springer, Cham (2018)
22. Satyanarayanan, M.: The emergence of edge computing. Computer 50(1), 30–39
(2017)
23. Smith, B.: Arm and intel battle over the mobile chip’s future. Computer 41(5),
15–18 (2008)
24. Soldatos, J., et al.: Openiot: open source internet-of-things in the cloud. In: Inter-
operability and open-source solutions for the internet of things. Springer (2015)
25. Soursos, S., Žarko, I.P., Zwickl, P., Gojmerac, I., Bianchi, G., Carrozzo, G.: Towards
the cross-domain interoperability of iot platforms. In: 2016 European Conference
on Networks and Communications (EuCNC), pp. 398–402. IEEE (2016)
26. Tommasini, R., Calvaresi, D., Calbimonte, J.-P.: Stream reasoning agents: blue sky
ideas track. In: AAMAS, pp. 1664–1680 (2019)
27. Tommasini, R., et al.: Vocals: vocabulary and catalog of linked streams. In: Inter-
national Semantic Web Conference (2018)
Certain Answers to a sparql Query
over a Knowledge Base
1 Introduction
sparql is an expressive SQL-like query language designed for Semantic Web
data, exposed as rdf graphs. Recently, sparql has been extended with so-called
entailment regimes, which specify different semantics to query an rdfs or owl
Knowledge Base (KB), i.e. data enriched with a background theory. This allows
retrieving answers to a query not only over the facts explicitly stated in the KB,
but more generally over what can be inferred from the KB.
The sparql entailment regimes are in turn largely influenced by theoretical
work on Ontology Mediated Query Answering (OMQA), notably in the field of
Description Logics (DLs). However, OMQA was initially developed for unions of
conjunctive queries (UCQs), which have a limited expressivity when compared to
sparql. It turns out that conciliating the standard (compositional) semantics of
sparql on the one hand, and the semantics used for OMQA on the other hand,
called certain answers, is non-trivial.
As an illustration, Example 1 provides a simple KB and sparql query. The
dataset (a.k.a ABox ) A states that Alice is a driver, whereas the background theory
c Springer Nature Switzerland AG 2020
X. Wang et al. (Eds.): JIST 2019, LNCS 12032, pp. 320–335, 2020.
https://doi.org/10.1007/978-3-030-41407-8_21
Certain Answers to a SPARQL Query over a Knowledge Base 321
(a.k.a. TBox ) T states that a driver must have a license (for conciseness, we use DLs
for the TBox, rather than some concrete syntax of owl). Finally, the sparql query q
retrieves all individuals that have a license.
Example 1
A = {Driver(Alice)}
T = {Driver ∃hasLicense}
q = SELECT ?x WHERE { ?x hasLicense ?y }
Intuitively, one expects Alice to be retrieved as an answer to q. And it would indeed
be the case under certain answer semantics, if one considers the natural translation of
this query into a UCQ. On the other hand, under the standard semantics of sparql
1.1 [8], this query has no answer. This is expected, since the fact that Alice has a
driving license is not present in the ABox. More surprisingly though, under all sparql
entailment regimes [6], this query also has no answer.
This mismatch between certain answers and entailment regimes has already been
discussed in depth in [1], where the interpretation of the OPTIONAL operator of sparql
is identified as a challenge, when trying to define a suitable semantics for sparql that
complies with certain answers for UCQs. A concrete proposal is also made in [1] in this
direction. Unfortunately, this semantics does not comply with the standard semantics
of sparql when the TBox is empty. This means that a same query over a plain rdf
graph may yield different answers, depending on whether it is evaluated under this
semantics, or under the one defined in the sparql 1.1 specification [8].
We propose in this article to investigate whether and how this dilemma can be
solved, for the so-called set semantics of sparql and certain answers. To this end,
we first formulate in Sect. 4 some requirements to be met by any reasonable semantics
meant to conciliate certain answers and standard sparql answers. Then in Sect. 5, we
use these requirements to review different semantics. We also show that all requirements
can be satisfied, for the fragment of sparql with SELECT, UNION and OPTIONAL, and
for KBs that admit a unique canonical model. Finally, in Sect. 6, we provide combined
complexity results for query answering under this semantics, over KBs in DL-LiteR ,
one of the most popular DLs tailored for query answering, which correspond to the
owl 2 ql standard. We show in particular that upper bounds for this problem match
results already known to hold for sparql over plain graphs, which means that under
this semantics, and as far as worst-case complexity is concerned, the presence of a
TBox does not introduce a computational overhead. Before this, Sect. 2 introduces
preliminary notions, and Sect. 3 reviews existing semantics for sparql over a KB.
Proofs can be found in the extended version of this paper (https://arxiv.org/abs/1911.
02668).
2 Preliminaries
We assume countably infinite and mutually disjoint sets NI , NC , NR , and NV of indi-
viduals (constants), concept names (unary predicates), role names (binary predicates),
and variables respectively. We also assume a countably infinite universe U, such that
NI ⊆ U. For clarity, we abstract away from concrete domains (as well as rdf term
types), since these are irrelevant to the content of this paper. We also assume that NI ,
NC and NR do not contain any reserved term from the rdf/rdfs/owl vocabularies
(such as rdfs:subClassOf, owl:disjointWith, etc.)
322 J. Corman and G. Xiao
where h, pi are predicates and x, xi are tuple over NV . Abusing notation, we may use x
(resp. xi ) below to designate the elements of x (resp. xi ) viewed as a set. An additional
syntactic requirement on a CQ is that x ⊆ x1 ∪ .. ∪ xm . The variables in x are called
distinguished, and we use vars(h) to designate the distinguished variables of CQ h. We
focus in this article on CQs where each pi is unary or binary, i.e. pi ∈ NC ∪ NR . A match
for h in an interpretation I is a total function ρ from x1 ∪ . . . ∪ xm to ΔI such that
ρ(xi ) ∈ (pi )I for i ∈ {1..m}. A mapping ω is an answer to h over I iff there is a match
ρ for h in I s.t. ω = ρ|vars(h) .
A union of conjunctive queries (UCQ) is a set q = {h1 , . . . , hn } of CQs sharing the
same distinguished variables, and ω is an answer to q over I iff ω is an answer to some
hi over I. Finally, ω is a certain answer to q over a KB K iff range(ω) ⊆ aDom(K) and
ω is an answer to q over each I ∈ mod(K). We use certAns(q, K) to designate the set
of certain answers to q over K.
CQs and UCQs have a straightforward representation as sparql queries. The CQ
h(x) ← p1 (x1 ), . . . , pm (xm ) in sparql syntax is written:
h1 union .. union hn
1. Triple patterns are not evaluated over the ABox A, but instead over the so-called
entailed graph, which consists of all ABox assertions entailed by K. This includes
assertions of the form C(a), where C is a complex concept expression allowed in
L. The semantics of other sparql operators is preserved.
2. The sparql query can use L-concepts in triple pattern, e.g. ∃hasLicense(x).
Consider again Example 1 under the owl 2 QL entailment regime for instance, which
corresponds (roughly) to the DL DL-LiteR . In this example, the query ∃hasLicense(x)
has {x → Alice} as unique answer: since the entailed graph contains all ABox asser-
tions entailed by K, it contains the assertion ∃hasLicense(Alice) (again, we use the
DL syntax rather than owl, for readability).
So the expressivity of the L-entailment regime is limited by the concepts that can
be expressed in L. This is why [10] proposed to extend the semantics of the owl 2 QL
profile, retrieving instances of concepts that cannot be expressed in DL-LiteR (e.g. con-
cepts of the form ∃r1 .∃r2 ). Still, under this semantics as well as all entailment regimes
defined in the specification, the query select{x} hasLicense(x, y) has no answer over
the KB of Example 1, because the entailed graph does not contain any assertion of the
form hasLicense(Alice, e).
This point was discussed in depth in [1], for the SUJO fragment, and based on
remarks made earlier in [2]. The current paper essentially builds upon this discussion,
which is why we reproduce it below. A first remark made in [2] and [1] is that the opt
operator of sparql prevents the usage of certain answers, even when querying a plain
graph (or equivalently, a KB with empty TBox). This can be seen with Example 2.
Example 2
A = {Person(Alice)}
q = Person(x) opt hasLicense(x, y)
Then in [2] and [1] still, the authors remark that in this example, ω can nonetheless
be extended to an answer in every model of ∅, A
. This is the main intuition used in [1]
to adapt the definition of certain answers to sparql queries with opt. If q is a query
and I an interpretation, let eAns(q, I) designate all mappings that can be extended to
an answer to q in I, i.e.:
eAns(q, I) = {ω | ω ω for some ω ∈ sparqlAns(q, I)}
Then if K is a KB, the set eCertAns(q, K) of mappings that can be extended to an
answer in every model of K is defined as:
eCertAns(q, K) = eAns(q, I)
I∈mod(K)
But as pointed out in [1], eCertAns(q, I) does not comply with sparql answers over
a plain graph (i.e. when the TBox is empty). Indeed, if some ω can be extended to
an answer in every model of the KB, then this is also the case of any mapping that
ω extends (e.g. trivially the empty mapping). So in Example 2, eCertAns(q, ∅, A
) =
{{}, {x → Alice}}, whereas sparqlAns(q, A) = {{x → Alice}}.
The semantics proposed in [1] is designed to solve this issue. The precise scope of
the proposal is so-called well-designed SUJO queries (see [14] for a definition), in some
normal form (no union in the scope of select, join or opt, no select in the scope of
join or opt, and no opt in the scope of join).1 Given a KB K, the solution consists in
retaining, for each maximal SJO subquery q , the maximal elements of eCertAns(q , K)
w.r.t . An additional restriction is put on the domain of such solution mappings,
based on the so-called pattern-tree representation (defined in [12]) of well-designed
SJO queries. The union operator on the other hand is evaluated compositionally, as
in Definition 1.
But as illustrated by the authors, this proposal does not comply with the standard
semantics for sparql over plain graphs. Example 3 below reproduces the one given
in [1, Example 4]:
Example 3
A = {teachesTo(Alice, Bob), knows(Bob, Carol), teachesTo(Alice, Dan)}
q = select{x,z} (teachesTo(x, y) opt knows(y, z))
In this example, sparqlAns(q, A) = {{x → Alice, z → Carol}, {x → Alice}}.
Instead, the semantics proposed in [1] yields {{x → Alice, z → Carol}}.
Section 5.3 below defines a different semantics for evaluating a sparql query over a
KB, which coincides not only with certain answers for UCQs (as opposed to the sparql
entailment regimes and [10]), but also with the sparql specification in the case where
the TBox is empty (as opposed to the proposal made in [1]).
Before continuing, other works need to be mentioned, even though they are not
immediately related to the problem addressed in this paper. First, a modification of
the entailment regimes’ semantics was proposed in [11] for the SJO fragment extended
with the sparql FILTER operator. For DLs with negation, it consists in ruling out a
partial solution mappings if it cannot be extended to an answer in any model of the
KB. Finally, another topic of interest when it comes to sparql and certain answers,
but which falls out of the scope of this paper, is the treatment of blank nodes, discussed
in the specification of sparql entailment regimes [6], and more recently in [7] and [9].
1
This is without loss of expressivity, but normalization may cause an exponential
blowup.
326 J. Corman and G. Xiao
4 Requirements
As seen in the previous section, existing semantics for sparql answers over a KB fail
to comply either with certain answers (for the fragment of sparql that corresponds to
UCQs), or with sparql answers over a plain graph when the TBox is empty.
We will show in Sect. 5 that these two requirements are compatible for some DLs
and fragments of sparql. But first, in this section, we formalize these two require-
ments, as properties to met by any semantics whose purpose is to conciliate certain
answers and sparql answers. We also define three additional requirements (called opt
extension, variable binding and binding provenance), which generalizes to KBs some
basic properties of sparql answers over plain graphs. We note that these requirements
apply to arbitrary DLs, whereas Sect. 5 focuses instead on specific families of DLs.
If q is a sparql query and K a KB, we use ans(q, K) below to denote the answers
to q over K under some (underspecified) semantics. This allows us to define properties
to be met by such a semantics.
Requirement 1 states that ans(q, K) should coincide with certain answers for UCQs.
ans(q, K) = certAns(q, K)
ans(q, ∅, A ) = sparqlAns(q, A)
As will be seen in the next section, it is possible to define semantics that verify
Requirements 1 and 2, but fail to comply with basic properties of sparql answers over
a plain graph. This is why we define additional requirements.
First, as observed in [11] for instance, the opt operator of sparql was introduced
to “not reject the solutions because some part of the query pattern does not match” [8].
Or in other words, for each answer ω to the left operand of an opt, either ω or some
extension of ω is expected be present in the answers to the whole expression. Let
g be the partial order over sets of solution mappings defined by Ω1 g Ω2 iff, for
each ω1 ∈ Ω1 , there is a ω2 ∈ Ω2 s.t. ω1 ω2 . Then this property is expressed with
Requirement 3.
Another important property of sparql answers over plain graphs pertains to bound
variables. Indeed, a sparql query q (with union and/or opt) may allow partial solution
mappings, i.e. whose domain does not cover all variables projected by q. For instance, in
Example 2, ω = {x → Alice} ∈ sparqlAns(q, A), even though the variables projected
by q are x and y. In such a case, we say that variable x is bound by ω, whereas
variable y is not. Then a sparql query may only admit answers that bind certain sets
of variables. For instance the query A(x) opt (R(x, y) join R(y, z)) admits answers
that bind either {x} or {x, y, z}. But it does not admit answers that bind another
Certain Answers to a SPARQL Query over a Knowledge Base 327
set of variables ({y},{x, y}, etc.). So a natural requirement when generalizing sparql
answers to KBs is to respect such constraints. We say that a set X of variables is
admissible for a query q iff there exists a graph A and solution mapping ω s.t. ω ∈
sparqlAns(q, A) and dom(ω) = X. Unfortunately, for queries with OPTIONAL, whether
a given set of variables is admissible for a given query is undecidable. So we adopt
instead a relaxed notion of admissible bindings. For a SUJO query q, we use adm(q) to
denote the family of sets of variables defined inductively as follows:
Definition 2 (Definition of adm(q) for the SUJO fragment)
If q is a triple pattern, then adm(q) = {vars(q)}
adm(selectX q) = { X ∩ X | X ∈ adm(q) }
adm(q1 join q2 ) = { X1 ∪ X2 | (X1 , X2 ) ∈ adm(q1 ) × adm(q2 ) }
adm(q1 opt q2 ) = adm(q1 ) ∪ adm(q1 join q2 )
adm(q1 union q2 ) = adm(q1 ) ∪ adm(q2 )
We can now formulate the corresponding requirement:
Requirement 4 (Variable binding). For any SUJO query q, KB K and ω ∈ ans(q, K):
dom(ω) ∈ adm(q)
This constraint on variable bindings is still arguably weak though, if one con-
sider queries with union. Take for instance the query q = A(x) union R(x, y). Then
adm(q) = {{x}, {x, y}}. But the semantics of sparql over plain graphs puts a stronger
requirement on variable bindings. If ω is a solution to q, then ω may bind {x} only if
ω is an answer to the left operand A(x), and ω may bind {x, y} only if ω is an answer
to the right operand R(x, y). It is immediate to see that Requirement 4 on variable
bindings does not enforce this property. So we add as a simple fifth requirement:
5 Semantics
We now investigate different semantics for answering sparql queries over a KB, in view
of the requirements expressed in the previous section. We note that each semantics
is defined for a specific fragment of sparql only, and that this is also the case of
Requirements 1, 4 and 5 (the other two requirements are defined for arbitrary sparql
queries). So when we say below that a semantics defined for fragment L1 satisfies a
requirement defined for fragment L2 , this means that the requirement holds for the
fragment L1 ∩ L2 .
Section 5.1 shows that adopting a compositional interpretation or certain answers,
analogous to sparql entailment regimes (restricted to SUJO queries), is sufficient to
satisfy Requirement 2, but fails to satisfy Requirement 1 for the SJ and U fragments
already. Section 5.2 focuses on DLs with the canonical model property. For these, we
consider generalizing a well-known property of certain answers to UCQs: they are
equivalent to answers over the canonical model, but restricted to those that range over
328 J. Corman and G. Xiao
the active domain of the KB. We show that this solution satisfies Requirements 1 and 2
for the SUJO fragment, but fails to satisfy Requirement 3 for the O fragment already.
Finally, Sect. 5.3 builds upon this last observation, and shows that it is possible to
define a semantics that satisfies all requirements for the SUJO fragment.
Table 1 summarizes our observations (for KBs with the canonical model property
only), together with observations about the proposal made in [1] (discussed in Sect. 3).
Example 5
A = {Driver(Alice)}
T = {Driver ∃hasLicense}
q = select{x} (Driver(x) join hasLicense(x, y))
Then certAns(q, T , A
) = {{x → Alice}}, but eRAns(q, T , A
) = ∅.
So entailment regime answers fail to satisfy Requirement 1 for the U and SJ frag-
ments already.
Proposition 1 states that canonical answers comply with sparql answers over a
plain graph (Requirement 2).
Proposition 1. For any SUJO query q and Lcan KB K, canAns(q, K) satisfies Require-
ment 2.
From the observation made above, canonical answers also comply with certain
answers for UCQs (Requirement 1). But they fail to satisfy opt extension (Require-
ment 3), as illustrated with Example 6.
330 J. Corman and G. Xiao
Example 6
A = {Driver(Alice)}
T = {Driver ∃hasLicense}
q = Driver(x) opt hasLicense(x, y)
In this example, Let K = T , A
. Then canAns(Driver(x), K) = {{x → Alice}}.
However, sparqlAns(q, can(K)) = {{x → Alice, y → e}}, for some e ∈ aDom(K). There-
fore canAns(q, K) = sparqlAns(q, can(K)) aDom(K) = ∅. So canAns(Driver(x), K) g
canAns(q, K), which immediately violates Requirement 3.
We can now generalize maximal admissible canonical answers to the SUJO fragment:
It can be easily verified that Definitions 6 and 8 coincide for SJO queries, since
in this case branch(q) = {q}. Proposition 2 shows that maximal admissible canonical
answers satisfy all requirements expressed in the previous section.
Table 2. Combined complexity of evalsparqlAns and evalmCanAns . “-c” stands for com-
plete, and “A/B” for all fragments between A and B.
6 Complexity
We now provide complexity results for query answering under the semantics defined
in Sect. 5.3, for different sub-fragments of the SUJO fragment, and focusing on KBs in
DL-LiteR [3], a DL tailored for query answering, which corresponds to the owl 2 ql
profile. As is conventional, we focus on the decision problem for query answering, i.e.
the problem evalmCanAns below. We also focus on combined complexity, i.e. measured in
the size of the whole input (KB and query), and leave data complexity (parameterized
either by the size of the query, or of the query and TBox) as future work.
evalmCanAns
Input: DL-LiteR KB K, query q, mapping ω
Decide: ω ∈ mCanAns(q, K)
Complexity of sparql query evaluation over plain graphs has been extensively
studied (see [13] for a recent overview). When these results are tight, they provide us
immediate lower bounds. Indeed, from Proposition 1, certain canonical answers satisfy
Requirement 2, so evalmCanAns is at least as hard as the problem evalsparqlAns below.
evalsparqlAns
Input: graph A, query q, mapping ω
Decide: ω ∈ sparqlAns(q, A)
compared to sparql answers over a plain graph. This observation is analogous to well-
known results for answering UCQs under certain-answer semantics over a DL-LiteR
KB [5], which matches the (NP) upper bound for UCQs over a plain graph.
Before explaining these results, we isolate a key observation:
Proposition 3. If q is a JO query and X1 , X2 ⊆ vars(q), then it can be decided in
O(|q|2 ) whether X1 ∈ max⊆ (adm(q) ∩ 2X2 ).
The induction guarantees that | min⊆ (base(q))| = 1, so that |base(q))| = O(|q|) must
hold. Then in order to decide X1 ∈ max⊆ (adm(q)∩ 2X2 ), it is sufficient to: (i) check
whether X1 ∈ adm(q), i.e. check whether X1 ⊆ {B ∈ base(q) | B ⊆ X1 }, and (ii)
check whether there is an X ∈ adm(q) ∩ 2X2 s.t. X X . This is the case iff there is
a B ∈ base(q) s.t. X1 X1 ∪B X2 .
We note that from the definition of adm(q), this property is independent from
the semantics under investigation, so it holds for sparql over a plain graph. It also
follows that deciding whether X ∈ adm(q) for an arbitrary X and JO query q is
tractable (consider the case where X1 = X2 ). Interestingly, this does not hold for the
UJ fragment already. Indeed, immediately from the reduction used in [15] for hardness
of evalsparqlAns in this fragment, deciding X ∈ adm(q) for any X and UJ query q is
NP-hard (we refer to the the extended version of this paper for details).
We now sketch the argument used to derive upper bounds for the SUJO, well-
designed SJO* and UJ fragments (proofs can be found in the extended version). For
simplicity, we focus on the well-designed SJO* fragment. The argument for queries with
union is similar, but with additional technicalities, because the definition of certain
canonical answers in this case is more involved (compare Definitions 6 and 8 above).
We also simplify the explanation by assuming that the Gaifman graph of the query is
connected. If G is a graph, we will use V (G) below to designate its vertices.
From the definition of evalmCanAns , K, q, ω
is a positive instance iff ω ∈
mCanAns(q, K), i.e. iff there is an ω s.t. (i) ω = ω |X for some X ∈ max⊆ (adm(q) ∩
2dom(ω aDom(K) ) )} and (ii) ω ∈ sparqlAns(q, K).
So a (non-deterministic) procedure to decide whether ω ∈ mCanAns(q, K) consists
in guessing an extension ω or ω, then verify (i), and then verify (ii). From Proposition 3
above, (i) can be verified in O(|q|2 ). For (ii), if ω ∈ sparqlAns(q, can(K)), from well-
known properties of can(K) for DL-LiteR , it can be shown that:
– there must exist a subgraph G of can(K) s.t. V (G) ∩ V (A) = ∅, and the size of the
subgraph of G induced by V (G) \ V (A) is linearly bounded by max(|q|, |T |).
– for each maximal connected subgraph G of G s.t. V (G ) ∩ V (A) = ∅, it can be
verified in O((|G | + |T |) · |T |) whether G is a subgraph of can(K).
334 J. Corman and G. Xiao
References
1. Ahmetaj, S., Fischl, W., Pichler, R., Šimkus, M., Skritek, S.: Towards reconciling
SPARQL and certain answers. In: Proceedings of the 24th International Conference
on World Wide Web, pp. 23–33. ACM (2015)
2. Arenas, M., Pérez, J.: Querying semantic web data with SPARQL. In: Proceedings
of the Thirtieth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of
Database Systems, pp. 305–316. ACM (2011)
3. Artale, A., Calvanese, D., Kontchakov, R., Zakharyaschev, M.: The DL-Lite family
and relations. J. Artif. Intell. Res. 36, 1–69 (2009)
4. Baader, F., Calvanese, D., McGuinness, D., Nardi, D., Patel-Schneider, P. (eds.):
The Description Logic Handbook: Theory, Implementation, and Applications.
Cambridge University Press, Cambridge (2003)
5. Calvanese, D., De Giacomo, G., Lembo, D., Lenzerini, M., Rosati, R.: Tractable
reasoning and efficient query answering in description logics: the DL-Lite family.
J. Autom. Reason. 39(3), 385–429 (2007)
6. Glimm, B., Ogbuji, C.: SPARQL 1.1 entailment regimes. Technical report, W3C,
March 2013
7. Gutierrez, C., Hernández, D., Hogan, A., Polleres, A.: Certain answers for
SPARQL? In: AMW (2016)
8. Harris, S., Seaborne, A., Prud’hommeaux, E.: SPARQL 1.1 query language. W3C
recommendation, W3C (2013)
9. Hernández, D., Gutierrez, C., Hogan, A.: Certain answers for SPARQL with blank
nodes. In: Vrandečić, D., et al. (eds.) ISWC 2018. LNCS, vol. 11136, pp. 337–353.
Springer, Cham (2018). https://doi.org/10.1007/978-3-030-00671-6_20
10. Kontchakov, R., Rezk, M., Rodríguez-Muro, M., Xiao, G., Zakharyaschev, M.:
Answering SPARQL queries over databases under OWL 2 QL entailment regime.
In: Mika, P., et al. (eds.) ISWC 2014. LNCS, vol. 8796, pp. 552–567. Springer,
Cham (2014). https://doi.org/10.1007/978-3-319-11964-9_35
11. Kostylev, E.V., Cuenca Grau, B.: On the semantics of SPARQL queries with
optional matching under entailment regimes. In: Mika, P., et al. (eds.) ISWC 2014.
LNCS, vol. 8797, pp. 374–389. Springer, Cham (2014). https://doi.org/10.1007/
978-3-319-11915-1_24
12. Letelier, A., Pérez, J., Pichler, R., Skritek, S.: Static analysis and optimization of
semantic web queries. ACM Trans. Database Syst. (TODS) 38(4), 25 (2013)
13. Mengel, S., Skritek, S.: On tractable query evaluation for SPARQL. arXiv preprint
arXiv:1712.08939 (2017)
14. Pérez, J., Arenas, M., Gutierrez, C.: Semantics and complexity of SPARQL. ACM
Trans. Database Syst. (TODS) 34(3), 16 (2009)
15. Schmidt, M., Meier, M., Lausen, G.: Foundations of SPARQL query optimization.
In: Proceedings of the 13th International Conference on Database Theory, pp. 4–33.
ACM (2010)
16. Xiao, G., et al.: Ontology-based data access: a survey. In: Proceedings of the
Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI-18,
International Joint Conferences on Artificial Intelligence Organization, pp. 5511–
5519, July 2018
External Knowledge-Based Weakly Supervised
Learning Approach on Chinese Clinical Named
Entity Recognition
{bin.dong,shanshan.jiang}@srcb.ricoh.com
1 Introduction
Building named entity recognition system for clinical text could not only do great ben-
efit to medical workers in their daily work, but also help to construct the large scale
medical knowledge graph and perform other downstream tasks such as relation extrac-
tion, knowledge graph reasoning. Since electronic health record is a kind of text with
very strong domain features, the entity recognition task in clinical texts faces even more
challenge than that in other domains [1]. Firstly, rich labeled clinical corpus is relatively
difficult to obtain due to the lack of uniform standards on labeling Chinese clinical health
records [2] and medical data labeling costs vast human labor [3]. Secondly, clinical text
contains abundant medical professional knowledge, such as rules, dictionaries and so
on. Using external knowledge to guide the entity extraction has been proved an effective
© Springer Nature Switzerland AG 2020
X. Wang et al. (Eds.): JIST 2019, LNCS 12032, pp. 336–352, 2020.
https://doi.org/10.1007/978-3-030-41407-8_22
External Knowledge-Based Weakly Supervised Learning Approach 337
way to pursuit higher performance especially when the supervision signal is weak, and
many works [4–6] have carried out successful approaches.
Traditional named entity recognition (NER) approaches can be categorized as rule
based [7, 8], dictionary based [9, 10] and machine learning based [11–13] methods.
With the rapid progress on deep learning technology, NER systems based on deep neu-
ral networks (DNNs) achieved remarkable success. Some recurrent neural networks
(RNNs) based end-to-end models, especially bi-directional long short term memory-
conditional random fields (Bi-LSTM-CRF) models gained the state-of-the-art results
[14, 15]. To adapt these methods into a certain domain such as medical and health,
effective ways are domain specific feature engineering [16, 17], multi-model ensemble
[18] and incorporating external knowledge [19].
Most successful techniques, such as deep learning [20], require large scale ground-
truth labeled training data, however it can be difficult to obtain strong supervision in many
tasks due to the high cost of data labeling process. Thus, it is desired to enable machine
learning techniques to work with weak supervision. Typically, weak supervision could
be classified into three types, incomplete supervision, inexact supervision and inaccurate
supervision [21]. However, it is difficult for a deep learning NER model to reach an ideal
performance with only weak supervision because of the lack of semantic knowledge and
insufficient domain information. At the same time, the lack of training data can easily
lead to over-fitting of the model, which leads to low generalization ability. For example, a
sentence “患者出现恶心、呕吐、腹痛、腹泻, 间断服用奥美拉唑等药物。”, which
means “The patient had symptoms such as nausea, vomiting, abdominal pain, diarrhea,
and intermittently used omeprazole.” We use only a small amount of training data to
train a model, the corresponding entity recognition results are shown in Table 1.
Table 1. Recognition results of NER model with small amount training data.
symptom
2 Base Model
Named entity recognition can be regarded as a sequence labeling task. Different from
English texts, Chinese texts don’t have clear word boundary, and according to our statistic
on CCKS-2018 CNER dataset, the entity mismatch caused by word segmentation take
up more than 25% of all the labeled entities. So we perform the sequence labeling task
on character level to alleviate such mismatch errors.
We use BIO (Begin, Inside, Outside) tagging scheme to tag the data into sequences.
Our goal is to predict labels for a given sentence s = {w1 , w2 , · · · wn } on each character
with the same BIO tagging scheme that indicates the entity type and its boundary. The
tagging scheme is shown in Table 2.
Text 上 腹 部 疼 痛
。
Label B-body I-body I-body B-description I-description O
The NER model consists of three key components: embedding layer, Bi-LSTM layer
and CRF layer. For embedding layer, we encode every Chinese character into a tensor
representation which is concatenated by a 768 length pre-training embedding denoted
as eber t , and a 100 length word2vec [22] embedding denoted as ew2v . In order to get
a better context model of texts, we use BERT [23] as pre-training embedding. Then
the sequence of tensors is fed into a Bi-LSTM neural network in order to capture the
contextual features. Finally, a conditional random field (CRF) layer is used to capture
the dependencies in tagging and determine the best tagging sequence for the sentence.
External Knowledge-Based Weakly Supervised Learning Approach 339
Since the powerful effect of Bert pre-training has already been proved, we use Bert
pre-training as part of the character embedding of the input tensor in our base model.
Besides, to make the performance better, we concatenate the 100 length word2vec tensor
to Bert pre-training. Such representation form is inspired by recently successful works
[24]. From our observation in experiments, we found this hybrid embedding carries out
better presentation performance than both the single pre-train embedding of Bert and
word2vec.
The Bi-LSTM network incorporates a gated memory-cell to capture long-range
dependencies within the input tensors along both forward and backward directions.
For each position t, LSTM computes ht with input et and previous state ht−1 :
i t = σ (W i et + U i ht−1 + bi ) (2)
f t = σ W f et + U f ht−1 + b f (3)
ot = σ (W o et + U o ht−1 + bo ) (6)
ht = ot tanh(ct ) (7)
Where h, i, f , o ∈ Rdh are dh -dimensional hidden state, input gate, forget gate
and output gate. The trainable parameters of LSTM are W i , W f , W c , W o ∈ R4dh ×de ,
U i , U f , U c , U o ∈ R4dh ×dh and bi , b f , bc , bo ∈ R4dh . σ is the sigmoid function, and
denotes elementwise production. Bi-LSTM computes both directions, left h t and right
←
h t , the final output is:
←
t ⊕ ht
ht = h (8)
Here Viterbi algorithm [25] is used to compute [A]i, j and optimal tag sequences for
inference.
Figure 1 is the architecture of our approach. We start from a labeled seed corpus and
a weak supervision trained model. Because the supervision is rather weak at the initial
stage, the model can only partially predict the right entities from unlabeled data. Our
approach automatically and iteratively enhances the labels of unlabeled data with the
help of external knowledge, and the enhanced data is used to train a more generalized
model at the next iteration. At each iteration, we enhance a set of unlabeled data, update
the external knowledge and train a more generalized model.
1. Rule Construction:
Based on the observation on clinical entity domain features and medical literatures
we design several kinds of rules to remove the noise from the weakly supervised model
output. Some of the example rules are presented in Table 3.
2. Dictionary mining:
To generate the clinical entity dictionaries we collect named entities from various
sources. Besides the seed corpus annotations, we also collect entity names from medical
Rules
Name Explanation Examples
Minimum length Any one character entity to be length(drug) > 1;
a surgery or a drug length(surgery) > 1;
Parentheses and quotation Entity with only one part of “(” + . * + “)”;
marks mismatch parentheses or quotation “(” + . * + “)”;
marks “ + .* + ”;
‘ + .* + ’
Comma ending Entity should not end with ent[length(ent)-1] not in (“,”,
comma “,”);
Part-of-speech rule filter If the word part-of-speech POS(drug) ! = verb.;
conflicts the entity type we POS(drug) ! = adj.;
regard it as a noise word, for POS(surgery) ! = adj.;
example, drug type entity POS(body) ! = verb.;
should not be a verb or …
adjective
Special context Some special cases are Type(“手”) ! = body when in
determined by the context “手术”;
Type(“口”) ! = body when in
“切口”;
Type(“心”) ! = body when in
“心律”;
… … …
342 Y. Duan et al.
At each iteration of our weakly supervised learning approach, we apply the data enhance-
ment to training data. We pick a small set from unlabeled corpus as an unlabeled subset,
and output the labels, therefore update the labeled corpus and the dictionaries. For every
unlabeled subset, we generate the labels from two sources:
1. The output from the current CNER model, noise eliminated by rules. And the newly
learned entity words are added to the external dictionary;
2. For those entities that the current model didn’t recognize, we use external dictionary
to complete the labels;
If the model prediction conflicts with the dictionary complement results, we firstly
consider predictions of the current model output. And if dictionary labels one character
as multiple entity parts, we use the prior strategy to decide which label should be chosen.
In this label completion process, the shortest word length s is an important parameter
that should be properly adjusted. We initialize this parameter as 0, as the experiment goes
further, short entities appear to bring more and more wrong cases, and the performance
climbs slower. We gradually increased the shortest word length to 4, and this kind of
wrong cases get effectively removed.
Thus we finish unlabeled data enhancement at each iteration. As the iterative process
continues, the scale of the labeled corpus is enlarged and the performance of the CNER
model is improved.
Current deep neural network based approaches are mostly end-to-end, we propose an
iterative training method, and train a more generalized model by several iterations. The
outline of our CNER approach is described in Algorithm 2.
344 Y. Duan et al.
Our approach starts from a seed corpus and a weakly supervised model. We denote
the dataset enhanced by our method as an increment corpus at each iteration, and we
append a subset from the unlabeled corpus into the increment corpus. With the model
trained at previous iteration, we enhance the increment corpus entirely, and obtain a new
model by training with the seed corpus along with the increment corpus. This process
ends till we make use of all the data in unlabeled corpus, and we get the final CNER
model.
4 Experiments
4.1 Dataset
Parameter Value
Size of Bert embedding 768
Size of word embedding 100
Drop-out 0.2
Loss function Cross Entropy
Adam learning rate 0.001
Early stopping patience 5 epochs
Shortest match length Initialize: 0; Final: 4
We denote the approach using all the labeled training data as Fully Supervised approach.
With the 10% labeled seed corpus we trained a baseline model and denote it as Seed
Corpus Supervised approach. In order to depict the effect of the dictionary, we also
did experiment without dictionary in our external knowledge, this is denoted as Our
346 Y. Duan et al.
Approaches P R F1
Fully Supervised 87.42 86.72 87.07
Seed Corpus Supervised 76.98 77.94 77.46
Our Approach (without dictionary) 87.13 78.89 82.80
Our Approach 85.34 86.15 85.74
88 86.83 87.07
86.45
85.87 85.46 85.74
86 85.02 85.24
84.64
84 82.89 83.19 82.95 83.14
82.43
81.72 81.87
82 80.95
82.68 82.8
80.48 82.07 81.86
80 81.3 81.15
80.62
79.46 79.88
78
77.46
76
10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
Fully Supervised Our Approach Our Approach(without dictionary)
From Fig. 3 we can see that comparing to the fully labeled data trained model, our
approach has shown the trend of pursuit from the early stage. Although in mid-term
the improvement of performance slows down, benefiting from the reasonable control
of noise, we finally achieve a satisfactory result. Comparing to the without dictionary
approach, the role of dictionary is not obvious at early stage, but with the increase of the
corpus, our approach gradually widens the gap.
External Knowledge-Based Weakly Supervised Learning Approach 347
The overall experimental results show that our approach is an effective way to do the
CNER task with weak supervision, under the guide of external-knowledge, we achieved
over 98% of the fully labeled data trained model performance with only 10% of labeled
training data.
The detailed recognition performance of different methods is presented in Table 8.
By the analysis on performance statistics, we can see that the model without dictionary
returns a good precision but the recall is relatively low. After incorporating the external
dictionary, we get an obvious improvement on recall, but a little decline of the precision
as sacrifice.
Based on the observation of experimental results we find that drug and surgery entities
gain most obvious improvements on F1-score by our approach comparing to the weakly
supervised method (drug: from 69.49 to 83.15, surgery: from 68.52 to 83.60), because
these entity types get more benefits from the external knowledge. Our approach even
outperforms the fully labeled model on drug type (our approach: 83.15, fully labeled
model: 80.67). Because drug names often contain transliterated words from English,
for example “吉西他滨” means “Gemcitabine” and “伊利替康” means “Irinotecan”,
such kind of entity is very hard to model only by word-embedding features, dictio-
nary completion is rather powerful on drug entity. Meanwhile, surgery entities often
appear in the form of long phrases, for example “腹腔镜下乙状结肠根治性切除术”
348 Y. Duan et al.
means “laparoscopic radical resection of the sigmoid colon”, and the advantage of our
dictionary completion algorithm is revealed dealing with such long phrases since we
use the maximum matching algorithm. Description type, on the contrary, only gain a
slightly improvement by external knowledge comparing to the weakly supervised model
(weakly supervised: 81.38, our approach: 83.22), it even get a lower F1-score than the
without dictionary method (without dictionary method: 86.96, our approach: 83.22).
That’s because description entities are not easy to be defined by external knowledge.
For example, description entities “胀” means “inflation” and “痛” means “ache” can be
wrongly recognized as symptom. The meaning of these words is likely to be confused
with other kinds of entities and the none-entity phrases.
Dictionary completed data faces another problem. The external dictionary indicates
an entity as soon as the word appears and don’t take the context in consideration, which
introduces a lot of wrong cases especially when the entity word is very short. Our solution
is to adjust the shortest word length at the later iterations when conducting dictionary
completion, because the model has already learned these short entities well, and we use
dictionary mainly to label longer words, rather than introducing more noise.
5 Related Work
5.1 Clinical Named Entity Recognition
As the key area of artificial intelligence application, named entity recognition from
clinical text has attracted considerable and extensive attention. Researchers in this field
have proposed many effective solution approaches. These existing methods could be
categorized as rule-based approaches, dictionary-based approaches, machine learning
approaches and deep learning approaches.
In early time CNER research, rule based approaches used to take the dominant place.
Friedman [7], Zeng [28] and Savova [29] made some successful systems for named entity
recognition on medical texts. However rules are impossible to be enumerated, and the
making of the rules always takes vast engineering cost.
Dictionary based CNER systems could effectively locate the entities that appeared
in the dictionaries [30]. However the performance of this kind of method highly depend
on the quality of the dictionaries, they can’t handle entities that don’t appear in the
dictionaries. Meanwhile, in Chinese there are many characters or words have multiple
meanings, these entities should be determined by the context, which dictionary based
approaches could not solve and leads to a low precision result.
Machine learning based approaches usually consider CNER as a sequence labeling
problem. Classical methods are hidden Markov models [12, 30], maximum entropy
Markov models [11], conditional random fields [13, 31] and supported vector machine
[32]. This kind of approach requires heavy work on feature engineering process, and it’s
rather difficult to find the best set of features combination.
The use of deep neural network for NER was pioneered by Collobert [33] in 2011,
who proposed convolutional neural networks (CNNs) over the sequence of words. Huang
et al. [34] proposed bidirectional LSTM encoder to replace CNN encoder. Lample et al.
[35] introduced hierarchy in the architecture by replacing hand-engineered character-
level features in prior works with additional bidirectional LSTM. The sequential CRF
External Knowledge-Based Weakly Supervised Learning Approach 349
on top of the recurrent layers ensures that the optimal sequence of tags over the entire
sentence is obtained.
6 Conclusion
In this work, we introduce an iterative weakly supervised learning architecture to per-
form the CNER task. We propose a bootstrapping CNER method integrating external
knowledge acquired from rule construction and dictionary mining, and achieve a close
performance comparing to the fully labeled data trained model. Our approach effectively
reduces the need for the size of labeled training data, and properly takes the advantage
of external knowledge in performing CNER task.
In the future our work will focus on more effective methods on acquiring useful
external knowledge automatically, and make the iteratively bootstrapping process more
efficient.
Acknowledgments. We sincerely thank the reviewers for their insightful comments and valu-
able suggestions. Moreover, this work is supported by the National Nature Science Foundation
of China under Grants no. 61772505; the National Key R&D Program of China under Grant
2018YFB1005100.
350 Y. Duan et al.
References
1. Kundeti, S.R., Vijayananda, J., Mujjiga, S., Kalyan, M.: Clinical named entity recognition:
challenges and opportunities. In: 2016 IEEE International Conference on Big Data (Big Data),
pp. 1937–1945. Washington, DC (2016)
2. Jiang, Z., Zhao, F., Guan, Y., Yang, J.: Research on Chinese electronic medical record oriented
lexical corpus annotation. Chin. High Technol. Lett. 24(6), 609–615 (2014)
3. Deleger, L., et al.: Overview of the bacteria biotope task at bionlp shared task 2016. In
Proceedings of the 4th BioNLP Shared Task Workshop, pp. 12–22 (2016)
4. Alfonseca, E, Manandhar, S.: An unsupervised method for general named entity recognition
and automated concept discovery. In: Proceedings of the 1st international conference on
general WordNet, Mysore, India, pp. 34–43 (2002)
5. Nadeau, D., Turney, P.D., Matwin, S.: Unsupervised named-entity recognition: generating
gazetteers and resolving ambiguity. In: Lamontagne, L., Marchand, M. (eds.) AI 2006.
LNCS (LNAI), vol. 4013, pp. 266–277. Springer, Heidelberg (2006). https://doi.org/10.1007/
11766247_23
6. Sekine, S., Nobata, C.: Definition, dictionaries and tagger for extended named entity hierarchy.
In: Proceedings of the language resources and evaluation conference (LREC), pp. 1977–1980
(2004)
7. Friedman, C., Alderson, P.O., Austin, J.H.M., Cimino, J.J., Johnson, S.B.: A general natural-
language text processor for clinical radiology. J. Am. Med. Inform. Assoc. 1(2), 161–174
(1994)
8. Fukuda, K., Tamura, A., Tsunoda, T., Takagi, T.: Toward information extraction: identifying
protein names from biological papers. In: Pacific Symposium on Biocomputing, pp. 707–718
(1998)
9. Rindflesch, T.C., Tanabe, L., Weinstein, J.N., Hunter, L.: Edgar, extraction of drugs, genes
and relations from the biomedical literature. In: Biocomputing 2000, pp. 517–528. World
Scientific (1999)
10. Gaizauskas, R., Demetriou, G., Humphreys, K.: Term recognition and classification in bio-
logical science journal articles. In: Computional Terminology for Medical & Biological
Applications Workshop of the 2nd International Conference on NLP, pp. 37–44 (2000)
11. Mccallum, A., Freitag, D., Pereira, F.: Maximum entropy markov models for information
extraction and segmentation. In: Proceedings of the 17th International Conference on Machine
Learning, pp. 591–598 (2000)
12. Zhou, G.D., Su, J.: Named entity recognition using an HMM-based chunk tagger. In: Meeting
on Association for Computational Linguistics, pp. 473–480 (2002)
13. Mccallum, A., Li, W.: Early results for named entity recognition with conditional random
fields, feature induction and web-enhanced lexicons. In: Conference on Natural Language
Learning at Hlt-Naacl, pp. 188–191 (2003)
14. Gridach, M.: Character-level neural network for biomedical named entity recognition. J.
Biomed. Inform. 70, 85–91 (2017)
15. Habibi, M., Webber, L., Neves, M., Wiegandt, D.L., Leser, U.: Deep learning with word
embeddings improves biomedical named entity recognition. Bioinformatics 33(14), i37–i48
(2017)
16. Wang, Y., Ananiadou, S., Tsujii, J.: Improve Chinese clinical named entity recognition per-
formance by using the graphical and phonetic feature. In: IEEE International Conference on
Bioinformatics and Biomedicine (BIBM), pp. 5386–5488 (2018)
17. Yang, X., Huang, W.: A conditional random fields approach to clinical name entity recognition.
In: 2018 CEUR Workshop Proceedings, vol. 2242, pp. 1–6 (2018)
External Knowledge-Based Weakly Supervised Learning Approach 351
18. Luo, L., Li, N.: DUTIR at the CCKS-2018 Task1: A neural network ensemble approach
for Chinese clinical named entity recognition. http://CEUR-WS.org/Vol-2242/paper02.pdf
(2018)
19. Zhang, S., Elhadad, N.: Unsupervised biomedical named entity recognition: experiments with
clinical and biological texts. J. Biomed. Inform. 46(6), 1088–1098 (2013)
20. Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning. MIT Press, Cambridge (2016)
21. Zhou, Z.-H.: A brief introduction to weakly supervised learning. Natl. Sci. Rev. 5(1), 44–53
(2018)
22. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in
vector space. In: Proceedings of Workshop at ICLR (2013)
23. Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pretraining of deep bidirectional
transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
24. Zhang, J., Qin, Y., Zhang, Y., Liu, M., Ji, D.: Extracting entities and events as a single task
using a transition-based neural model. In: Proceedings of the Twenty-Eighth International
Joint Conference on Artificial Intelligence Main track, pp. 5422–5428
25. Rabiner, L.R.: A tutorial on hidden markov models and selected applications in speech
recognition. Readings Speech Recogn. 77(2), 267–296 (1990)
26. Liu, Y., Tan, Q., Shen, K.X.: The word segmentation rules and automatic word segmentation
methods for Chinese information processing. Tsinghua University Press and Guang Xi, p. 36
(1994)
27. Xue, N.: Chinese word segmentation as character tagging. Int. J. Comput. Linguist. Chin.
Lang. Process. 8(1), 29–48 (2003). February 2003: Special Issue on Word Formation and
Chinese Language Processing
28. Zeng, Q.T., Goryachev, S., Weiss, S., Sordo, M., Murphy, S.N., Lazarus, R.: Extracting prin-
ciple diagnosis, co-morbidity and smoking status for asthma research: evaluation of a natural
language processing system. BMC Med. Inform. Decis. Mak. 6(1), 1–9 (2006)
29. Savova, G.K., et al.: Mayo clinical text analysis and knowledge extraction system (ctakes):
architecture, component evaluation and applications. J. Am. Med. Inform. Assoc. JAMIA
17(5), 507–513 (2010)
30. Song, M., Yu, H., Han, W.: Developing a hybrid dictionary-based bio-entity recognition
technique. BMC Med. Inform. Decis. Mak. 15(S-1), S9 (2015)
31. Skeppstedt, M., Kvist, M., Nilsson, G.H., Dalianis, H.: Automatic recognition of disor-
ders, findings, pharmaceuticals and body structures from clinical text. J. Biomed. Inform.
49(20140), 148–158 (2014)
32. Ju, Z., Wang, J., Zhu, F.: Named entity recognition from biomedical text using SVM. In:
International Conference on Bioinformatics and Biomedical Engineering, pp. 1–4 (2011)
33. Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., Kuksa, P.: Natural
language processing (almost) from scratch. J. Mach. Learn. Res. 12(Aug), 2493–2537 (2011)
34. Huang, Z., Xu, W., Yu, K.: Bidirectional lstm-crf models for sequence tagging. arXiv preprint
arXiv:1508.01991 (2015)
35. Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., Dyer, C.: Neural architectures
for named entity recognition. In: Proceedings of NAACL-HLT, pp. 260–270 (2016)
36. Yarowsky, D.: Unsupervised word sense disambiguation rivaling supervised methods. In:
Meeting of the Association for Computational Linguistics, pp. 189–196 (1995)
37. Blum, A., Mitchell, T.: Combining labeled and unlabeled data with co-training. In: COLT: Pro-
ceedings of the Workshop on Computational Learning Theory. Morgan Kaufmann Publishers
(1998)
38. Collins, M., Singer, Y.: Unsupervised models for named entity classification. In
EMNLP/VLC-99 (1999)
39. Kozareva, Z.: Bootstrapping named entity recognition with automatically generated gazetteer
lists. In: EACL The Association for Computer Linguistics, (2006)
352 Y. Duan et al.
40. Teixeira, J., Sarmento, L., Oliveira, E.: A bootstrapping approach for training a ner with
conditional random fields. In: Antunes, L., Pinto, H.S. (eds.) EPIA 2011. LNCS (LNAI), vol.
7026, pp. 664–678. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-24769-
9_48
41. Shen, Y.: Deep active learning for named entity recognition. In: ICLR (2018)
42. Kim, J., Ko, Y., Seo, J.: A bootstrapping approach with CRF and deep learning models
for improving the biomedical named entity recognition in multi-domains. IEEE Access 7,
70308–70318 (2019)
Metadata Application Profile Provenance
with Extensible Authoring Format
and PAV Ontology
1 Introduction
Metadata application profiles (MAP) are data element schemas from various
namespaces mixed and customized for a specific application [9]. MAPs are the
best mechanism to express consensus of any metadata instance by documenting
the elements, policies, guidelines, and vocabularies for that particular implemen-
tation along with the schemas, and applicable constraints. Application profiles
also provide the term usage specifications and support interoperability by rep-
resenting domain consensus, alignment, and the structure of the data [1,10].
1. Distinguish the source of the application profile from the published versions
to baseline the concepts of authoring formats and expression formats for
application profiles.
2. Identifying and retrieving application profiles and its versions, including
changelogs, can be automated with the help of semantic linking of MAP
resources.
3. A source of MAP with an interoperable authoring format consists of an action-
able timeline can help to maintain the longevity of the schema. Declared roles
of contribution can act as a means of provenance for MAP resources.
Dublin Core Metadata Initiative (DCMI) defines one of the earliest guidelines
to express application profiles, which can be in various formats, as Description
Set Profiles (DSP). DSP is a constraint language for Dublin Core Application
Profiles (DCAP) based on the Singapore framework for application profiles [18].
XML or RDF can be used as an expression format for DSP.
Singapore framework recommends publishing the application profiles in
human- readable expression formats as a documentation, with detailed usage
guidelines aimed to maximize reusability and interoperability. Expressing appli-
cation profile in human readable formats require much more components than
textual descriptions of first-order elements such as properties and classes. As
a result, the expression of an application profile in human readable formats is
MAP Provenance with Extensible Authoring Format and PAV 355
curating application profiles. For creating application profiles, there are not many
well-accepted authoring formats or pre-processors.
2 Related Work
As an application format, DCMI proposed a constrained language for Dublin
Core Application Profiles named Description Set Profile (DSP). As an author-
ing format for DSP, a MoinMoin wiki syntax was introduced to embed Appli-
cation Profiles in web pages. Later, Simple DSP (SDSP) [7]. A simplified form
of DSP using spreadsheets as an authoring format was developed as part of the
Metabridge project [17]. Recently, the DCMI application profile Special Inter-
est Group is working on improving DSP [6]. Library of Congress BIBFRAME
project developed a web-based editor for BibFrame Profiles [5]. Linked Data
for Production 2 (LD4P2) project modified and released BIBFRAME editor for
general application profile creation named Sinopia Profile Editor [11].
There is no extensibility of all these stated authoring formats. A format’s
extensibility is critical to its acceptance, which helps different communities to
adopt a simple base format and introduce specific domain requirements. It will
also help to create different standard formats from the same source document
without relying on the common elements. The authors previously proposed an
extensible authoring format named Yet Another Metadata Application Profile
(YAMA) [25] using YAML1 syntax and validated its extensibility over existing
similar proposals [24].
Li and Sugimoto proposed a provenance model named DSP-PROV [13] to
keep track of structural changes of metadata schemas. The DSP-PROV model
applies PROV to the Dublin Core Application Profile. Different from the above
proposal, this paper is treating application profile documents as a digital resource
and attempting to use a lightweight ontology to map different versions of the
published MAP and its provenance.
3 Methodology
The authors are attempting to extend a previously proposed MAP authoring
format with an actionable timeline [23]. With the consideration that the format is
to be a complete source of MAP authoring and versioning, a lightweight ontology
is introduced to notate the authoring and versioning of MAP. The ontology is
introduced with a notion that it can express different versions of the MAP as
well as stakeholders and authoring source of the MAP.
YAMA is extended with two different sets of change mapping options. An action-
able change record named ‘changesets’ - a collection of changes declared using a
custom adaptation of JSON-PATCH - along with minimal metadata for the set
of changes. Changesets are declared within the ‘changes’ path of the YAMA doc-
ument. JSON patch is originally intended to use as HTTP-PATCH method for
1
https://yaml.org.
MAP Provenance with Extensible Authoring Format and PAV 359
1 {
2 " op " : " remove " ,
3 " path " : "/ statements / statement_id /"
4 }
2
https://tools.ietf.org/html/rfc6902.
3
https://tools.ietf.org/html/rfc5789.
4
https://tools.ietf.org/html/rfc2616.
360 N. Thalhath et al.
1 # YAMA
2 changes :
3 cs_ 2 0 1 8 1 1 0 8 _ 0 1 :
4 version : 1 . 2
5 previous_version : 1 . 1
6 date : 2 0 1 8 -1 1 -0 8
7 changeset :
8 ch_ 2 0 1 8 1 1 0 8 _ 0 1 :
9 op : replace # remove , add , replace , copy , test
10 path : / statements / pr_type / max
11 value : n
12 previous_value : 0
Element Usage
version Version of the MAP after the change
previous_version Version of the MAP, to which the change is applied
date Date of change in ISO 8601 (not the date of release)
Fig. 6. YAMA with actionable changesets and changelogs mapped to their expected
outputs
the efforts required for expressing resources using an ontology. Being lightweight
over PROV-O is the main reason for considering PAV to be a means of expressing
MAP resources [4].
There are vocabularies similar to PAV such as Dublin Core Terms (DC
Terms) [3], PROV-O [12], OPM [16], and Provenance Vocabulary [8]. Among
that PROV-O is the most suitable and previously considered in many other
studies to express MAP provenance. PROV-O is similar to a generic framework
for describing provenance in a different range of applications. However, using
PROV-O alone may not be suitable in expressing necessary details for the spe-
cific provenance involving authoring and versioning. PAV can be considered as
a specialization of PROV-O by facilitating more straightforward relationships
for expressing common provenance for digital resources in the web [4]. PROV-O
implements terms useful in tracing the origin of a resource, its derivations, and
the relationship between these different resources. PROV-O is also capable of
expressing the different entities contributed to the resource. In short, PROV-O
can be considered as a general provenance data model extendable for domain-
specific provenance information. For example, PROV-O does not distinguish
between authors, editors, and contributors - which is a noticeable distinction
in use-cases like collaborative MAP authoring and publishing based on public
repositories such as GitHub.
PAV based framework is proposed in the context of MAP authoring and
publishing with these intentions.
Table 2. Subset of PAV authoring properties mapped to YAMA MAP metadata ele-
ments
4 Validation
To validate the proposal, a popular public application profile, The DCAT Appli-
cation profile for data portals in Europe (DCAT-AP) can be used. DCAT-AP
an application profile based on W3C’s Data Catalogue vocabulary (DCAT).
DCAT is implemented for describing public sector datasets in Europe to enable
a cross-data portal search for open data sets and make them searchable. DCAT-
AP is published in Joinup portal6 , but the sources are maintained in a GitHub
repository7 . DCAT-AP repository does not use any authoring format or prepro-
cessors but maintains and releases the MAP in individual expression formats.
6
https://joinup.ec.europa.eu/solution/dcat-application-profile-data-portals-europe.
7
https://github.com/SEMICeu/DCAT-AP.
364 N. Thalhath et al.
6 Conclusion
References
1. Baca, M.: Introduction to Metadata, July 2016. http://www.getty.edu/
publications/intrometadata
2. Ben-Kiki, O., Evans, C., döt Net, I.: YAML Ain’t Markup Language (YAMLTM )
Version 1.2, October 2009. https://yaml.org/spec/1.2/spec.html
3. Board, D.U.: DCMI: DCMI Metadata Terms. https://www.dublincore.org/
specifications/dublin-core/dcmi-terms/2012-06-14/
4. Ciccarese, P., Soiland-Reyes, S., Belhajjame, K., Gray, A.J., Goble, C., Clark, T.:
PAV ontology: provenance, authoring and versioning. J. Biomed. Semant. 4(1), 37
(2013). https://doi.org/10.1186/2041-1480-4-37
5. Library of Congress, L.: BIBFRAME Profile Editor (2018). http://bibframe.org/
profile-edit/
6. Coyle, K.: RDF-AP, January 2017. https://github.com/kcoyle/RDF-AP, original-
date: 2017–01-12T15:38:41Z
7. Enoksson, F.: DCMI: A MoinMoin Wiki Syntax for Description Set Profiles, Octo-
ber 2008. http://www.dublincore.org/specifications/dublin-core/dsp-wiki-syntax/
8. Hartig, O., Zhao, J.: Publishing and consuming provenance metadata on the web of
linked data. In: McGuinness, D.L., Michaelis, J.R., Moreau, L. (eds.) IPAW 2010.
LNCS, vol. 6378, pp. 78–90. Springer, Heidelberg (2010). https://doi.org/10.1007/
978-3-642-17819-1 10
9. Heery, R., Patel, M.: Application profiles: mixing and matching metadata schemas.
Ariadne (25) (2000). http://www.ariadne.ac.uk/issue/25/app-profiles/
10. Hillmann, D.: Metadata standards and applications (2006). http://
managemetadata.com/, publisher: Metadata Management Associates LLC
11. LD4P2: Sinopia Profile Editor (2019). https://profile-editor.sinopia.io/
12. Lebo, T., et al.: Prov-o: The prov ontology. W3C recommendation 30 (2013)
13. Li, C., Sugimoto, S.: Provenance description of metadata application profiles
for long-term maintenance of metadata schemas. J. Documentation 74(1), 36–61
(2018). https://doi.org/10.1108/JD-03-2017-0042
14. Malta, M.C., Baptista, A.A.: A method for the development of Dublin core appli-
cation profiles (Me4dcap V0.2): detailed description. In: Proceedings of the Inter-
national Conference on Dublin Core and Metadata Applications, p. 14 (2013)
15. Malta, M.C., Baptista, A.A.: A panoramic view on metadata application profiles
of the last decade. Int. J. Metadata Semant. Ontol. 9(1), 58 (2014). https://doi.
org/10.1504/IJMSO.2014.059124
16. Moreau, L., et al.: The open provenance model core specification (v1.1). Futur.
Gener. Comput. Syst. 27(6), 743–756 (2011). https://doi.org/10.1016/j.future.
2010.07.005
17. Nagamori, M., Kanzaki, M., Torigoshi, N., Sugimoto, S.: Meta-bridge: a develop-
ment of metadata information infrastructure in Japan. In: Proceedings Interna-
tional Conference on Dublin Core and Metadata Applications, p. 6 (2011)
18. Nilsson, M., Baker, T., Johnston, P.: DCMI: The Singapore Framework for Dublin
Core Application Profiles, January 2008. http://dublincore.org/specifications/
dublin-core/singapore-framework/
19. Nottingham, M., Bryan, P.: JavaScript Object Notation (JSON) Patch, April 2013.
https://tools.ietf.org/html/rfc6902
368 N. Thalhath et al.
20. The DCAT Application profile for data portals in Europe (DCAT-AP), April 2019.
https://github.com/SEMICeu/DCAT-AP, original-date: 2017–09-13T07:53:27Z
21. Svensson, L.A.R.V.: Negotiating Profiles in HTTP, March 2017. https://
profilenegotiation.github.io/I-D-Accept-Schema/I-D-accept-schema
22. Svensson, L.G., Atkinson, R., Car, N.J.: Content Negotiation by Profile, April
2019. https://www.w3.org/TR/dx-prof-conneg/
23. Thalhath, N., Nagamori, M., Sakaguchi, T.: YAMA: Yet Another Metadata Appli-
cation Profile (2019). https://purl.org/yama/spec/latest
24. Thalhath, N., Nagamori, M., Sakaguchi, T., Sugimoto, S.: Authoring formats
and their extensibility for application profiles. In: Jatowt, A., Maeda, A., Syn,
S.Y. (eds.) ICADL 2019. LNCS, vol. 11853, pp. 116–122. Springer, Cham (2019).
https://doi.org/10.1007/978-3-030-34058-2 12
25. Thalhath, N., Nagamori, M., Sakaguchi, T., Sugimoto, S.: Yet another meta-
data application profile (YAMA): authoring, versioning and publishing of appli-
cation profiles. In: International Conference on Dublin Core and Metadata Appli-
cations, pp. 114–125 (2019). https://dcpapers.dublincore.org/pubs/article/view/
4055. ISSN 1939-1366
An Ontology-Based Development of
Activity Knowledge and System Design
1 Introduction
Human activities are diversifying in ways that require appropriate knowledge
processing in accordance with the domain of each activity. In order to support
human activities using information technologies, it is necessary to come up with
a description form of knowledge that can be processed and with knowledge rep-
resentations that make understanding and reasoning easier. In the knowledge
engineering field, many domain ontologies have been developed to resolve these
issues. Domain ontology here is defined as the knowledge conceptualized and
hierarchized from a specific viewpoint in a specific domain. It enables not only
deeper understanding of domain knowledge but also the collection and utilization
of adequate information through multiple data integrations.
c Springer Nature Switzerland AG 2020
X. Wang et al. (Eds.): JIST 2019, LNCS 12032, pp. 369–384, 2020.
https://doi.org/10.1007/978-3-030-41407-8_24
370 N. Iino et al.
The most pressing issue with domain ontologies is that it is difficult to (i) ver-
ify the validity by domain experts [1] and (ii) regularly improve the ontology on
pace with developments in the field. Collaboration between computer engineers
and domain experts is thus extremely important to build domain knowledge.
Previous works in the field of knowledge representation have pursued mainly
machine readability. We feel that workflow design to connect both experts and
a method for easily extracting and organizing knowledge is important. In par-
ticular, knowledge representation that is easy to understand for domain experts
and integration with a domain ontology would be helpful for our activities.
1.2 Contributions
2 Related Work
There seems to be a demand for studies that provide knowledge development and
guidelines. This section focuses on the relevance of representing domain knowl-
edge in ontologies in a way that is easier for the domain experts and machines
(collaborative development and consumption) with regard to readability. We
describe the existing approaches that would be suitable for this purpose.
2.1 Methodologies
perform the process, (2) another knowledge that describes “what” like ontology
does to ensure the consistency of the terms. To develop activity knowledge, a
process that enables domain experts and ontology experts to represent domain
knowledge including “what” and “how” from the same perspective is necessary.
Many visualization tools have been developed to promote understanding of
ontologies. Visual Notation for OWL Ontologies (VOWL) can be customized to
preference [14]. The OntoGraph was developed to provide documentation on exist-
ing OWL ontologies. It can create separate graphs for the classes, object and data
properties, and individuals of an ontology [1]. Another approach to represent-
ing ontologies is Spreadsheets, which are used in a lot of domains. Populous has
been used for ontology development such as Kidney and Urinary Pathway Ontol-
ogy (KUPO)[9]. These tools cannot visualize complex structures that have blank
nodes, however, some works have addressed to solve it by skolemization which the
existing provides reliable tools support for customization [4,15]. For that reason,
we feel it is necessary to use another description form or tools. Activity knowledge
that provides high readability might help solve this problem.
The workflow proposed in some tools [3,18,19] allows domain experts to eas-
ily modify and scale ontologies as per the rapid need of the application. Further-
more, they also provide other features such as version control and visualization.
As a framework of knowledge representation in the product design field, the
Functional Ontology was developed [11]. This ontology deals with knowledge
about design rationale and can provide generality with high reusability by sepa-
rating function (what to achieve) from the way of function achievement (how to
achieve). In this study, we mitigate the notion and use it when developing the
activity knowledge.
The problem with classical guitar is that there are no general textbooks, which
tends to result in a lack of knowledge sharing among players or teachers and
hinders the understanding and progress of students. Therefore, our previous
work involved the collection and organization of knowledge related to classical
guitar techniques [5]. In this study, we revised that knowledge by utilizing the
Convincing Human Action Rationalized Model (CHARM)[16].
The basic components of CHARM are the action, which implies the change
of states, the instance of action, the doer, additional information, and the man-
ner. This model presents the action as a purpose and more detailed partial
actions, but we distinguished between them to deal with the abstract purpose.
In addition, there is an “achievement method” between the action layers that
374 N. Iino et al.
indicate the conceptualized principles of the physical law in state change [11].
We eased the idea of the achievement method (hereinafter called the “method”)
and defined it as “the technique for achieving the purpose.”
There are three main contents of our activity knowledge: purpose, action,
and method. In addition, we created items of detailed information as needed.
Figure 2 shows an example of the activity knowledge developed on the basis of
the above ideas. We adopted classical guitar renditions as the method. The green
rectangle indicates the purpose, the black squares and blue letters present the
method (guitar rendition), the blue rectangles indicate actions, and the orange
rectangles describe the detailed information. In this example, the “Artificial
harmonics” method is defined to achieve the purpose of “changing the timbre”
and is performed in the order of the following three actions: “press string,” “touch
string with right hand,” and “pluck sting.” When performing “touch string with
right hand,” the index finger would be used. We can clearly distinguish the
content of the knowledge in this way. However, in order to process this knowledge,
we have to control the terminologies. Ontology helps achieve this by drawing
from common concepts and conventions.
An Ontology-Based Development of Activity Knowledge and System Design 375
as needed; for instance, “ornament tone” indicates added notes for decoration
in the Ornament rendition (Fig. 3).
The description rules of actions are as follows: describe the order of action
by “action+number,” express the simultaneity and continuity of an action such
as ‘perform action A being performed during action B’ by “primary-action” and
“conditional-action,” and explain the details of actions by several properties such
as “used finger” and “place of action.”
Figure 4 shows the description of Artificial harmonics. This rendition is
described as a sequence of two actions (action1 and action2). In action1, players
pluck a string with a finger on the body side of a guitar while pressing the same
string with a finger on the neck side and touching a string with the indice (index
finger) on the body side. The playing-action of the primary-action is “pluck
string body” with one of three fingers; anular (third finger), medio (middle fin-
ger), or pulgar (thumb), and two conditional-actions are “press string neck” and
“touch string body” with properties for usage of the fingers. Then, in action2,
they release the indice from a string. The playing-action of the primary-action
is described as “release finger from the string body,” and the conditional-action
is “press string neck.”
Table 1. Variations in knowledge. (a) and (b) are the name of purposes that correspond
to Timbre rendition and Percussion rendition of GRO.
Subjective Evaluation. We also asked the domain experts about their opin-
ions regarding the activity knowledge and GRO to examine the effects of the
ontology-based development of activity knowledge. Table 2 lists our questions
about readability (Q.1), appropriateness (Q.2), and usefulness (Q.3). Experts
gave scores on five-point scales, ranging from 1 (strongly disagree) to 5 (strongly
agree).
As shown in Table 3, all evaluations were positive. With regard to readability,
the score of activity knowledge was better than that of GRO, which means the
description form was easy for the domain experts to understand. On the other
hand, its adequacy scores were lower than GRO’s because the structures of
the ontology are clear. According to the evaluation of the usefulness and the
378 N. Iino et al.
Table 2. Questions.
Table 3. Scores.
experts’ comments, we found that improving the activity knowledge with domain
ontology promoted understanding on the part of the domain experts. These
results demonstrate that the ontology-based development of activity knowledge
enables (1) the discovery of items in a domain, (2) the improvement of knowledge
representation, and (3) deep understanding on the part of domain experts.
4 System Design
This section describes the kNeXaR (kNowledge eXplication AugmenteR) sys-
tem we designed and developed to support the ontology-based development of
activity knowledge.
4.1 Architecture
kNeXaR was designed to describe activity knowledge based on the domain ontol-
ogy. In the beginning, we dealt with nursing care-related knowledge, but now
we intend to widen the domain to include instrument performance, education,
and the manufacturing industry. The system’s architecture, shown in Fig. 5, has
following components: Ontology, Declarative Knowledge, and Procedural knowl-
edge. Ontology is managed in OWL (Web Ontology Language) and all activity
knowledge are managed in XML. However, we must extend data availability for
RDF and SPARQL in order to process and reason knowledge. Declarative knowl-
edge here is an extension of ontology that is a platform for adding or linking all
kinds of informations (expertise, individual know-how, videos, etc.). Note that
we do not use declarative knowledge in this study. Here, procedural knowledge
indicates the activity knowledge and can be developed by various domain experts
in accordance with domains, facilities, communities and so on.
380 N. Iino et al.
4.2 Functions
Figure 6 shows the activity knowledge-related windows: the right one presents the
list of the ontology terms, the middle is the edit window for describing activity
knowledge, and the left one presents the described knowledge. The description
items of activity knowledge are based on the CHARM: Action, the Case for des-
ignating a case when an action performed, the Subject who performs an action,
the Object of an action, the Noun of an action, the Verb of an action, the Details
of verb to more fully explain the action, any Risk that is expected, the Instance
Fig. 6. Active windows used to describe the activity knowledge. The terms of the
ontology to be selected are on the right, the items to describe or edit knowledge are in
the middle, and lists of described knowledge are on the left.
An Ontology-Based Development of Activity Knowledge and System Design 381
of noun, a Line, which is the option to change from a simple line to an arrow,
and ThisKey and JumpKey to link between actions. The Risk and Instance of
noun, allow for the description of long sentences forming paragraphs. Data such
as PDF and JPEG files can be attached to the Risk description item.
To ensure the terminology of activity knowledge, we recommend the ontology-
based development as follows: First, the user imports or describes an ontology.
Then, the user selects any concepts or properties of the ontology by clicking on
the magnifying glass icon to the right of each item on the edit window. In the
case shown in Fig. 6, the “select ontology” window of the Action item presents
the registered action-related concepts of the Guitar Rendition Ontology. “Pluck
string body” is selected and transcribed to the item in the edit window. And
then, it listed the activity knowledge.
We described the activity knowledge, that we improved in Sect. 3.3, by using
kNeXaR (Fig. 7). The top of the white rectangle presents the purpose of the
knowledge, the red letters indicate methods (guitar rendition), the white rectan-
gles (except the purpose) present actions, and the orange rectangles present the
detailed information described in Instance of noun. The usability of the infor-
mation system depends not only on the functions but also on a design that is
attractive to users. We need to modify the design according to each user’s or
facility’s preference.
Results showed that all terms were covered with the ontology’s one, so the
matching rate became 100% (Table 4). Also, the kinds of action were decreased,
which means it controlled by using limited ontological terminologies effectively.
Table 5 shows the details of the term changes. For example, “press string” was
modified to “press string body” and “press string neck” (the latter two are
subclasses of the former. “Place the right hand on the fingerboard side,” that
contained multiple information, was divided into two: the action was changed
to “move to position body” and the detailed information, “place of action: fin-
gerboard side,” was added. “Place of action” is the ontological term. We also
controlled the verbal representation, for instance, from “put a finger” to “touch
a string,” which express the same states.
Table 4. Changes of action in the activity knowledge for “Change the timbre”: These
are compared with the results of domain experts’ improvement (Before) and of one
that was rewrote by using the ontology selection function (After).
Before After
Number of action 19
Matching rate 52.6% 100.0%
Kinds of action 10 8
Before After
Action Action Detailed information
place the right hand change position body place of action: neck
on neck side side
pluck string pluck string body,
pluck string neck
place the right hand change position body place of action: bridge
on bridge side side
touch string touch string body,
touch string neck
put the palm of right touch string body part of hand or hinger:
hand palm
put the little finger of touch string body used finger: little
right hand finger
press string press string body,
press string neck
touch string with left touch string neck
hand
touch string with right touch string body
hand
release the touched finger release finger from the string,
release finger from the string body
An Ontology-Based Development of Activity Knowledge and System Design 383
5 Conclusion
In this study, we have presented the process to develop a domain knowledge
using two type of knowledge representation: activity knowledge and a domain
ontology. We practiced an ontology-based development of activity knowledge
in detail on a musical instrument performance. Trough improving the activity
knowledge based on the Guitar Rendition Ontology by domain experts, we deter-
mined the following observations: (1) the discovery of items in a domain, (2) the
improvement of knowledge representation, and (3) deep understanding on the
part of domain experts. Furthermore, we developed a system named kNeXaR
(kNowledge eXplication AugmenteR) to help describe activity knowledge based
on ontological terms, and demonstrated that kNeXaR can control the terms. In
future works, we will test the process in different domains using kNeXaR.
References
1. Andrea, W., Rebecca, T.: Ontology development by domain experts (without using
the “O” word). Appl. Ontol. 12, 299–311 (2017)
2. Bada, M., et al.: A short study on the success of the gene ontology. Web Semant.
Sci. Serv. Agents World Wide Web 1(2), 235–240 (2014)
3. Halilaj, L., et al.: VoCol: an integrated environment to support version-controlled
vocabulary development. In: Blomqvist, E., Ciancarini, P., Poggi, F., Vitali, F.
(eds.) EKAW 2016. LNCS (LNAI), vol. 10024, pp. 303–319. Springer, Cham (2016).
https://doi.org/10.1007/978-3-319-49004-5 20
4. Hogan, A.: Skolemising blank nodes while preserving isomorphism. In: Proceedings
of the 24th International Conference on World Wide Web, International World
Wide Web Conferences Steering Committee, pp. 430–440 (2015)
5. Iino, N., Nishimura, S., Fukuda, K., Watanabe, K., Jokinen K., Nishimura, T.:
Development and use of an activity model based on structured knowledge - a music
teaching support system. In: IEEE International Conference on Data Mining, The
5th International Workshop on the Market of Data (2017)
6. Iino, N., Nishimura, S., Nishimura, T., Fukuda, K., Takeda, H.: The guitar ren-
dition ontology for teaching and learning support. In: IEEE 13th International
Conference on Semantic Computing (2019). DOI: https://doi.org/10.1109/icosc.
2019.8665532
7. Joo, S., Koide, S., Takeda, H., Horyu, D., Takezaki, A., Yoshida T.: Agriculture
activity ontology : an ontology for core vocabulary of agriculture activity. In: The
15th International Semantic Web Conference (2016)
8. Joo, S., Koide, S., Takeda, H., Horyu, D., Takezaki, A., Yoshida T.: A building
model for domain knowledge graph based on agricultural knowledge graph, SIG-
SWO-047-10 (2019)
384 N. Iino et al.
9. Jupp, S., et al.: Populous: a tool for building OWL ontologies from templates.
BMC Bioinform. 13(Supplement 1) (2012)
10. Kamruzzaman, S.Md., Krisnadhi, A., Hitzler, P.: OWLAx: a protégé plugin to sup-
port ontology axiomatization through diagramming. In: 15th International Seman-
tic Web Conference (2016)
11. Kitamura, Y., Koji, Y., Mizoguchi, R.: An ontological model of device function:
industrial deployment and lessons learned. Appl. Ontol. 1(3–4), 237–262 (2006)
12. Kolozali, S., Barthet, M., Fazekas, G., Sandler, M.: Knowledge representation issues
in musical instrument ontology design. In: 12th International Society for Music
Information Retrieval Conference, pp. 465–470 (2011)
13. Lisena, P., et al.: Controlled Vocabularies for Music Metadata. In: Proceedings of
the 19th ISMIR Conference, 424–430 (2018)
14. Lohmann, S., Negru, S., Haag, F., Ertl, T.: Visualizing ontologies with VOWL.
Semant. Web 7(4), 399–419 (2016)
15. Mallea, A., Arenas, M., Hogan, A., Polleres, A.: On blank nodes. In: Aroyo, L.,
et al. (eds.) ISWC 2011. LNCS, vol. 7031, pp. 421–437. Springer, Heidelberg (2011).
https://doi.org/10.1007/978-3-642-25073-6 27
16. Nishimura, S., et al.: CHARM as activity model to share knowledge and transmit
activity knowledge and its application to nursing guidelines integration. J. Adv.
Comput. Intell. Intell. Inform. 17(2), 208–220 (2013)
17. Rashid, S.M., McGuinness, D.L., Roure, D.D.: A music theory ontology. In: Inter-
national Workshop on Semantic Applications for Audio and Music (2018)
18. Stellato, A., et al.: VocBench: a web application for collaborative development of
multilingual thesauri. In: Gandon, F., Sabou, M., Sack, H., d’Amato, C., Cudré-
Mauroux, P., Zimmermann, A. (eds.) ESWC 2015. LNCS, vol. 9088, pp. 38–53.
Springer, Cham (2015). https://doi.org/10.1007/978-3-319-18818-8 3
19. Tudorache, T., Nyulas, C., Noy, M.F., Musen, M.A.: WebProtégé: a collaborative
ontology editor and knowledge acquisition tool for the web. Semant. Web 4(1),
89–99 (2013). https://doi.org/10.3233/SW-2012-0057
20. Uschold, R., Gruninger, M.: Ontologies: principles, methods and applications.
Knowl. Eng. Rev. 11(2), 93–136 (1996)
21. Raimond, Y., Abdallah, S., Sandler, M., Giasson, F.: The music ontology. In: Pro-
ceedings of the International Conference on Music Information Retrieval, pp. 417–
422 (2007)
Author Index