Finding Out Noisy Patterns For Relation Extraction of Bangla Sentences
Finding Out Noisy Patterns For Relation Extraction of Bangla Sentences
Finding Out Noisy Patterns For Relation Extraction of Bangla Sentences
1, February 2020
ABSTRACT
Relation extraction is one of the most important parts of natural language processing. It is the process of
extracting relationships from a text. Extracted relationships actually occur between two or more entities of
a certain type and these relations may have different patterns. The goal of the paper is to find out the noisy
patterns for relation extraction of Bangla sentences. For the research work, seed tuples were needed
containing two entities and the relation between them. We can get seed tuples from Freebase. Freebase is a
large collaborative knowledge base and database of general, structured information for public use. But for
Bangla language, there is no available Freebase. So we made Bangla Freebase which was the real
challenge and it can be used for any other NLP based works. Then we tried to find out the noisy patterns
for relation extraction by measuring conflict score.
KEYWORDS
Natural Language Processing, Relation Extraction, Bangla, Conflict Score, Noisy Pattern
1. INTRODUCTION
Natural language processing (NLP) is a branch of artificial intelligence which describes
interaction of human and computer by manipulating human language. Its goal is to fill up the gap
between human communication and computer understanding. Here, relation extraction (RE) is a
fundamental topic of NLP. It is actually the task of finding semantic relationships between pairs
of entities. Relation extraction is essential for many well-known tasks such as knowledge base
completion, question answering, medical science and ontology construction [1]. There are so
many unstructured electronic text data available on the web like newspaper, articles, journals,
blogs, government and private documents etc. But unstructured text can be turned into structured
by annotating semantic information.
Here, entities can be like person, organization, locations. We have to identify the entity types of a
sentence. A relation is defined in the form of a tuple t = (e1, e2, ..., en) where the ei are entities in
a predefined relation r within document D. Most relation extraction systems focus on binary
relations. Examples of binary relations include born-in(Ruma, Dhaka), father-of(John David, Eric
David) [8]. For relation extraction, we have different methods like supervised method, distant
supervised method and unsupervised method.
In supervised method, sentences in a corpus are first hand-labeled for the presence of entities and
relations between them. Lexical, syntactic and semantic features have to be extracted by the
automatic content systems (ACE) to build supervised classifiers to label the relation between a
given pair of entities in a test set sentence. Labeled training data is expensive to produce and thus
DOI: 10.5121/ijnlc.2020.9102 9
International Journal on Natural Language Computing (IJNLC) Vol.9, No.1, February 2020
limited in quantity. Another one is distant supervised method for relation extraction which aligns
texts to the given KB and use the alignment to learn a relation extractor [3]. They use the large
amounts of structured data sources (such as Freebase) as the distant supervision information. As
these methods do not need a hand-labeled dataset and KBs grow fast recently, they are more
efficient.
In our work, we tried to find out noisy patterns for relation extraction of Bangla sentences for
distantly supervised method. For this method, we need seed tuples which we can get from
knowledge base like (Freebase). But there is no available Freebase in Bangla. So we built Bangla
Freebase which contains a large amount of relations.
2. PREVIOUS STUDY
Distant supervision can be introduced as an efficient method to scale relation extraction to very
large corpora which contains a lot of relations. The authors proposed a sentence-level attention
model to select the valid instances, which makes full use of the supervision information from
knowledge bases [2]. And entity descriptions from Freebase and Wikipedia pages to supplement
background knowledge have been extracted for their task. The background knowledge not only
provides more information for predicting relations, but also brings better entity representations
for the attention module. Three experiments have been conducted on a widely used dataset and
the experimental results showed that their approach outperforms all the baseline systems
10
International Journal on Natural Language Computing (IJNLC) Vol.9, No.1, February 2020
significantly [2].
Modern models of relation extraction for tasks like ACE are based on supervised learning of
relations from small hand-labeled corpora. The authors used a paradigm that does not require
labeled corpora [3]. This paradigm can avoid the domain dependence of ACE style algorithms,
and allow the use of corpora of any size. This experiment used Freebase which is a large semantic
database of several thousand relations. The Freebase provides distant supervision. For each pair
of entities that appears in some Freebase relation, all sentences containing those entities in a large
unlabeled corpus have been selected and extracted textual features to train a relation classifier.
Their algorithm combines the advantages of supervised IE and unsupervised IE [3].
There have been so many works of relation extraction entities in English. In this work, we have
worked on relation extraction of Bangla sentences on which not so much research work has been
done. So it will be very much beneficial for this language. This is the nobility of our work.
3. CREATING BANGLA FREEBASE
Freebase is a large collaborative knowledge base and database of general, structured information
for public use. Its structured data had been harvested from many sources, including individual,
user-submitted wiki contributions. Its aim is to create a global resource so that people (and
machines) can access common information more effectively [9]. It is available in English.
Actually in Freebase, the triple format is like (e1, r, e2) where e1 and e2 are the two entities and r
defines the relation. So relation can be found in a known KG and can generate large amount of
data [4]. Since we mentioned before that we created our own Bangla Freebase which contains a
large number of relation with the help of Wikidata query service and SPARQL query language. It
is a large collection of knowledge base database. Today the number of Bangla articles in the
internet is growing day by day. So it has become a necessary to have a structured data store in
Bangla. It consists of different types of concepts (topics) and relationships between those topics.
These include different types of areas like popular culture (e.g. films, music, books, sports)
location information (restaurants, locations, businesses), scholarly information (linguistics,
biology, astronomy), birth place of (poets, politicians, actor, actress) and general knowledge
(Wikipedia). Here we collected more than 100 relations according to our need. By using
SPARQL query, anyone can find out their required relation. So this knowledge base is very much
helpful. It will be much more helpful for relation extraction or any kind of NLP (Natural
Language Processing) works on Bangla language.
readable name, each item has a list of labels in each language associated with it. So we‟ll see that
the English (en) label at Q1490 is “Tokyo”, also has corresponding word for the Japanese (ja)
label, the Bangla (bn) label and so on.
12
International Journal on Natural Language Computing (IJNLC) Vol.9, No.1, February 2020
http://www.wikidata.or
(Poet)
g/entity/Q4665322 (Abdul Gaffar Chowdhury) (Bangladesh)
http://www.wikidata.or
g/entity/Q4665454 (Abdul Quadir) (Poet)
(Bangladesh)
http://www.wikidata.or
(Abid Azad) (Poet)
g/entity/Q4667573 (Bangladesh)
http://www.wikidata.or
(Poet)
g/entity/Q4670213 (Abu Hena Mustafa Kamal) (Bangladesh)
4. METHODOLOGY
Relation extraction is very much significant in NLP based work. There are lot of methods for RE.
In our work, we use distant supervision for relation extraction. In distant supervision, an already
existing database, such as Freebase (knowledge-database) is prepared. Then we gather examples
for the relations we want to extract. Thus our training data will be prepared. For example,
Freebase contains the fact that Paris is the capital of France. We then label each pair of "France"
and "Paris" that appear in the same sentence as a positive example for “capital_of_the_country”
relation. A large amount of (possibly noisy) training data can be generated. In the research work
we needed seed tuples which are collected from Bangla Freebase made by us. Distant supervised
method has been used which is very much efficient. In each seed tuple, there are two entities and
their relation. There may be different types of entities like person, organization, location, films
etc. We then extracted features from the sentences containing those entities in a large corpus. So
we can say, our goal is to extract relation between two entities from sentences in a triple format
and map the triple elements existing in a knowledge base [5]. After that we made decision that
these extracted features are valid or not for each relation by measuring conflict score.
4.1. Name Entity Recognition
For our work, we had to identify the entity for each sentence. For entity identification, we used
word level features (e.g., token, prefix, suffix), list lookup features (e.g., gazetteers). Gazetteers
include names of countries, major cities, common people name, organization name etc.
13
International Journal on Natural Language Computing (IJNLC) Vol.9, No.1, February 2020
For each pair of entities in the seed tuple that appears in some Freebase relation, we used
Wikipedia because it is relatively up to-date. We found out all sentences containing those entities
in Wikipedia or large unlabeled corpus and collected them. Then we worked on them and
extracted textual features. A part of our corpus looks like below:
A threshold value is fixed which is 0.3. If the conflict score is less than or equal to threshold
value then it is a valid pattern. Otherwise the pattern is invalid or noisy for a relation. For person
and organization entities we take three relations. We know it is not necessarily for a person that
his working place and birth place will be same. It helps us to find out the conflict patterns.
15
International Journal on Natural Language Computing (IJNLC) Vol.9, No.1, February 2020
The number of
The number of Valid or
No. Patterns we get patterns Conflict Score
valid patterns Invalid
with conflict
(was
1. 1 20 0.05 Valid
born in/at)
(was
2 2 17 0.18 Valid
born in/at)
3. (works at) 9 1 9 Invalid
So, the valid patterns are:- (was born in), (is born in),
(born at/in), (the birthplace is). Other patterns are noisy patterns for this relation.
So the sentences containing these patterns will be removed. Noisy patterns are ,
, ,
Relation 2: Here, „place_of_work‟ relation is „ ‟ in Bangla. So the entities are person
and location. For relation the conflict scores of different patterns are given below.
16
International Journal on Natural Language Computing (IJNLC) Vol.9, No.1, February 2020
The number of
The number of Valid or
No. Patterns we get patterns with Conflict Score
valid patterns Invalid
conflict
(works/work
1. 2 16 0.125 Valid
at)
(had been
2 0 31 0 Valid
working)
3. (works/work at) 2 21 0.01 Valid
4. 12 1 12 Invalid
(was/were born in)
(has/have
5. 5 2 2.5 Invalid
gone to travel)
(works/work
6. 1 19 0.05 Valid
at)
(has
7 been appointed to the 3 18 0.16 Valid
work)
8. 5 3 1.67 Invalid
(arranged the party)
So, the valid patterns are:- (works at), (had been working),
(works/work at), (works at), (has been appointed to the
work). Other patterns , , are
noisy patterns.
Relation 3: Here, „living-place‟ relation is „ ‟ in Bangla. So the entities are person and
location. For relation the conflict scores of different patterns are given below:
Table 6. Valid pattern identification for (living-place) relation
The number of
The number of Valid or
No. Patterns we get patterns with Conflict Score
valid patterns Invalid
conflict
1. (lives/live in) 4 20 0.2 Valid
(has
2 2 11 0.11 Valid
been living)
3. (works at) 5 3 1.67 Invalid
(are the
4. 1 14 0.08 Valid
permanent resident)
(has gone to
5. 3 1 3 Invalid
travel)
6. (works at) 4 2 2 Invalid
17
International Journal on Natural Language Computing (IJNLC) Vol.9, No.1, February 2020
The number of
The number of Valid or
No. Patterns we get patterns with Conflict Score
valid patterns Invalid
conflict
(has
1. 4 17 0.24 Valid
directed)
2 (has 6 7 0.86 Invalid
produced)
3. 1 20 0.05 Valid
(directed)
4. (has 4 22 0.19 Valid
directed)
5. 1 9 0.11 Valid
(film director is)
6. (has 7 4 1.75 Invalid
acted)
So, the valid patterns are:- (has directed), (directed),
(has directed), (the film director is).
Relation 5: Here, film_producer‟ relation is „ _ ‟ in Bangla. So the entities are
person and film. For _ relation the conflict scores of different patterns are given
below:
Table 8. Valid patterns identification for „ _ (film-producer)‟ relation.
The number of
The number of Valid or
No. Patterns we get patterns with Conflict Score
valid patterns Invalid
conflict
1. (has 1 10 0.1 Valid
produced)
2 (has acted) 5 1 5 Invalid
3. 10 2 5 Invalid
(directed)
18
International Journal on Natural Language Computing (IJNLC) Vol.9, No.1, February 2020
6. FUTURE WORK
Here in this research work, we worked on relation extraction of Bangla sentences. RE is used in
information extraction. As Bengali articles are increasing in the Web, this work holds very much
significance for Bangla language based research work. Bangla language is very much
enriched. In future, we will build a classifier where noisy patterns for any relation will be
removed.
7. CONCLUSIONS
Relation extraction is very fundamental topic in NLP. In this work, we made a Freebase for
Bangla which is a large collection of structured data by using Wikidata query service. It will be
much more helpful for further research work like in natural language processing of Bangla, where
the researchers need to get seed tuples. Actually researchers in areas such as entity extraction and
reconciliation, data mining, Semantic Web, information retrieval, ontology creation and analysis
can use this to support their work. With the help of Freebase, we get our seed tuples. We worked
on Bangla sentences for relation extraction using distant supervision. Then we found out the
noisy patterns using conflict score.
ACKNOWLEDGEMENTS
We are thankful to the Department of Computer Science & Engineering, Jahangirnagar
University.
REFERENCES
[1] Liu, Liyuan & Ren, Xiang & Zhu, Qi & Zhi, Shi & Gui, Huan & Ji, Heng & Han, Jiawei.
“Heterogeneous Supervision for Relation Extraction: A Representation Learning Approach.” arXiv
preprint arXiv:1707.00166 [cs.CL] (2017).
[2] Mike Mintz, Steven Bills, Rion Snow, Dan Jurafsky, “Distant supervision for relation extraction
without labeled data”
[3] Guoliang Ji, Kang Liu, Shizhu He, Jun Zhao, (2017). “Distant Supervision for Relation Extraction with
Sentence-Level Attention and Entity Descriptions”, Semantic Scholar.
[4] Wang, Guanying & Zhang, Wen & Wang, Ruoxu & Zhou, Yalin & Chen, Xi & Zhang, Wei & Zhu, Hai
& Chen, Huajun.“Label-Free Distant Supervision for Relation Extraction via Knowledge Graph
Embedding.” In proceedings of the 2018 Conference on Empirical Methods in Natural Language
Processing (2018).
[5] Bayu Distiawan Trisedya, Gerhard Weikum, Jianzhong Qi, et al. “Neural Relation Extraction for
Knowledge Base Enrichment”. In proceedings of the 57th Annual Meeting of the Association for
Computational Linguistics (July, 2019).
[6] F. H. Marc Weise, Steffen Lohmann, “Ld-vowl: Extracting and visualizing schema information for
linked data”.
[7] D. Hernández, A. Hogan, and M. Krotzsch, (2015). “Reifying RDF: what works well with wikidata?”
Proceedings of the 11th International Workshop on Scalable Semantic Web Knowledge Base Systems,
19
International Journal on Natural Language Computing (IJNLC) Vol.9, No.1, February 2020
vol. 1457 of CEUR Workshop Proceedings, pp. 32–47. CEUR- WS.org, 2015.
[8] Nguyen Bach, Sameer Badaskar. “A Review of Relation Extraction.”. Semantic Scholar.
[9] K. D. Bollacker, P. Tufts, T. Pierce, and R. Cook, (2007). “A platform for scalable, collaborative,
structured information integration.”
[10] Zeng, D. & Dai, Y. & Li, F. & Sherratt, R.S. & Wang, J. “Adversarial learning for distant supervised
relation extraction.” Computers, Materials and Continua (2018).
[11] Wang, Dongsheng & Tiwari, Prayag & Garg, Sahil & Zhu, Hongyin & Bruza, Peter. (2019).
“Structural block driven - enhanced convolutional neural representation for relation extraction.”.
Applied Soft Computing. 86. 105913. 10.1016/j.asoc.2019.105913.
Authors
Rukaiya Habib is an M.Sc. student of Computer Science and Engineering,
Jahangirnagar University, Savar, Dhaka, Bangladesh. She has completed her
B.Sc. also in Computer Science and Engineering, from Jahangirnagar
University in 2017. She is interested in Natural Language Processing, Mobile
Adhoc Network and Computer vision related research works.
Md. Musfique Anwar has awarded PhD degree from the Department of
Computer Science and Software Engineering, Faculty of Science,
Engineering and Technology of Swinburne University of Technology,
Melbourne, Australia in 2018. He has received M.Sc. degree from the
Department of Intelligence Science and Technology, Graduate School of
Informatics of Kyoto University, Japan in 2013 and B.Sc. degree in Computer
Science and Engineering from Jahangirnagar University, Savar, Dhaka,
Bangladesh in 2006. Since 2008, he is a faculty member having current
designation Associate Professor in the Department of Computer Science and Engineering of
Jahangirnagar University, Savar, Dhaka, Bangladesh. Currently, his research focuses on Data
Mining, Social Network Analysis, Natural Language Processing and Software Engineering. He
achieved Best Student Paper award in 29th Australasian Database Conference (ADC) in 2018,
Best Poster award in 26th Australasian Database Conference (ADC) in 2015 and Best Poster
Paper award in International Workshop on Computer Vision and Intelligent Systems-2019
(IWCVIS2019) in 2019.
20