Seminar Article1

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 11

Ambo University

Institute Of Technology

Hachalu Hundessa Campus

School Of Informatics And Electrical Engineering

Department Of Information Technology

An Article Review On

Building Wordnet For Afaan Oromoo

Submitted By

Aster Olkeba…………………………UGR/51469/13

Efrata teriesa………………...……...UGR/51121/13

Meti Daniel…………………………..UGR/51550/13

Rufael Yosef…………………………UGR/51636/13
ABSTRACT
This study presents a novel combined approach for capturing both semantic and structural relationships
within Afaan Oromo, an under-resourced language, by integrating the extraction of synonym relations
from Word2Vec and hyponym relations from Lexico-Syntactic Patterns (LSP). The research aims to
enhance the understanding of linguistic relationships in Afaan Oromo and improve the performance of
natural language processing (NLP) applications tailored to the language.

The experimental results demonstrate the effectiveness of the combined approach in capturing synonym
and hyponym relations within Afaan Oromo text data. Evaluation metrics for Word2Vec embeddings and
the extraction of hyponym relations from LSP provide quantitative measures of performance, showcasing
the robustness of the combined approach. Furthermore, the integration of these relations enriches the
understanding of semantic and structural relationships within the language, as evidenced by the
application of the extracted relations in NLP tasks such as word sense disambiguation, semantic similarity
assessment, and knowledge base construction.

The comparative analysis highlights the advantages of the combined approach over baseline methods,
emphasizing the effectiveness of leveraging Word2Vec and LSP for Afaan Oromo. Additionally,
qualitative examples of synonym and hyponym relations extracted from Afaan Oromo text data provide
further insights into the depth and richness of the linguistic relationships captured through this approach.

Overall, the combined extraction of synonym and hyponym relations from Word2Vec and LSP offers a
robust framework for advancing the understanding of Afaan Oromo, contributing to the development of
more accurate and linguistically informed NLP applications, and facilitating broader efforts in language
processing and resource development for under-resourced languages. This approach has the potential to
enhance language technology and linguistic research for Afaan Oromo and other languages with limited
linguistic resources.

i
Table of Contents
Abstract........................................................................................................................................................i
List figure....................................................................................................................................................iii
1.Introduction..............................................................................................................................................1
1.1.Background...........................................................................................................................................1
1.2. Afaan Oromoo language...................................................................................................................2
1.3. Statement Of The Problem...............................................................................................................2
1.4.Review of Literature..........................................................................................................................3
2.Objectives Of The Article.........................................................................................................................3
2.1.General Objective..............................................................................................................................3
2.2.Specific Objectives............................................................................................................................3
2.3 Materials and methods......................................................................................................................4
2.3.1.Development Tools and Techniques...........................................................................................5
2.3.2. Word Embedding (Word2vec)...................................................................................................5
2.3.3.Lexical syntactic patterns (LSP)...................................................................................................5
2.3.4. Extraction of synonym relations from Word2Vec and hyponym relations from LSP................6
2.3.5.Experimental Results..................................................................................................................6
3. Conclusion...............................................................................................................................................7

ii
LIST FIGURE
Figure 1 the system architecture of the automatic Afaan Oromoo WordNet generation system...............9
Figure 2Python program to calculate the similarity of two Afaan Oromoo words....................................10

iii
1.Introduction

1.1.Background
According to Wales and Sanger (2017), WordNet is a lexical database for natural languages. Another
online lexical reference system is called WordNet (Miller et al., August 1993). Its design was influenced
by contemporary psycholinguistic ideas of human lexical memory. It gives brief, broad descriptions,
records the different semantic linkages between these synonym sets, and organizes words of a particular
language into sets of synonyms called synsets. Its two goals are to: (1) make dictionary and thesaurus
combinations more easily readable; and (2) support AI and automatic text analysis applications. Princeton
University produced WordNet, a manually built electronic lexical database for English, in 1986 and it is
still being worked on there (Fellbaum, 2006). The psycholinguist George A. Miller was motivated to test
the underlying assumptions on a big scale by artificial intelligence researchers' experiments (Collins &
Quillian, 1968) exploring human semantic recall (Miller et al., August 1993).

In a communication aid module for retrieving full-text messages, WordNet has been utilized as a
comprehensive semantic lexicon. This allows for the creation of more expansive searches through
keyword design. In order to provide users with well-organized and integrated access to information,
WordNet has begun to be used as a language knowledge tool. This integration has become increasingly
important with the development of multiple database access systems, and WordNet's ability to identify
and interpret semantic equivalents is particularly helpful in this regard (Morato et al., 2003).

Major search engines, IR research projects, numerous natural language processing (NLP) applications,
and numerous artificial intelligence (AI) domains have long used a collection of words found in WordNet
(Fishkin, 2005). However, a lot of the time, an application doesn't comprehend what they mean when they
employ phrases. WordNet has the ability to retrieve data regarding the following for a particular word or
phrase:

synonyms are terms with comparable meanings (same = similar), homonyms are the names of wholes of
which other words are a part, hyponyms are members of a class of terms, and hypernyms are the general
term used to describe a class of particular (e.g., Soil is a sort of land). Homonyms are the names of
wholes of which other words are parts (e.g., nutrients are a part of the soil); hyponyms are members of a
class of terms (e.g., clay is a kind of soil); meronyms are parts of the holonym (e.g., soil is a part of the
ground) (Wales & Sanger, 2017).

1
This study aims to develop an Afaan Oromoo system to facilitate NLP applications related to Afaan
Oromoo, using resources like synonyms and hyponyms from Afaan Oromoo texts to address the lack of
WordNet.

1.2. Afaan Oromoo language


The Afaan Oromoo language is an Afro-Asiatic macro language composed of four distinct varieties:
Southern Oromo, Orma, Sakuye, Munyo, Waata/Sanye, and West Central Oromo. Oromo is a dialect
continuum, with differences accumulating over distance. It is a Cushitic language spoken by above 50
million people in Somalia, Kenya, Ethiopia, and Egypt and is the third-largest language in Africa. There
are additional Oromo speakers out of the Ethiopian country than the resident population in Ethiopia. In
the United States, Australia, Canada, and different Europe cities people are speaking and communities are
teaching their kids and foreigners those interested in communications in Afaan Oromoo also taking the
Oromo class. In Oromia, it has a high rank and it is an official language. It has its own writing and it can
be written with Latin script. The verbalized tradition is very rich and nowadays there are sufficient
literary works written in Afaan Oromoo; modern arts like music and folk arts. Oromo people speak Afaan
Oromoo, as well as Amharic, Tigrinya, Guragegna and Omotic languages. They are mainly Muslim and
Christian, while only 3% still follow the customary religion based on the worshipping of the God Waaqa.
Oromo are mainly farmers and cattle herders. They have distinguished themselves throughout history for
their strong military organization (Erena, 2017).

1.3. Statement Of The Problem


So, what were the challenges the author faced in this linguistic adventure?

 Cracking the Synonym and Hyponym Code Afaan Oromoo has this unique challenge where
words can mean different things. The authors mission was to figure out the synonyms and
hyponyms, making sure words were on the same page for natural language processing and AI
magic.
 Choosing the Right Word-Flavor Algorithm: Imagine standing in a spice bazaar, trying to pick
the best ones for your dish. The author had to decide between continuous bag of words and Skip
Gram, looking not just at accuracy but also the speed at which these algorithms worked in Afaan
Oromoo.
 Unlocking the Power of Lexico-Syntactic Patterns: Biru wasn't satisfied with just one trick. They
wanted to see if Lexico-Syntactic Patterns could play nice with Word2Vec. It's like mixing your

2
favorite ingredients to create a magical recipe – in this case, a recipe for better precision, recall,
and F-measure.
 Dancing Around Afaan Oromoo's Unique Style: Writing in Afaan Oromoo is an art, full of
expressive noun phrases. Biru had to waltz through this unique writing style, knowing it could
add a layer of complexity to the whole process.
 Dodging Typos and Spelling Hiccups: It's like walking through a linguistic minefield. Biru
understood that typos and small errors could sneak in and disrupt the harmony, so they kept an
eye out for those.
 In a nutshell, the authors approach was like crafting a linguistic masterpiece – a blend of
algorithmic finesse, a dash of linguistic understanding, and a sprinkle of proactive problem-
solving.

1.4.Review of Literature
The literature review will explore Afaan Oromoo's lexical resources and linguistic research, utilizing
academic publications, databases, and cultural documentation. Dialectal variations will be accounted for,
and limited linguistic research will be addressed. Consistency and accuracy in representing Afaan
Oromoo in the chosen script will be ensured, especially considering the transition from Ge'ez to Qubee
script. Unique semantic nuances and culturally specific concepts will be accurately represented

2.Objectives Of The Article

2.1.General Objective
the general objective of developing a robust and culturally sensitive Afaan Oromoo WordNet will be
achieved, supporting language learning, natural language processing, and the preservation of the
language's unique identity.

2.2.Specific Objectives
 Data from various sources, including linguistic experts, native speakers, and existing linguistic
resources, will be compiled to create a comprehensive lexical database for Afaan Oromoo.
 Dialectal variations in Afaan Oromoo will be incorporated into the WordNet while maintaining
coherence, addressing the challenges posed by regional spoken forms.
 Existing linguistic research and documentation for Afaan Oromoo will be systematically analyzed
and utilized to enhance the development of the WordNet.

3
 Consistency and accuracy in representing Afaan Oromoo in the chosen script will be ensured
through careful consideration of the language's transition from the Ge'ez script to the Qubee
script.
 Unique semantic nuances and culturally specific concepts within Afaan Oromoo will be
accurately represented in the WordNet to preserve the language's rich cultural heritage.

2.3 Materials and methods


This study uses merged and distributional semantics approaches to build WordNet using a word
embedding model, generating terms from Afaan Oromoo documents.

The system architecture is depicted in Figure 1. The architecture is modularized into sub-components
including preprocessing tasks like tokenization, stop words and number removal, applying a lexical
syntactic pattern, co-occurrence manipulation, and similarity. In addition to this, the system architecture
has text operation (i.e., lexical analysis, and stop word elimination), extracting synonyms relations by
word embedding, and extracting hyponym relations by lexical syntactic patterns with and without
combination of word embedding components. The system accepts document collections and applies text
operation on the documents to generate the nearest neighbors of the term. The system also accepts
document collections and applies text operation on the documents to retrieve the hyponym relations exist
in texts which can match the created patterns These nearest neighbors have automatically generated
WordNet terms for that particular term.

Figure 1 the system architecture of the automatic Afaan Oromoo WordNet generation system.

2.3.1.Development Tools and Techniques


This study aims to find a large-sized and standard corpus for the Afaan Oromo language due to the lack of
developed standards and time-consuming preparation. A small corpus (2746KB) with 680 pages and

4
387224 tokens was collected from various news and media sources, including VOA, Bariisaa, Holy Bible,
and online educational resources. The corpus was used to extract lexical syntactic patterns and extract
hyponym relations in Afaan Oromoo texts using WordNet and Python programming languages is used to
implement word embedding as shown in the figure 2.

Figure 2Python program to calculate the similarity of two Afaan Oromoo words

2.3.2. Word Embedding (Word2vec)


Word embedding, also known as Word2Vec, is a language modeling and feature learning
technique in Natural Language Processing (NLP) that links words or phrases to real numbers. It
uses two-layer neural networks to reconstruct linguistic contexts of words, creating a vector
space with words in close proximity. Two model constructions, continuous skip-gram and
continuous bag-of-words, are used to create a distributed representation of words.

2.3.3.Lexical syntactic patterns (LSP)


Lexical Syntactic Patterns (LSP) are linguistic structures that indicate semantic relationships
between words, aiding in identifying official concepts and conceptual relations in natural
language text. These patterns are strings of words paired with syntactic structures, allowing for
the expression of lexicon syntactic structures and their semantic relations.

This study creates seven lexico-syntactic and semantic patterns from Afaan Oromoo texts for
automatic ontology building. These patterns are general, domain/application-independent, and

5
operate at the sentence level. They are used to derive taxonomic and non-taxonomic relations
and axioms from sentences and phrases. Semantic patterns are language independent.

Lexico syntactic pattern with Word embedding

In this study, Word2Vec is integrated with LSP to enhance its performance. This integration is
motivated by the observation that words sharing the same hyponym relation are positioned in
close proximity within the Word2Vec embedding space. Consequently, Word2Vec treats these
words as similar due to their neighboring words.

2.3.4. Extraction of synonym relations from Word2Vec and hyponym relations from LSP
WordNet relations extracted by Word2Vec and LSP models are not adjusted, as the system only
includes synonym relations based on their weight, removing the given number before saving
similar words in Afaan Oromoo WordNet

2.3.5.Experimental Results

This study evaluates the performance of proposed approaches, including the Word2Vec system
and lexico-syntactic pattern system. The proposed knowledge-based WSD algorithm is assessed,
with 80.09% and 85.04% accuracy for CBOW and Skip Gram algorithms. Lexical syntactic
patterns are created from texts, with results presented for Afaan Oromoo.

3. Conclusion
This paper presents models for building WordNet for Afaan Oromoo, aiming to automatically extract
synonym and hyponym WordNet Relations from Afaan Oromoo texts. Word2Vec is developed for
synonym extraction, LSP for hyponym extraction, and Word2Vec with LSP integrated for improved
performance. The system is evaluated using 30 Afaan Oromo pairs of words and 7 lexical patterns. The
system is tested using the CBOW algorithm and the Skip Gram algorithm, with the skip-gram algorithm

6
showing satisfactory performance compared to the CBOW algorithm. LSP is evaluated using IR metrics
like precision, recall, and F-measure. However, factors such as the writer's use of multiple adjectives and
potential misspellings and typographical errors could affect the accuracy of results.

You might also like