Ontology Learning
Ontology Learning
Ontology Learning
Janardhana R. Punuru
Jianhua Chen
Computer Science Dept.
Louisiana State University, USA
Presentation Outline
Introduction
Concept extraction
Taxonomical relation learning
Non-taxonomical relation learning
Conclusions and Future Works
Introduction
Ontology
An ontology OL of a domain D is a specification of
a conceptualisation of D, or simply, a data model
describing D. An OL typically consists of:
A list of concepts important for domain D
A list of attributes describing the concepts
A list of taxonomical (hierarchical) relationships
among these concepts
A list of (non-hierarchical) semantical
relationships among these concepts
Sample (partial) Ontology Electronic Voting
Domain
texts Ontology
extraction ontology
Challenges in Text Processing
Unstructured texts
Ambiguity in English text
Multiple senses of a word
Multiple parts of speech e.g., like can occur in 8 PoS:
Verb: Fruit flies like banana
Noun: We may not see its like again
Adjective: People of like tastes agree
Adverb: The rate is more like 12 percent
Preposition: Time flies like an arrow
etc
Lack of closed domain of lexical categories
Noisy texts
Requirement of very large training text sets
Lack of standards in text processing
Challenges in Knowledge
Acquisition from Texts
Lack of standards in knowledge representation
Lack of fully automatic techniques for KA
Lack of techniques for coverage of whole texts
Existing techniques typically consider word
frequencies, co-occurrence statistics, syntactic
patterns, and ignore other useful information from
the texts
Full-fledged natural language understanding is still
computationally infeasible for large text collections
Our Approach
Our Approach
Concept Extraction: Existing Methods
Frequency-based methods
Text-to-Onto [Maedche & Volz 2001]
Use syntactic patterns and extract concepts
matching the patterns
[Paice, Jones 1993]
Use WordNet
[Gelfand et. Al. 2004] start from a base word list,
for each w in the list, add the hypernyms and
hyponyms in WordNet to the list
Concept Extraction: Our Approach
439 irrelevant
Background: Text Processing
Many local election officials and voting machine companies are fighting paper trails, in part
because they will create more work and will raise difficult questions if the paper and electronic
tallies do not match.
POS Tagging: Many/JJ local/JJ election/NN officials/NNS and/CC voting/NN machine/NN
companies/NNS are/VBP fighting/VBG paper/NN trails,/NN in/IN part/NN because/IN they/PRP
will/MD create/VB more/JJR work/NN and/CC will/MD raise/VB difficult/JJ questions/NNS if/IN
the/DT paper/NN and/CC electronic/JJ tallies/NNS do/VBP not/RB match./JJ
NP Chuking: [ Many/JJ local/JJ election/NN officials/NNS ] and/CC [ voting/NN machine/NN
companies/NNS ] are/VBP fighting/VBG [ paper/NN trails,/NN ] in/IN [ part/NN ] because/IN [
they/PRP ] will/MD create/VB [ more/JJR work/NN ] and/CC will/MD raise/VB [ difficult/JJ
questions/NNS ] if/IN [ the/DT paper/NN ] and/CC [ electronic/JJ tallies/NNS ] do/VBP not/RB [
match./JJ]
Stopword Elimination: local/JJ election/NN officials/NNS, voting/NN machine/NN
companies/NNS , paper/NN trails,/NN, part/NN, work/NN, difficult/JJ questions/NNS,
paper/NN, electronic/JJ tallies/NNS, match./JJ
Morphological Analysis: local election official, voting machine company, paper trail, part, work,
difficult question, paper, electronic tally
WNSCA + {PE, POP}
Take top n% of NP, and select only those with less than 4
senses in WordNet ==> obtain T, a set of noun phrases
Make a base list L of words from T
PE: add to T, any noun phrase np from NP, if the head-
word (ending word) in np is in L
POP: add to T, any noun phrase np from NP, if some
word in np is in L
Evaluation: Precision and Recall
S T
|S T |
Precision:
|S |
n
|S T |
Recall: |T |
Evaluations on the E-voting Domain
100
90
80
70
Raw Freq
precision
60
WNSCA
50
W +PE
40
W + POP
30
20
10
0
Top Top Top Top
10% 20% 50% 75%
frequency threshold
Evaluations on the E-voting Domain
90
80
70
60 Raw Freq
recall
50 WNSCA
40 W +PE
30 W + POP
20
10
0
Top Top Top Top
10% 25% 50% 75%
frequency threshold
TF*IDF Measure
| D|
TF * IDF (t ij ) f ij* Log
| Di |
|D|: total number of documents
|Di|: total number of documents containing term ti
TF*IDF(tij): TF*IDF measure for term ti in document dj
fij: frequency of term ti in document dj
Comparison with the tf.idf method
tf.idf
WNSCA
W+PE
325
W+POP
300
275
250
225
200
175
150
125
100
75
50
25
0
Retrieved R & Rel Precision Recall F-measure
Evaluations on the TNM Domain
90
80
70
60
tf*idf
50 WNSCA
40 W+10%PE
W+10%POP
30
20
10
0
Pre. Recall f-measure
Taxonomy Extraction: Existing Methods
Verb V VF*ICF(V)
produce 25.010
check 24.674
ensure 23.971
purge 23.863
create 23.160
include 23.160
say 23.151
restore 23.088
certify 23.047
pass 23.047
Relation label assignment by
Log-likelihood ratio measure
n
|
CS
(
C,
C
1)
2
| n
V|
S(
V)
| n
C
V
|
S(
C,
C
1)
2
S(
V
)|
n |C |
N | S (V ) S (C, C
i 1j , k 1
i j k )|
Log-likelihood ratio: Log Log
L( H1)
L( H2)
For concept pair (C1, C2), select V with highest value for 2Log
Experiments on the E-voting Domain
yiuy787878uyuiuuuiuiuiiii
Table II Example concept pairs