Automatic and Semantic Pre - Selection of Features Using Ontology For DM On Data Sets Related Cancer
Automatic and Semantic Pre - Selection of Features Using Ontology For DM On Data Sets Related Cancer
Automatic and Semantic Pre - Selection of Features Using Ontology For DM On Data Sets Related Cancer
Adriana da Silva Jacinto Ricardo da Silva Santos; José Maria Parente de Oliveira
Department of Computer Science Department of Computer Science
Faculty of Technology Professor Jessen Vidal(Fatec – SJC) Aeronautics Institute of Technology (ITA)
Sao Jose dos Campos, Brazil Sao Jose dos Campos, Brazil
[email protected] {rsantos, parente}@ita.br
Integrated Center of Biochemistry Research
University of Mogi das Cruzes
Mogi das Cruzes, Brazil
Abstract — Analysis of medical data sets can reveal important managers, and elicit more appropriate treatments for patients.
information, especially data coming from patients with cancer. In addition, researches and organizations have focus on
That can be improved by use of data mining. However, to obtain identifying causes of the disease or factors that may promote
coherent results coming from data mining, the correct selection its recurrence in patients [14, 29].
of more semantic features should occur but, usually, that does not
happen. Therefore, this work presents a proposal of automatic Those studies are based on analysis of several data sets,
and semantic pre – selection of features. Four famous data sets including those that describe thousands of genes coming from
coming from patients with cancer were used in experiments and patients with cancer or tumors [4, 29].Thus, data mining is a
the results show the validity of the proposal. valuable tool in that issue, because it is the exploration and
analysis of databases through a variety of statistical techniques
Keywords-ontology; cancer; feature; selection; methods; and machine learning algorithms to discover important rules
semantic; search and patterns [22].
, INTRODUCTION However, the success of data mining depends on correct
feature selection, i.e., the election of the most relevant features
Health is one of the most important areas of research for
[11]. Thus, feature selection depends on methods and is highly
society and cancer is a serious issue of investigation. Cancer is
related to the knowledge of domain experts.
a term used for diseases in which abnormal cells divide without
control and are able to invade other tissues, forming a Generally, experts in Medicine are not available to support
malignant tumor in most cases [14, 29]. a full – time data miner. The medical area contains several
terms and peculiarities, which can make difficult the selection
In 2025, the estimate is that there will be 19.3 million new
of the semantically more relevant feature. Moreover, if the data
cases of cancer around the world [14]. Therefore, the study of
set is composed of thousands of genes as features, a
clinical profile of patients with tumors is very important,
computational resource capable of making a semantic pre –
because that can lead to better actions by public health
selection of features must be, facilitating the work of a genetic
feature, called concept. Therefore, this issue is addressed by Possible number of combination is given by (1), where n is the
semantic search. amount of available feature selection methods and p is the used
quantity of them.
C. Semantic Search
Semantic search over documents refers to finding
information based on presence and meaning of words [21]. Its (1)
goal is to contemplate some aspects such as [33]: 3 – A feature is selected when: a) its statistical weight (pm)
morphological variations, synonyms with correct senses, is higher than a threshold; b) it is indicated by every method of
generalizations, concept matching, knowledge matching, combination.
natural language queries and questions, ability to point to
uninterrupted paragraph and the most relevant sentence, and so 4 – From the data set, only names of features are taken.
on. Those names are compared to domain ontology and to a lexical
ontology. Lexical ontology is used with Thesaurus or
For this, semantic search employs domain ontology or a dictionaries to recognize words. The used domain ontology is
lexical ontology to obtain the relations among the concepts. In NCI Ontology [29] and the WordNet is a lexical ontology [24].
addition, semantic search may use natural language processing
[21]. 5 – Each feature is related to an equivalent concept of
domain ontology. If some feature is not found on domain
Natural language processing (NLP) is an interdisciplinary ontology, an automatic search for synonyms, hyperonymy and
field of Computer Science, Artificial Intelligence, and other relations occurs on WordNet. [24].
Linguistics which works on solutions to allow computers to
process and understand human languages [23]. This proposal performs an automatic normalization
procedure with names of features, which is the treatment of
,,, RELATED WORKS strings. This procedure converts strings to lowercase, discards
Sensitive to semantic aspect of features, Kuo et al. [15] use grammatical accents, deletes blank spaces and hyphens,
ontology to facilitate tasks of data mining. They propose an withdrawal of numeric digits and punctuation.
approach in which a domain ontology is used to prepare After normalization of strings, comparison between feature
semantic groups of features to association task, during pre – and an ontology concept is made by calculating the similarity
processing phase. Association rules are generated, revised and measure of words [31] shown in (2), and withdrawing [17].
compared by a domain expert. However, this approach is
manual because a computational tool for natural language
processing is out of the scope of that work.
Mansingh et al. [35] focus on association task. Among (2)
several rules generated on data mining, that work describes a
method based on ontology to select rules more interesting to
domain application. To select a more relevant feature (input) is Letters s
harder than to choose more meaningful rules (output). and t represent two strings to be compared. Expression com(s,t)
Computationally, to verify a solution is easier than to find it.
represents the amount of characters that appears in the two
Aubrecht and Žáková [34] describe a system called strings, but in a different order. Expression transp (s,t) refers to
SumatraTT that maps a table from a relational database to the quantity of transpositions of characters that occurred. First
ontology, facilitating the understanding and visualization of calculating σ(s,t) refers to σJaro(s,t) calculus. Second σ(s,t)
concepts. calculation refers to the measure of σJaro-Winkler (s,t), where P
Wu et al. [25] exposes how ontology can aid in the refers to the size of the common prefix of two strings, and Q is
detection of semantic mistakes, during data mining, improving a constant.
the efficiency and efficacy of KDD. 6 – Just the features related to an equivalent and new
concept are pre – selected as semantically relevant. Features
In summary, seeing the benefits that the use of semantics related to repeated concepts are removed because this indicates
can provide, some authors have reported the use of ontology redundancy. A feature not related to a concept is considered as
with database and data mining but this work investigates if semantically irrelevant.
there are advantages in preprocessing of data for KDD.
7 – The union of semantically relevant features {A1, A2}
IV. SEMANTIC PRE – SELECTION OF FEATURES and statistically relevant features {A1, A4} gives the output of
Basically, the approach of semantic pre – selection of proposal.
features is divided into 7 steps as follow. Fig. 1 presents the proposal called semantic pre – selection
1 – A data set with x features {A1, A2..., Ax}, y tuples and a of features by use of ontology.
prediction feature {AS} is the input of the application..
2 – A combination of feature selection filter methods
chooses the most relevant features of the data set, {M1,
M2,…}. This choosing is based just on the statistical analysis.