Automatic and Semantic Pre - Selection of Features Using Ontology For DM On Data Sets Related Cancer

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

International Conference on Information Society (i-Society 2014)

Automatic and Semantic Pre – Selection of Features


Using Ontology for Data Mining on Data Sets
Related to Cancer

Adriana da Silva Jacinto Ricardo da Silva Santos; José Maria Parente de Oliveira
Department of Computer Science Department of Computer Science
Faculty of Technology Professor Jessen Vidal(Fatec – SJC) Aeronautics Institute of Technology (ITA)
Sao Jose dos Campos, Brazil Sao Jose dos Campos, Brazil
[email protected] {rsantos, parente}@ita.br
Integrated Center of Biochemistry Research
University of Mogi das Cruzes
Mogi das Cruzes, Brazil

Abstract — Analysis of medical data sets can reveal important managers, and elicit more appropriate treatments for patients.
information, especially data coming from patients with cancer. In addition, researches and organizations have focus on
That can be improved by use of data mining. However, to obtain identifying causes of the disease or factors that may promote
coherent results coming from data mining, the correct selection its recurrence in patients [14, 29].
of more semantic features should occur but, usually, that does not
happen. Therefore, this work presents a proposal of automatic Those studies are based on analysis of several data sets,
and semantic pre – selection of features. Four famous data sets including those that describe thousands of genes coming from
coming from patients with cancer were used in experiments and patients with cancer or tumors [4, 29].Thus, data mining is a
the results show the validity of the proposal. valuable tool in that issue, because it is the exploration and
analysis of databases through a variety of statistical techniques
Keywords-ontology; cancer; feature; selection; methods; and machine learning algorithms to discover important rules
semantic; search and patterns [22].
, INTRODUCTION However, the success of data mining depends on correct
feature selection, i.e., the election of the most relevant features
Health is one of the most important areas of research for
[11]. Thus, feature selection depends on methods and is highly
society and cancer is a serious issue of investigation. Cancer is
related to the knowledge of domain experts.
a term used for diseases in which abnormal cells divide without
control and are able to invade other tissues, forming a Generally, experts in Medicine are not available to support
malignant tumor in most cases [14, 29]. a full – time data miner. The medical area contains several
terms and peculiarities, which can make difficult the selection
In 2025, the estimate is that there will be 19.3 million new
of the semantically more relevant feature. Moreover, if the data
cases of cancer around the world [14]. Therefore, the study of
set is composed of thousands of genes as features, a
clinical profile of patients with tumors is very important,
computational resource capable of making a semantic pre –
because that can lead to better actions by public health
selection of features must be, facilitating the work of a genetic

978-1-908320-38/4/$25.00©2014 IEEE 282


International Conference on Information Society (i-Society 2014)

expert. are based on Entropy, Consistency, Matrix Resources, Rough


Sets, Similarity and Correlation.
Kuo et al. [15] mention that a person who is not familiar
with the specific domain could not easily understand the Entropy and others (the probability calculus, degree of
meanings of several features, and it is unclear which of these symmetric uncertainty, information gain, and mutual
features are useful in data mining. Consequently, a large information and frequency values) quantify the disorder among
amount of time is spent in feature selection for the data mining. the elements of a data set, based on the calculation of
probability [28]. Low entropy means homogeneous data set.
Several feature selection methods have been developed to
improve the pre – processing of the data [1]. However, in Other approach adopted by the feature selection methods is
general, they employ only statistical techniques to select the the calculation of consistency, which refers to coherency, i.e.,
features and do not capture the semantics of features. Therefore, there are two tuples with the same values of input feature; they
some features semantically more relevant may be excluded if must have equal values for prediction feature. If it does not
not statistically significant, while irrelevant ones may be happen, an example of inconsistency is figured out [7]. This
selected due to their great statistical contribution. way, the subset of features that displays the lowest level of
inconsistency will be chosen by selection method.
To provide semantic support to a data miner, this work
proposes the computational capturing of semantics of features Proceeding description of the found approaches, there is the
by using of ontology in conjunction with a combination of use of matrix resources (SVD matrix, the Laplacian matrix),
traditional methods of feature selection, promoting semantic which consists on decomposition of a data matrix singular
enrichment of them. value, calculating cosine or covariance between two columns
of the matrix [19].
This paper is organized as follow. Section 2 outlines the
main concepts for understanding of the proposal. Section 3 Whereas different vectors form a data set, these vectors can
describes related works. Section 4 presents the basic be compared, establishing between them a measure, which is
architecture of the proposal. Section 5 reports the results calculated based on a metric. This approach is called similarity
obtained from experiments. Finally, Section 6 highlights the or correlation and vectors can be tuples or features [26, 27].
main contribution of this work.
Concluding the description of the approaches, the theory of
,, BACKGROUND rough sets conducts tests with all possible subsets of features to
investigate which has better quality of approximation to the
A. Knowledge Discovery in Databases (KDD) original set of features [18, 6].
Data Mining is one of the stages of Knowledge Discovery Some feature selection methods use more than one
in Database, which uses statistical techniques and artificial approach to obtain a subset of features most relevant. However,
intelligence to discover meaningful patterns and rules in large as all described approaches perform calculus taking into
volumes of data [22]. account only data and not the semantics of each feature, this
According to the type of knowledge to be extracted, Data goal often is not reached. Thus, those approaches should be
Mining has a variety of tasks, such as prediction, sequence refined and enriched by semantic analysis, considering the
analysis, classification, clustering and association [22, 26]. meaningfulness of each feature and simulating expert
intervention.
Before data mining, the pre – processing of the data
reduces the dimensionality of the data by the removing B. Ontology
missing data and decreasing the original number of features, Gruber defines ontology as an explicit specification of a
through feature selection by the discarding of irrelevant or conceptualization [9]. The specification can be represented in a
redundant features [22]. declarative formalism defining a set of objects (classes) and the
For a classification task, a feature selection method can be: relationships among them. According to Guarino, ontology is a
conceptual model that captures and explains the vocabulary
x Embedded, which is incorporated into used in semantic applications [10].
classification algorithm [22];
Jacinto and Oliveira [13] said that use of ontology allows a
x Wrapper, which uses the classification algorithm better understanding of the various issues, particularly with
as a black box to evaluate subsets of features regard to problems that do not have an exact or mathematical
according to their predictive ability [11]; answer.
x Filter, in which selection of features occurs An ontology can be manipulate by a computer since it is
regardless of classification algorithm [11]. described in a language such as OWL (Web Ontology
Language), which became a recommended standard of W3C
This work identified the core of the used approach by (World Wide Web Consortium)[8].
various feature selection methods, grouping them by
similarities. The considered methods are only those of the filter In this way, use of ontology can make viable the semantic
type, i.e., implementation and execution are independent of the and automatic analysis of features and relationships among
mining algorithm. them, by computational manner. However, names of features
coming from data sets must be searched in domain ontology,
In the literature, the main approaches for feature selection which can use different nomenclature to describe the same

978-1-908320-38/4/$25.00©2014 IEEE 283


International Conference on Information Society (i-Society 2014)

feature, called concept. Therefore, this issue is addressed by Possible number of combination is given by (1), where n is the
semantic search. amount of available feature selection methods and p is the used
quantity of them.
C. Semantic Search
Semantic search over documents refers to finding
information based on presence and meaning of words [21]. Its (1)
goal is to contemplate some aspects such as [33]: 3 – A feature is selected when: a) its statistical weight (pm)
morphological variations, synonyms with correct senses, is higher than a threshold; b) it is indicated by every method of
generalizations, concept matching, knowledge matching, combination.
natural language queries and questions, ability to point to
uninterrupted paragraph and the most relevant sentence, and so 4 – From the data set, only names of features are taken.
on. Those names are compared to domain ontology and to a lexical
ontology. Lexical ontology is used with Thesaurus or
For this, semantic search employs domain ontology or a dictionaries to recognize words. The used domain ontology is
lexical ontology to obtain the relations among the concepts. In NCI Ontology [29] and the WordNet is a lexical ontology [24].
addition, semantic search may use natural language processing
[21]. 5 – Each feature is related to an equivalent concept of
domain ontology. If some feature is not found on domain
Natural language processing (NLP) is an interdisciplinary ontology, an automatic search for synonyms, hyperonymy and
field of Computer Science, Artificial Intelligence, and other relations occurs on WordNet. [24].
Linguistics which works on solutions to allow computers to
process and understand human languages [23]. This proposal performs an automatic normalization
procedure with names of features, which is the treatment of
,,, RELATED WORKS strings. This procedure converts strings to lowercase, discards
Sensitive to semantic aspect of features, Kuo et al. [15] use grammatical accents, deletes blank spaces and hyphens,
ontology to facilitate tasks of data mining. They propose an withdrawal of numeric digits and punctuation.
approach in which a domain ontology is used to prepare After normalization of strings, comparison between feature
semantic groups of features to association task, during pre – and an ontology concept is made by calculating the similarity
processing phase. Association rules are generated, revised and measure of words [31] shown in (2), and withdrawing [17].
compared by a domain expert. However, this approach is
manual because a computational tool for natural language
processing is out of the scope of that work.
Mansingh et al. [35] focus on association task. Among (2)
several rules generated on data mining, that work describes a
method based on ontology to select rules more interesting to
domain application. To select a more relevant feature (input) is Letters s
harder than to choose more meaningful rules (output). and t represent two strings to be compared. Expression com(s,t)
Computationally, to verify a solution is easier than to find it.
represents the amount of characters that appears in the two
Aubrecht and Žáková [34] describe a system called strings, but in a different order. Expression transp (s,t) refers to
SumatraTT that maps a table from a relational database to the quantity of transpositions of characters that occurred. First
ontology, facilitating the understanding and visualization of calculating σ(s,t) refers to σJaro(s,t) calculus. Second σ(s,t)
concepts. calculation refers to the measure of σJaro-Winkler (s,t), where P
Wu et al. [25] exposes how ontology can aid in the refers to the size of the common prefix of two strings, and Q is
detection of semantic mistakes, during data mining, improving a constant.
the efficiency and efficacy of KDD. 6 – Just the features related to an equivalent and new
concept are pre – selected as semantically relevant. Features
In summary, seeing the benefits that the use of semantics related to repeated concepts are removed because this indicates
can provide, some authors have reported the use of ontology redundancy. A feature not related to a concept is considered as
with database and data mining but this work investigates if semantically irrelevant.
there are advantages in preprocessing of data for KDD.
7 – The union of semantically relevant features {A1, A2}
IV. SEMANTIC PRE – SELECTION OF FEATURES and statistically relevant features {A1, A4} gives the output of
Basically, the approach of semantic pre – selection of proposal.
features is divided into 7 steps as follow. Fig. 1 presents the proposal called semantic pre – selection
1 – A data set with x features {A1, A2..., Ax}, y tuples and a of features by use of ontology.
prediction feature {AS} is the input of the application..
2 – A combination of feature selection filter methods
chooses the most relevant features of the data set, {M1,
M2,…}. This choosing is based just on the statistical analysis.

978-1-908320-38/4/$25.00©2014 IEEE 284


International Conference on Information Society (i-Society 2014)

according to the patient's genetics.


Data sets with genes as input features have two
inconveniences for feature selection: a) these data sets have
thousands of genes; b) each gene does not have a unique
identity. Several resources and applications are generated to
perform the conversion of nomenclature of genes [32]. This
work used MADGene [30] to obtain the GenBank Accession of
each gene, which is the standard used by the NCI ontology.
A prototype of the proposal of this work was implemented
in the Java programming language, using the Jena APIs [20],
Weka [12] and extJWNL [2]. A notebook computer with Intel
Core 2 Duo, 2 GHz, 2GB RAM and 138 GB HD processor was
used.
In the experiments, a combination of traditional methods of
feature selection was used: Information Gain Attribute Ranking
(IG) [28], which employs the concept of entropy; and the
Consistency Subset Evaluation (CSE) [7], which uses the
calculation of consistency to elect the most relevant feature.
The used threshold was 0.01.
B. Results
Table 1 shows the outcomes.
TABLE I. OUTCOMES
Figure 1. Semantic Pre- Selection of Features.
It presents names of each data set, feature selection coming
V. EXPERIMENTS AND RESULTS from traditional methods and the results of the automatic and
semantic pre – selection of features, together with the
A. Experiments corresponding concept found in the ontology.
Given a data set and a prediction feature, the goal of the
In Lymphography data set, features uptake_in and
experiments was to do a semantic and automatic pre – selection
dislocation_of were not found in the domain ontology.
of features, for classification task in data mining, comparing
Therefore, through relationships of synonyms, hypernym,
outcomes with ones coming from just a statistical selection of
hyponymy, and others in WordNet, those features were
features.
connected to concepts Disruption and Ingestion, respectively,
Four data sets related to cancer were used: Lymphography, in an automated way.
Breast Cancer, Primary Tumor and Central Nervous System.
When a feature does not appear in boldface, this indicates
The first three data sets were obtained from University Medical
that semantic pre – selection confirms the selection of
Center, Institute of Oncology, Ljubljana, Yugoslavia [16] and
attributes in a statistical way. The results of the Primary Tumor
are available at [4]. The last data set was obtained from Broad
data set occur with the majority of features: Skin, pleura and
Institute and also used by [5]. Each data set is described as others.
follow.
Features in boldface mean that there was no overlap
x Lymphography – This data set contains 18 features between the traditional selection of features and semantic pre –
from 148 patients. Prediction feature is the diagnosis, it selection. If a feature coming from traditional methods is
means, the aim is to classify patient's situation as unheard, then it has only statistical significance. Probably, an
normal, metastasis, malignant lymphoma or a fibrosis, expert does not usually relate this feature to the prediction
based on input features. feature.
x Breast Cancer – From 9 features, the goal is to know if That is the case of X82206 gene (H. sapiens mRNA for
there is a risk of tumor recurrence in patients that alpha – centractin) in the Central Nervous System data set.
already received treatment. Querying [3], there is apparently no relationship between this
x Primary Tumor – With 17 features from 339 patients, gene and the prediction feature, i.e., the survival.
the aim is to determine where first tumor appeared, A feature unprecedented in the semantic part indicates a
which means to know in which part of body the cancer situation in which the semantic pre – selection points out a
started. semantically relevant feature that was ignored by traditional
x Central Nervous System – This data set contains 7129 methods. NM_000314 gene (phosphatase and tensin homolog)
genes coming from 60 patients who had cancer in the fits this situation, because it is related to several types of cancer
central nervous system and received treatment. and tumors [3].
Prediction feature is survival or not to sickness,

978-1-908320-38/4/$25.00©2014 IEEE 285


International Conference on Information Society (i-Society 2014)

VI. CONCLUSIONS Telecommunications Engineering. 1ed. Berlin: Springer Berlin Heidelberg,


2012, vol. 82, pp. 212-226.
To assist a data miner to elect features more semantically [14]J. Ferlay, et al. GLOBOCAN 2012 v1.0, Cancer Incidence and Mortality
relevant in data sets related to cancer, this paper presented a Worldwide: IARC CancerBase No. 11 [Internet]. Lyon, France: International
proposal to perform semantic pre- selection of features by Agency for Research on Cancer; 2013. Available
computational ways, using domain ontology, the NCI ontology at:<http://globocan.iarc.fr.>.Access Date:10 June, 2014.
and a lexical ontology, the WordNet. Furthermore, the proposal [15]Y-T. Kuo, A. Lonie, L. Sonenberg. Domain Ontology Driven Data Mining:
does not ignore the statistical significance of the data, since it A Medical Case Study. Proceddings of 2007 ACM SIGKDD Workshop on
used a combination of two traditional methods of feature Domain Driven Data Mining (DDDM2007); 2007 Aug 12-14; San Jose,
selection: CSE and IG. California, USA, p.11-17.
[16]R. Michalski, I. Mozetic, J. Hong, and N. Lavrac, (1986). The Multi-
Experiments performed with four known data sets related Purpose Incremental Learning System AQ15 and its Testing Applications to
to cancer patients indicated the validity of the proposal. Three Medical Domains. In Proceedings of the Fifth National Conference on
Artificial Intelligence, 1041-1045. Philadelphia, PA: Morgan Kaufmann.
From the foregoing, a data miner can obtain larger semantic
[17]J. Euzenat and P. Shvaiko. 2007. Ontology matching, Springer-Verlag,
meaning to their classification models, saving time and labor. 978-3-540-49611-3.
In addition, from thousands of features, a domain expert can [18]Z. Pawlak. Rough sets. In: International Journal of Computer and
perform a semantic pre – selection of features, focusing to Information Sciences, vol. 11, New York, NY. n.º 5, pp. 341-356, 1982.
refine the resulting set. Plenum.
As future work, the pre – selection of features will be [19]K. Pearson. On Lines and Planes of Closest Fit to Systems of Points in
Space. Philosophical Magazine 2 (11): 559–572. 1901.
refined by assigning a semantic weight to each feature in order
to obtain better outcomes. [20]APACHE JENA. Apache Jena. Available at: <http://jena.apache.org/>
Access Date: 15 January, 2013.
ACKNOWLEDGMENT [21]K. Bontcheva, and V. Tablan; Cunningham, H. Semantic Search over
Documents and Ontologies. In: Ferro, Nicola. Bridging Between Information
Thanks to M. Zwitter and M. Soklic for providing the data Retrieval and Databases, Lecture Notes in Computer Science 1ed. Berlin:
sets Lymphography, Breast Cancer and Primary Tumor. Springer Berlin Heidelberg vol. 8173, 2014, pp 31-53.
[22]P.-N Tan, M. Steinbach, V. Kumar. Introduction to Data Mining, 1st
REFERENCES Edition. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA.
2005.
[1]P. K Ammu and V. Preeja. Review on Feature Selection Techniques of [23]H. Tennant. Natural Language Processing. An Introduction to an
DNA Microarray Data. In: International Journal of Computer Applications Emerging Technology. Petrocelli Books, Inc. United States.1981. ISBN: 0-
(0975 – 8887) vol 61– No.12, January 2013. pp. 39-44. 2013. 89433-100-0.
[2]A. Autayeu. API extJWNL (Extended Java WordNet Library), Available at: [24]WORDNET [Online]. Available at: http://wordnet.princeton.edu. Access
<http://extjwnl.sourceforge.net/>.Access Date: 25 May, 2013. Date: 10 June, 2013.
[3]NCBI. Available at: <http://www.ncbi.nlm.nih.gov>. Access Date: 17April, [25]C.-A. wu, W.-Y Lin, C.-L Jiang, and C.-C. Wu, C.-C. (2011). Toward
2014. intelligent data warehouse mining: An ontology-integrated approach for
[4]UCI MACHINE LEARNING REPOSITORY. Available at: multi-dimensional association mining. Expert Systems with Applications,
<http://archive.ics.uci.edu/ml/> .Access Date: 15 May, 2014. 38(9), 11011-11023. Elsevier Ltd. doi:10.1016/j.eswa.2011.02.144.
[1] [5]S. L. Pomeroy, et al. Prediction of Central Nervous System [26]M. Zaki and W. Meira Jr. Fundamentals of Data Mining Algorithms,
Embryonal Tumour Outcome Based on Gene Expression, Nature, 415:436- Cambridge University Press (in press). 2009. 555p. Available at:
442, January 2002. <http://www.dcc.ufmg.br/miningalgorithms/>. Access Date: 10 June, 2013.
[6]A. Chouchoulas and Q. Shen. Rough set-aided keyword reduction for text [27]K. Kira and L. A. Rendell. The Feature Selection Problem: Traditional
categorization. Applied Artificial Intelligence: An International Journal. Methods and a New Algorithm, In: Proceedings of 10th Conference on
15(9):843-873. 2001. Artificial Intelligence, 1992, p. 129-136, Menlo Park, CA.
[7]M. Dash and H. Liu. Consistency-Based Search in Feature Selection. [28]T. M. Cover and J. A.Thomas. Elements of Information Theory. Wiley,
Artificial Intelligence. 151(1-2):155-176, December, 2003. 1991. S. Dumais, J. Platt, D. Heckerman, and M. Sahami. Inductive learning
algorithms and representations for text categorization. In Proceedings of the
[8]M. Dean and G. Schreiber. Web Ontology Language (OWL) reference. International Conference on Information and Knowledge Management, pp.
2004. Available at: <http://www.w3.org/TR/owl-ref>. Access Date: 12 June, 148-155, 1998.
2014.
[29]NATIONAL CANCER INSTITUTE. [Online]. Available at:
[9]T. R. Gruber. Toward Principles for the Design of Ontologies Used for <http://www.cancer.gov/>. Access Date: 12 June, 2014.
Knowledge Sharing. Int. J. Human-Computer Studies, 43 (1995) 907-928.
[30]D. Baron, A. Bihouée, R. Teusan, E. Dubois, F. Savagner, M. Steenman,
[10]N. Guarino. Formal Ontology, Conceptual Analysis and Knowledge R. Houlgatte, G. Ramstein. MADGene: retrieval and processing of gene
Representation. Int. J. Human-Computer Studies, 43 (1995), 625-640. identifier lists for the analysis of heterogeneous microarray datasets.
[11]I. Guyon and A. Elisseeff. An introduction to variable and feature Bioinformatics. 2011 Mar1; 27(5):725-6. Epub 2011 Jan 6. PubMed PMID:
selection. pp. 1157-1182. Journal of Machine Learning Research 3. 2003. 21216776.
[12]M. Hall et al. The WEKA Data Mining Software: An Update; SIGKDD [31]M. A Jaro. "Probabilistic linkage of large public health data file".
Explorations, vol. 11, Issue 1. 2009. Statistics in Medicine 14 (5–7): 491–8. doi:10.1002/sim.4780140510. PMID
[13]A. S. Jacinto, J. M. P. Oliveira. A Process for Solving Ill-Structured 7792443. 1995
Problem Supported by Ontology and Software Tools. In: Liñán Reyes, Matías; [32]Judith Mary. H., S. Seetha Lakshmi and R. Shobana (Correspondence:
Flores Arias, José M.;González de la Rosa, Juan J.;Langer, Josef; Bellido Acharya KK, [email protected]), A compilation of Gene ID conversion
Outeiriño, Francisco J.;Moreno-Munñoz, Antonio. (Org.). IT Revolutions – resources; In: Startbioinfo; 07 Jun 2011, Available at:
Lecture Notes of the Institute for Computer Sciences, Social Informatics and http://www.shodhaka.com/cgi-bin/startbioinfo/simpleresources.pl?tn=Gene
ID conversion/. Access Date: 10 June, 2013.

978-1-908320-38/4/$25.00©2014 IEEE 286


International Conference on Information Society (i-Society 2014)

[33]G. Sudeepthi, G. Anuradha, and M. Surendra Prasad Babu, A survey on


semantic web search engines, International Journal of Computer Science
Issues, vol. 9, no. 1, March 2012
[34]P. Aubrecht, and M. Žáková. (2005). Data Preprocessing Using
Ontologies. Mobile Computing Meets Knowledge Management. Praha: Czech
Technical University in Prague, 2005, p. 21-24. ISBN 80-903198-0-7.
[35]G. Mansingh, K-M Osei-Bryson, and H. Reichgelt. (2011). Using
ontologies to facilitate post-processing of association rules by domain experts.
Information Sciences, 181(3), 419-434. Elsevier Inc.
doi:10.1016/j.ins.2010.09.027.

978-1-908320-38/4/$25.00©2014 IEEE 287

You might also like