Data Mining: A Database Perspective
Data Mining: A Database Perspective
Data Mining: A Database Perspective
Abstract
Data mining on large databases has been a major concern in research com-
munity, due to the diculty of analyzing huge volumes of data using only
traditional OLAP tools. This sort of process implies a lot of computa-
tional power, memory and disk I/O, which can only be provided by parallel
computers. We present a discussion of how database technology can be
integrated to data mining techniques. Finally, we also point out several ad-
vantages of addressing data consuming activities through a tight integration
of a parallel database server and data mining techniques.
1 Introduction
Data mining techniques have increasingly been studied7;9;21, espe-
cially in their application in real-world databases. One typical prob-
lem is that databases tend to be very large, and these techniques
often repeatedly scan the entire set. Sampling has been used for a
long time, but subtle dierences among sets of objects become less
evident.
This work provides an overview of some important data mining
techniques and their applicability on large databases. We also spot
several advantages of using a database management system (DBMS)
to manage and process information instead of conventional
at les.
This approach has been a major concern of several researches, be-
cause it represents a very natural solution since DBMSs have been
successfully used in business management and currently may store
valuable hidden knowledge.
One requirement of data mining is eciency and scalability of
mining algorithms. It makes the use of parallelism even more relevant
to provide a way of processing long running tasks in a timely manner.
In this context, parallel database systems come to play an important
role, because they can oer, among other advantages, transparent
and painless implementation of parallelism to process large data sets.
It is important to notice that, when we mention the use of large
amounts of information in data mining, we are not referring to usual
large DBMSs, which can reach more than one terabyte of data. As
data mining methods often repeatedly scan the data set, mining in
such a large database is not cited in the literature yet.
The remainder of this work is organized as follows. Section 2
presents some important mining techniques currently implemented
in data mining systems. Section 3 describes how these techniques
can be applied to both
at les and DBMSs, enforcing advantages of
the latter approach, and Section 4 presents important advantages to
be analyzed when considering the use of parallel databases to mine
knowledge. Section 5 describes a case study and an implementation
using a well-known classier algorithm with a tightly-coupled inte-
gration with a database system. Section 6 points out some related
data mining systems, whereas Section 7 presents our conclusions and
nal observations.
Even when using databases, most of the current data mining ap-
plications have a loose connection with them. They treat database
simply as a container from which data is extracted directly to the
main memory of the computer responsible for running the data min-
ing algorithm, just before the main execution begins. This approach
limits the amount of data that can be used, forcing applications to
lter information, and use only a part of it to discover patterns.
Alternatively, some applications dynamically perform queries to the
database, but work in a client / server architecture. It means that,
depending on the amount of data being transferred, unnecessary
network trac is generated. Moreover, they are often written in
programming languages that do not have any integration with the
database system.
3.2.2 Tightly-coupled
7 Conclusions
Data mining and its application on large databases have been ex-
tensively studied due to the increasing diculty of analyzing large
volumes of data using only OLAP tools. This diculty pointed out
the need of an automated process to discover interesting and hidden
patterns in real-world data sets. The ability to handle large amounts
of information has been a major concern in many recent data mining
applications. Parallel processing comes to play an important role in
this context, once only parallel machines can provide sucient com-
putational power, memory and disk I/O.
We described some important data mining techniques, present-
ing brief descriptions about them and showing how each one can con-
tribute to the pattern discovery process. Furthermore, we presented
several advantages of implementing a data mining method using a
DBMS instead of conventional
at les. Our practical work exploited
many specic characteristics of DBMSs, providing a tightly-coupled
integration of a data mining technique with a parallel database server
using a complex application. We have exercised adverse situations
such as large number of attributes, discrete and continuous attributes
with many distinct values, observing problems and solutions during
the whole process.
Experimental results have shown performance bottlenecks when
using a DBMS when compared to
at les, due to the nature of
current SQL oered by most commercial databases. However, if we
assume that data mining of large databases is going to become in
the future a routine task and DBMS vendors are going to implement
new features that could help the data mining process, then database
servers will play an important role in this context. It is possible
that changes to current DBMSs and SQL language could enable data
mining operations to be performed more eciently. Besides, through
emerging object-relational technology, a potential area must be ex-
ploited. Other interfaces, such as those for integrating with indexing
and optimization mechanisms will be available in a near future, which
can oer a means of interfering in the parallel optimization process.
References
[1] Agrawal, R., Ghosh, S., Imelinski, T., Iyer, B., & Swami, A.,
An Interval classier for database mining applications, Proc. of
VLDB Conference, Vancouver, Canada, pp. 560-573, 1992.
[2] Agrawal, R., Imielinski, T., & Swami, A., Mining association rules
between sets of items in large databases, Proc. of Int. Conf. ACM
SIGMOD, Washington D. C. pp. 207-216, 1993.
[3] Agrawal, R., Gehrke, J., Gunopulos, & D., Raghavan, P., Auto-
matic Subspace Clustering of High Dimensional Data for Data
Mining Applications, Proc. of the ACM SIGMOD Int. Conf. on
Management of Data, Seattle, Washington, 1998.
[4] Agrawal, R., Metha, M., Shafer, J., & Srikant, R., The Quest
Data Mining System, Proc. of the 2nd Int. Conf. on Knowledge
Discovery in Databases and Data Mining, Portland, Oregon, 1996.
[5] Agrawal, R., & Shim, K., Developing Tightly-Coupled Data Min-
ing Applications on a Relational Database System, Proc. of 2nd
Int. Conf. on Knowledge Discovery in Databases and Data Min-
ing, Portland, Oregon, 1996.
[6] Agrawal, R., & Srikant, R., Fast Algorithms for mining associa-
tion rules, Proc. of the 20th VLDB Int. Conf., Santiago, Chile,
1994.
[7] Bigus, J. P., Data Mining with NeuralNetworks, McGraw-Hill,
1996.
[8] Breiman, L., Friedman, J., Olshen, R., & Stone, C., Classication
and Regression Trees, Wadsworth International Group, 1984.
[9] Chen, M. S., Han, J., & Yu, P. S., Data Mining: An Overview
from Database Perspective, IEEE Trans. on Knowledge and Data
Engineering, Vol. 8, No. 6, pp. 866-883, 1996
[10] Fayyad, U, Djorgovski, S., & Weir, N.. Automating the Anal-
ysis and Cataloging of Sky Surveys. In Advances in Knowledge
Discovery and Data Mining, pp. 471-493, AAAI Press, 1996.
[11] Fayyad, U., Piatesky-Shapriro, G., & Smyth, P., From Data
Mining to Knowledge Discovery: An Overview, In Advances in
Knowledge Discovery and Data Mining, pp. 1-34, AAAI Press,
1996.
[12] Freitas, A., Generic, Set-oriented Primitives to Support Data-
parallel Knowledge Discovery in Relational Database Systems,
Phd diss., 1997,
http://cswww.essex.ac.uk/SystemsArchitecture/DataMining
/alex/thesis.html.
[13] Freitas, A., & Lavington, S. H., Mining Very Large Databases
With Parallel Processing, Kluwer Academic Publishers, 1998.
[14] Hallmark, G., Oracle Parallel Warehouse Server, Proc. of ICDE,
pp. 314-320, 1997.
[15] Han, J., Fu, Y., Koperski, K., Melli, G., Wang, W., & Zaane,
O., Knowledge Mining in Databases: An Integration of Machine
Learning Methodologies with Database Technologies, Canadian
AI Magazine, 1995.
[16] Han, J., Fu, Y., Wang, W., Chiang, J., Gong, W., Koperski, K.,
Li, D., Lu, Y., Rajan, A., Stefanovic, N., Xia, B., & Zaiane, O.,
DBMiner: A System for Mining Knowledge in Large Relational
Databases. Proc. Int. Conf. on KDD, Portland, Oregon, 1996.
[17] Holsheimer, M., Kersten, & M. L., Siebes, A., Data Surveyor:
Searching the Nuggets in Parallel. In Advances in Knowledge Dis-
covery and Data Mining, pp. 447-467, AAAI Press, 1996.
[18] Kufrin, R., Decision Trees on Parallel Processors, Proc. of the
IJCAI Workshop on Parallel Processing for Articial Intelligence,
pp. 87-95, 1995.
[19] Metha, M., Agrawal, R., & Rissanen, J., SLIQ: A Fast Scalable
Classier for Data Mining, Proc. of the 5th Int'l Conference on
Extending Database Technology (EDBT), Avignon, France, 1996.
[20] Metha, M., & DeWitt, D.J., Data Placement in shared-nothing
parallel database systems. VLDB Journal, Springer-Verlag 1997.
[21] Mitchell, T. M., Machine Learning. McGraw-Hill, 1997.
[22] Navathe, S. B., & Ra, M., Vertical Partitioning for Database
Design: A Graphical Algorithm. Proc. of SIGMOD Int. Conf.,
pp. 440-450, 1989.
[23] Oracle Corporation. Oracle Parallel Server Concepts & Admin-
istration Rel. 7.3. Oracle Technical Manual.
[24] Quinlan, J., C4.5: Programs for Machine Learning. Morgan
Kaufman, 1993.
[25] Shafer, J., Agrawal, R., & Metha, M., SPRINT: A Scalable Par-
allel Classier for Data Mining, Proc. of the 22th Int. Conf. on
VLDB, Mumbi, India, 1996.
[26] Sousa, M. S., Mattoso, M.L.Q., Ebecken, & N.F.F., Data Min-
ing: A Tightly-Coupled Implementation using a Parallel Database
Server. Proc. Int. Conf. on DEXA Workshop Parallel Databases:
innovative applications and new architecture, IEEE CS, Viena,
Austria, 1998.
[27] Sousa, M. S., Mattoso, M.L.Q., Ebecken, & N.F.F., Data Mining
on Parallel Database Systems. Proc. Int. Conf. on PDPTA: Spe-
cial Session on Parallel Data Warehousing, CSREA Press, Las
Vegas, 1998.
[28] Srikant, R., & Agrawal, R., Mining Sequential Patterns: Gen-
eralizations and Performance Improvements, Proc. of the 5th
Int. Conf. on Extending Database Technology (EDBT), Avignon,
France, 1996.