Data Mining

1.0 INTRODUCTION 1.1 What is Data Mining and Data Warehousing? 1.2 History Of Data Mining 2.0 THEORY 2.1 Technological Infrastructure Of Data Mining 2.2 Working Of Data Mining 2.3 Analysis at Different Levels 2.4 Elements of Data Mining 2.5 Advantages of Data Mining 2.6 Disadvantages of Data Mining 2.7 Applications of Data Mining 2.8 Future of Data Mining 3.0 CONCLUSION 4.0 FUTURE OF DATA MINING 7 8 8 9 10 11 12 13 13 14 5 5 6

With the increased and widespread use of technologies, interest in data mining has increased rapidly. Companies are now utilized data mining techniques to exam their database looking for trends, relationships, and outcomes to enhance their overall operations and discover new patterns that may allow them to better serve their customers. Data mining provides numerous benefits to businesses, government, society as well as individual persons. However, like many technologies, there are negative things that caused by data mining such as invasion of privacy right. This paper tries to explore the advantages as well as the disadvantages of data mining. WalMart is pioneering massive data mining to transform its supplier relationships. WalMart captures point-of-sale transactions from over 2,900 stores in 6 countries and continuously transmits this data to its massive 7.5 terabyte Teradata data warehouse. WalMart allows more than 3,500 suppliers, to access data on their products and perform data analyses. These suppliers use this data to identify customer buying patterns at the store display level. They use this information to manage local store inventory and identify new merchandising opportunities. In 1995, WalMart computers processed over 1 million complex data queries.

1.1 DATAMINING AND DATAWAREHOUSING: Data mining (DM), also called Knowledge-Discovery in Databases (KDD) or Knowledge-Discovery. Data mining has been defined as "the nontrivial extraction of implicit, previously unknown, and potentially useful information from data" and "the science of extracting useful information from large data sets or databases"

Data warehousing is defined as a process of centralized data management and retrieval. Data warehousing represents an ideal vision of maintaining a central repository of all organizational data. Centralization of data is needed to maximize user access and analysis A data warehouse is a copy of transaction data specifically structured for querying and reporting.

1.2 HISTORY OF DATAMINING: For many years, statistics have been used to analyze data in an effort to find correlations, patterns, and dependencies. However, with an increased in technology more and more data are available, which greatly exceed the human capacity to manually analyze them. Before the 1990s, data collected by bankers, credit card companies, department stores and so on have little used. But in recent years, as computational power increases, the idea of data mining has emerged. Data mining is a term used to describe the process ofdiscovering patterns and trends in large data sets in order to find useful decisionmaking information. With data mining, the information obtained from the bankers, credit card companies, and department stores can be put to good use.

2.1 TECHNOLOGICAL INFRASTRUCTURE OF DATAMINING: Today, data mining applications are available on all size systems for mainframe, client/server, and PC platforms. System prices range from several thousand dollars for the smallest applications up to $1 million a terabyte for the largest. Enterprise-wide applications generally range in size from 10 gigabytes to over 11 terabytes. There are two critical technological drivers:

Size of the database: the more data being processed and maintained, the more powerful the system required.

Query complexity: the more complex the queries and the greater the number of queries being processed, the more powerful the system required.

Relational database storage and management technology is adequate for many data mining applications less than 50 gigabytes. However, this infrastructure needs to be significantly enhanced to support larger applications. Some vendors have added extensive indexing capabilities to improve query performance. Others use new hardware architectures such as Massively Parallel Processors (MPP) to achieve order-of-magnitude improvements in query time. For example, MPP systems from NCR link hundreds of high-speed Pentium processors to achieve performance levels exceeding those of the largest supercomputers.

2.2 WORKING OF DATA MINING: Data mining is a component of a wider process called knowledge discovery from database.It involves scientists and statisticians, as well as those working in other fields such as machine learning, artificial intelligence, information retrieval and pattern recognition. Before a data set can be mined, it first has to be cleaned. This cleaning process removes errors, ensures consistency and takes missing values into account. Next, computer algorithms are used to mine the clean data looking for unusual patterns. Data mining software analyzes relationships and patterns in stored transaction data based on open-ended user queries. Several types of analytical software are available: statistical, machine learning, and neural networks. Generally, any of four types of relationships are sought:

Classes: Stored data is used to locate data in predetermined groups. This information could be used to increase traffic by having daily specials. Clusters: Data items are grouped according to logical relationships or consumer preferences Associations: Data can be mined to identify associations. The beer-diaper example is an example of associative mining. Sequential patterns: Data is mined to anticipate behavior patterns and trends


Artificial neural networks: Non-linear predictive models that learn through training and resemble biological neural networks in structure.

Genetic algorithms: Optimization techniques that use processes such as genetic combination, mutation, and natural selection in a design based on the concepts of natural evolution.

Decision trees: Tree-shaped structures that represent sets of decisions. These decisions generate rules for the classification of a dataset. Specific decision tree methods include Classification and Regression Trees (CART) and Chi Square Automatic Interaction Detection (CHAID) . CART and CHAID are decision tree techniques used for classification of a dataset. They provide a set of rules that you can apply to a new (unclassified) dataset to predict which records will have a given outcome. CART segments a dataset by creating 2-way splits while CHAID segments using chi square tests to create multi-way splits. CART typically.

Nearest neighbor method: A technique that classifies each record in a dataset based on a combination of the classes of the k record(s) most similar to it in a historical dataset (where k 1). Sometimes called the k-nearest neighbor technique.

Rule induction: The extraction of useful if-then rules from data based on statistical significance.










multidimensional data. Graphics tools are used to illustrate data relationships 2.4 ELEMENTS OF DATA MINING: Data mining consists of five major elements: Extract, transform, and load transaction data onto the data warehouse system. Store and manage the data in a multidimensional database system. Provide data access to business analysts and information technology professionals. Analyze the data by application software. Present the data in a useful format, such as a graph or table

2.5 ADVANTAGES OF DATAMINING: Marking/Retailing: Data mining can aid direct marketers by providing them with useful and accurate trends about their customers purchasing behavior. Based on these trends, marketers can direct their marketing attentions to their customers with more precision. For example, marketers of a software company may advertise about their new software to consumers who have a lot of software purchasing history. In addition, data mining may also help marketers in predicting which products their customers may be interested in buying. Through this prediction, marketers can surprise their customers and make the customers shopping experience becomes a pleasant one.5 Retail stores can also benefit from data mining in similar ways. For example, through the trends provide by data mining, the store managers can arrange shelves, stock certain items, or provide a certain discount that will attract their customers Banking/Crediting: Data mining can assist financial institutions in areas such as credit reporting and loan information. For example, by examining previous customers with similar attributes, a bank can estimated the level of risk associated with each given loan. In addition, data mining can also assist credit card issuers in detecting potentially fraudulent credit card transaction. Although the data mining technique is not a 100% accurate in its prediction about fraudulent charges, it does help the credit card issuers reduce their losses.6 Law enforcement: Data mining can aid law enforcers in identifying criminal suspects as well as apprehending these criminals by examining trends in location, crime type, habit, and other patterns of behaviors.

Researchers: Data mining can assist researchers by speeding up their data analyzing process; thus, allowing them more time to work on other projects. 2.6 DISADVANTAGES OF DATA MINING: Privacy Issues Personal privacy has always been a major concern in this country. In recent years, with the widespread use of Internet, the concerns about privacy have increase tremendously. Because of the privacy issues, some people do not shop on Internet. They are afraid that somebody may have access to their personal information and then use that information in an unethical way; thus causing them harm. Although it is against the law to sell or trade personal information between different organizations, selling personal information have occurred. Security issues: Although companies have a lot of personal information about us available online, they do not have sufficient security systems in place to protect that information. For example, recently the Ford Motor credit company had to inform 13,000 of the consumers that their personal information including Social Security number, address, account number and payment history were accessed by hackers who broke into a database belonging to the Experian credit reporting agency. This incidence illustrated that companies are willing to disclose and share your personal information, but they are not taking care of the information properly. With so much personal information available, identity theft could become a real problem.

Misuse of information/inaccurate information

Trends obtain through data mining intended to be used for marketing purpose or for some other ethical purposes, may be misused. Unethical businesses or people may used the information obtained through data mining to take advantage of vulnerable people or discriminated against a certain group of people. In addition, data mining technique is not a 100 percent accurate; thus mistakes do happen which can have serious consequence.

2.7 ETHICAL ISSUES: As with many technologies, both positives and negatives lie in the power of data mining. There are, of course, valid arguments to both sides. Here is the positive as well as the negative things about data mining from different perspectives. According to the consumers, data mining benefits businesses more than it benefit them. Consumers may benefit from data mining by having companies customized their product and service to fit the consumers individual needs. However, the consumers privacy may be lost as a result of data mining Through data mining, financial and insurance companies are able to detect patterns of fraudulent credit care usage, identify behavior patterns of risk customers, and analyze claims. Data mining would help these companies minimize their risk and increase their profits. Since companies are able to minimize their risk, they may be able to charge the customers lower interest rate or lower premium. Data mining can aid law enforcers in their process of identify criminal suspects and apprehend these criminals. Data mining can help reduce the amount of time and effort that these law enforcers have to spend on any one particular case. Thus, allowing them to deal with more problems. Hopefully, this would make the country becomes a safer place. In addition, data mining may also help reduce terrorist acts by allowing government officers to identify and locate potential terrorists early.

2.8 APPLICATIONS: Business Intelligence BioInformatics ChemoInformatics Intelligence science Dicovery Science Loyalty Card Quantative Structure and Activity relationship

3.0 CONCLUSION: Data mining can be beneficial for businesses, governments, society as well as the individual person. However, the major flaw with data mining is that it increases the risk of privacy invasion. Currently, business organizations do not have sufficient security systems to protect the information that they obtained through data mining from unauthorized access, though the use of data mining should be restricted. In the future, when companies are willing to spend money to develop sufficient security system to protect consumer data, then the use of data mining may be supported. Database marketing software applications will have a tremendous impact on how business is done in the future. Although the core data mining technology is here today, developers need to take what already exists and turn it into something that business users










combinedataminingtechnology with a thorough understanding of business problems and present the results in a way that the user can understand.

4.0 FUTURE OF DATAMINING: What does the future have in store for data mining? In the end, much of what is called data mining will likely end up as standard tools built into database or data warehouse software products. As a motivation for this statement, I would like to use the field of spell checking software as an example. Just look back ten years to the infancy of computer word processing. Many companies made spell checking software. You would usually buy a spell checker as a separate piece of software for use with whatever word processor you might have. Sometimes the spell-checker wouldn't understand a particular word processor's file format. Some spell-checkers might have even required you to dump your document as an ASCII file before it would check the spelling (on the ASCII file). In that case, you would have had to manually make corrections in the original document. Eventually the spell checkers became more user friendly and understood every possible document format. Functionality also increased. The future of spell checking probably looked pretty rosy

