DM - Unit4
DM - Unit4
Definition - History of Data Mining- Features of Data Mining - Types of Data Mining - Data
Mining Vs Data Warehousing- Advantages and Disadvantages of Data Mining - Data Mining
Applications - Challenges of Implementation in Data mining - Steps involved in Data Mining -
Classification of Data Mining Systems.
4.1 DATAMINNING
Data mining is one of the most useful techniques that help entrepreneurs, researchers, and
individuals to extract valuable information from huge sets of data. Data mining is also
called Knowledge Discovery in Database (KDD). The knowledge discovery process includes
Data cleaning, Data integration, Data selection, Data transformation, Data mining, Pattern
evaluation, and Knowledge presentation.
Our Data mining tutorial includes all topics of Data mining such as applications, Data mining vs
Machine learning, Data mining tools, Social Media Data mining, Data mining techniques,
Clustering in data mining, Challenges in Data mining, etc.
The process of extracting information to identify patterns, trends, and useful data that would
allow the business to take the data-driven decision from huge sets of data is called Data Mining.
In other words, we can say that Data Mining is the process of investigating hidden patterns of
information to various perspectives for categorization into useful data, which is collected and
assembled in particular areas such as data warehouses, efficient analysis, data mining algorithm,
helping decision making and other data requirement to eventually cost-cutting and generating
revenue.
Data mining is the act of automatically searching for large stores of information to find trends
and patterns that go beyond simple analysis procedures. Data mining utilizes complex
mathematical algorithms for data segments and evaluates the probability of future events. Data
Mining is also called Knowledge Discovery of Data (KDD).
Data Mining is a process used by organizations to extract specific data from huge databases to
solve business problems. It primarily turns raw data into useful information.
Data Mining is similar to Data Science carried out by a person, in a specific situation, on a
particular data set, with an objective. This process includes various types of services such as text
mining, web mining, audio and video mining, pictorial data mining, and social media mining. It
is done through software that is simple or highly specific. By outsourcing data mining, all the
work can be done faster with low operation costs. Specialized firms can also use new
technologies to collect data that is impossible to locate manually. There are tonnes of
information available on various platforms, but very little knowledge is accessible. The biggest
challenge is to analyze the data to extract important information that can be used to solve a
problem or for company development. There are many powerful instruments and techniques
available to mine data and find better insight from it.
In the 1990s, the term "Data Mining" was introduced, but data mining is the evolution of a sector
with an extensive history.
Early techniques of identifying patterns in data include Bayes theorem (1700s), and the evolution
of regression(1800s). The generation and growing power of computer science have boosted data
collection, storage, and manipulation as data sets have broad in size and complexity level.
Explicit hands-on data investigation has progressively been improved with indirect, automatic
data processing, and other computer science discoveries such as neural networks, clustering,
genetic algorithms (1950s), decision trees(1960s), and supporting vector machines (1990s).
Data mining origins are traced back to three family lines: Classical statistics, Artificial
intelligence, and Machine learning.
Classical statistics:
Statistics are the basis of most technology on which data mining is built, such as regression
analysis, standard deviation, standard distribution, standard variance, discriminatory analysis,
cluster analysis, and confidence intervals. All of these are used to analyze data and data
connection.
Artificial Intelligence:
Machine Learning:
Relational Database:
A relational database is a collection of multiple data sets formally organized by tables, records,
and columns from which data can be accessed in various ways without having to recognize the
database tables. Tables convey and share information, which facilitates data searchability,
reporting, and organization.
Data warehouses:
A Data Warehouse is the technology that collects the data from various sources within the
organization to provide meaningful business insights. The huge amount of data comes from
multiple places such as Marketing and Finance. The extracted data is utilized for analytical
purposes and helps in decision- making for a business organization. The data warehouse is
designed for the analysis of data rather than transaction processing.
Data Repositories:
The Data Repository generally refers to a destination for data storage. However, many IT
professionals utilize the term more clearly to refer to a specific kind of setup within an IT
structure. For example, a group of databases, where an organization has kept various kinds of
information.
Object-Relational Database:
One of the primary objectives of the Object-relational data model is to close the gap between the
Relational database and the object-oriented model practices frequently utilized in many
programming languages, for example, C++, Java, C#, and so on.
Transactional Database:
A transactional database refers to a database management system (DBMS) that has the potential
to undo a database transaction if it is not performed appropriately. Even though this was a unique
capability a very long while back, today, most of the relational database systems support
transactional database activities.
Data mining is the process of determining data A data warehouse is a database system designed
patterns. for analytics.
Data mining is generally considered as the Data warehousing is the process of combining
process of extracting useful data from a large set all the relevant data.
of data.
Business entrepreneurs carry data mining with Data warehousing is entirely carried out by the
the help of engineers. engineers.
In data mining, data is analyzed repeatedly. In data warehousing, data is stored periodically.
Data mining uses pattern recognition techniques Data warehousing is the process of extracting
to identify patterns. and storing data that allow easier reporting.
One of the most amazing data mining technique One of the advantages of the data warehouse is
is the detection and identification of the its ability to update frequently. That is the reason
unwanted errors that occur in the system. why it is ideal for business entrepreneurs who
want up to date with the latest stuff.
The data mining techniques are cost-efficient as The responsibility of the data warehouse is to
compared to other statistical data applications. simplify every type of business data.
The data mining techniques are not 100 percent In the data warehouse, there is a high possibility
accurate. It may lead to serious consequences in that the data required for analysis by the
a certain condition. company may not be integrated into the
warehouse. It can simply lead to loss of data.
Companies can benefit from this analytical tool Data warehouse stores a huge amount of
by equipping suitable and accessible knowledge- historical data that helps users to analyze
based data. different periods and trends to make future
predictions.
o It is a quick process that makes it easy for new users to analyze enormous amounts of
data in a short time.
o There is a probability that the organizations may sell useful data of customers to other
organizations for money. As per the report, American Express has sold credit card
purchases of their customers to other organizations.
o Many data mining analytics software is difficult to operate and needs advance training to
work on.
o Different data mining instruments operate in distinct ways due to the different algorithms
used in their design. Therefore, the selection of the right data mining tools is a very
challenging task.
o The data mining techniques are not precise, so that it may lead to severe consequences in
certain conditions.
Data Mining is primarily used by organizations with intense consumer demands- Retail,
Communication, Financial, marketing company, determine price, consumer preferences, product
positioning, and impact on sales, customer satisfaction, and corporate profits. Data mining
enables a retailer to use point-of-sale records of customer purchases to develop products and
promotions that help the organization to attract the customer.
These are the following areas where data mining is widely used:
Data mining in healthcare has excellent potential to improve the health system. It uses data and
analytics for better insights and to identify best practices that will enhance health care services
and reduce costs. Analysts use data mining approaches such as Machine learning, Multi-
dimensional database, Data visualization, Soft computing, and statistics. Data Mining can be
used to forecast patients in each category. The procedures ensure that the patients get intensive
care at the right place and at the right time. Data mining also enables healthcare insurers to
recognize fraud and abuse.
Market basket analysis is a modeling method based on a hypothesis. If you buy a specific group
of products, then you are more likely to buy another group of products. This technique may
enable the retailer to understand the purchase behavior of a buyer. This data may assist the
retailer in understanding the requirements of the buyer and altering the store's layout accordingly.
Using a different analytical comparison of results between various stores, between customers in
different demographic groups can be done.
Education data mining is a newly emerging field, concerned with developing techniques that
explore knowledge from the data generated from educational Environments. EDM objectives are
recognized as affirming student's future learning behavior, studying the impact of educational
support, and promoting learning science. An organization can use data mining to make precise
decisions and also to predict the results of the student. With the results, the institution can
concentrate on what to teach and how to teach.
Knowledge is the best asset possessed by a manufacturing company. Data mining tools can be
beneficial to find patterns in a complex manufacturing process. Data mining can be used in
system-level designing to obtain the relationships between product architecture, product
portfolio, and data needs of the customers. It can also be used to forecast the product
development period, cost, and expectations among the other tasks.
Customer Relationship Management (CRM) is all about obtaining and holding Customers, also
enhancing customer loyalty and implementing customer-oriented strategies. To get a decent
relationship with the customer, a business organization needs to collect data and analyze the data.
With data mining technologies, the collected data can be used for analytics.
Billions of dollars are lost to the action of frauds. Traditional methods of fraud detection are a
little bit time consuming and sophisticated. Data mining provides meaningful patterns and
turning data into information. An ideal fraud detection system should protect the data of all the
users. Supervised methods consist of a collection of sample records, and these records are
classified as fraudulent or non-fraudulent. A model is constructed using this data, and the
technique is made to identify whether the document is fraudulent or not.
Apprehending a criminal is not a big deal, but bringing out the truth from him is a very
challenging task. Law enforcement may use data mining techniques to investigate offenses,
monitor suspected terrorist communications, etc. This technique includes text mining also, and it
seeks meaningful patterns in data, which is usually unstructured text. The information collected
from the previous investigations is compared, and a model for lie detection is constructed.
The Digitalization of the banking system is supposed to generate an enormous amount of data
with every new transaction. The data mining technique can help bankers by solving business-
related problems in banking and finance by identifying trends, casualties, and correlations in
business information and market costs that are not instantly evident to managers or executives
because the data volume is too large or are produced too rapidly on the screen by experts. The
manager may find these data for better targeting, acquiring, retaining, segmenting, and maintain
a profitable customer.
Although data mining is very powerful, it faces many challenges during its execution. Various
challenges could be related to performance, data, methods, and techniques, etc. The process of
data mining becomes effective when the challenges or problems are correctly recognized and
adequately resolved.
The process of extracting useful data from large volumes of data is data mining. The data in the
real-world is heterogeneous, incomplete, and noisy. Data in huge quantities will usually be
inaccurate or unreliable. These problems may occur due to data measuring instrument or because
of human errors. Suppose a retail chain collects phone numbers of customers who spend more
than $ 500, and the accounting employees put the information into their system. The person may
make a digit mistake when entering the phone number, which results in incorrect data. Even
some customers may not be willing to disclose their phone numbers, which results in incomplete
data. The data could get changed due to human or system error. All these consequences (noisy
and incomplete data)makes data mining challenging.
Data Distribution:
Complex Data:
Real-world data is heterogeneous, and it could be multimedia data, including audio and video,
images, complex data, spatial data, time series, and so on. Managing these various types of data
and extracting useful information is a tough task. Most of the time, new technologies, new tools,
and methodologies would have to be refined to obtain specific information.
Performance:
The data mining system's performance relies primarily on the efficiency of algorithms and
techniques used. If the designed algorithm and techniques are not up to the mark, then the
efficiency of the data mining process will be affected adversely.
Data mining usually leads to serious issues in terms of data security, governance, and privacy.
For example, if a retailer analyzes the details of the purchased items, then it reveals data about
buying habits and preferences of the customers without their permission.
Data Visualization:
In data mining, data visualization is a very important process because it is the primary method
that shows the output to the user in a presentable way. The extracted data should convey the
exact meaning of what it intends to express. But many times, representing the information to the
end-user in a precise and easy way is difficult. The input data and the output information being
complicated, very efficient, and successful data visualization processes need to be implemented
to make it successful.
There are many more challenges in data mining in addition to the problems above-mentioned.
More problems are disclosed as the actual data mining process begins, and the success of data
mining relies on getting rid of all these difficulties.
Prerequisites
Before learning the concepts of Data Mining, you should have a basic understanding of
Statistics, Database Knowledge, and Basic programming language.
Audience
Our Data Mining Tutorial is prepared for all beginners or computer science graduates to help
them learn the basics to advanced techniques related to data mining.
Problems
We assure you that you will not find any difficulty while learning our Data Mining tutorial. But
if there is any mistake in this tutorial, kindly post the problem or error in the contact form so that
we can improve it.
If you’re interested to know the basic steps of data mining, follow the list below.
The first step to data mining is cleaning incomplete or dirty data in order to maintain the industry
standard.Otherwise, there will be endless system failures and poor insights, which can take more
time and effort. As per the requirements of specific industries, the specialists use multiple
methods or tools to accomplish this task.
2. Integration of Data:
In the second step, the specialists perform data integration, which refers to analysing data by
combining the sources and sets of multiple data. It’s a crucial step that requires different
databases to do the second layer of data cleaning. The main purpose here is to improve data
quality by eliminating inconsistent information.
3. Reduction of Data:
Now that the cleaning process is complete, it’s time for the reduction of data so that the quality
enhances further.Hence, specialists take small data and reduce the structure, to sum up, its main
message. Machine learning is a very important process that is used along with several data
mining tools for smooth performance in this third step.
4. Transformation of Data:
Every data mining task has its own mining goals, which gets clarified in the fourth step. It’s the
phase when the specialists combine all the preparation data through different methods such as
data mapping, normalisation, aggregation and others. As a result, the quality of data gets
improved further and the specialists move one step forward to create a final report.
5. Data Mining:
Though the entire process is known as data mining, this step specifically includes the mining
tasks. Some modelling techniques used in this step are classification, clustering etc. The
specialists use multiple tools for data mining and other intelligent methods to come up with
models, which are basically the extracted information.
6. Pattern Analysis:
Data mining is a process that finds out the pattern of relationships between multiple data. In the
sixth step, the specialists finally come up with their insights and discuss them with business
owners so that new decisions can be taken. Starting from sales to employee behaviour and
customer needs, all things are discussed in this step.
Right after the discussion, the specialists usually present their final report that includes every
relevant information of the process including their intelligent insight on the overall business
performance and its pattern of problems. Companies get the report and realise the pattern of their
behaviour so that they can improve it in the future.
Data mining refers to the process of extracting important data from raw data. It analyses the data
patterns in huge sets of data with the help of several software. Ever since the development of
data mining, it is being incorporated by researchers in the research and development field.
With Data mining, businesses are found to gain more profit. It has not only helped in
understanding customer demand but also in developing effective strategies to enforce overall
business turnover. It has helped in determining business objectives for making clear decisions.
Data collection and data warehousing, and computer processing are some of the strongest pillars
of data mining. Data mining utilizes the concept of mathematical algorithms to segment the data
and assess the possibility of occurrence of future events.
To understand the system and meet the desired requirements, data mining can be classified into
the following systems:
A data mining system can be classified based on the types of databases that have been mined. A
database system can be further segmented based on distinct principles, such as data models,
types of data, etc., which further assist in classifying a data mining system.
For example, if we want to classify a database based on the data model, we need to select either
relational, transactional, object-relational or data warehouse mining systems.
A data mining system categorized based on the kind of knowledge mind may have the following
functionalities:
1. Characterization
2. Discrimination
3. Association and Correlation Analysis
4. Classification
5. Prediction
6. Outlier Analysis
7. Evolution Analysis
A data mining system can also be classified based on the type of techniques that are being
incorporated. These techniques can be assessed based on the involvement of user interaction
involved or the methods of analysis employed.
Data mining systems classified based on adapted applications adapted are as follows:
1. Finance
2. Telecommunications
3. DNA
4. Stock Markets
5. E-mail