Data Warehousing and Data Mining
Data Warehousing and Data Mining
Database and Data Warehousing History of data warehousing Evolution in organization use of data warehouses Data Warehouse Architecture Benefits of data warehousing Strategic uses of data warehousing Disadvantages of data warehouses Data mart Data mining Data mining for decision support Text mining OLAP Data warehousing integration Business intelligence
What product prom-otions have the biggest impact on revenue? What impact will new products/services have on revenue and margins?
Data warehousing is
Subject Oriented: Data that gives information about a particular subject instead of about a company's ongoing operations. Integrated: Data that is gathered into the data warehouse from a variety of sources and merged into a coherent whole. Time-variant: All data in the data warehouse is identified with a particular time period. Non-volatile: Data is stable in a data warehouse. More data is added but data is never removed. This enables management to gain a consistent picture of the business. Data warehousing is combining data from multiple and usually varied sources into one comprehensive and easily manipulated database. Common accessing systems of data warehousing include queries, analysis and reporting. Because data warehousing creates one database in the end, the number of sources can be anything you want it to be, provided that the system can handle the volume, of course. The final result, however, is homogeneous data, which can be more easily manipulated.
OLTP- ONLINE TRANSACTION PROCESSING Special data organization, access methods and implementation methods are needed to support data warehouse queries (typically multidimensional queries) OLTP systems are tuned for known transactions and workloads while workload is not known a priori in a data warehouse e.g., average amount spent on phone calls between 9AM-5PM in Pune during the month of December
Warehouse (DSS)
Subject Oriented Used to analyze business Summarized and refined Snapshot data Integrated Data Knowledge User (Manager) Large volumes accessed at a time (millions) Mostly Read (Batch Update) Redundancy present Database Size 100 GB - few terabytes Query throughput is the performance metric Hundreds of users Managed by subsets
To summarize ...
OLTP Systems are used to run a business
The data has been selected from various sources and then integrate and store the data in a single and particular format. Data warehouses contain current detailed data, historical detailed data, lightly and highly summarized data, and metadata. Current and historical data are voluminous because they are stored at the highest level of detail. Lightly and highly summarized data are necessary to save processing time when users request them and are readily accessible. Metadata are data about data. It is important for designing, constructing, retrieving, and controlling the warehouse data. Technical metadata include where the data come from, how the data were changed, how the data are organized, how the data are stored, who owns the data, who is responsible for the data and how to contact them, who can access the data , and the date of last update. Business metadata include what data are available, where the data are, what the data mean, how to access the data, predefined reports and queries, and how current the data are.
Business advantages
It provides business users with a customer-centric view of the companys heterogeneous data by helping to integrate data from sales, service, manufacturing and distribution, and other customer-related business systems. It provides added value to the companys customers by allowing them to access better information when data warehousing is coupled with internet technology. It consolidates data about individual customers and provides a repository of all customer contacts for segmentation modeling, customer retention planning, and cross sales analysis. It removes barriers among functional areas by offering a way to reconcile views from multiple areas, thus providing a look at activities that cross functional lines. It reports on trends across multidivisional, multinational operating units, including trends or relationships in areas such as merchandising, production planning etc.
Strategic use
Crew assignment, aircraft development, mix of fares, analysis of route profitability, frequent flyer program promotions Customer service, trend analysis, product and service promotions, reduction of IS expenses Customer service, new information service, fraud detection Reduction of operational expenses Risk management, market movements analysis, customer tendencies analysis, portfolio management Trend analysis, buying pattern analysis, pricing policy, inventory control, sales promotions, optimal distribution channel New product and service promotions, reduction of IS budget, profitability analysis Distribution decisions, product promotions, sales decisions, pricing policy Intelligence gathering
Product development; Operations; marketing Product development; marketing Operations Product development; Operations; marketing Distribution; marketing
Data Marts
A data mart is a scaled down version of a data warehouse that focuses on a particular subject area. A data mart is a subset of an organizational data store, usually oriented to a specific purpose or major data subject, that may be distributed to support business needs. Data marts are analytical data stores designed to focus on specific business functions for a specific community within an organization. Usually designed to support the unique business requirements of a specified department or business process Implemented as the first step in proving the usefulness of the technologies to solve business problems Reasons for creating a data mart Easy access to frequently needed data Creates collective view by a group of users Improves end-user response time Ease of creation in less time Lower cost than implementing a full Data warehouse Potential users are more clearly defined than in a full Data warehouse
Organizationally Structured
Data Warehouse
Small Flexible Customized by Department OLAP Source is departmentally structured data warehouse
Data warehouse
Data Mining
Data Mining is the process of extracting information from the company's various databases and re-organizing it for purposes other than what the databases were originally intended for. It provides a means of extracting previously unknown, predictive information from the base of accessible data in data warehouses. Data mining process is different for different organizations depending upon the nature of the data and organization. Data mining tools use sophisticated, automated algorithms to discover hidden patterns, correlations, and relationships among organizational data. Data mining tools are used to predict future trends and behaviors, allowing businesses to make proactive, knowledge driven decisions. For ex: for targeted marketing, data mining can use data on past promotional mailings to identify the targets most likely to maximize the return on the companys investment in future mailings.
Classification: It infers the defining characteristics of a certain group Clustering: identifies group of items that share a particular characteristic Association: identifies relationships between events that occur at one time Sequencing: similar to association, except that the relationship exists over a period of time Forecasting: estimates future values based on patterns within large sets of data
Data mining tools are needed to extract the buried information ore. The miner is often an end user, empowered by data drills and other power query tools to ask ad hoc questions and get answers quickly, with little or no programming skill. The data mining environment usually has a client/server architecture. Because of the large amounts of data, it is sometimes necessary to use parallel processing for data mining. Data mining tools are easily combined with spreadsheets and other end user software development tools, enabling the mined data to be analyzed and processed quickly and easily. Data mining yields five types of information: associations, sequences, classifications, clusters and forecasting. Striking it rich often involves finding unexpected, valuable results.
Fraud detection
Direct marketing
Models customer flows in theme parks; analyzes safety of amusement parks rides Predicts which customers will buy new policies; identifies behavior patterns that increase insurance risk; spots fraudulent claims Optimizes product design, balancing manufacturability and safety; improves shop-floor scheduling and machine utilization Ranks successful therapies for different illnesses; predicts drug efficacy; discovers new drugs and treatments Analyzes seismic data for signs of underground deposits ; prioritizes drilling locations; simulates underground flows to improve recovery Discerns buying-behavior patterns; predicts how customers will respond to marketing campaigns
Text mining
Text mining is the application of data mining to non structured or less structured text files. Operates with less structured information Frequently focused on document format rather than document content
To summarize ...
OLTP Systems are used to run a business
Online Analytical Processing - coined by EF Codd in 1994 paper contracted by Arbor Software Generally synonymous with earlier terms such as Decisions Support, Business Intelligence, Executive Information System OLAP = Multidimensional Database
Online analytical processing refers to such end user activities as DSS modelling using spreadsheets and graphics that are done online. OLAP involves many different data items in complex relationships. Objective of OLAP is to analyze complex relationships and look for patterns, trends and exceptions.
Fast Analysis Shared Multidimensional Information
Strengths of OLAP
It is a powerful visualization paradigm
End Users:
Direct use
Data visualization
Use of knowledge
Generate knowledge
Knowledge base
Businesses run on information and the knowledge of how to put that information to use. Knowledge is not readily available, it is continuously constructed from data and/or information, in a process that may not be simple or easy. The transformation of data into knowledge may be accomplished in several ways Data collection from various sources stored in simple databases
Data can be processed, organized, and stored in a data warehouse and then analyzed (e.g.) by using analytical processing) by end users for decision support. Some of the data are converted to information prior to storage in the data warehouse, and some of the data and/or information can be analyzed to generate knowledge. For example, by using data mining, a process that looks for unknown relationships and patterns in the data, knowledge regarding the impact of advertising on a specific group of customers can be generated. This generated knowledge is stored in an organizational knowledge base, a repository of accumulated corporate knowledge and of purchased knowledge. The knowledge in the knowledge base can be used to support less experienced and users, or to support complex decision making. Both the data and the information, at various times during the process, and the knowledge derived at the end of the process, may need to be presented to users.
Business Intelligence
One ultimate use of the data gathered and processed in the data life cycle is for business intelligence. Business intelligence generally involves the creation or use of a data warehouse and/or data mart for storage of data, and the use of front-end analytical tools such as Oracles Sales Analyzer and Financial Analyzer or Micro Strategys Web. Such tools can be employed by end users to access data, ask queries, request ad hoc (special) reports, examine scenarios, create CRM activities, devise pricing strategies, and much more.
Using the business intelligence software the user can ask queries, request ad-hoc reports, or conduct any other analysis. For example, deep analysis can be carried out by performing multilayer queries. Because all the databases are linked, one can search for what products a store has too much of, determine which of these products commonly sell with popular items, bases on previous sales. After planning a promotion to move the excess stock along with the popular products (by bundling them together, for example), one can dig deeper to see where this promotion would be most popular (and most profitable). The results of the request can be reports, predictions, alerts, and/or graphical presentations. These can be disseminated to decision makers to help them in their decision-making tasks.
More advanced applications of business intelligence include outputs such as financial modeling budgeting resource allocation and competitive intelligence.