CA2 Notes - Copy
CA2 Notes - Copy
Data is an important asset of any business, but the value of data to any organisation depends on its quality. Data
warehousing is the process of collecting and managing data from a number of different sources. The data warehouse is
effectively a secure, electronic storage of business data as a way to create a historical trove of data for future analysis
and insight. This relies on several computer-based technologies and techniques which create BI systems that
contribute towards data visualisation, reporting and analysis. Effectively, data warehousing is a component of BI
architecture. In Data Warehouse Two main construction approaches are used: Two of the most common models that
have been developed are the Top-Down approach and the Bottom-Up approach and each of them possesses its
strengths and weaknesses. A Data-Warehouse is a heterogeneous collection of data sources organized under a unified
schema. There are 2 approaches for constructing a data warehouse: The top-down approach and the Bottom-up
approach are explained below.
The initial approach developed by Bill Inmon known as the top-down approach starts with building a single source
data warehouse for the whole company. Merges and processes external data through the ETL (Extract, Transform,
Load) process and subsequently stores them in the data warehouse. Specialized data marts for different organizations
departments, for instance, the finance department are then formed from there. The strength of this method is that it
offers a clear structure for managing data, however, this method can be expensive as well as time-consuming and for
that reason, it is ideal for large organizations only.
1. External Sources: External source is a source from where data is collected irrespective of the type of data.
Data can be structured, semi structured and unstructured as well.
2. Stage Area: Since the data, extracted from the external sources does not follow a particular format, so there is
a need to validate this data to load into Datawarehouse. For this purpose, it is recommended to use ETL tool.
L(Load): Data is loaded into Datawarehouse after transforming it into the standard format.
3. Data-warehouse: After cleansing of data, it is stored in the data warehouse as central repository. It actually
stores the meta data and the actual data gets stored in the data marts. Note that data warehouse stores the data
in its purest form in this top-down approach.
4. Data Marts: Data mart is also a part of storage component. It stores the information of a particular function of
an organisation which is handled by single authority. There can be as many number of data marts in an
organisation depending upon the functions. We can also say that data mart contains subset of the data stored in
data warehouse.
5. Data Mining: The practice of analysing the big data present in data warehouse is data mining. It is used to find
the hidden patterns that are present in the database or in data warehouse with the help of algorithm of data
mining. This approach is defined by Inmon as – data warehouse as a central repository for the complete
organisation and data marts are created from it after the complete data warehouse has been created.
Bottom-up Approach is the Ralph Kimball’s approach of the construction of individual data marts that lie at the
centre of specific business goals or functions such as marketing or sales. These data marts are extracted transformed
& loaded first to provide organizations’ ability to generate reports instantly. In turn, these data marts are affiliated
to the more centralized and broad data warehouse system. This is a more flexible method of training, cheaper and
best recommendable in smaller organizations. Nevertheless, it entails the creation of data silos and disparities, and
this may not allow an organization to have a coherent perspective in its various departments.
1. First, the data is extracted from external sources (same as happens in top-down approach).
2. Then, the data go through the staging area (as explained above) and loaded into data marts instead of data
warehouse. The data marts are created first and provide reporting capability. It addresses a single business
area.
3. These data marts are then integrated into data warehouse. This approach is given by Kimball as – data marts
are created first and provides a thin view for analyses and data warehouse is created after complete data marts
have been created.
In a data warehouse, a schema is used to define the way to organize the system with all the database entities (fact
tables, dimension tables) and their logical association.
This is the simplest and most effective schema in a data warehouse. A fact table in the centre surrounded by multiple
dimension tables resembles a star in the Star Schema model.
The fact table maintains one-to-many relations with all the dimension tables. Every row in a fact table is associated
with its dimension table rows with a foreign key reference.
Due to the above reason, navigation among the tables in this model is easy for querying aggregated data. An end-user
can easily understand this structure. Hence all the Business Intelligence (BI) tools greatly support the Star schema
model.
While designing star schemas the dimension tables are purposefully de-normalized. They are wide with many
attributes to store the contextual data for better analysis and reporting.
Example -
Star schema acts as an input to design a SnowFlake schema. Snow flaking is a process that completely normalizes all
the dimension tables from a star schema.
The arrangement of a fact table in the centre surrounded by multiple hierarchies of dimension tables looks like a
SnowFlake in the SnowFlake schema model. Every fact table row is associated with its dimension table rows with a
foreign key reference.
While designing SnowFlake schemas the dimension tables are purposefully normalized. Foreign keys will be added to
each level of the dimension tables to link to its parent attribute. The complexity of the SnowFlake schema is directly
proportional to the hierarchy levels of the dimension tables.
Example -
OLTP (Online Transaction Processing) and OLAP (Online Analytical Processing) are two key concepts in the context
of Business Intelligence. They serve different purposes and are designed to handle different types of data workloads.
OLTP (Online Transaction Processing):
1. Purpose: OLTP systems are designed to manage and process transactional data in real-time. These systems
support day-to-day operations like sales, order processing, and inventory management.
2. Data Structure: OLTP databases are typically normalized, meaning data is stored in many related tables to
reduce redundancy and optimize for speed and efficiency in transaction processing.
3. Operations: OLTP systems focus on insert, update, and delete operations that handle real-time transactions.
4. Speed: They are optimized for quick query response times and high transaction throughput.
5. Data Size: The datasets are generally smaller compared to OLAP and grow incrementally as transactions
occur.
6. OLTP Focuses on real-time data, high transaction volumes, and operational tasks.
1. Purpose: OLAP systems are designed for complex querying and analysis of large datasets. They support
decision-making processes by providing insights into trends, patterns, and summarizing business performance.
2. Data Structure: OLAP databases are typically denormalized (e.g., star or snowflake schema) for faster
querying. The data is aggregated and organized to facilitate reporting and multidimensional analysis.
3. Operations: OLAP systems focus on read operations, with users performing complex queries, aggregations,
and slicing/dicing of data.
4. Speed: They are optimized for fast read access and complex query execution, but might not be as fast for
transactional processing.
5. Data Size: OLAP systems handle large amounts of historical data, which is used for reporting, trend analysis,
and forecasting.
6. OLAP Focuses on large datasets, complex queries, and analytical tasks to assist in decision-making.
7. Example: A company’s data warehouse where executives analyse sales performance over the past year.
Data mining refers to the process of extracting meaningful information and knowledge from large datasets. It involves
the use of various statistical and computational techniques to discover patterns, trends, and relationships. By analysing
vast amounts of data, data mining can reveal valuable insights that help organizations make informed business
decisions.
One of the key techniques used in data mining is association rule mining, which aims to find interesting relationships
between variables in large datasets. For example, in a retail setting, data mining can help identify patterns such as
“customers who buy product A are likely to also purchase product B.” This information can be used to optimize
product placement and marketing strategies to increase sales and customer satisfaction.
Another important aspect of data mining is clustering, which involves grouping similar data points together based on
certain characteristics. This technique is useful for segmenting customers into different categories for targeted
marketing campaigns. By understanding the preferences and behaviours of each customer segment, businesses can
tailor their offerings to better meet the needs of their target audience.
Extract, Transform, Load (ETL) is the backbone of data mining. It involves extracting data from various sources,
transforming it into a suitable format, and loading it into a data warehouse. The extraction process retrieves data from
sources like databases and cloud storage. The transformation cleanses and integrates the data, ensuring accuracy.
Finally, the data is loaded into a warehouse for further analysis.
2. Store Data
Once data is transformed and loaded, it is stored in a multidimensional database system. This storage method enables
complex queries and data aggregation across multiple dimensions, such as time or geography. Multidimensional
storage enhances in-depth analysis and supports rapid information retrieval, which is essential for informed decision-
making.
3. Provide Access
Providing access to the data is crucial for business analysts and IT professionals. Business analysts need access to
generate insights and support strategic decisions, while IT professionals manage and maintain data infrastructure. User
roles and permissions typically govern this access, ensuring that different levels of access are granted based on the
user’s role within the organization.
4. Analyse Data
Analysing the data is the core of the data mining process. Application software, including statistical tools and machine
learning algorithms, is used to uncover patterns, correlations, and trends. This analysis leads to actionable insights,
such as predicting future trends or detecting anomalies. The analysis process is often iterative, refining results to
achieve accuracy.
5. Present Data
Finally, data must be presented in useful formats, such as graphs, tables, dashboards, and reports. Effective data
presentation is key to making insights understandable and actionable for stakeholders. Visual representations help
convey complex information quickly, while detailed reports provide depth for thorough analysis, facilitating informed
decision-making.
The process of data mining is simple and consists of three stages. The initial exploration stage usually starts with data
preparation which involves cleaning out data, transforming data, and selecting subsets of records and data sets with
large number of variables. Then, identifying relevant variables and determining the complexity of models must be
done to elaborate exploratory analyses using a wide variety of graphical and statistical methods.
1. Data Collection:
o Collect all the data from various sources like sales reports, customer information, or website visits.
2. Data Cleaning:
o Clean the data by removing errors, fixing mistakes, and filling in missing information.
3. Data Selection:
o Choose only the important data needed for your analysis (not all data will be useful).
4. Data Transformation:
o Convert the data into a useful format. For example, turning text into numbers or combining different
pieces of data into one.
o It’s like chopping, mixing, or cooking the ingredients to get them ready for the dish.
5. Data Mining:
o This is where you actually analyse the data using special techniques (like looking for patterns or
trends).
6. Pattern:
o Evaluation Check the patterns you found to see if they are important or useful for decision-making.
7. Knowledge Representation:
o Present the results in an easy-to-understand way (like charts or graphs) so business leaders can make
decisions.
o Finally, you serve the dish in a nice way so everyone can enjoy it!
Q6. Discuss the application of Data Mining/ Describe a real-World application of data mining?
Data mining is the process of discovering patterns and extracting useful information from large sets of data. Here are
some simple applications of data mining in everyday life:
1. Customer Recommendations
Example: When you shop online, you often see suggestions like "Customers who bought this also bought..."
This is data mining at work. It analyses what other customers have purchased to recommend products you
might like.
2. Fraud Detection
Example: Banks and credit card companies use data mining to spot unusual transactions. If someone suddenly
makes a large purchase in a different country, the system can flag it as potentially fraudulent.
Example: Grocery stores analyse what items are frequently bought together. If many customers buy bread and
butter together, the store might place them near each other or offer discounts on one when you buy the other.
4. Customer Segmentation
Example: Companies use data mining to group customers based on their buying habits. For instance, they
might find that some customers prefer organic products while others look for discounts. This helps businesses
tailor their marketing strategies.
5. Predictive Maintenance
Example: In manufacturing, data mining can predict when machines are likely to fail based on historical data.
This allows companies to perform maintenance before a breakdown occurs, saving time and money.
6. Healthcare Insights
Example: Hospitals analyse patient data to identify trends in diseases or treatment outcomes. This can help in
improving patient care and developing better treatment plans.
8. Sports Analytics
Example: Sports teams use data mining to analyse player performance and game strategies. This helps
coaches make informed decisions about training and game tactics.
One real-world application of data mining is customer recommendation systems used by online retailers like
Amazon.
How It Works:
1. Collecting Data: When you shop online, the website collects data about what you browse, what you add to
your cart, and what you purchase. It also looks at data from other customers.
2. Finding Patterns: Data mining analyses this information to find patterns. For example, if many customers who
bought a particular book also bought a specific set of headphones, the system notes this connection.
3. Making Recommendations: When you visit the site, it uses these patterns to suggest products you might like.
So, if you look at a book, it might say, "Customers who bought this also bought..." and show you related
items.
Example in Action:
Imagine you’re looking for a new phone case on an online store. After you view a few options, the site suggests a
screen protector and a portable charger because many other customers who bought the same phone case also bought
those items. This not only helps you find what you need but also encourages you to buy more, benefiting both you and
the retailer.