DMBI Simplified
DMBI Simplified
DMBI Simplified
Q1) Explain KDD process using figure. What is data mining? Write down a short
note on the KDD process.
● Data mining: It refers to the extraction for mining knowledge from large
amounts of data.
● Knowledge base: It is used to search interesting patterns which helps to make
business decisions.
● Data mining engine: It consists of a set of functional modules like prediction,
cluster analysis, association and correlation analysis, etc.
● Pattern evaluation module: It interacts with data mining modules to find
interesting patterns.
● An enormous amount of data is increasing in file databases and other
repositories.
● It is necessary to develop powerful tools for analysis and interpretation of such
data.
● This results in extraction of interesting knowledge that helps to take business
related decisions.
● KDD stands for knowledge discovery in database
● Data mining is a part of the KDD process.
Knowledge discovery in database is a iterative process as follows:
1. Data cleaning: noisy data and irrelevant data removed from the collection
2. Data integration: multiple data sources are combined in a common source. The
format of all the data is unified in similar structure.
3. Data selection: relevant data for the analysis is retrieved from the data
collection.
1
4. Data transformation: selected data is transformed into the forms of mining
procedure. Also known as data consolidation.
5. Data mining: clever techniques are used to extract potentially useful patterns.
6. Pattern evaluation: interesting patterns presenting knowledge are identified.
7. Knowledge representation: discover knowledge is usually represented to the
user. Users should be able to understand and interpret the data mining results.
Q2) What are the major issues in data mining? Define the term data mining?
Discuss the major issues in data mining.
Data mining refers to the extraction for mining of knowledge from large amounts of
data.
● It is a computational process of discovering interesting patterns from data
warehouses and databases.
● It involves methods like artificial intelligence, machine learning, statistics and
DBMS.
5 Major issues in data mining:
1. Mining methodology
2. User interaction
3. Performance and scalability
4. Diversity of data type
5. Application and social impact
1. Mining methodology
○ Mining different kinds of knowledge in database
○ It requires to develop different types of data mining techniques
○ Mining of knowledge at multiple levels of abstraction
○ Incorporation of background knowledge
○ Handling the noisy and unwanted data
2. User interaction
○ The complex interface of data mining software.
○ Expression and visualisation of data mining results.
○ Handling noise and incomplete data.
○ Pattern evaluation with the help of visualisation techniques.
○ Background study of the domain.
3. Performance and scalability
2
○ Efficiency and scalability of data mining algorithms.
○ Real time execution of data mining algorithm.
○ Parallel distributed and incremental mining problems.
4. Diversity of data types
○ Handling complex data types like temporal data, biological sequence
spatial data and web data.
○ Handling relational data types like structured and semi structured data.
○ Data mining of multiple sources at global level like the world wide web.
5. Application and social impacts
○ Protection of data security, integrity and privacy.
○ Understanding data sensitivity Data sensitivity.
○ Intelligent query processing.
○ Build data mining functions in commonly used business software.
Q3) Describe various methods for handling missing data values. List and describe
the methods for handling missing values in data cleaning.
Data cleaning is a process to fill the missing values, smooth out noise and correct the
inconsistencies of the data.
Missing data
There are 6 ways to handle missing data:
1. Ignore the tuple
2. Manually filling
3. Use a global constant
4. Use of measure of Central tendency for the attribute
5. Use attribute mean or median for all samples belonging to the same class
6. Use the most probable value
1. Ignore the tuple
○ Usually done when the class label is missing.
○ Not very effective when the tubal has only few missing values.
○ Does not work when the percentage of missing values attribute varies
considerably.
2. Manually filling
3
○ It is time consuming.
○ Not possible for large data sets.
○ Not reliable when value to be filled is not easily determined.
3. Using a global constant
○ Replace all missing attribute values by the same constant.
○ Example - unknown.
○ Simple but not recommended.
○ Because all these constant values(unknown) form the interesting
pattern.
○ Data mining programs may consider this as valuable information.
4. Using a measure of Central tendency for the attribute
○ Use of mean, median and mode.
○ Mean - for symmetric numeric data.
○ Median - for asymmetric numeric data.
○ Mode - for nominal data.
5. Using the attribute mean or median for all samples belonging to the same
class as the given to tuple
○ Replace the missing numeric values with mean.
○ Replace the missing asymmetric numeric values with mode.
○ Example: replace the missing value with the mean income value for
customers in the same credit risk category.
6. Using the most probable value
○ Using various methods to predict the missing values.
Methods like:
i. Regression
ii. Inference based tools using bayesian formalism
iii. Decision tree induction
Q4) What is noise? Explain binning method.
Noise is a random error for variance in a measured variable.
● Noise results in unreliable and poor output.
● Smoothening is done to remove noise.
● Smoothing techniques are binding, regression and clustering.
4
Binning:
● It is a top down splitting technique based on a specified number of bins.
● Does not use class information.
● It is an unsupervised discretisation technique.
● It smooths sorted data value by consulting its neighborhood.
● Sorted values are distributed into a number of buckets.
● In this way it performs local smoothening.
Example:
1) Equal frequency: Data for prices are first started and then partitioned into
equal frequency bins of size 3.
2) Smoothing by means: Each value in a bin is replaced by the mean value of the
bin.
3) Smoothing by median: Each bin value is replaced by the bin median.
4) Smoothing by boundaries: Minimum and maximum values in a given bin are
identified as bin boundaries. Replace each bin value by closest boundary value.
● Larger width = Great effect of smoothing
Other methods for removing noisy data
5
● Regression: It is a technique that confirms data value to a function. It involves
finding the best line to fit two attributes so that one attribute can be used to
predict the other.
● Outlier analysis: It is detected by clustering where similar values are organised
into groups or clusters. Values falling outside of the cluster are considered as
outliers.
Q5) Explain mean, median, mode, variance, standard deviation and 5 number
summary with suitable database example.
Mean:
Example:
● Suppose you randomly planted weed in six acres of forest land and came up
with the following counts of this weed in this region: 34, 43, 81, 106, 106 and
115
● We compute the sample mean by adding and dividing by the
number of samples 6,
6
● An alternative measure for outlier is the median;
● The median is the middle score.
● If we have an even number of events, we take the average of the two
middles.
● The median is better for describing the typical value.
● It is often used for income and home prices.
Example:
● Suppose you randomly selected 10 house prices in your area. You are
interested in the typical house price. In $100,000 the prices were: 2.7, 2.9,
3.1, 3.4, 3.7, 4.1, 4.3, 4.7, 4.7, 40.8.
● If we computed the mean, we would say that the average house price is
744,000.
● Although this number is true, it does not reflect the price for available
housing in your area.
● Since there is an even number of outcomes, we take the average of the
middle two as 3.9.
7
● In general, a dataset with two or more modes is multimodal.
● At the other extreme, if each data value occurs only once, then there is no
mode.
● Using the same example, we consider 4.7 As Mode.
8
Example:
● The owner of the Indian restaurant is interested in how much people spend
at the restaurant.
● He examines 10 randomly selected receipts for parties of four and writes
down the following data.
● 44, 50, 38, 96, 42, 47, 40, 39, 46, 50
● He calculated the mean by adding and dividing by 10 to get Average(Mean) =
49.2.
● Below is the table for getting the standard deviation:
9
● Since the standard deviation can be thought of as measuring how far the
data values lie from the mean, we take the mean and move one standard
deviation in either direction.
● The mean for this example was about 49.2 and the standard deviation was
17.
● We have: 49.2 - 17 = 32.2 and 49.2 + 17 = 66.2
Q6) What is market basket analysis? Explain association rules with confidence
and support.
Market Basket Analysis is a modelling technique based on theory that if you buy a
certain group of items you are more or less likely to buy another group of items.
10
Q7) Do feature wise comparison between classification and prediction.
Prediction: It constructs a model and uses the model to predict unknown or missing
values.
● Accuracy − Ability of classifier to predict the class label correctly and the
ability of the predictor to predict the value of missing data value or unknown
values.
● Speed − This refers to the computational cost in generating and using the
classifier or predictor.
● Robustness − Ability of a classifier or predictor to make correct predictions
from given noisy data.
● Scalability − Ability to construct the classifier or predictor efficiently which can
handle a large amount of data.
● Interpretability − It refers to what extent the classifier or predictor
understands.
What is classification?
Example
● A bank loan officer wants to analyze the data in order to know which
customer (loan applicants) are risky or which are safe.
● A marketing manager at a company needs to analyze a customer with a given
profile, who will buy a new computer.
● In both of the above examples, a model or classifier is constructed to predict
the categorical labels. These labels are risky or safe for loan application data
and yes or no for marketing data.
11
What is prediction?
Example
● Suppose the marketing manager needs to predict how much a given
customer will spend during a sale at his company.
● In this example we are bothered to predict a numeric value. Therefore the
data analysis task is an example of numeric prediction.
● In this case, a model or a predictor will be constructed that predicts a
continuous-valued-function or ordered value.
Relevance Analysis
● Databases may also have the irrelevant attributes.
● Correlation analysis is used to know whether any two given attributes are
related.
Data Transformation and Reduction
1) Normalization
● The data is transformed using normalization.
● It involves scaling all values for given attributes in order to make them fall
within a small specified range.
● Normalization is used when in the methods involving measurements are
used.
2) Generalization
● The data can also be transformed by generalizing it to the higher concept.
● For this purpose we can use the concept hierarchies.
12
Data warehouse architecture consists of 3 tiers. Top, Middle and Bottom tier.
1. Bottom tier
○ It is the data warehouse database server.
○ It has a relational database system.
○ Back end tools and utilities are used to feed data into this tier.
○ These backend tools perform extraction, cleaning and loading.
○ The data are extracted using an application program like ODBC and
JDBC known as gateways.
○ This tier also contains a metadata repository.
○ It stores information about the data warehouse and its content.
2. Middle tier
○ It has an OLAP server.
○ It is implemented using either ROLAP or MOLAP models.
○ This application presents an abstracted view of the database.
○ This layer is a mediator between the end user and the database.
○ ROLAP - It maps operation on multidimensional data to standard
relational operations.
○ MOLAP - It directly implements the multidimensional data and
operations.
3. Top tier
○ It is a front end client layer.
○ It contains tools for query reporting, analysis and data mining.
○ These tools help to get data out from the data warehouse.
○ This player helps to represent reports and analysis to the end user.
○ End users interact with this layer.
13
Q9) Explain star, snowflake and fact constellation schema for multidimensional
database.
14
1. Star schema
○ It is the most in demand data model.
○ It contains of:
i. Fact table - centre table with bulk of data is no redundancy.
ii. Dimension table - smaller table for each dimension.
○ This table has a radial pattern around the central table.
○ So it resembles a starburst.
2. Snowflake schema
○ Dimensional tables are kept in normalized form to reduce redundancy.
○ Such tables are easy to maintain and save storage space.
○ Dimension tables are organised in hierarchical manner.
○ The snowflake structure reduces the effectiveness of browsing.
○ So it is not popular as a star schema.
15
16
Q10) Define a data cube and explain three operations on it. What is cuboid?
Explain any three OLAP operations on a data cube with example.
Data Cube - Each dimension represents some attribute in the database and cells
represent some values.
OLAP operations
1. Roll-up
17
● Rollup is performed by climbing up the concept hierarchy.
● By rolling up, data is aggregated by ascending the location hierarchy.
● Initially the concept was City.
● By climbing up, the concept is transformed from city level to country level.
● While rolling up one or more dimensions are removed.
2. Drill Down
18
3. Slice
● It selects one particular dimension and provides a new sub-cube.
● Slice is performed on one of the dimensions.
● One of the dimensions is used as criteria for slicing.
● It will form a new sub-cube.
4. Dice
● It selects two or more dimensions.
● Example
● Dice operation is performed on the given cube.
● Three dimensions used as criteria.
● Location = Tronto or Vancouver
● Time = Q1 or Q2
● Item = Mobile or Modem
19
5. Pivot
● It is also known as rotation.
● It rotates the data axis in view to provide alternative presentation.
● Consider the example:
● In this item and location axis rotated.
20
Q11) OLAP and OLTP.
OLTP OLAP
f
Its transaction are the original source o OLTP database are source for OLAP data
data
Web mining
21
1. Web content mining
○ It is the process of extracting useful information from the contents of
web documents.
○ It consists of texts, images, audio video, list and tables.
○ Issues in text mining are topic discovery and tracking, extracting
association patterns, clustering of web documents and classification of
web pages.
2. Web structure mining
○ It consists of structures of web graphs.
○ Web graphs have web pages as nodes and hyperlinks as edges.
○ Hyperlink connects two different locations in the same or different
webpage.
○ The content of a web page can also be organised in tree structure
format.
3. Web usage mining
○ It is the process of discovering interesting usage patterns from web
usage data.
○ Usage data includes identity of the web user and their browsing
behaviour.
○ Web server data include IP address page reference and access time.
○ web robots are used to retrieve information from the hyperlink
structure of the web.
○ Web robots consume more network bandwidth. So, it makes it difficult
to perform clickstream analysis.
Text mining
22
Text analysis process
Applications
23
Q14) Explain Hadoop storage HDFS. Discuss the main features of Hadoop
distributed file system.
24
● All the data nodes are spread across various machines.
● This system is designed in such a way that user data never flows through the
Namenode.
Namenode
Datanode
Features
25
● It uses the concept of MapReduce which has two separate functions.
● Users can create a directory and store files inside them.
● It also supports third party file systems like cloud storage and Amazon simple
storage solution.
Data replication:
Data organization
● HDFS supports large files by assigning one or more blocks of size 64 MB.
● Every block is placed on separate datanodes.
26
Q15) Explain big data and Big Data analytics. Define big data. Discuss various
applications of big data and 3 V's of big data.
Big Data is a collection of very large volumes of data available at various sources, in
varying degrees of complexity, generated at different speeds, which cannot be
processed using traditional technologies methods algorithms or any commercial
solutions.
Big data analytics is the process of examining large and various types of data sets to
uncover hidden patterns, unknown correlations, market trends, customer preference
and other useful information that can help organisations to make business decisions.
Example:
27
1. Volume
○ Relational databases are not able to handle the volume of big data.
○ A PC might have had 10 GB of storage in 2000.
○ Today Facebook adds 500 TB of new data everyday.
○ Big data volume increases with the storage of records, images, videos
which consumes thousands of TB everyday.
2. Velocity
○ It refers to the speed of generation of data.
○ Data is generated so fast at an exponential rate.
○ It is difficult to process data at this velocity.
○ Data is created in real time but it is hard processed in real time.
○ The data which is produced are in structured, semi structured or
unstructured manner.
○ Ad impressions capture user behaviour at millions of events per second.
○ High frequency stock trading algorithms reflect market changes within
microseconds.
○ Online gaming system supports millions of concurrent users.
○ Infrastructure and sensors generate missing data in real time.
3. Variety
○ It refers to the heterogeneous sources and the structured and
unstructured nature of data.
○ Big data is not just about numbers, dates and strings.
○ Variety of data types like 3D data, audio, video, text, files and social
media are generated everyday.
○ Traditional database systems are designed to handle a few types of data.
○ Big Data analytics includes all the different types of data.
28