DMBI Simplified

DATA MINING AND BUSINESS INTELLIGENCE
Q1) Explain KDD process using figure. What is data mining? Write down a short
note on the KDD process.

● Data mining: It refers to the extraction for mining knowledge from large
amounts of data.
● Knowledge base: It is used to search interesting patterns which helps to make
business decisions.
● Data mining engine: It consists of a set of functional modules like prediction,
cluster analysis, association and correlation analysis, etc.
● Pattern evaluation module: It interacts with data mining modules to find
interesting patterns.
● An enormous amount of data is increasing in file databases and other
repositories.
● It is necessary to develop powerful tools for analysis and interpretation of such
data.
● This results in extraction of interesting knowledge that helps to take business
related decisions.
● KDD stands for knowledge discovery in database
● Data mining is a part of the KDD process.

Knowledge discovery in database is a iterative process as follows:

1. Data cleaning: noisy data and irrelevant data removed from the collection
2. Data integration: multiple data sources are combined in a common source. The
format of all the data is unified in similar structure.
3. Data selection: relevant data for the analysis is retrieved from the data
collection.
1
4. Data transformation: selected data is transformed into the forms of mining
procedure. Also known as data consolidation.
5. Data mining: clever techniques are used to extract potentially useful patterns.
6. Pattern evaluation: interesting patterns presenting knowledge are identified.
7. Knowledge representation: discover knowledge is usually represented to the
user. Users should be able to understand and interpret the data mining results.

Q2) What are the major issues in data mining? Define the term data mining?
Discuss the major issues in data mining.

Data mining refers to the extraction for mining of knowledge from large amounts of
data.
● It is a computational process of discovering interesting patterns from data
warehouses and databases.
● It involves methods like artificial intelligence, machine learning, statistics and
DBMS.

5 Major issues in data mining:

1. Mining methodology
2. User interaction
3. Performance and scalability
4. Diversity of data type
5. Application and social impact

1. Mining methodology
○ Mining different kinds of knowledge in database
○ It requires to develop different types of data mining techniques
○ Mining of knowledge at multiple levels of abstraction
○ Incorporation of background knowledge
○ Handling the noisy and unwanted data

2. User interaction
○ The complex interface of data mining software.
○ Expression and visualisation of data mining results.
○ Handling noise and incomplete data.
○ Pattern evaluation with the help of visualisation techniques.
○ Background study of the domain.

3. Performance and scalability
2
○ Efficiency and scalability of data mining algorithms.
○ Real time execution of data mining algorithm.
○ Parallel distributed and incremental mining problems.

4. Diversity of data types
○ Handling complex data types like temporal data, biological sequence
spatial data and web data.
○ Handling relational data types like structured and semi structured data.
○ Data mining of multiple sources at global level like the world wide web.

5. Application and social impacts
○ Protection of data security, integrity and privacy.
○ Understanding data sensitivity Data sensitivity.
○ Intelligent query processing.
○ Build data mining functions in commonly used business software.

Q3) Describe various methods for handling missing data values. List and describe
the methods for handling missing values in data cleaning.

Data cleaning is a process to fill the missing values, smooth out noise and correct the
inconsistencies of the data.

Missing data

There are 6 ways to handle missing data:

1. Ignore the tuple
2. Manually filling
3. Use a global constant
4. Use of measure of Central tendency for the attribute
5. Use attribute mean or median for all samples belonging to the same class
6. Use the most probable value

1. Ignore the tuple
○ Usually done when the class label is missing.
○ Not very effective when the tubal has only few missing values.
○ Does not work when the percentage of missing values attribute varies
considerably.

2. Manually filling
3
○ It is time consuming.
○ Not possible for large data sets.
○ Not reliable when value to be filled is not easily determined.

3. Using a global constant
○ Replace all missing attribute values by the same constant.
○ Example - unknown.
○ Simple but not recommended.
○ Because all these constant values(unknown) form the interesting
pattern.
○ Data mining programs may consider this as valuable information.

4. Using a measure of Central tendency for the attribute
○ Use of mean, median and mode.
○ Mean - for symmetric numeric data.
○ Median - for asymmetric numeric data.
○ Mode - for nominal data.

5. Using the attribute mean or median for all samples belonging to the same
class as the given to tuple
○ Replace the missing numeric values with mean.
○ Replace the missing asymmetric numeric values with mode.
○ Example: replace the missing value with the mean income value for
customers in the same credit risk category.

6. Using the most probable value
○ Using various methods to predict the missing values.
Methods like:
i. Regression
ii. Inference based tools using bayesian formalism
iii. Decision tree induction

Q4) What is noise? Explain binning method.

Noise is a random error for variance in a measured variable.

● Noise results in unreliable and poor output.
● Smoothening is done to remove noise.
● Smoothing techniques are binding, regression and clustering.

4
Binning:

● It is a top down splitting technique based on a specified number of bins.
● Does not use class information.
● It is an unsupervised discretisation technique.
● It smooths sorted data value by consulting its neighborhood.
● Sorted values are distributed into a number of buckets.
● In this way it performs local smoothening.

Example:

1) Equal frequency: Data for prices are first started and then partitioned into
equal frequency bins of size 3.
2) Smoothing by means: Each value in a bin is replaced by the mean value of the
bin.
3) Smoothing by median: Each bin value is replaced by the bin median.
4) Smoothing by boundaries: Minimum and maximum values in a given bin are
identified as bin boundaries. Replace each bin value by closest boundary value.

● Larger width = Great effect of smoothing

Other methods for removing noisy data

5
● Regression: It is a technique that confirms data value to a function. It involves
finding the best line to fit two attributes so that one attribute can be used to
predict the other.
● Outlier analysis: It is detected by clustering where similar values are organised
into groups or clusters. Values falling outside of the cluster are considered as
outliers.

Q5) Explain mean, median, mode, variance, standard deviation and 5 number
summary with suitable database example.
Mean:
● The sample mean is the average.

● It is calculated as the sum of all the observed outcomes divided by the total
number of events.

Example:
● Suppose you randomly planted weed in six acres of forest land and came up
with the following counts of this weed in this region: 34, 43, 81, 106, 106 and
115
● We compute the sample mean by adding and dividing by the
number of samples 6,
● We can say that the sample mean of weed is 80.83.

● Mean does not depict the typical outcome.
● If one value is very far from the rest of the value then it will strongly affect
the mean.
● Such values are called outliers.

Median:
6
● An alternative measure for outlier is the median;
● The median is the middle score.
● If we have an even number of events, we take the average of the two
middles.
● The median is better for describing the typical value.
● It is often used for income and home prices.
Example:
● Suppose you randomly selected 10 house prices in your area. You are
interested in the typical house price. In $100,000 the prices were: 2.7, 2.9,
3.1, 3.4, 3.7, 4.1, 4.3, 4.7, 4.7, 40.8.
● If we computed the mean, we would say that the average house price is
744,000.
● Although this number is true, it does not reflect the price for available
housing in your area.
● Since there is an even number of outcomes, we take the average of the
middle two as 3.9.
● The median house price is $390,000.

● This better reflects what house shoppers should expect to spend.

Mode:
● The mode for a set of data is the value that occurs most frequently in the
set.
● The mode is another measure of central tendency.
● Therefore, it can be determined for qualitative and quantitative attributes.
● It is possible for the greatest frequency to correspond to several different
values, which results in more than one mode.
● Data sets with one, two, or three modes are respectively called unimodal,
bimodal, and trimodal.
7
● In general, a dataset with two or more modes is multimodal.
● At the other extreme, if each data value occurs only once, then there is no
mode.
● Using the same example, we consider 4.7 As Mode.

Variance & Standard Deviation:
● They are measures of data dispersion.

● It indicates how the data is distributed.
● Low standard deviation means the data spread is close to mean.
● High standard deviation means data is spread out over a large range of
values.
● For example, a pharmaceutical engineer develops a new drug that
regulates iron in the blood. Suppose she finds out that the average sugar
content after taking the medication is the optimal level. This does not mean
that the drug is effective. There is a possibility that half of the patients have
dangerously low sugar content while the other half have dangerously high
content.
● Instead of the drug being an effective regulator, it is a deadly poison. What
the pharmacist needs is a measure of how far the data is spread apart. This is
what the variance and standard deviation do.

● Variance is represented as:
● Standard deviation is the square root of variance and is represented as:
8
Example:
● Calculate the mean x.

● Write a table that subtracts the mean from each observed value.
● Square each of the differences.
● Add this column.
● Divide by n -1 where n is the number of items in the sample this is the variance.
● The owner of the Indian restaurant is interested in how much people spend
at the restaurant.
● He examines 10 randomly selected receipts for parties of four and writes
down the following data.
● 44, 50, 38, 96, 42, 47, 40, 39, 46, 50
● He calculated the mean by adding and dividing by 10 to get Average(Mean) =
49.2.
● Below is the table for getting the standard deviation:
● Now 2600.4/10 – 1 = 288.7

● Hence the variance is 289.
● And the standard deviation is the square root of 289 = 17.
9
● Since the standard deviation can be thought of as measuring how far the
data values lie from the mean, we take the mean and move one standard
deviation in either direction.
● The mean for this example was about 49.2 and the standard deviation was
17.
● We have: 49.2 - 17 = 32.2 and 49.2 + 17 = 66.2
Q6) What is market basket analysis? Explain association rules with confidence
and support.
Market Basket Analysis is a modelling technique based on theory that if you buy a
certain group of items you are more or less likely to buy another group of items.
● It is an example of frequent itemset mining.

● Purpose is to determine what products customers purchase together.
● It is a technique which identifies the strength of association between pairs of
products purchased together.
● It helps to identify the pattern of co-occurrence when two or more things take
place together.
● The rules are derived from The frequencies of co-occurrence.
● The rules can be applied in pricing strategies, product placement and various
cross selling strategies.
● It takes data at transaction level and makes a list of items bought by a
customer in single purchase.
● This data is then used to build If-Then rules for the items purchased.
● Example: IF{Milk,Eggs} THEN {Bread}
● IF part is called as Antecedent
● THEN part is called as Consequent
● Probability that a customer will buy milk without eggs is called SUPPORT.
● The conditional probability that a customer will purchase bread is called
CONFIDENCE.
● The algorithms for performing market basket analysis are straightforward.
10
Q7) Do feature wise comparison between classification and prediction.
Classification: It predicts categorial class levels.
Prediction: It constructs a model and uses the model to predict unknown or missing
values.
● We can build a classification model to categorize bank loan applications as

either safe or risky.
● Prediction model to predict the expenditures in dollars of potential
customers on computer equipment given their income and occupation.
Comparison:
● Accuracy − Ability of classifier to predict the class label correctly and the
ability of the predictor to predict the value of missing data value or unknown
values.
● Speed − This refers to the computational cost in generating and using the
classifier or predictor.
● Robustness − Ability of a classifier or predictor to make correct predictions
from given noisy data.
● Scalability − Ability to construct the classifier or predictor efficiently which can
handle a large amount of data.
● Interpretability − It refers to what extent the classifier or predictor
understands.
What is classification?
Example
● A bank loan officer wants to analyze the data in order to know which
customer (loan applicants) are risky or which are safe.
● A marketing manager at a company needs to analyze a customer with a given
profile, who will buy a new computer.
● In both of the above examples, a model or classifier is constructed to predict
the categorical labels. These labels are risky or safe for loan application data
and yes or no for marketing data.
11
What is prediction?
Example
● Suppose the marketing manager needs to predict how much a given
customer will spend during a sale at his company.
● In this example we are bothered to predict a numeric value. Therefore the
data analysis task is an example of numeric prediction.
● In this case, a model or a predictor will be constructed that predicts a
continuous-valued-function or ordered value.

Classification and Prediction Issues

Data Cleaning
● It involves removing the noise and treatment of missing values.
● The noise is removed by applying smoothing techniques.
● The missing values are replaced by the most commonly occurring value for
that attribute.
Relevance Analysis
● Databases may also have the irrelevant attributes.
● Correlation analysis is used to know whether any two given attributes are
related.

Data Transformation and Reduction
1) Normalization
● The data is transformed using normalization.
● It involves scaling all values for given attributes in order to make them fall
within a small specified range.
● Normalization is used when in the methods involving measurements are
used.
2) Generalization
● The data can also be transformed by generalizing it to the higher concept.
● For this purpose we can use the concept hierarchies.
12

Q8) Explain data warehouse architecture.
Data warehouse architecture consists of 3 tiers. Top, Middle and Bottom tier.
1. Bottom tier
○ It is the data warehouse database server.
○ It has a relational database system.
○ Back end tools and utilities are used to feed data into this tier.
○ These backend tools perform extraction, cleaning and loading.
○ The data are extracted using an application program like ODBC and
JDBC known as gateways.
○ This tier also contains a metadata repository.
○ It stores information about the data warehouse and its content.
2. Middle tier
○ It has an OLAP server.
○ It is implemented using either ROLAP or MOLAP models.
○ This application presents an abstracted view of the database.
○ This layer is a mediator between the end user and the database.
○ ROLAP - It maps operation on multidimensional data to standard
relational operations.
○ MOLAP - It directly implements the multidimensional data and
operations.
3. Top tier
○ It is a front end client layer.
○ It contains tools for query reporting, analysis and data mining.
○ These tools help to get data out from the data warehouse.
○ This player helps to represent reports and analysis to the end user.
○ End users interact with this layer.
13

Q9) Explain star, snowflake and fact constellation schema for multidimensional
database.
● The ER data model is used to design relational databases.

● Such models are used for on-line transaction processing.
● The most popular data model for data warehouses is the multidimensional
model.
Following are the types of multidimensional data model:
14
1. Star schema
○ It is the most in demand data model.
○ It contains of:
i. Fact table - centre table with bulk of data is no redundancy.
ii. Dimension table - smaller table for each dimension.
○ This table has a radial pattern around the central table.
○ So it resembles a starburst.
2. Snowflake schema
○ Dimensional tables are kept in normalized form to reduce redundancy.
○ Such tables are easy to maintain and save storage space.
○ Dimension tables are organised in hierarchical manner.
○ The snowflake structure reduces the effectiveness of browsing.
○ So it is not popular as a star schema.
15

3. Fact constellation schema

○ It is a collection of star schema and so it is also called a galaxy schema.
○ It allows multiple fact tables to share dimension tables.
○ For example dimension tables time, item and location are shared
between both sales and shipping fact tables.
16

Q10) Define a data cube and explain three operations on it. What is cuboid?
Explain any three OLAP operations on a data cube with example.
Data Cube - Each dimension represents some attribute in the database and cells
represent some values.
● It allows data to be modelled and viewed in multiple dimensions.

● It is defined by dimensions and facts.
OLAP operations
1. Roll-up
It performs aggregation on data cube in following ways:
a. By climbing up concept hierarchy.

b. By dimension reduction.
17
● Rollup is performed by climbing up the concept hierarchy.
● By rolling up, data is aggregated by ascending the location hierarchy.
● Initially the concept was City.
● By climbing up, the concept is transformed from city level to country level.
● While rolling up one or more dimensions are removed.
2. Drill Down
It is the reverse operation of roll up.
a. By stepping down the concept hierarchy

b. By introducing a new dimension
● It is performed by stepping down the concept hierarchy.
● By drilling down the data is aggregated by descending the time hierarchy.
● Initially the concept was quarter (Q1,QW,Q3,Q4).
● By drilling down, it is transformed from quarters level to month level.
● One or more dimensions are added.
● It moves data from less detail to highly detailed.
18
3. Slice
● It selects one particular dimension and provides a new sub-cube.
● Slice is performed on one of the dimensions.
● One of the dimensions is used as criteria for slicing.
● It will form a new sub-cube.
4. Dice
● It selects two or more dimensions.
● Example
● Dice operation is performed on the given cube.
● Three dimensions used as criteria.
● Location = Tronto or Vancouver
● Time = Q1 or Q2
● Item = Mobile or Modem
19

5. Pivot
● It is also known as rotation.
● It rotates the data axis in view to provide alternative presentation.
● Consider the example:
● In this item and location axis rotated.
20
Q11) OLAP and OLTP.
OLTP OLAP
Support long transaction Support short transaction
It is operational processing system It is information processing system
Purpose t o control and run fundamental Purpose is to help learning problem

business task solving and decision support
Requires less space i f historical data is Requires more space

achieved
Tables are normalised Tables are not normalised
f
Its transaction are the original source o OLTP database are source for OLAP data
data
Database size: 100 MB to GB Database size: 100 GB to TB
It contains current data It contains historical data
It is application oriented It is information oriented

Q12) Explain Apriori property and algorithm. Generate candidate itemsets

frequent itemsets and association rules using Apriori algorithm.
Refer to darshan pdf and YouTube videos
Q13) Explain basic concepts of text mining and web mining.
Web mining
● It is the use of data mining techniques to automatically discover and extract

information from web documents and services.
● Web mining broadly divided into three categories:

21
1. Web content mining
○ It is the process of extracting useful information from the contents of
web documents.
○ It consists of texts, images, audio video, list and tables.
○ Issues in text mining are topic discovery and tracking, extracting
association patterns, clustering of web documents and classification of
web pages.
2. Web structure mining
○ It consists of structures of web graphs.
○ Web graphs have web pages as nodes and hyperlinks as edges.
○ Hyperlink connects two different locations in the same or different
webpage.
○ The content of a web page can also be organised in tree structure
format.
3. Web usage mining
○ It is the process of discovering interesting usage patterns from web
usage data.
○ Usage data includes identity of the web user and their browsing
behaviour.
○ Web server data include IP address page reference and access time.
○ web robots are used to retrieve information from the hyperlink
structure of the web.
○ Web robots consume more network bandwidth. So, it makes it difficult
to perform clickstream analysis.
Text mining
● It is the process of deriving high quality information from text.

● Statistical pattern learning is used to study patterns and trends.
22
Text analysis process
● Collecting a set of textual materials for analysis.

● It is collected from a web, file system or database.
● Text analytics systems apply natural language processing like speech taking
syntactic parsing and linguistic analysis.
● Named entity recognition is used to identify the names of people,
organizations, place, symbols and so on.
● Disambiguation is required if there are different meanings of the same word.
● Example: 'Ford' can refer to former US president, vehicle manufacturer or a
movie star.
● Data such as telephone numbers and email addresses are found with the help
of pattern matching algorithms.
● Lastly, the noun phrases and other terms which refers to the same object are
identified.
Applications
● Data mining and business intelligence

● National security
● Social media monitoring
● Automatic date placement
● Sentiments analysis tool
● E-Discovery
23
Q14) Explain Hadoop storage HDFS. Discuss the main features of Hadoop
distributed file system.
Architecture and storage
● It is an open-source software framework and it is licensed under Apache V2

licence.
● It provides a software framework for big data processing in real time
application.
● Hadoop architecture have two parts:
a. Hadoop distributed file system.
b. Mapreduce engine.
● It is designed to run on commodity hardware.
● It uses of block structured file system
● Each file is divided into blocks of predetermined size.
● These blocks are stored in clusters of one or more machines.
● It follows Master/Slave architecture.
● A cluster has a single Namenode (Master) and various Datanode (Slave).
24
● All the data nodes are spread across various machines.
● This system is designed in such a way that user data never flows through the
Namenode.
Namenode
● It maintains and manages the blocks present in the data nodes.

● It is a very highly available server.
● It manages the file system namespace and controls the access to the files by
clients.
● it executes operations like opening, closing and renaming files and directories.
● It determines the mapping of blocks to data nodes.
Datanode
● It is a commodity hardware which is a non expensive system.

● It is not of high quality and high-availability.
● It is a block server that stores the data in a local file like ext3 and ext4.
● It performs operations like block creation, deletion and replication.
Journal and Checkpoint
● Journal is available in native file system.

● It consists of modification log of image.
● It is updated for every client transaction.
● Checkpoint is a persistent record of images.
● It is also stored in native filesystem.
● Checkpoint nodes can create new checkpoint files on startup or restart.
● Namenode is not allowed to update or modify checkpoint files.
Features
● It supports the traditional hierarchical file organisation.
25
● It uses the concept of MapReduce which has two separate functions.
● Users can create a directory and store files inside them.
● It also supports third party file systems like cloud storage and Amazon simple
storage solution.
Data replication:
● HDFS replicates file blocks for fault tolerance.

● It specifies the number of replicas of file at the time when it is created.
● All the replication decisions are taken by namenode.
● HDFS uses an intelligent replica placement model for reliability and
performance.
● It also uses network bandwidth efficiently.
Data organization
● HDFS supports large files by assigning one or more blocks of size 64 MB.
● Every block is placed on separate datanodes.
● HDFS applications need a write-once-read-many access model for files.

● A file once created, written and closed need not be changed.
● HDFS has master slave architecture.
● HDFS cluster consists of a single name node that manages file systems.
● HDFS supports traditional hierarchical file organisation.
● Users can create directories and store files inside them.
● HDFS does not support hard links or soft links.
● Files in HDFS have strictly one writer at any time.
26
Q15) Explain big data and Big Data analytics. Define big data. Discuss various
applications of big data and 3 V's of big data.
Big Data is a collection of very large volumes of data available at various sources, in
varying degrees of complexity, generated at different speeds, which cannot be
processed using traditional technologies methods algorithms or any commercial
solutions.
● Big data is similar to small data but bigger in size.

● It requires different techniques, tools and architecture for analysis.
Big data analytics is the process of examining large and various types of data sets to
uncover hidden patterns, unknown correlations, market trends, customer preference
and other useful information that can help organisations to make business decisions.
● It processes a very large quantity of Digital information that cannot be

analysed using traditional techniques.
● Organisations like Google, LinkedIn and Facebook are using Big Data analytics.
Example:
● Walmart handles more than 1 million customer transactions every hour.

● Facebook handles 40 million photos from its user base.
● Twitter generates 70 b of data daily.
3 characteristics of big data
27
1. Volume
○ Relational databases are not able to handle the volume of big data.
○ A PC might have had 10 GB of storage in 2000.
○ Today Facebook adds 500 TB of new data everyday.
○ Big data volume increases with the storage of records, images, videos
which consumes thousands of TB everyday.
2. Velocity
○ It refers to the speed of generation of data.
○ Data is generated so fast at an exponential rate.
○ It is difficult to process data at this velocity.
○ Data is created in real time but it is hard processed in real time.
○ The data which is produced are in structured, semi structured or
unstructured manner.
○ Ad impressions capture user behaviour at millions of events per second.
○ High frequency stock trading algorithms reflect market changes within
microseconds.
○ Online gaming system supports millions of concurrent users.
○ Infrastructure and sensors generate missing data in real time.
3. Variety
○ It refers to the heterogeneous sources and the structured and
unstructured nature of data.
○ Big data is not just about numbers, dates and strings.
○ Variety of data types like 3D data, audio, video, text, files and social
media are generated everyday.
○ Traditional database systems are designed to handle a few types of data.
○ Big Data analytics includes all the different types of data.
28

DMBI Simplified

Uploaded by

Copyright:

Available Formats

DMBI Simplified

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DMBI Simplified

Uploaded by

Copyright:

Available Formats

DATA MINING AND BUSINESS INTELLIGENCE

● The sample mean is the average.

● We can say that the sample mean of weed is 80.83.

● The median house price is $390,000.

Variance & Standard Deviation:

● They are measures of data dispersion.

● Standard deviation is the ​square root​ of variance and is represented as:

● Calculate the mean​ x​.

● Now 2600.4/10 – 1 = 288.7

● It is an example of frequent itemset mining.

Classification: ​It predicts categorial class levels.

● We can build a classification model to categorize bank loan applications as

Classification and Prediction Issues

Q8) Explain data warehouse architecture.

● The ER data model is used to design relational databases.

Following are the types of multidimensional data model:

3. Fact constellation schema

● It allows data to be modelled and viewed in multiple dimensions.

It performs aggregation on data cube in following ways:

a. By climbing up concept hierarchy.

It is the reverse operation of roll up.

a. By stepping down the concept hierarchy

Support long ​transaction Support short transaction

It is operational ​processing system It is information processing system

Purpose t​ o control and run fundamental Purpose is to help learning problem

Requires less ​space i​ f historical data is Requires more space

Tables ​are normalised Tables are not normalised

Database size​: 100 MB to GB Database size: 100 GB to TB

It contains current ​data It contains historical data

It is application ​oriented It is information oriented

Q12) Explain Apriori property and algorithm. Generate candidate itemsets

Refer to darshan pdf and YouTube videos

Q13) Explain basic concepts of text mining and web mining.

● It is the use of data mining techniques to automatically discover and extract

● It is the process of deriving high quality information from text.

● Collecting a set of textual materials for analysis.

● Data mining and business intelligence

Architecture and storage

● It is an open-source software framework and it is licensed under Apache V2

● It maintains and manages the blocks present in the data nodes.

● It is a commodity hardware which is a non expensive system.

Journal and Checkpoint

● Journal is available in native file system.

● It supports the traditional hierarchical file organisation.

● HDFS replicates file blocks for fault tolerance.

● HDFS applications need a write-once-read-many access model for files.

● Big data is similar to small data but bigger in size.

● It processes a very large quantity of Digital information that cannot be

● Walmart handles more than 1 million customer transactions every hour.

3 characteristics of big data

You might also like

● Standard deviation is the square root of variance and is represented as:

● Calculate the mean x.

Classification: It predicts categorial class levels.

Support long transaction Support short transaction

It is operational processing system It is information processing system

Purpose t o control and run fundamental Purpose is to help learning problem

Requires less space i f historical data is Requires more space

Tables are normalised Tables are not normalised

Database size: 100 MB to GB Database size: 100 GB to TB

It contains current data It contains historical data

It is application oriented It is information oriented