DMBI Simplified

Download as pdf or txt
Download as pdf or txt
You are on page 1of 28

DATA MINING AND BUSINESS INTELLIGENCE 

Q1) Explain KDD process using figure. What is data mining? Write down a short 
note on the KDD process. 
 
● Data mining: ​It refers to the extraction for mining knowledge from large 
amounts of data. 
● Knowledge base: ​It is used to search interesting patterns which helps to make 
business decisions. 
● Data mining engine: ​It consists of a set of functional modules like prediction, 
cluster analysis, association and correlation analysis, etc. 
● Pattern evaluation module: ​It interacts with data mining modules to find 
interesting patterns. 
● An enormous amount of data is increasing in file databases and other 
repositories. 
● It is necessary to develop powerful tools for analysis and interpretation of such 
data. 
● This results in extraction of interesting knowledge that helps to take business 
related decisions. 
● KDD stands for knowledge discovery in database 
● Data mining is a part of the KDD process. 
 

 
 
Knowledge discovery in database is a iterative process as follows: 
 
1. Data cleaning: ​noisy data and irrelevant data removed from the collection 
2. Data integration: ​multiple data sources are combined in a common source. The 
format of all the data is unified in similar structure. 
3. Data selection: ​relevant data for the analysis is retrieved from the data 
collection. 

1
4. Data transformation: ​selected data is transformed into the forms of mining 
procedure. Also known as data consolidation. 
5. Data mining: ​clever techniques are used to extract potentially useful patterns. 
6. Pattern evaluation: ​interesting patterns presenting knowledge are identified. 
7. Knowledge representation: ​discover knowledge is usually represented to the 
user. Users should be able to understand and interpret the data mining results. 
 
Q2) What are the major issues in data mining? Define the term data mining? 
Discuss the major issues in data mining. 
 
Data mining​ refers to the extraction for mining of knowledge from large amounts of 
data. 
● It is a computational process of discovering interesting patterns from data 
warehouses and databases. 
● It involves methods like artificial intelligence, machine learning, statistics and 
DBMS. 
 
5 Major issues in data mining: 
 
1. Mining methodology 
2. User interaction 
3. Performance and scalability 
4. Diversity of data type 
5. Application and social impact 
 
1. Mining methodology 
○ Mining different kinds of knowledge in database 
○ It requires to develop different types of data mining techniques 
○ Mining of knowledge at multiple levels of abstraction 
○ Incorporation of background knowledge 
○ Handling the noisy and unwanted data 
 
2. User interaction 
○ The complex interface of data mining software. 
○ Expression and visualisation of data mining results. 
○ Handling noise and incomplete data. 
○ Pattern evaluation with the help of visualisation techniques. 
○ Background study of the domain. 
 
3. Performance and scalability 

2
○ Efficiency and scalability of data mining algorithms. 
○ Real time execution of data mining algorithm. 
○ Parallel distributed and incremental mining problems. 
 
4. Diversity of data types 
○ Handling complex data types like temporal data, biological sequence 
spatial data and web data. 
○ Handling relational data types like structured and semi structured data. 
○ Data mining of multiple sources at global level like the world wide web. 
 
5. Application and social impacts 
○ Protection of data security, integrity and privacy. 
○ Understanding data sensitivity Data sensitivity. 
○ Intelligent query processing. 
○ Build data mining functions in commonly used business software. 
 
Q3) Describe various methods for handling missing data values. List and describe 
the methods for handling missing values in data cleaning. 
 
Data cleaning​ is a process to fill the missing values, smooth out noise and correct the 
inconsistencies of the data. 
 
Missing data 
 
There are​ 6 ways​ to handle missing data: 
 
1. Ignore the tuple 
2. Manually filling 
3. Use a global constant 
4. Use of measure of Central tendency for the attribute 
5. Use attribute mean or median for all samples belonging to the same class 
6. Use the most probable value 
 
1. Ignore the tuple 
○ Usually done when the class label is missing. 
○ Not very effective when the tubal has only few missing values. 
○ Does not work when the percentage of missing values attribute varies 
considerably. 
 
2. Manually filling 

3
○ It is time consuming. 
○ Not possible for large data sets. 
○ Not reliable when value to be filled is not easily determined. 
 
3. Using a global constant 
○ Replace all missing attribute values by the same constant. 
○ Example - unknown. 
○ Simple but not recommended. 
○ Because all these constant values(unknown) form the interesting 
pattern. 
○ Data mining programs may consider this as valuable information. 
 
4. Using a measure of Central tendency for the attribute 
○ Use of mean, median and mode. 
○ Mean ​- for symmetric numeric data. 
○ Median ​- for asymmetric numeric data. 
○ Mode ​- for nominal data. 
 
5. Using the attribute mean or median for all samples belonging to the same 
class as the given to tuple 
○ Replace the missing numeric values with mean. 
○ Replace the missing asymmetric numeric values with mode. 
○ Example​: replace the missing value with the mean income value for 
customers in the same credit risk category. 
 
6. Using the most probable value 
○ Using various methods to predict the missing values. 
Methods like:  
i. Regression  
ii. Inference based tools using bayesian formalism  
iii. Decision tree induction 
 
Q4) What is noise? Explain binning method. 
 
Noise ​is a random error for variance in a measured variable. 
 
● Noise results in unreliable and poor output. 
● Smoothening is done to remove noise. 
● Smoothing techniques are binding, regression and clustering. 
 

4
Binning:  
 
● I​t is a top down splitting technique based on a specified number of bins. 
● Does not use class information. 
● It is an unsupervised discretisation technique. 
● It smooths sorted data value by consulting its neighborhood. 
● Sorted values are distributed into a number of buckets. 
● In this way it performs local smoothening. 
 

 
 
Example: 
 
1) Equal frequency​: Data for prices are first started and then partitioned into 
equal frequency bins of size 3. 
2) Smoothing by means​: Each value in a bin is replaced by the mean value of the 
bin. 
3) Smoothing by median​: Each bin value is replaced by the bin median. 
4) Smoothing by boundaries​: Minimum and maximum values in a given bin are 
identified as bin boundaries. Replace each bin value by closest boundary value. 
 
● Larger width = Great effect of smoothing 
 
Other methods for removing noisy data 
 

5
● Regression​: It is a technique that confirms data value to a function. It involves 
finding the best line to fit two attributes so that one attribute can be used to 
predict the other. 
● Outlier analysis​: It is detected by clustering where similar values are organised 
into groups or clusters. Values falling outside of the cluster are considered as 
outliers. 
 
Q5) Explain mean, median, mode, variance, standard deviation and 5 number 
summary with suitable database example. 
Mean​:  

● The sample mean is the average. 


● It is calculated as the sum of all the observed outcomes divided by the total 
number of events.  

 
Example​: 

● Suppose you randomly planted weed in six acres of forest land and came up 
with the following counts of this weed in this region: 34, 43, 81, 106, 106 and 
115 
● We compute the sample mean by adding and dividing by the 
number of samples 6, 

● We can say that the sample mean of weed is 80.83. 


● Mean does not depict the typical outcome. 
● If  one  value  is  very  far  from  the  rest  of  the  value  then  it  will  strongly  affect 
the mean. 
● Such values are called outliers. 
 
Median​:  

6
● An alternative measure for outlier is the median;  
● The median is the middle score.  
● If  we  have  an  even  number  of  events,  we  take  the  average  of  the  two 
middles.  
● The median is better for describing the typical value.  
● It is often used for income and home prices. 
Example: 

● Suppose  you  randomly  selected  10  house  prices  in  your  area.  You  are 
interested  in  the  typical  house  price.  In  $100,000  the  prices  were:  2.7,  2.9, 
3.1, 3.4, 3.7, 4.1, 4.3, 4.7, 4.7, 40.8. 
● If  we  computed  the  mean,  we  would  say  that  the  average  house  price  is 
744,000.  
● Although  this  number  is  true,  it  does  not  reflect  the  price  for  available 
housing in your area. 
● Since  there  is  an  even  number  of  outcomes,  we  take  the  average  of  the 
middle two as 3.9. 

● The median house price is $390,000.  


● This better reflects what house shoppers should expect to spend. 
 
Mode:  
● The mode for a set of data is the value that occurs most frequently in the 
set. 
● The mode is another measure of central tendency.  
● Therefore, it can be determined for qualitative and quantitative attributes.  
● It  is  possible  for  the  greatest  frequency  to  correspond  to  several  different 
values, which results in more than one mode.  
● Data  sets  with  one,  two,  or  three  modes  are  respectively  called  ​unimodal, 
bimodal, and trimodal​. 

7
● In general, a dataset with two or more modes is ​multimodal​.  
● At the other extreme, if each data value occurs only once, then there is no 
mode. 
● Using the same example, we consider 4.7 As Mode. 
 

Variance & Standard Deviation:  

● They are measures of data dispersion. 


● It indicates how the data is distributed. 
● Low standard deviation means the data spread is close to mean. 
● High  standard  deviation  means  data  is  spread  out  over  a  large  range  of 
values. 
● For  example,  a  pharmaceutical  engineer  develops  a  new  drug  that 
regulates  iron  in  the  blood.  Suppose  she  finds  out  that  the average sugar 
content  after  taking  the  medication  is  the  optimal level. This does not mean 
that  the  drug  is  effective.  There  is a possibility that half of the patients have 
dangerously  low  sugar  content  while  the  other  half  have  dangerously  high 
content. 
● Instead  of  the  drug  being  an  effective  regulator,  it  is  a  deadly  poison.  What 
the  pharmacist  needs is a measure of how far the data is spread apart. This is 
what the variance and standard deviation do. 
 
● Variance is represented as: 

● Standard deviation is the ​square root​ of variance and is represented as: 

8
Example:  

● Calculate the mean​ x​. 


● Write a table that subtracts the mean from each observed value. 
● Square each of the differences. 
● Add this column. 
● Divide by ​n -1 ​where ​n ​is the number of items in the sample this is the ​variance​. 

● The owner of the Indian restaurant is interested in how much people spend 
at the restaurant.  
● He examines ​10 ​randomly selected receipts for parties of four and writes 
down the following data. 
● 44, 50, 38, 96, 42, 47, 40, 39, 46, 50 
● He calculated the mean by adding and dividing by 10 to get Average(Mean) = 
49.2. 
● Below is the table for getting the standard deviation:  

● Now 2600.4/10 – 1 = 288.7 


● Hence the variance is ​289​. 
● And the standard deviation is the square root of ​289 = 17​. 

9
● Since  the  standard  deviation  can  be  thought  of  as  measuring  how  far  the 
data  values  lie  from  the  mean,  we  take  the  mean  and  move  one  standard 
deviation in either direction.  
● The  mean  for  this  example  was  about  49.2  and  the  standard  deviation  was 
17. 
● We have: 49.2 - 17 = 32.2 and 49.2 + 17 = 66.2 

Q6)  What  is  market  basket  analysis?  Explain  association  rules  with  confidence 
and support. 

Market Basket Analysis ​is a modelling technique based on theory that if you buy a 
certain group of items you are more or less likely to buy another group of items. 

● It is an example of frequent itemset mining. 


● Purpose is to determine what products customers purchase together. 
● It is a technique which identifies the strength of association between pairs of 
products purchased together. 
● It helps to identify the pattern of co-occurrence when two or more things take 
place together. 
● The rules are derived from The frequencies of co-occurrence. 
● The rules can be applied in pricing strategies, product placement and various 
cross selling strategies. 
● It takes data at transaction level and makes a list of items bought by a 
customer in single purchase. 
● This data is then used to build If-Then rules for the items purchased. 
● Example​: ​IF{Milk,Eggs} THEN {Bread} 
● IF part is called as ​Antecedent 
● THEN part is called as ​Consequent 
● Probability that a customer will buy milk without eggs is called ​SUPPORT.​  
● The conditional probability that a customer will purchase bread is called 
CONFIDENCE.​  
● The algorithms for performing market basket analysis are straightforward. 

10
Q7) Do feature wise comparison between classification and prediction. 

Classification: ​It predicts categorial class levels. 

Prediction: ​It constructs a model and uses the model to predict unknown or missing 
values. 

● We can build a classification model to categorize bank loan applications as 


either safe or risky. 
● Prediction model to predict the expenditures in dollars of potential 
customers on computer equipment given their income and occupation. 
Comparison: 

● Accuracy ​− Ability of classifier to predict the class label correctly and the 
ability of the predictor to predict the value of missing data value or unknown 
values. 
● Speed ​− This refers to the computational cost in generating and using the 
classifier or predictor. 
● Robustness ​− Ability of a classifier or predictor to make correct predictions 
from given noisy data. 
● Scalability ​− Ability to construct the classifier or predictor efficiently which can 
handle a large amount of data.   
● Interpretability ​− It refers to what extent the classifier or predictor 
understands. 

What is classification? 
Example 
● A bank loan officer wants to analyze the data in order to know which 
customer (loan applicants) are risky or which are safe. 
● A marketing manager at a company needs to analyze a customer with a given 
profile, who will buy a new computer. 
● In both of the above examples, a model or classifier is constructed to predict 
the categorical labels. These labels are risky or safe for loan application data 
and yes or no for marketing data. 

11
What is prediction? 
Example 
● Suppose the marketing manager needs to predict how much a given 
customer will spend during a sale at his company. 
● In this example we are bothered to predict a numeric value. Therefore the 
data analysis task is an example of numeric prediction. 
● In this case, a model or a predictor will be constructed that predicts a 
continuous-valued-function or ordered value. 
 

Classification and Prediction Issues 


 
Data Cleaning 
● It involves removing the noise and treatment of missing values. 
● The noise is removed by applying smoothing techniques. 
● The missing values are replaced by the most commonly occurring value for 
that attribute. 

Relevance Analysis 
● Databases may also have the irrelevant attributes.  
● Correlation analysis is used to know whether any two given attributes are 
related. 
 
Data Transformation and Reduction  

1) Normalization 
● The data is transformed using normalization. 
● It involves scaling all values for given attributes in order to make them fall 
within a small specified range. 
● Normalization is used when in the methods involving measurements are 
used. 
2) Generalization 
● The data can also be transformed by generalizing it to the higher concept. 
● For this purpose we can use the concept hierarchies. 

12
 

Q8) Explain data warehouse architecture. 

Data warehouse architecture consists of 3 tiers. Top, Middle and Bottom tier. 

1. Bottom tier 
○ It is the data warehouse database server. 
○ It has a relational database system. 
○ Back end tools and utilities are used to feed data into this tier. 
○ These backend tools perform extraction, cleaning and loading. 
○ The data are extracted using an application program like ODBC and 
JDBC known as gateways. 
○ This tier also contains a metadata repository. 
○ It stores information about the data warehouse and its content. 

2. Middle tier 
○ It has an OLAP server. 
○ It is implemented using either ROLAP or MOLAP models. 
○ This application presents an abstracted view of the database. 
○ This layer is a mediator between the end user and the database. 
○ ROLAP ​- It maps operation on multidimensional data to standard 
relational operations. 
○ MOLAP ​- It directly implements the multidimensional data and 
operations. 

3. Top tier 
○ It is a front end client layer. 
○ It contains tools for query reporting, analysis and data mining. 
○ These tools help to get data out from the data warehouse. 
○ This player helps to represent reports and analysis to the end user. 
○ End users interact with this layer. 

13
 

 
 

Q9) Explain star, snowflake and fact constellation schema for multidimensional 
database. 

● The ER data model is used to design relational databases. 


● Such models are used for on-line transaction processing. 
● The most popular data model for data warehouses is the multidimensional 
model. 

Following are the types of multidimensional data model: 

14
1. Star schema 
○ It is the most in demand data model. 
○ It contains of: 
i. Fact table​ - centre table with bulk of data is no redundancy. 
ii. Dimension table - ​smaller table for each dimension. 
○ This table has a radial pattern around the central table. 
○ So it resembles a starburst. 

2. Snowflake schema 
○ Dimensional tables are kept in normalized form to reduce redundancy. 
○ Such tables are easy to maintain and save storage space. 
○ Dimension tables are organised in hierarchical manner. 
○ The snowflake structure reduces the effectiveness of browsing. 
○ So it is not popular as a star schema. 

15
 

3. Fact constellation schema 


○ It is a collection of star schema and so it is also called a galaxy schema. 
○ It allows multiple fact tables to share dimension tables. 
○ For example dimension tables time, item and location are shared 
between both sales and shipping fact tables. 

16
 

Q10) Define a data cube and explain three operations on it. What is cuboid? 
Explain any three OLAP operations on a data cube with example. 

Data Cube​ - Each dimension represents some attribute in the database and cells 
represent some values. 

● It allows data to be modelled and viewed in multiple dimensions. 


● It is defined by dimensions and facts. 

OLAP operations 

1. Roll-up 

It performs aggregation on data cube in following ways: 

a. By climbing up concept hierarchy. 


b. By dimension reduction. 

17
● Rollup is performed by climbing up the concept hierarchy. 
● By rolling up, data is aggregated by ascending the location hierarchy. 
● Initially the concept was City. 
● By climbing up, the concept is transformed from city level to country level. 
● While rolling up one or more dimensions are removed. 

2. Drill Down 

It is the reverse operation of roll up. 

a. By stepping down the concept hierarchy 


b. By introducing a new dimension 
● It is performed by stepping down the concept hierarchy. 
● By drilling down the data is aggregated by descending the time hierarchy. 
● Initially the concept was quarter (Q1,QW,Q3,Q4). 
● By drilling down, it is transformed from quarters level to month level. 
● One or more dimensions are added. 
● It moves data from less detail to highly detailed. 

18
3. Slice 
● It selects one particular dimension and provides a new sub-cube. 
● Slice is performed on one of the dimensions. 
● One of the dimensions is used as criteria for slicing. 
● It will form a new sub-cube. 

4. Dice 
● It selects two or more dimensions. 
● Example 
● Dice operation is performed on the given cube. 
● Three dimensions used as criteria. 
● Location = Tronto or Vancouver 
● Time = Q1 or Q2 
● Item = Mobile or Modem 

19
 

5. Pivot 
● It is also known as rotation. 
● It rotates the data axis in view to provide alternative presentation. 
● Consider the example: 
● In this item and location axis rotated. 

20
Q11) OLAP and OLTP. 

OLTP  OLAP 

Support long ​transaction  Support short transaction 

It is operational ​processing system  It is information processing system 

Purpose t​ o control and run fundamental  Purpose is to help learning problem 


business task  solving and decision support 

Requires less ​space i​ f historical data is  Requires more space 


achieved 

Tables ​are normalised  Tables are not normalised 

​ f 
Its transaction are the original ​source o OLTP database are source for OLAP data 
data 

Database size​: 100 MB to GB  Database size: 100 GB to TB 

It contains current ​data  It contains historical data 

It is application ​oriented  It is information oriented 


 

Q12) Explain Apriori property and algorithm. Generate candidate itemsets 


frequent itemsets and association rules using Apriori algorithm. 

Refer to darshan pdf and YouTube videos 

Q13) Explain basic concepts of text mining and web mining. 

Web mining 

● It is the use of data mining techniques to automatically discover and extract 


information from web documents and services. 
● Web mining broadly divided into three categories: 
 

21
1. Web content mining 
○ It is the process of extracting useful information from the contents of 
web documents. 
○ It consists of texts, images, audio video, list and tables. 
○ Issues in text mining are topic discovery and tracking, extracting 
association patterns, clustering of web documents and classification of 
web pages. 
2. Web structure mining 
○ It consists of structures of web graphs. 
○ Web graphs have web pages as nodes and hyperlinks as edges. 
○ Hyperlink connects two different locations in the same or different 
webpage. 
○ The content of a web page can also be organised in tree structure 
format. 
3. Web usage mining 
○ It is the process of discovering interesting usage patterns from web 
usage data​. 
○ Usage data includes identity of the web user and their browsing 
behaviour. 
○ Web server data include IP address page reference and access time. 
○ web robots are used to retrieve information from the hyperlink 
structure of the web. 
○ Web robots consume more network bandwidth. So, it makes it difficult 
to perform clickstream analysis. 

Text mining 

● It is the process of deriving high quality information from text. 


● Statistical pattern learning is used to study patterns and trends. 

22
Text analysis process 

● Collecting a set of textual materials for analysis. 


● It is collected from a web, file system or database. 
● Text analytics systems apply natural language processing like speech taking 
syntactic parsing and linguistic analysis. 
● Named entity recognition is used to identify the names of people, 
organizations, place, symbols and so on. 
● Disambiguation is required if there are different meanings of the same word. 
● Example: 'Ford' can refer to former US president, vehicle manufacturer or a 
movie star. 
● Data such as telephone numbers and email addresses are found with the help 
of pattern matching algorithms. 
● Lastly, the noun phrases and other terms which refers to the same object are 
identified. 

Applications 

● Data mining and business intelligence 


● National security 
● Social media monitoring 
● Automatic date placement 
● Sentiments analysis tool 
● E-Discovery 

23
Q14) Explain Hadoop storage HDFS. Discuss the main features of Hadoop 
distributed file system. 

Architecture and storage 

● It is an open-source software framework and it is licensed under Apache V2 


licence. 
● It provides a software framework for big data processing in real time 
application. 
● Hadoop architecture have two parts: 
a. Hadoop distributed file system. 
b. Mapreduce engine. 
● It is designed to run on commodity hardware. 
● It uses of block structured file system  
● Each file is divided into blocks of predetermined size. 
● These blocks are stored in clusters of one or more machines. 
● It follows Master/Slave architecture. 
● A cluster has a single Namenode (Master) and various Datanode (Slave). 

24
● All the data nodes are spread across various machines. 
● This system is designed in such a way that user data never flows through the 
Namenode. 

Namenode  

● It maintains and manages the blocks present in the data nodes. 


● It is a very highly available server. 
● It manages the file system namespace and controls the access to the files by 
clients. 
● it executes operations like opening, closing and renaming files and directories. 
● It determines the mapping of blocks to data nodes. 

Datanode  

● It is a commodity hardware which is a non expensive system. 


● It is not of high quality and high-availability. 
● It is a block server that stores the data in a local file like ext3 and ext4. 
● It performs operations like block creation, deletion and replication. 

Journal and Checkpoint 

● Journal is available in native file system. 


● It consists of modification log of image. 
● It is updated for every client transaction. 
● Checkpoint is a persistent record of images. 
● It is also stored in native filesystem. 
● Checkpoint nodes can create new checkpoint files on startup or restart. 
● Namenode is not allowed to update or modify checkpoint files. 

Features 

● It supports the traditional hierarchical file organisation. 

25
● It uses the concept of MapReduce which has two separate functions. 
● Users can create a directory and store files inside them. 
● It also supports third party file systems like cloud storage and Amazon simple 
storage solution. 

Data replication:  

● HDFS replicates file blocks for fault tolerance. 


● It specifies the number of replicas of file at the time when it is created. 
● All the replication decisions are taken by namenode. 
● HDFS uses an intelligent replica placement model for reliability and 
performance. 
● It also uses network bandwidth efficiently. 

Data organization 

● HDFS supports large files by assigning one or more blocks of size 64 MB. 
● Every block is placed on separate datanodes. 

● HDFS applications need a write-once-read-many access model for files. 


● A file once created, written and closed need not be changed. 
● HDFS has master slave architecture. 
● HDFS cluster consists of a single name node that manages file systems. 
● HDFS supports traditional hierarchical file organisation. 
● Users can create directories and store files inside them. 
● HDFS does not support hard links or soft links. 
● Files in HDFS have strictly one writer at any time. 

26
Q15) Explain big data and Big Data analytics. Define big data. Discuss various 
applications of big data and 3 V's of big data. 

Big Data ​is a collection of very large volumes of data available at various sources, in 
varying degrees of complexity, generated at different speeds, which cannot be 
processed using traditional technologies methods algorithms or any commercial 
solutions. 

● Big data is similar to small data but bigger in size. 


● It requires different techniques, tools and architecture for analysis. 

Big data analytics​ is the process of examining large and various types of data sets to 
uncover hidden patterns, unknown correlations, market trends, customer preference 
and other useful information that can help organisations to make business decisions. 

● It processes a very large quantity of Digital information that cannot be 


analysed using traditional techniques. 
● Organisations like Google, LinkedIn and Facebook are using Big Data analytics. 

Example: 

● Walmart handles more than 1 million customer transactions every hour. 


● Facebook handles 40 million photos from its user base. 
● Twitter generates 70 b of data daily. 

3 characteristics of big data 

27
1. Volume 
○ Relational databases are not able to handle the volume of big data. 
○ A PC might have had 10 GB of storage in 2000. 
○ Today Facebook adds 500 TB of new data everyday. 
○ Big data volume increases with the storage of records, images, videos 
which consumes thousands of TB everyday. 

2. Velocity 
○ It refers to the speed of generation of data. 
○ Data is generated so fast at an exponential rate. 
○ It is difficult to process data at this velocity. 
○ Data is created in real time but it is hard processed in real time. 
○ The data which is produced are in structured, semi structured or 
unstructured manner. 
○ Ad impressions capture user behaviour at millions of events per second. 
○ High frequency stock trading algorithms reflect market changes within 
microseconds. 
○ Online gaming system supports millions of concurrent users. 
○ Infrastructure and sensors generate missing data in real time. 

3. Variety 
○ It refers to the heterogeneous sources and the structured and 
unstructured nature of data. 
○ Big data is not just about numbers, dates and strings. 
○ Variety of data types like 3D data, audio, video, text, files and social 
media are generated everyday. 
○ Traditional database systems are designed to handle a few types of data. 
○ Big Data analytics includes all the different types of data. 

28

You might also like