Data Analytic 3 Marks Q

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

DATA ANALYTIC

5 MARKS QUESTION

Q1. Write a short note on associative data mining?

Q2. Write a short note on deep mining.

Q3. Application of clustering.

Q4. Quadratic discriminate analysis.

Q5. Describe in details about the role of statistical model in data analytics.

Q6. Descriptive Statistics.

Q7. Difference between probability did distribution & descriptive statistics.

Q8. Classify over between overfitting & underfitting how to deals with them.

Q9. Explain principal of neural network.

Q10. Explain why SVM more accurate than logistic regression, with example.

Q11. Justify why SVM is so fast.

Q12. Justify what are the best practice in Big data analytic.

Q13. Assess the technique in Big data analytics.

Q14. Explain ANOVA.

Q15. Different feature of Hadoop.

Q16. illustrate Hadoop related to big data analytics.

3 MARKS QUESTIONS

Q1. Explain data cleansing.

Data cleaning is the process of fixing or removing incorrect, corrupted, incorrectly formatted, duplicate,
or incomplete data within a dataset. When combining multiple data sources, there are many
opportunities for data to be duplicated or mislabeled.
For example, if you conduct a survey and ask people for their phone numbers, people may
enter their numbers in different formats.

Q2. Explain what is a big data Analytics.

Big data analytics is the use of advanced analytic techniques against very large, diverse data
sets that include structured, semi-structured and unstructured data, from different sources, and
in different sizes from terabytes to zettabytes. Big data is a term applied to data sets whose size
or type is beyond the ability of traditional relational databases to capture, manage and process
the data with low latency .

For example: big data comes from sensors, devices, video/audio, networks, log files,
transactional applications, web, and social media — much of it generated in real time and at a
very large scale.

Q3. Explain why SVM more accurate than logistic regression.

SVM tries to finds the “best” margin (distance between the line and the support vectors)
that separates the classes and this reduces the risk of error on the data, while logistic
regression does not, instead it can have different decision boundaries with different weights that
are near the optimal point.

Since SVM can handle complex data, there would be less room for errors compared to Logistic
Regression. Logistic regression is more sensitive to outliers, hence SVM performs better in
presence of outliers.

Q4. Describe the various hierarchical model cluster analysis?

Q5. How can the initial number of clusters for K-means algorithm be estimated?

Compute clustering algorithm (e.g., k-means clustering) for different values of k. For instance,
by varying k from 1 to 10 clusters. For each k, calculate the total within-cluster sum of
square (wss). Plot the curve of wss according to the number of clusters k.

Q6. Explain how is Hadoop related to big data

Big data and Hadoop are interrelated. In simple words, big data is massive amount of data
that cannot be stored, processed, or analyzed using the traditional methods. Big data
consists of vast volumes of various types of data which is generated at a high speed. To
overcome the issue of storing, processing, and analyzing big data, Hadoop is used.
Hadoop is a framework that is used to store and process big data in a distributed and
parallel way. In Hadoop, storing vast volumes of data becomes easy as the data is
distributed across various machines, and data is also processed parallelly and this saves
time.

Q7. Short note on probability distribution.

Probability distribution yields the possible outcomes for any random event. It is also defined based
on the underlying sample space as a set of possible outcomes of any random experiment. These
settings could be a set of real numbers or a set of vectors or a set of any entities.

Types of Probability Distribution


1. Normal or Cumulative Probability Distribution :

The cumulative probability distribution is also known as a continuous probability distribution.


In this distribution, the set of possible outcomes can take on values in a continuous range.

2. Binomial or Discrete Probability Distribution:

A distribution is called a discrete probability distribution, where the set of outcomes are
discrete in nature.

Q8. Difference between active learning & reinforcement learning.

ACTIVE LEARNING REINFORCEMENT LEARNING


Active learning is a special case of machine Reinforcement learning (RL) is an area of
learning in which a learning algorithm can machine learning concerned with how
interactively query a user to label new data software agents ought to take actions in an
points with the desired outputs. environment in order to maximize the
notion of cumulative reward.
Active learning is a technique that is Reinforcement learning is a different
applied to Supervised Learning settings. paradigm, where we don't have labels, and
therefore cannot use supervised learning.
It is based on rewards and punishments
Active learning is based on the concept. mechanism which can be both active and
passive.

Q9. Write down few problems that data analytic usually encounter while performing the analysis.
• The amount of data being collected.
• Collecting meaningful and real time data.
• Visual representation of data.
• Inaccessible data.
• Poor quality data.
• Budget
• Scaling data analysis.

Q10. Explain in details about the challenges of Conventional system.

• It cannot work on unstructured data efficiently.


• It is built on top of the relational data model .
• It is batch oriented and we need to wait for nightly ETL (extract, transform and load) and
transformation jobs to complete before the required insight is obtained.
• Parallelism in a traditional analytics system is achieved through costly hardware like MPP
(Massively Parallel Processing) systems.
• Inadequate support of aggregated summaries of data

Q11. Difference between data mining & data analytics.

DATA MINING DATA ANALYTICS


Data mining is a process of extracting useful Data analysis is a method that can be used to
information, patterns, and trends from raw investigate, analyze, and demonstrate data to
data. find useful information.
It includes the intersection of databases, It requires expertise in computer science,
machine learning, and statistics. mathematics, statistics, AI.
It is also called KDD. It is of various types - text analytics,
predictive analysis, data mining, etc.
The data mining output gives the data The data analysis output is a verified
pattern. hypothesis or insights based on the data.
The best example of a data mining The best example of data analysis is the study
application is in the E-commerce sector. of the census.

Q12. Short note on two-way ANOVA.

A two-way ANOVA is an extension of the one-way ANOVA. A two-way ANOVA


is used to estimate how the mean of a quantitative variable changes according to the
levels of two categorical variables. Use a two-way ANOVA when you want to know how two
independent variables, in combination, affect a dependent variable. ANOVA is a statistical
method that stands for analysis of variance. ANOVA was developed by Ronald Fisher in 1918 .
Q13. What is the role activation function in neural network?

The primary role of the Activation Function is to transform the summed weighted input from
the node into an output value to be fed to the next hidden layer or as output.

Q14. Cluster & centroid difference

Q15. Write down about application of association rule.

• Items purchased on a credit card, such as rental cars and hotel rooms,
support insight into the following product that customer are likely to buy.
• Optional services purchased by tele-connection users (call waiting, call
forwarding, DSL, speed call, etc.) support decide how to bundle these
functions to maximize revenue.
• Banking services used by retail users (money industry accounts, CDs,
investment services, car loans, etc.) recognize users likely to needed other
services.
• Unusual group of insurance claims can be an expression of fraud and can
spark higher investigation.
• Medical patient histories can supports expressions of likely complications
based on definite set of treatments.

Q16. Describe Overfitting & Underfitting.

Overfitting: Overfitting occurs when our machine learning model tries to cover all the data points
or more than the required data points present in the given dataset. Because of this, the model
starts caching noise and inaccurate values present in the dataset, and all these factors reduce
the efficiency and accuracy of the model. The overfitted model has low bias and high variance.
It occur mainly in supervised learning.

Underfitting: Underfitting occurs when our machine learning model is not able to capture the
underlying trend of the data. the model is not able to learn enough from the training data, and
hence it reduces the accuracy and produces unreliable predictions.

An underfitted model has high bias and low variance.


Q17. What is clustering?

"A way of grouping the data points into different clusters, consisting of similar data
points. The objects with the possible similarities remain in a group that has less or
no similarities with another group." It is an unsupervised learning method, hence no
supervision is provided to the algorithm, and it deals with the unlabeled dataset. The
clustering technique is commonly used for statistical data analysis. Some most
common uses of this technique are:

o Market Segmentation
o Statistical data analysis
o Social network analysis
o Image segmentation
o Anomaly detection, etc.

Q18. What is outlier?

Outlier are an important part of a dataset. They can hold useful information about our data. Outliers can
give helpful insights into the data you're studying, and they can have an effect on statistical results. This
can potentially help you discover inconsistencies and detect any errors in your statistical processes.

Q19. Different type of sampling technique used by data analysist?

1. Simple Random Sampling

In simple random sampling, the researcher selects the participants randomly.

2. Systematic Sampling

In systematic sampling, every population is given a number as well like in simple


random sampling.
3. Stratified Sampling

In stratified sampling, the population is subdivided into subgroups, called strata, based
on some characteristics (age, gender, income, etc.). After forming a subgroup, you can
then use random or systematic sampling to select a sample for each subgroup. This
method allows you to draw more precise conclusions because it ensures that every
subgroup is properly represented.

4. Cluster Sampling

In cluster sampling, the population is divided into subgroups, but each subgroup has
similar characteristics to the whole sample.

Q20. Data wrangling in data analytics. (Describe)

Data wrangling is the process of removing errors and combining complex data sets to make them more
accessible and easier to analyze. Due to the rapid expansion of the amount of data and data sources
available today, storing and organizing large quantities of data for analysis is becoming increasingly
necessary. 6 Steps to Perform Data Wrangling

• Step 1: Data Discovery.


• Step 2: Data Structuring.
• Step 3: Data Cleaning.
• Step 4: Data Enriching.
• Step 5: Data Validating.
• Step 6: Data Publishing.

Q21. Explain the term normal distribution.

The normal distribution is the proper term for a probability bell curve. In a normal distribution the
mean is zero and the standard deviation is 1. It has zero skew and a kurtosis of 3. Normal
distributions are symmetrical, but not all symmetrical distributions are normal. Normal
distribution is also known as gaussian distribution, is a probability distribution that is symmetric
about the mean.
Q22. Type of hypothesis testing.

1. Alternative Hypothesis

Alternative Hypothesis (H1) or the research hypothesis states that there


is a relationship between two variables (where one variable affects the
other). The alternative hypothesis is the main driving force for hypothesis
testing.
2. Null Hypothesis

The Null Hypothesis (H0) aims to nullify the alternative hypothesis by


implying that there exists no relation between two variables in statistics.
It states that the effect of one variable on the other is solely due to
chance and no empirical cause lies behind it.
3. Non-Directional Hypothesis

The Non-directional hypothesis states that the relation between two


variables has no direction.
4. Directional Hypothesis

The Directional hypothesis, on the other hand, asserts the direction of


effect of the relationship that exists between two variables.
5. Statistical Hypothesis

A statistical hypothesis is a hypothesis that can be verified to be


plausible on the basis of statistics.
Q23. Characteristic of Big Data.

5 V's of Big Data


o Volume : The name Big Data itself is related to an enormous size. Big Data is a vast
'volumes' of data generated from many sources daily, such as business processes,
machines, social media platforms, networks, human interactions, and many more.
o Variety :

Big Data can be structured, unstructured, and semi-structured that are being
collected from different sources.

Veracity: Veracity means how much the data is reliable. It has many ways to filter
or translate the data. Veracity is the process of being able to handle and manage data
efficiently. Big Data is also essential in business development.

For example, Facebook posts with hashtags.

Value : Value is an essential characteristic of big data. It is not the data that we
process or store. It is valuable and reliable data that we store, process, and
also analyze.

o Velocity: Velocity creates the speed by which the data is created in real-time. It contains
the linking of incoming data sets speeds, rate of change, and activity bursts. The
primary aspect of Big Data is to provide demanding data rapidly.

Q24. Type -1 & Type-II error Comparison.


Q25. Explain EDA. :

Exploratory data analysis (EDA) is used by data scientists to analyze and investigate
data sets and summarize their main characteristics, often employing data visualization
methods. It helps determine how best to manipulate data sources to get the answers
you need, making it easier for data scientists to discover patterns, spot anomalies, test
a hypothesis, or check assumptions.

EDA is primarily used to see what data can reveal beyond the formal modeling or
hypothesis testing task and provides a provides a better understanding of data set
variables and the relationships between them. It can also help determine if the
statistical techniques you are considering for data analysis are appropriate.

EDA techniques continue to be a widely used method in the data discovery process today.

You might also like