Data Analytic 3 Marks Q
Data Analytic 3 Marks Q
Data Analytic 3 Marks Q
5 MARKS QUESTION
Q5. Describe in details about the role of statistical model in data analytics.
Q8. Classify over between overfitting & underfitting how to deals with them.
Q10. Explain why SVM more accurate than logistic regression, with example.
Q12. Justify what are the best practice in Big data analytic.
3 MARKS QUESTIONS
Data cleaning is the process of fixing or removing incorrect, corrupted, incorrectly formatted, duplicate,
or incomplete data within a dataset. When combining multiple data sources, there are many
opportunities for data to be duplicated or mislabeled.
For example, if you conduct a survey and ask people for their phone numbers, people may
enter their numbers in different formats.
Big data analytics is the use of advanced analytic techniques against very large, diverse data
sets that include structured, semi-structured and unstructured data, from different sources, and
in different sizes from terabytes to zettabytes. Big data is a term applied to data sets whose size
or type is beyond the ability of traditional relational databases to capture, manage and process
the data with low latency .
For example: big data comes from sensors, devices, video/audio, networks, log files,
transactional applications, web, and social media — much of it generated in real time and at a
very large scale.
SVM tries to finds the “best” margin (distance between the line and the support vectors)
that separates the classes and this reduces the risk of error on the data, while logistic
regression does not, instead it can have different decision boundaries with different weights that
are near the optimal point.
Since SVM can handle complex data, there would be less room for errors compared to Logistic
Regression. Logistic regression is more sensitive to outliers, hence SVM performs better in
presence of outliers.
Q5. How can the initial number of clusters for K-means algorithm be estimated?
Compute clustering algorithm (e.g., k-means clustering) for different values of k. For instance,
by varying k from 1 to 10 clusters. For each k, calculate the total within-cluster sum of
square (wss). Plot the curve of wss according to the number of clusters k.
Big data and Hadoop are interrelated. In simple words, big data is massive amount of data
that cannot be stored, processed, or analyzed using the traditional methods. Big data
consists of vast volumes of various types of data which is generated at a high speed. To
overcome the issue of storing, processing, and analyzing big data, Hadoop is used.
Hadoop is a framework that is used to store and process big data in a distributed and
parallel way. In Hadoop, storing vast volumes of data becomes easy as the data is
distributed across various machines, and data is also processed parallelly and this saves
time.
Probability distribution yields the possible outcomes for any random event. It is also defined based
on the underlying sample space as a set of possible outcomes of any random experiment. These
settings could be a set of real numbers or a set of vectors or a set of any entities.
A distribution is called a discrete probability distribution, where the set of outcomes are
discrete in nature.
Q9. Write down few problems that data analytic usually encounter while performing the analysis.
• The amount of data being collected.
• Collecting meaningful and real time data.
• Visual representation of data.
• Inaccessible data.
• Poor quality data.
• Budget
• Scaling data analysis.
The primary role of the Activation Function is to transform the summed weighted input from
the node into an output value to be fed to the next hidden layer or as output.
• Items purchased on a credit card, such as rental cars and hotel rooms,
support insight into the following product that customer are likely to buy.
• Optional services purchased by tele-connection users (call waiting, call
forwarding, DSL, speed call, etc.) support decide how to bundle these
functions to maximize revenue.
• Banking services used by retail users (money industry accounts, CDs,
investment services, car loans, etc.) recognize users likely to needed other
services.
• Unusual group of insurance claims can be an expression of fraud and can
spark higher investigation.
• Medical patient histories can supports expressions of likely complications
based on definite set of treatments.
Overfitting: Overfitting occurs when our machine learning model tries to cover all the data points
or more than the required data points present in the given dataset. Because of this, the model
starts caching noise and inaccurate values present in the dataset, and all these factors reduce
the efficiency and accuracy of the model. The overfitted model has low bias and high variance.
It occur mainly in supervised learning.
Underfitting: Underfitting occurs when our machine learning model is not able to capture the
underlying trend of the data. the model is not able to learn enough from the training data, and
hence it reduces the accuracy and produces unreliable predictions.
"A way of grouping the data points into different clusters, consisting of similar data
points. The objects with the possible similarities remain in a group that has less or
no similarities with another group." It is an unsupervised learning method, hence no
supervision is provided to the algorithm, and it deals with the unlabeled dataset. The
clustering technique is commonly used for statistical data analysis. Some most
common uses of this technique are:
o Market Segmentation
o Statistical data analysis
o Social network analysis
o Image segmentation
o Anomaly detection, etc.
Outlier are an important part of a dataset. They can hold useful information about our data. Outliers can
give helpful insights into the data you're studying, and they can have an effect on statistical results. This
can potentially help you discover inconsistencies and detect any errors in your statistical processes.
2. Systematic Sampling
In stratified sampling, the population is subdivided into subgroups, called strata, based
on some characteristics (age, gender, income, etc.). After forming a subgroup, you can
then use random or systematic sampling to select a sample for each subgroup. This
method allows you to draw more precise conclusions because it ensures that every
subgroup is properly represented.
4. Cluster Sampling
In cluster sampling, the population is divided into subgroups, but each subgroup has
similar characteristics to the whole sample.
Data wrangling is the process of removing errors and combining complex data sets to make them more
accessible and easier to analyze. Due to the rapid expansion of the amount of data and data sources
available today, storing and organizing large quantities of data for analysis is becoming increasingly
necessary. 6 Steps to Perform Data Wrangling
The normal distribution is the proper term for a probability bell curve. In a normal distribution the
mean is zero and the standard deviation is 1. It has zero skew and a kurtosis of 3. Normal
distributions are symmetrical, but not all symmetrical distributions are normal. Normal
distribution is also known as gaussian distribution, is a probability distribution that is symmetric
about the mean.
Q22. Type of hypothesis testing.
1. Alternative Hypothesis
Big Data can be structured, unstructured, and semi-structured that are being
collected from different sources.
Veracity: Veracity means how much the data is reliable. It has many ways to filter
or translate the data. Veracity is the process of being able to handle and manage data
efficiently. Big Data is also essential in business development.
Value : Value is an essential characteristic of big data. It is not the data that we
process or store. It is valuable and reliable data that we store, process, and
also analyze.
o Velocity: Velocity creates the speed by which the data is created in real-time. It contains
the linking of incoming data sets speeds, rate of change, and activity bursts. The
primary aspect of Big Data is to provide demanding data rapidly.
Exploratory data analysis (EDA) is used by data scientists to analyze and investigate
data sets and summarize their main characteristics, often employing data visualization
methods. It helps determine how best to manipulate data sources to get the answers
you need, making it easier for data scientists to discover patterns, spot anomalies, test
a hypothesis, or check assumptions.
EDA is primarily used to see what data can reveal beyond the formal modeling or
hypothesis testing task and provides a provides a better understanding of data set
variables and the relationships between them. It can also help determine if the
statistical techniques you are considering for data analysis are appropriate.
EDA techniques continue to be a widely used method in the data discovery process today.