DS Assignment No 2
DS Assignment No 2
DS Assignment No 2
Q. Questions CO BT
No.
1. Unit I
A. Explain normal distribution in detail.
B. Explain types of structured data with examples.
C. what do you mean by robust estimates? which estimates of location are robust in
nature?Explain.
D. Define Correlation. Explain different types of correlation. Also describe Correlation
coefficient.
E. Describe standard deviation, Variance and Interquartile Range with example.
F. What is an outlier? Explain in brief with example
G. calculate mean , median , standard deviation for the given set of data
{1,2,2,3,5}
2. Unit II
A. Explain correlation and its types.
B. Explain standard deviation and interquartile range and write python code to
compute standard deviation and interquartile range.
C. Explain the process of Bootstrapping with example.
D. Define Bias. Explain selection Bias.
E. Explain Normal Distribution and binomial distribution in detail.
F. Explain confidence intervals in detail
G. Explain sampling distribution of statistics in detail.
H. Illustrate the central limit theorem with a neat diagram.
3. Unit III
A. Define hypothesis. Explain its types.
B. Explain A/B testing with examples.
C. Explain statistical significance and p-Values with an example.
D. Explain ANOVA test in detail.
E. Explain chi-square test with example.
4. Unit IV
5. Unit V
6. The difference between the sample value expected and the estimates value of the
parameter is called as?
a) bias
b) error
c) contradiction
d) difference
View Answer
Answer: a
Explanation: The difference between the expected sample value and the estimated value of
parameter is called as bias. A sample used to estimate a parameter is unbiased if the mean
of its sampling distribution is exactly equal to the true value of the parameter being
estimated.
7. . Which of the following is a subset of population?
a) distribution
b) sample
c) data
d) set
View Answer
Answer: b
8. Any population which we want to study is referred as?
a) standard population
b) final population
c) infinite population
d) target population
Answer: d
(For example, if one of the columns in a dataframe has ordinal data, we will have to
preprocess it, and in python, the scikit-learn package offers an OrdinalEncoder to deal with
ordinal data.
The next step is to dive deeper into structured data and how we can use third party packages
and libraries to manipulate such structures. We have mainly two types of structures or data
storage models:
1. Rectangular
2. Non-Rectangular
Rectangular Data
Mostly all analyses in data science are done with a rectangular two-dimensional data object
like a dataframe, spreadsheet, CSV file, or a database table.
This mainly consists of rows that represent records(observations) and
columns(features/variables). Dataframe on the other hand is a special data structure with a
tabular format that offers super-efficient operations to manipulate the data.
Dataframes are the most commonly used data structures and it’s important to cover a few
definitions here:
Data frame
Rectangular data structure (like a spreadsheet) for efficient manipulation and application of
statistical and machine learning models.
Feature
A column within a dataframe is commonly referred to as a feature.
Synonyms — attribute, input, predictor, variable
Outcome
Many data science projects involve predicting an outcome — often a yes/no outcome.
Synonyms — dependent variable, response, target, output
Records
A row within a dataframe is commonly referred to as a record.
Synonyms — case, example, instance, observation, pattern, sample
Example:
Dataframe of a cricket match data
Relational database tables have one or more columns designated as an index, essentially a
row number. This can vastly improve the efficiency of certain database queries. In a pandas
dataframe, an automatic integer index is created based on the order of the rows. In pandas, it
is also possible to set multilevel/hierarchical indexes to improve the efficiency of certain
operations)
The parameter σ defines the shape of the distribution. An example of the PDF
of a normal distribution with μ = 6 and σ = 2 is given in Fig. .
Binomial Distribution
In binomial distribution each trial has two possible outcomes with definite probabilities.
(buy/don’t buy, click/don’t click, survive/die, and so on)
For example, flipping a coin 10 times is a binomial experiment with 10 trials, each trial
having two possible outcomes (heads or tails); see Figure 2-14. Such yes/no or 0/1 outcomes
are termed binary outcomes, and they need not have 50/50 probabilities.
Any probabilities that sum to 1.0 are possible. It is conventional in statistics to term the “1”
outcome the success outcome; it is also common practice to assign “1” to the more rare
outcome.
The binomial distribution is the frequency distribution of the number of successes (x) in a
given number of trials (n) with specified probability (p) of success in each trial. There is a
family of binomial distributions, depending on the values of n and p. The binomial
distribution would answer a question like:
Ex: If the probability of a click converting to a sale is 0.02, what is the probability of
observing 0 sales in 200 clicks?
To determine the probability of x or fewer successes in n trials. In this case, we use the
function pbinom():
pbinom(2, 5, 0.1)
This would return 0.9914, the probability of observing two or fewer successes in five
trials, where the probability of success for each trial is 0.1.
The scipy.stats module implements a large variety of statistical distributions. For the binomial
distribution, use the functions
stats.binom.pmf and stats.binom.cdf:
stats.binom.pmf(2, n=5, p=0.1)
stats.binom.cdf(2, n=5, p=0.1)
The mean of a binomial distribution is n × p;
Q bootstrapping
An easy and effective way to estimate the sampling distribution of a statistic, or of model parameters,
is to draw additional samples, with replacement, from the sample itself and recalculate the statistic or
model for each resample. This procedure is called the bootstrap, and it does not necessarily involve
any assumptions about the data or the sample statistic being normally distributed.
Bootstrap sample
We simply replace each observation after each draw; that is, we sample with replacement. In this way
we effectively create an infinite population in which the probability of an element being drawn
remains unchanged from draw to draw. The algorithm for a bootstrap resampling of the mean, for a
sample of size n, is as follows:
a. Calculate their standard deviation (this estimates sample mean standard error).
The major Python packages don’t provide implementations of the bootstrap approach. It can be
implemented using the scikit-learn method resample:
results = []
for nrepeat in range(1000):
sample = resample(loans_income)
results.append(sample.median())
results = pd.Series(results)
print('Bootstrap Statistics:')
print(f'original: {loans_income.median()}')
print(f'bias: {results.mean() - loans_income.median()}')
print(f'std. error: {results.std()}')
The bootstrap can be used with multivariate data, where the rows are sampled as units
A model might then be run on the bootstrapped data, for example, to estimate the stability (variability)
of model parameters, or to improve predictive power. With classification and regression trees (also
called decision trees), running multiple trees on bootstrap samples and then averaging their
predictions.
Correlation
Exploratory data analysis in many modeling projects (whether in data science or in research) involves
examining correlation among predictors, and between predictors and a target variable.
Variables X and Y (each with measured data) are said to be positively correlated if high values of X
go with high values of Y, and low values of X go with low values of Y.
If high values of X go with low values of Y, and vice versa, the variables are negatively correlated.
Correlation coefficient
A metric that measures the extent to which numeric variables are associated with one another (ranges from –1 to
+1).
Correlation matrix
A table where the variables are shown on both rows and columns, and the cell values are the correlations
between the variables.
Consider these two variables, perfectly correlated in the sense that each goes from low to high:
The vector sum of products is 1·4+2·5+3·6 = 32. Now try shuffling one of them and recalculating—
the vector sum of products will never be higher than 32. So this sum of products could be used as a
metric; that is, the observed sum of 32 could be compared to lots of random shufflings (in fact, this
idea relates to a resampling-based estimate;
More useful is a standardized variant: the correlation coefficient, which gives an estimate of the
correlation between two variables that always lies on the same scale.
To compute Pearson’s correlation coefficient, we multiply deviations from the mean for variable 1
times those for variable 2, and divide by the product of the standard deviations:
The correlation coefficient always lies between +1 (perfect positive correlation) and –1 (perfect
negative correlation); 0 indicates no correlation.
Variables can have an association that is not linear, in which case the correlation coef ‐ ficient may not
be a useful metric. The relationship between tax rates and revenue raised is an example: as tax rates
increase from zero, the revenue raised also increases. However, once tax rates reach a high level and
approach 100%, tax avoidance increa‐ ses and tax revenue actually declines.
Table shown above called a correlation matrix, shows the correlation between the daily returns for
telecommunication stocks from July 2012 through June 2015.
From the table, you can see that Verizon (VZ) and ATT (T) have the highest correlation. Level 3
(LVLT), which is an infrastructure company, has the lowest correlation with the others. Note the
diagonal of 1s (the correlation of a stock with itself is 1) and the redundancy of the information above
and below the diagonal.
The following code demonstrates this using the seaborn.heat map package. In the accompanying source
code repository, we include Python code to generate the more comprehensive visualization:
sp500_sym[sp500_sym['sector'] == 'etf']['symbol']]
(variability, also referred to as dispersion, measures whether the data values are tightly clustered or
spread out. )
Variance
The sum of squared deviations from the mean divided by n – 1 where n is the number of data values.
Synonym: mean-squared-error
Standard deviation
The best-known estimates of variability are the variance and the standard deviation, which are based
on squared deviations.
The variance is an average of the squared deviations, and the standard deviation is the square root of
the variance:
Interquartile range
A common measurement of variability is the difference between the 25th percentile and the 75th
percentile, called the interquartile range (or IQR).
The most widely used estimates of variation are based on the differences, or deviations, between the
estimate of location and the observed data.
For a set of data {1, 4, 4}, the mean is 3 and the median is 4. The deviations from the mean are the
differences: 1 – 3 = –2, 4 – 3 = 1, 4 – 3 = 1. These deviations tell us how dispersed the data is around
the central value.
The pandas data frame provides methods for calculating standard deviation and quantiles. Using the
quantiles, we can easily determine the IQR. For the robust MAD, we use the function robust.scale.mad
from the statsmodels package:
state['Population'].std()
state['Population'].quantile(0.75) - state['Population'].quantile(0.25)
robust.scale.mad(state['Population'])
Here is a simple example: {3,1,5,3,6,7,2,9}. We sort these to get {1,2,3,3,5,6,7,9}. The 25th
percentile is at 2.5, and the 75th percentile is at 6.5, so the interquartile range is 6.5 – 2.5 = 4.
An A/B test is an experiment with two groups to establish which of two treatments, products,
procedures, or the like is superior.
Often one of the two treatments is the standard existing treatment, or no treatment. If a standard (or
no) treatment is used, it is called the control. A typical hypothesis is that a new treatment is better
than the control.
A/B tests are common in web design and marketing, since results are so readily meas ‐ ured. Some
examples of A/B testing include:
1. Testing two soil treatments to determine which produces better seed germination
2. Testing two therapies to determine which suppresses cancer more effectively
3. Testing two prices to determine which yields more net profit
A proper A/B test has subjects that can be assigned to one treatment or another. The subject might be
a person, a plant seed, a web visitor; the key is that the subject is exposed to the treatment.
Ideally, subjects are randomized (assigned randomly) to treatments. Any difference between the
treatment groups is due to one of two things:
• Luck of the draw in which subjects are assigned to which treatments (i.e., the random assignment
may have resulted in the naturally better-performing subjects being concentrated in A or B)
The most common metric in data science is a binary variable: click or no-click, buy or don’t buy,
fraud or no fraud, and so on. Those results would be summed up in a 2×2 table.
Q hypothesis test?
Hypothesis Tests
Hypothesis tests, also called significance tests, are ubiquitous in the traditional statisti‐ cal analysis of
published research. Their purpose is to help you learn whether random chance might be responsible
for an observed effect.
Null hypothesis
Alternative hypothesis
Counterpoint to the null (what you hope to prove).
One-way test
Two-way test
An A/B test is typically constructed with a hypothesis in mind. For example, the hypothesis might be
that price B produces higher profit. Why do we need a hypothesis? Why not just look at the outcome
of the experiment and go with whichever treatment does better?
The answer lies in the tendency of the human mind to underestimate the scope of natural random
behavior. One manifestation of this is the failure to anticipate extreme events, or so-called “black
swans”
Another manifestation is the tendency to misinterpret random events as having patterns of some
significance. Statistical hypothesis testing was invented as a way to protect researchers from being
fooled by random chance.
NULL hypothesis: This baseline assumption is termed the null hypothesis. Our hope, then, is that we
can in fact prove the null hypothesis wrong and show that the outcomes for groups A and B are more
different than what chance might produce.
Alternative Hypothesis
Hypothesis tests by their nature involve not just a null hypothesis but also an offset ‐ ting alternative
hypothesis. Here are some examples:
● Null = “no difference between the means of group A and group B”;
● alternative = “A is different from B” (could be bigger or smaller)
A one-tail hypothesis test often fits the nature of A/B decision making, in which a decision is required
and one option is typically assigned “default” status unless the other proves better.
If you want a hypothesis test to protect you from being fooled by chance in either direction, the
alternative hypothesis is bidirectional (A is different from B; could be bigger or smaller).
In such a case, you use a two-way (or two-tail) hypothesis. This means that extreme chance results in
either direction count toward the p-value.
Selection bias
Data snooping
Regression
Regression is a statistical technique used in data science to model the relationship between one
or more independent variables and a dependent variable. The goal of regression analysis is to
identify the most important predictors of the dependent variable, and to quantify the strength and
1. Simple linear regression: Models the relationship between a single independent variable
and a dependent variable using a linear equation.
2. Multiple linear regression: Models the relationship between multiple independent
variables and a dependent variable using a linear equation.
3. Logistic regression: Models the relationship between one or more independent variables
and a binary dependent variable, such as whether a customer will buy a product or not.
4. Polynomial regression: Models the relationship between an independent variable and a
dependent variable using a higher-order polynomial equation, such as a quadratic or
cubic function.
5. Ridge regression: Used to handle multicollinearity, where the independent variables are
highly correlated with each other.
6. Lasso regression: Used to select the most important independent variables and reduce
overfitting.
Regression analysis is widely used in data science for many applications, including:
1. Predictive modeling: Predicting future outcomes based on past data, such as predicting
customer churn or stock prices.
2. Forecasting: Forecasting future trends based on historical data, such as predicting sales
or demand for a product.
3. Causal inference: Determining the causal relationship between two variables, such as the
impact of a marketing campaign on sales.
4. Feature selection: Identifying the most important independent variables to include in a
predictive model.
Overall, regression analysis is a powerful tool for understanding the relationships between
—-
FActor variables
Factor variables, also known as categorical variables, are variables that take on a
gender (male or female), education level (high school, college, or graduate), and
to model the relationship between the factor variable and the dependent variable.
they are not continuous variables and cannot be directly used in a regression
equation.
To use a factor variable in a regression analysis, the variable must first be encoded
variable that takes on the value of 1 if the category is present, and 0 otherwise. For
example, if the factor variable is gender, two indicator variables would be created:
one for male (coded as 1 if male, 0 if female) and one for female (coded as 1 if
female, 0 if male).
The regression equation for a model that includes a factor variable with k categories
would be:
where x1, x2, ..., xk are the k-1 indicator variables for the k categories of the factor
variable. One of the k categories is chosen as the reference category, and the other
The coefficient β1 represents the difference in the dependent variable between the
reference category and the first category, β2 represents the difference between the
reference category and the second category, and so on.
require special treatment through the creation of indicator variables. The use of
these relationships.
—---
Regression diagnostics
it refer to a set of procedures used to evaluate the quality and appropriateness of a regression
model in data science. Regression diagnostics are important because they help ensure that the
assumptions underlying the regression model are met, and that the model provides a good fit to
the data.
The following are some common regression diagnostic techniques used in data science:
1. Residual analysis: Residuals are the differences between the predicted values and the
actual values of the dependent variable. Residual analysis involves examining the
distribution of residuals and checking for patterns or trends in the residuals. A good
regression model should have residuals that are normally distributed with no discernible
patterns.
2. Influence analysis: Influence analysis is used to identify outliers or influential data points
that have a large impact on the regression model. Outliers and influential data points can
have a significant effect on the estimated coefficients and can lead to biased results.
independent variables are highly correlated with each other. Multicollinearity can lead to
unstable and unreliable regression coefficients, making it difficult to interpret the results
5. Goodness of fit analysis: Goodness of fit analysis involves evaluating how well the
regression model fits the data. This can be done by examining the R-squared value, which
measures the proportion of variance in the dependent variable that is explained by the
independent variables.
Overall, regression diagnostics are an important part of the data science process as they help
ensure that regression models are valid and provide reliable results. By using regression
diagnostics, data scientists can identify and address potential issues with their models, and
—-----
Polynomial and spline regression are two types of regression models commonly
used in data science to model non-linear relationships between the dependent and
independent variables.
polynomial function is a function of the form y = a0 + a1x + a2x^2 + ... + anx^n, where
y is the dependent variable, x is the independent variable, and a0, a1, a2, ..., an are the
Spline regression, on the other hand, involves fitting a piecewise polynomial function
to the data. A spline is a curve that is defined by a set of polynomial functions, each
polynomial functions are connected at a set of points called knots. The knots are
chosen based on the data, and the number and location of the knots can be adjusted
to control the flexibility of the spline.
Polynomial and spline regression are both useful when the relationship between the
is simpler and more computationally efficient than spline regression, but it may not
but it can be more computationally intensive and may require more data to estimate
Overall, polynomial and spline regression are valuable tools in the data scientist's
toolkit for modeling non-linear relationships in data. By using these techniques, data
—---
models that are commonly used in data science, but they are used for different
variable and one or more continuous independent variables. It assumes that the
relationship between the variables is linear, which means that the change in the
Logistic regression, on the other hand, is used to model the relationship between a
binary dependent variable (i.e., one that takes on two possible values, typically 0 and
Logistic regression models output probabilities, or the likelihood that the dependent
variable will take on a certain value (e.g., the probability that a customer will
purchase a product).
In terms of assumptions, linear regression assumes that the residuals (i.e., the
differences between the predicted values and the actual values of the dependent
variable) are normally distributed and have constant variance across all levels of the
the independent variables and the dependent variable is log-linear, meaning that the
odds of the dependent variable taking on a certain value are proportional to the
independent variables.
In summary, while both linear regression and logistic regression are used to model
relationships between variables, they are used for different types of dependent
variables and have different assumptions. Linear regression is used for continuous
dependent variables, while logistic regression is used for binary dependent variables.
—------
Naive Bayes is a popular classification algorithm in machine learning and data science that
is based on Bayes' theorem. It is called "naive" because it assumes that the features (i.e.,
independent variables) are independent of each other, which is usually not the case in real-world
data.
The algorithm works by calculating the probability of each class (i.e., the dependent variable)
given the values of the features. It does this by first calculating the prior probability of each class,
which is the probability of the class occurring without any knowledge of the features. It then
calculates the likelihood of each feature given each class, which is the probability of the feature
having a certain value given that the class is true. Finally, it uses Bayes' theorem to calculate the
Naive Bayes can be used for both binary and multi-class classification problems. It is particularly
useful for text classification tasks such as spam detection and sentiment analysis. It is also
computationally efficient and requires relatively little training data compared to other machine
learning algorithms. However, its assumption of feature independence can lead to inaccurate
predictions in certain cases where the features are highly correlated.
—-----
data science to classify data into two or more groups based on a set of predictor
The technique works by creating a linear combination of the predictor variables that
maximizes the separation between the groups. This linear combination is called the
discriminant function. The discriminant function can be used to classify new data
There are two types of discriminant analysis: two-group discriminant analysis and
only two groups to classify the data into, while in multiple-group discriminant
The discriminant function is derived by first calculating the mean and covariance of
each predictor variable for each group. These statistics are then used to calculate
Discriminant analysis is used in many areas of data science and machine learning,
there are multiple predictor variables that can be used to classify the data. However,
it assumes that the predictor variables are normally distributed and have equal
covariance matrices across all groups. If these assumptions are not met, the
accuracy of the classification may be reduced.
—--------
data science, as it helps to determine the accuracy and effectiveness of the models
in predicting the class labels of new data. There are several metrics and techniques
F1 score.
model's predictions.
4. Recall: Recall is the proportion of true positive predictions out of the total
metric that combines both precision and recall into a single score.
6. Receiver operating characteristic (ROC) curve: The ROC curve is a plot of the
true positive rate (TPR) against the false positive rate (FPR) at different
classification models.
7. Area under the curve (AUC): The AUC is the area under the ROC curve. It is a
metric that measures the overall performance of the model, with higher values
indicating better performance.
specific problem and goals of the classification task. It is also important to use
techniques such as cross-validation to ensure that the evaluation results are not