DADM FAQs
DADM FAQs
DADM FAQs
1. What is the logic behind the statement, "2 Standard deviations away"?
It is the Central Limit Theorem and states that approximately ~95% of the observations
would fall in the range of mean+-2 standard deviation. This is used to understand the
minimum-maximum range within which we can expect the maximum data points to be.
Within and between states the variation. Within signifies the variation within each of the
groups and between states how each of the groups is away from the overall mean.
3. What does sums of squares imply? Please explain the logic and where should it be
applied?
Sums of squares is the term used for variation. Within-group variation/sum of square and
between-group variation/sum of the square. The ratio of between to within-group variation
gives you an F statistic value. The P-value corresponding to the F statistic helps in
understanding whether the group means are significantly different or not.
[email protected]
4. F-Statistic - What is it? In addition, what is its logic and where should it be applied?
DY2GHOQJV9
5. In regression what is R and R square? What is the logic behind this and where
should it be applied?
Multiple R is the correlation coefficient and R square tells us how much variation in Y is
explained with the help of X or multiple X’s
6. What is the coefficient value denotes, value is coming either in positive or
negative?
The coefficient value for various X variables show how much of an impact each X will have
on Y when X increases by 1 unit keeping all other variables constant. Positive sign says 1 unit
increase in X will increase Y and negative sign suggests that 1 unit increase in X will decrease
Y.
7. When the case is true P<0.05, the coefficient value is in negative also what it
means?
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Yes, one can create two columns, one for male and another for Female and see if there is
any significant difference between males and Females.
12 In two sample T-test, how do we decide what to column put first, is it based on the
number of observations?
Does not matter which goes first or second by order.
13. When positive coefficient of one predictor is a result in simple linear regression,
however it goes the other way round (negative coefficient) when all predictors were
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
If there are predictors that logically don't give the right symbol(+ or -) it may be due to
multicollinearity (independent variable highly correlated with other independent variables)
issue. For this course, we will just interpret the basis of the coefficient values with respect to
significant variables. Variable significance can be understood from the p-value. The issue of
multicollinearity can be addressed by removing highly collinear variables one by one so
building regression models iteratively and then picking the best one considering the
adjusted R-value.
14. What if the regression coefficient is negative while the p value is very small? What
should be concluded? Because as per p value the variable should be a significant
contributor But as per the regression coefficient it shows the negative relationship
(intercept is negative too) ...but the negative relationship doesn't fall under the common
sensical conclusion of the business model
Correlation coefficients are used to measure the strength of the linear relationship between
[email protected]
two variables. A correlation coefficient greater than zero indicates a positive relationship
DY2GHOQJV9
while a value less than zero signifies a negative relationship. As you have said, this does
indicate an unexpected pattern wherein a person who makes more money has a negative
balance.
If there are predictors that logically don't give the right symbol(+ or -) it may be due to
multicollinearity (independent variable highly correlated with other independent variables)
issue. For this course, we will just interpret the basis of the coefficient values with respect to
significant variables. Variable significance can be understood from the p-value. The issue of
multicollinearity can be addressed by removing highly collinear variables one by one so
building regression models iteratively and then picking the best one considering the
adjusted R-value.
15. Is the correlation coefficient is same and equal to regression coefficient?
The main difference in correlation vs regression is that the measures of the degree of a
relationship between two variables; let them be x and y. Here, correlation is for the
measurement of degree, whereas regression is a parameter to determine how one variable
affects another.
16. Is Norm.inv function calculate random variable value at particular probability ?
The NORM.INV Function is categorized under Excel Statistical functions. It will calculate
the inverse of the normal cumulative distribution for a supplied value of x, with a given
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
For example, suppose we are given a normally distributed random variable that is denoted
by x. For the value of x, if we wish to get the bottom 5% of the distribution, we can use the
NORM.INV function.
As a financial analyst, the function is useful in stock market analysis. We can use
NORM.INV to understand how a portfolio is affected by any additions or withdrawals
made.
Formula
=NORM.INV(probability,mean,standard_dev)
17. R-square measures the percentage of the variation in Y that is collectively explained by
the X variables in the model.
is it mean r-square = correlation between 2 items?
But if I draw correlation matrix correlation value looks like square root of r-square?
The coefficient of correlation is the “R” value which is given in the summary table in the
[email protected]
DY2GHOQJV9 Regression output. R square is also called the coefficient of determination. Multiply R
times R to get the R square value. In other words, Coefficient of Determination is the
square of the Coefficient of Correlation.
As additional reading material, we recommend you go through this link.
18. How to find out if a distribution is positively or negatively skewed? Should we consider
the signs before the coefficients or should we consider mean>median (if positively
skewed) and median> mean (if negatively skewed)?
If the mean is greater than the mode, the distribution is positively skewed. If the mean is
less than the mode, the distribution is negatively skewed. If the mean is greater than the
median, the distribution is positively skewed. If the mean is less than the median, the
distribution is negatively skewed.
As additional reading material, we recommend you go through this link.
19. My doubt is regrading one tail and two tail. After a two sample t test, there are rows
which mentions P(T<=t) one tail and similarly for two tail. When considering this P value,
on what basis one tail and two are chosen. Some answers we consider only one tail and
others two tail. My question is how to differentiate.
When in doubt, it is almost always more appropriate to use a two-tailed test. A one-tailed
test is only justified if you have a specific prediction about the direction of the difference
(e.g., Group A scoring higher than Group B), and you are completely uninterested in the
possibility that the opposite outcome could be true (e.g., Group A scoring lower than
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
The criteria of p-value<0.05 is preferable as it is more generally used. The equivalent rule is
t-statistic value > critical value. If the t-statistic is negative, then we ignore the negative sign
for this purpose. We are only interested in the absolute value when comparing with the
critical value (critical value is always positive). This is the situation in the case study that is
referred to.
21. I was trying to calculate the p value through dummy method using T test and through
excel formulas. Both tend to have different P value. In the first case P value was less than
0.05 and in the second case, it was greater than 0.05. How do I go about it? Which
method is more appropriate?
In addition, could you explain when we should have zero in hypothesis mean difference or
[email protected]
DY2GHOQJV9
a particular value (in this case 500).
Doubt_1.xlsx
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Box Plot is a technique that lets you see the distribution. Various lines represent the mark at
min permissible range, 25th percentile, median,75th percentile and max permissible range.
The samples that are above max and below min are treated as outliers. These outliers are
very sensitive and should be considered in the context of business domain. You may decide
to impute outliers with capping and flooring technique.
24. If there is a business to verify true or false.Whether we have to verify through T test or
through ANOVA.
T-test is applicable if you have 2 groups. For Anova, you need to have more than 2. After you
perform Anova, if the means are different, you may decide to pick any 2-group combination
out of 3 and do a t-test to observe if the groups picked have similar means.
[email protected]
DY2GHOQJV9 25. How to write null statement for a given problem? What if I made wrong in writing
statement and how analysis would give different results?
Please go by the guiding principles. Null should not have inequality. Alternate should have
either inequality, greater than or less than. If you follow these principles, your inference will
not be incorrect.
26. Request to provide simplistic handholding of stastical jargons. This will help us
understand and relate better when we are introduced to complex subject formulas.
Example: p-Value, t-stat
Please look for this from classroom recording and or additional video content shared. The
list is exhaustive and you may need to consistently broaden your knowledge base by
referring to materials from class, video, online etc.
27. We are introduced to different kinds of testing. Request support in how to make
judicious decision while choosing, which test to perform?
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
29. What is R Square? What is the significance of squaring them? (and then taking Sqrt to
get the Correlation coefficient).
R square talks about the explained variation. Taking square root of R square give you the
measure of correlation. The calculation of variation = (x-x-bar)^2. If you don’t square them
they would result in deviation of sample from the mean and when you try to sum up all
values all the positives and negatives may cancel out to give a 0 summation. Hence, we
square them to get the variation.
30. How to identify what is the best solution for a business case. I understand the fact that
[email protected]
DY2GHOQJV9 it depends on inputs available and output required. The question relates to
understanding, how we can have that understanding that we should apply x y z testing for
results
2 sample – T-test; more than 2 – Anova; if you try to understand impact of multiple
independent variables on a dependent variable – Regression
31. If independent variables are highly co-related to each other then it is not recommend
to go and do multiple regression analysis but in some case independent variables are
highly co-related to each other hence I would like to know what do we do when they are
highly co-related to each other
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
The answers do not differ. Please remember to divide the sum of squared deviations from
the mean by (n-1) not n, the sample size. Also, look for standard deviation in the output, not
standard error.
33. What for a given column the most repeated set of values are two different values lets
say 600 repeats 5 times just like 400 also repeats 5 times ? then how does Does
descriptive analysis pick it?
Whichever one reaches the maximum count first is the reported mode is this case by Excel.
The mode of 1,3,3,5,4,4 is reported as 3 and the mode of 1,4,4,5,3,3 is reported as 4.
34. In variance formula x i value was suggested to pick from certain range after or before
mean. when Xi - X bar was manually calculated in the lesson , i notice the Xi value picked
was x1 ... what makes one to decide to Xi value. how do we choose it if at all we did it
manually?
[email protected]
DY2GHOQJV9 All the X_i values are present in the formula. The formula (X_i - X_bar)^2 formula can be
applied to X_1 and then dragged down to the other cells, so that it can be applied to the
other cells (other X_i) as well.
In variance, dispersion is first calculated as the square of the difference from the mean for
each observation. Then, these quantities are averaged over all the observations (in the
average, n-1 is taken instead of n). In this sense, variance is a measure of average
dispersion.
36. Currency part I understood its not real currency when u measure variance but also we
talk about variance height in sq inches and sq cms. Are there any other measurement
parameters to be known other than these two currency and variance height, how do we
represent it n excel, kindly elaborate?
Variance of a currency is similarly in squared units, e.g. square-Rs or square-$, if even that
does not make sense in terms of the currency. If the observations have units, the variance
has squared units. To avoid this problem, we work with the standard deviation, the square
root of the variance. Standard deviation is in original data units. Excel is a calculator, so it
does not use any units in general. There are specific currency calculators in Excel, but we
have not implemented statistics in those.
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
40. When calculating variance in descriptive is a formula for variance , X i in the formula?
X_i in the formula refers to all the data points. X_1 is the first data point, X_2 is the second
[email protected]
DY2GHOQJV9 data point, X_n is the last data point (if there are n data points). Within a summation
formula, we replace X_i with each data point and then add up whatever term the formula
requires.
42. Please clarify about mean, median, mode in terms of its usefulness and application.
For a given sample, data how to identify whether mean or median is useful.
This depends on the application the measures are being put to. For example, measurement
of income. Mean is “per capita income of India”, median is “income of the average Indian”
and mode is “the most common level of income”. These have their own uses.
43. When mean is higher than median and mode can I conclude like the sample data
contains outliers or outlier is altogether a different concept?
It means that there are some high values that are not balanced by corresponding low
values. These may be outliers, or they may represent natural skewness (positive skewness in
the data). For example, income. High income may represent outliers, or it may represent the
natural fact that a few people can have much higher income than others can.
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.