DADM FAQs

Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

Data Analysis for Decision Making Frequently Asked Questions:

1. What is the logic behind the statement, "2 Standard deviations away"?

It is the Central Limit Theorem and states that approximately ~95% of the observations
would fall in the range of mean+-2 standard deviation. This is used to understand the
minimum-maximum range within which we can expect the maximum data points to be.

2. In ANOVA, what is within and between?

Within and between states the variation. Within signifies the variation within each of the
groups and between states how each of the groups is away from the overall mean.

3. What does sums of squares imply? Please explain the logic and where should it be
applied?

Sums of squares is the term used for variation. Within-group variation/sum of square and
between-group variation/sum of the square. The ratio of between to within-group variation
gives you an F statistic value. The P-value corresponding to the F statistic helps in
understanding whether the group means are significantly different or not.
[email protected]
4. F-Statistic - What is it? In addition, what is its logic and where should it be applied?
DY2GHOQJV9

Please refer to Q3 above.

5. In regression what is R and R square? What is the logic behind this and where
should it be applied?

Multiple R is the correlation coefficient and R square tells us how much variation in Y is
explained with the help of X or multiple X’s
6. What is the coefficient value denotes, value is coming either in positive or
negative?

The coefficient value for various X variables show how much of an impact each X will have
on Y when X increases by 1 unit keeping all other variables constant. Positive sign says 1 unit
increase in X will increase Y and negative sign suggests that 1 unit increase in X will decrease
Y.
7. When the case is true P<0.05, the coefficient value is in negative also what it
means?

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by [email protected] only.


Sharing or publishing the contents in part or full is liable for legal action.
The X variable is significant in explaining variation in Y. 1 unit increase in X, decreases Y by
coefficient value.

8. In correlation there are positive & negative values.


Negative value implies negative correlation.
This means it has negative influence against compared variable.
if X value increases Y value decrease.
Correct
9. Should we consider a correlation of 0.08 as it is near to zero and gives almost a straight
trend line or should we consider it as 8 % significant relation? I want to know when we
find correlation between two variables which is in the range of -1 to +1 and there we get
some value for example 0.08 and if we draw a trendline in the scatter plot for those
variables it is just slightly linear in the upward direction , so should we consider it as 8%
linear growth or should we ignore it stating that 0.08 is too low for considering any linear
correlation for the variables?
Establishing a level of significance is imp. General standard is 0.05. If anything is greater
than 0.05, model is usually not worth considering. Would not suggest changing the
significance for one's convenience. A value of 0.08 (i.e. 8%) is considered weak. You could
ignore this variable when this is a case of a simple linear regression - just one X variable.
[email protected]
DY2GHOQJV9 However, if this a multi-linear regression problem, then we could still consider this variable
as we are looking at a combination of X variables to explain the variance of the model.
10. If in the question it is asked to do linear regression and state coefficients, does it mean
we have to show the multiple R, which is the correlation coefficient? Or only the
coefficients for different variables which are the slopes?
Only the coefficients are required for each variable. Do keep in mind we use these
coefficients only if their individual p-values are also significant(Less than alpha)
11. While performing two-sample T-test, one of the columns contains non-numerical
value such as men, women and the other column contains balance of the credit cards, Can
I create two columns Male and Female (balance)?

Yes, one can create two columns, one for male and another for Female and see if there is
any significant difference between males and Females.
12 In two sample T-test, how do we decide what to column put first, is it based on the
number of observations?
Does not matter which goes first or second by order.

13. When positive coefficient of one predictor is a result in simple linear regression,
however it goes the other way round (negative coefficient) when all predictors were

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by [email protected] only.


Sharing or publishing the contents in part or full is liable for legal action.
accounted in multiple linear regression. and the p values shows that it is a significant
variable , though it may not be the actual situation , ( example income has a positive
coefficient in simple linear regression- however when multiplier linear regression is
carried out it has a high negative coefficient but with a p value less than 0.05 that still
shows it is a significant variable ) how to interpret such situations and should that variable
be considered as now it has a negative coefficient

If there are predictors that logically don't give the right symbol(+ or -) it may be due to
multicollinearity (independent variable highly correlated with other independent variables)
issue. For this course, we will just interpret the basis of the coefficient values with respect to
significant variables. Variable significance can be understood from the p-value. The issue of
multicollinearity can be addressed by removing highly collinear variables one by one so
building regression models iteratively and then picking the best one considering the
adjusted R-value.
14. What if the regression coefficient is negative while the p value is very small? What
should be concluded? Because as per p value the variable should be a significant
contributor But as per the regression coefficient it shows the negative relationship
(intercept is negative too) ...but the negative relationship doesn't fall under the common
sensical conclusion of the business model

Correlation coefficients are used to measure the strength of the linear relationship between
[email protected]
two variables. A correlation coefficient greater than zero indicates a positive relationship
DY2GHOQJV9
while a value less than zero signifies a negative relationship. As you have said, this does
indicate an unexpected pattern wherein a person who makes more money has a negative
balance.
If there are predictors that logically don't give the right symbol(+ or -) it may be due to
multicollinearity (independent variable highly correlated with other independent variables)
issue. For this course, we will just interpret the basis of the coefficient values with respect to
significant variables. Variable significance can be understood from the p-value. The issue of
multicollinearity can be addressed by removing highly collinear variables one by one so
building regression models iteratively and then picking the best one considering the
adjusted R-value.
15. Is the correlation coefficient is same and equal to regression coefficient?

The main difference in correlation vs regression is that the measures of the degree of a
relationship between two variables; let them be x and y. Here, correlation is for the
measurement of degree, whereas regression is a parameter to determine how one variable
affects another.
16. Is Norm.inv function calculate random variable value at particular probability ?

The NORM.INV Function is categorized under Excel Statistical functions. It will calculate
the inverse of the normal cumulative distribution for a supplied value of x, with a given

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by [email protected] only.


Sharing or publishing the contents in part or full is liable for legal action.
distribution mean and standard deviation. The function will calculate the probability to the
left of any particular point in a normal distribution.

For example, suppose we are given a normally distributed random variable that is denoted
by x. For the value of x, if we wish to get the bottom 5% of the distribution, we can use the
NORM.INV function.

As a financial analyst, the function is useful in stock market analysis. We can use
NORM.INV to understand how a portfolio is affected by any additions or withdrawals
made.

Formula

=NORM.INV(probability,mean,standard_dev)

Please read more about this here.

17. R-square measures the percentage of the variation in Y that is collectively explained by
the X variables in the model.
is it mean r-square = correlation between 2 items?
But if I draw correlation matrix correlation value looks like square root of r-square?

The coefficient of correlation is the “R” value which is given in the summary table in the
[email protected]
DY2GHOQJV9 Regression output. R square is also called the coefficient of determination. Multiply R
times R to get the R square value. In other words, Coefficient of Determination is the
square of the Coefficient of Correlation.
As additional reading material, we recommend you go through this link.

18. How to find out if a distribution is positively or negatively skewed? Should we consider
the signs before the coefficients or should we consider mean>median (if positively
skewed) and median> mean (if negatively skewed)?

If the mean is greater than the mode, the distribution is positively skewed. If the mean is
less than the mode, the distribution is negatively skewed. If the mean is greater than the
median, the distribution is positively skewed. If the mean is less than the median, the
distribution is negatively skewed.
As additional reading material, we recommend you go through this link.

19. My doubt is regrading one tail and two tail. After a two sample t test, there are rows
which mentions P(T<=t) one tail and similarly for two tail. When considering this P value,
on what basis one tail and two are chosen. Some answers we consider only one tail and
others two tail. My question is how to differentiate.

When in doubt, it is almost always more appropriate to use a two-tailed test. A one-tailed
test is only justified if you have a specific prediction about the direction of the difference
(e.g., Group A scoring higher than Group B), and you are completely uninterested in the
possibility that the opposite outcome could be true (e.g., Group A scoring lower than
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by [email protected] only.


Sharing or publishing the contents in part or full is liable for legal action.
Group B).

We recommend you to go through this link for additional reading.

20. Video 27-week 3-"Case Study"


To reject the null Hypothesis we must check if p-value is less than significance level and t-
statistics is greater than critical value.
In the above-mentioned case study, we can see (at 24:25) that the p-value < significance
level, but t-stat <critical value. how we decided to reject the Null Hypothesis then?

The criteria of p-value<0.05 is preferable as it is more generally used. The equivalent rule is
t-statistic value > critical value. If the t-statistic is negative, then we ignore the negative sign
for this purpose. We are only interested in the absolute value when comparing with the
critical value (critical value is always positive). This is the situation in the case study that is
referred to.

21. I was trying to calculate the p value through dummy method using T test and through
excel formulas. Both tend to have different P value. In the first case P value was less than
0.05 and in the second case, it was greater than 0.05. How do I go about it? Which
method is more appropriate?

In addition, could you explain when we should have zero in hypothesis mean difference or
[email protected]
DY2GHOQJV9
a particular value (in this case 500).

Here are few key points:

a. Remember that using a dummy of 0 is a workaround to execute one-sample t-test in


Excel, so that comparisons against a standard (i.e., one-sample t-test) can be executed as a
two-sample t-test
b. Always begin by framing the alternative and null hypothesis and only then look at the p-
values in the output
c. Hypothesized mean difference is 0 only in case when you are comparing two variables
(e.g., Wing A vs Wing C in the video). When you are comparing a single variable against a
standard (e.g., 500), the standard value is the hypothesized mean difference since the
dummy is assumed to take a value 0. The excel sheet attached presents detailed comments
on your doubt

Doubt_1.xlsx

22. When is T-test conducted?

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by [email protected] only.


Sharing or publishing the contents in part or full is liable for legal action.
T-test is conducted when you have two groups and you would like to determine if there is
any significant difference in the group means. Example sales score for 2 teams and you
would like to see if they are performing at equal level or differently.

23. I wanted to understand box plot in more detail.

Box Plot is a technique that lets you see the distribution. Various lines represent the mark at
min permissible range, 25th percentile, median,75th percentile and max permissible range.
The samples that are above max and below min are treated as outliers. These outliers are
very sensitive and should be considered in the context of business domain. You may decide
to impute outliers with capping and flooring technique.

24. If there is a business to verify true or false.Whether we have to verify through T test or
through ANOVA.

T-test is applicable if you have 2 groups. For Anova, you need to have more than 2. After you
perform Anova, if the means are different, you may decide to pick any 2-group combination
out of 3 and do a t-test to observe if the groups picked have similar means.

[email protected]
DY2GHOQJV9 25. How to write null statement for a given problem? What if I made wrong in writing
statement and how analysis would give different results?

Please go by the guiding principles. Null should not have inequality. Alternate should have
either inequality, greater than or less than. If you follow these principles, your inference will
not be incorrect.

26. Request to provide simplistic handholding of stastical jargons. This will help us
understand and relate better when we are introduced to complex subject formulas.
Example: p-Value, t-stat

Please look for this from classroom recording and or additional video content shared. The
list is exhaustive and you may need to consistently broaden your knowledge base by
referring to materials from class, video, online etc.

27. We are introduced to different kinds of testing. Request support in how to make
judicious decision while choosing, which test to perform?

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by [email protected] only.


Sharing or publishing the contents in part or full is liable for legal action.
2 sample – T-test ; more than 2 – Anova; if you try to understand impact of multiple
independent variables on a dependent variable - Regression

28. What can be the application of covariance?

Application of co-variance is similar to correlation. As covariance doesn’t have a fixed range


like correlation, we end up using correlation than covariance. Both measures the
relationship between 2 variables. Correlation is standardized version of covariance.

29. What is R Square? What is the significance of squaring them? (and then taking Sqrt to
get the Correlation coefficient).

R square talks about the explained variation. Taking square root of R square give you the
measure of correlation. The calculation of variation = (x-x-bar)^2. If you don’t square them
they would result in deviation of sample from the mean and when you try to sum up all
values all the positives and negatives may cancel out to give a 0 summation. Hence, we
square them to get the variation.

30. How to identify what is the best solution for a business case. I understand the fact that
[email protected]
DY2GHOQJV9 it depends on inputs available and output required. The question relates to
understanding, how we can have that understanding that we should apply x y z testing for
results

2 sample – T-test; more than 2 – Anova; if you try to understand impact of multiple
independent variables on a dependent variable – Regression

31. If independent variables are highly co-related to each other then it is not recommend
to go and do multiple regression analysis but in some case independent variables are
highly co-related to each other hence I would like to know what do we do when they are
highly co-related to each other

In real-world scenarios, there exists multicollinearity wherein many independent variables


are highly correlated amongst themselves.

In that situation, we treat this by performing several iterations of regression. In each


iteration, the aim is to include one and exclude other variables and while doing so check the
p-value, adjusted R-square etc. to identify the best regression line.

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by [email protected] only.


Sharing or publishing the contents in part or full is liable for legal action.
32. In descriptive analysis when the variance and standard deviation in data analysis tab
are different compared to when calculated manually ? Kindly explain why?

The answers do not differ. Please remember to divide the sum of squared deviations from
the mean by (n-1) not n, the sample size. Also, look for standard deviation in the output, not
standard error.

33. What for a given column the most repeated set of values are two different values lets
say 600 repeats 5 times just like 400 also repeats 5 times ? then how does Does
descriptive analysis pick it?

Whichever one reaches the maximum count first is the reported mode is this case by Excel.
The mode of 1,3,3,5,4,4 is reported as 3 and the mode of 1,4,4,5,3,3 is reported as 4.

34. In variance formula x i value was suggested to pick from certain range after or before
mean. when Xi - X bar was manually calculated in the lesson , i notice the Xi value picked
was x1 ... what makes one to decide to Xi value. how do we choose it if at all we did it
manually?

[email protected]
DY2GHOQJV9 All the X_i values are present in the formula. The formula (X_i - X_bar)^2 formula can be
applied to X_1 and then dragged down to the other cells, so that it can be applied to the
other cells (other X_i) as well.

35. How is variance a measure of average dispersion?

In variance, dispersion is first calculated as the square of the difference from the mean for
each observation. Then, these quantities are averaged over all the observations (in the
average, n-1 is taken instead of n). In this sense, variance is a measure of average
dispersion.

36. Currency part I understood its not real currency when u measure variance but also we
talk about variance height in sq inches and sq cms. Are there any other measurement
parameters to be known other than these two currency and variance height, how do we
represent it n excel, kindly elaborate?
Variance of a currency is similarly in squared units, e.g. square-Rs or square-$, if even that
does not make sense in terms of the currency. If the observations have units, the variance
has squared units. To avoid this problem, we work with the standard deviation, the square
root of the variance. Standard deviation is in original data units. Excel is a calculator, so it
does not use any units in general. There are specific currency calculators in Excel, but we
have not implemented statistics in those.

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by [email protected] only.


Sharing or publishing the contents in part or full is liable for legal action.
37. In lesson 13 we talk about the one standard deviation and two standard deviations
and three also ... why do we have multiple standard deviations , are we talking about the
same single column of data having multiple standard deviations in that lesson ? I am
confused?
Within the same data, we can talk of one, two, or three (or however many) standard
deviations from the mean. The mean value m is the same, the standard deviation value s is
the same. One standard deviation from the mean refers to the interval m-s to m+s. Two
standard deviations from the mean refers to the interval m-2s to m+2s. Three standard
deviations from the mean refers to the interval m-3s to m+3s.

38. DESCRIPTIVE STATISTICS CALCULATION of standard deviation and variance


The formulas are as provided in the videos.

39. Mode in central tendency of descriptive statistics


The mode is the value that occurs the highest number of times in a column of data.
(Also see Q42.)

40. When calculating variance in descriptive is a formula for variance , X i in the formula?
X_i in the formula refers to all the data points. X_1 is the first data point, X_2 is the second
[email protected]
DY2GHOQJV9 data point, X_n is the last data point (if there are n data points). Within a summation
formula, we replace X_i with each data point and then add up whatever term the formula
requires.

41. Units for variance


Square of the unit for the data. For example, if the column of data is cm, then variance is in
sq-cm. If the data has no units, for example a count of the number of defects, then the
variance also has no units and is dimensionless.

42. Please clarify about mean, median, mode in terms of its usefulness and application.
For a given sample, data how to identify whether mean or median is useful.
This depends on the application the measures are being put to. For example, measurement
of income. Mean is “per capita income of India”, median is “income of the average Indian”
and mode is “the most common level of income”. These have their own uses.

43. When mean is higher than median and mode can I conclude like the sample data
contains outliers or outlier is altogether a different concept?
It means that there are some high values that are not balanced by corresponding low
values. These may be outliers, or they may represent natural skewness (positive skewness in
the data). For example, income. High income may represent outliers, or it may represent the
natural fact that a few people can have much higher income than others can.
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by [email protected] only.


Sharing or publishing the contents in part or full is liable for legal action.

You might also like