Video Questions

Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

Video Questions # 9A, 9B, 9C, 13 - 17

VANGUARDIA PROGRAM

Video Questions Segment 9 (A, B, C)

• When testing a hypothesis statistically, what do we mean by one-tailed or


by two-tailed testing?

When testing a hypothesis statistically, we mean by one-tailed testing an


analysis in which we compare if a variable is only higher or lower than a specific
value. By two-tailed testing we mean a statistical analysis in which we compare
if a variable is between a specified range of values.

• When testing a hypothesis statistically, what is the meaning of the


concept of ‘degrees of freedom’?

The degrees of freedom are the number of independent values that we assume
to test our hypothesis. It is the information that is required to estimate the
distribution of the data that we want to analyze.

• When testing a hypothesis statistically, what do we mean by the


‘significance level’ for testing?

By significance level for testing we mean a value of credibility of the information


used to prove if the null hypothesis can be considered as true or false. It
represents a percentage of how real are the results.

• What, for example, do we mean by a ‘certainty level of 95% or more’ or a


‘p-level of 5% or less’?

By a certainty level of 95% is mean that we can assume that the results are
equal or close to the estimated results in a range of 95%. In the orher side, by a
p-level of 5% or less we can assume that the results or any analyzed value is
into a range of values with an error of 5% or less.

• How is the level of significance for testing determined? Who sets this
level of significance?

The level of significance for testing is determined by the study type, depending
on the topic to be studied, because it will be set by the person who is doing the
research and will depend on how significant or representable should be the
analysis. It is like the risk that you accept of rejecting the null hypothesis while it
is correct.

• What does the level of significance depend on?


The level of significance depends on the decision of the researcher and the
precision needed to prove if the hypothesis is true or false.

• In statistical testing, what is meant by ‘Type I errors’ and ‘Type II errors’,


and by the ‘Power’ of a test?

Type I errors is when we reject the null hypothesis by concluding that


something is there, while in fact it is false, because there is nothing there. Type
II errors are when someone does not detect something that is present in the
results and reject a hypothesis making that mistake. The power of the test is
about the possible tests that can be used to probe a hypothesis and decide
which one can produce less errors.

• Explain in your own words what is the ‘Central Limit Theorem’ and what
are the useful implications of that theorem for studying sample means or
sample percentages when taking a random sample from a population.

The central limit theorem is used to analyze a sample of any distribution of


probability; there the distribution of the mean of the data will still be similar to a
normal distribution. If the sample is large, it will be a good approximation, but if
n is less than 30, the central limit theorem will be good if the distribution of the
population is not too different from a normal distribution.

• In the context of the Law of Large Numbers, explain in words (or in maths)
the relationship between the size of the random sample drawn from a
population and the precision with which inferences can be made about
the characteristics of the population from which the sample is drawn.

The relationship between the size of the random sample and the precision with
which inferences can be made is that while the number of data analyzed
increases, the value of the standard deviation decreases, tending to the same
value of population when the number of samples is infinite.

• In this context, a distinction is made between ‘independent’ samples and


‘dependent’ samples. Explain what is meant by this difference and why it
may be important to know if you are testing differences between
dependent or between independent observations.

Dependent and independent samples are studied in different ways, so it is


important to kwon what kind of information we are using, because it can help to
reduce some factors that are not needed in some cases. It also can help to
establish some relationships into the analyzed variables to get better results.

• A distinction is made between the significance of a relationship and the


strength of that relationship. Explain or illustrate what is meant by this
distinction in the case of a correlation coefficient and in the case of a
(two-way or bivariate) contingency table.

The main difference between these two concepts is that a correlation coefficient
helps to determine the relation between two variables with their units and a
contingency table is mainly associated to establish a quantity relation between
two or more variables with their units. It can help to get a brief summary about
the situation of the data.

Video Questions Segment 13

• Survey data are typically entered as a data matrix in a spreadsheet.


Considering such a data matrix, list some of the major questions that can
be asked / answered by data analysis.

Some of the questions that can be asked by data analysis are:


- How do you use data collected on a representative sample to describe the
population from which you took the sample?
- Are there differences between subgroups in the population from which you
took the sample?
- Can you find and identify subgroups in your sample/population?
- Are there pairwise relationships between some variables or characteristics
that you have measured?
- Are some groups of variables or characteristics that you measured related,
interdependent?
- Can you pool some of these into new variables with attractive measurement
properties?
- Can you compute and investigate dependence relationships between some of
your survey variables?
- Can you model both interdependence and dependence relationships among
the measured variables?
- How do you analyze similarity data, preference data and
interaction/relationship data?

• What are the problems that you may encounter with the quality of your
data? How do you analyze a data matrix for possible data problems? How
do you deal with such data problems?

Some of the problems with the quality of the data and their solutions are:

Data errors: unacceptable data values.


Data-analysis packages provide you with an overview of the values for each
variable, making unacceptable values apparent. If an error cannot be corrected,
deal with the datum as missing.
Outliers: extreme data values.
Outliers are extreme, but correct values of a variable. Because of their extreme
value, they have an evident impact on the result of your analyses. You must
decide what to do with outliers: deal with them as missing, retain them in your
data but use procedures to mitigate their impact on your results, or accept them
as valid.

Missing data.
Missing data are either accepted as missing, reducing your effective number of
observations, or replaced by ‘acceptable’ values. Missing data are best
reported with a specific code, reducing your effective number of observations in
one of two ways:
- List wise deletion: eliminates the complete row from your data set,
- Pair wise deletion: eliminates only the missing datum, retaining the other data.

Video Questions Segment 14

• Discuss how you will give a description of categorical (nominal) variables


in your data set.

To give a description of categorical variables in a data set, we can use


frequency diagrams in which it may describe the distribution of the values and it
can help to infer some hypothesis about the distribution of our sample.

• Describe how you can test a hypothesis about the distribution of


observations over a number of categories.

A hypothesis about the distribution of observations over a number of categories


can be tested using a one-way Chi-square test with (k-1) degrees of freedom.
The Chi-square number is based on the (squared) discrepancies between the
observed frequencies and the frequencies that would be expected according to
the null hypothesis to be tested.

• Assume that you have made 500 observations, of which 150 belong to a
first category, 100 to a second, 160 to a third category and 90 to a fourth
category. Test the (null) hypothesis that the four categories actually may
have equal probability.

observed expected squared squared difference /


difference
frequency frequency difference expected frequency
1 150 125 25 625 5
2 100 125 -25 625 5
3 160 125 35 1225 9.8
4 90 125 -35 1225 9.8
n 500 chi2 29.6
Ho 125 df 3
p 1.68E-06

• Take variable x4 in the example data set. Test the hypothesis that the
numbers of the (ordered) response categories follow a Normal
distribution. Explain how you would carry out the test and carry out the
test.

To carry out the test, it is necessary to distribute the data into categories and
then the process can be numerical or graphical.

Numerical solution

Category Standard normal value area under observed expected squared squared difference /
Category difference
mean of a category border normal curve frequiencies frequencies difference expected frequency

<19 16 -1.055 14.25% 7 4 3 9.61 2.46


19≤x<25 22 -0.416 25.19% 9 4 5 23.04 5.49
25≤x<31 28 0.224 24.94% 7 5 2 2.56 0.47
31≤x<37 34 0.863 12.98% 3 6 -3 9.00 1.50
37≤x<43 40 1.503 10.18% 2 5 -3 7.84 1.63
43≤x<49 46 2.142 5.85% 1 3 -2 5.29 1.60
≥49 52 2.782 1 2 -1 1.96 0.82
Sum 30 30 chi2 13.98
mean 25.9 df 6
s. deviation 9.4 p 0.011
min 13.0
max 52.0
sum val 786

Graphical Solution

• Compute a new the variable x3+x4+x6 in the data set. For that variable
find the median, the interquartile range and the range between the 10th
and 90th percentile value. Compare the median to the mean and the mode.
X3+X4+X6
Median 10.50
Mean 11.07
Mode 9.00
Interquartile range 4.00
10th percentile 8.10
90th percentile 15.00
min 6.00
max 17.00
Standar deviation 2.37

Video Questions Segment 15


• Compute the new variable x3+x4+x6 on the example data set. Compute
the standardized values of that new variable.

X3+X4+X6 Standarized values


15 1.48
12 0.35
11 -0.03
11 -0.03
13 0.73
9 -0.78
13 0.73
12 0.35
17 2.24
10 -0.40
13 0.73
9 -0.78
11 -0.03
12 0.35
10 -0.40
17 2.24
14 1.11
10 -0.40
9 -0.78
8 -1.16
6 -1.91
10 -0.40
10 -0.40
8 -1.16
9 -0.78
9 -0.78
9 -0.78
15 1.48
11 -0.03
9 -0.78

• Verify that the mean of the new variable is 0 and that the standard
deviation is 1.0

X3+X4+X6 (standarized values)


Mean 0.00
Standar deviation 1.00
min -1.91
max 2.24
• Verify that the correlation between the new variable and tis standardized
values is equal to 1.0
X3+X4+X6 Standarized values Centered (X3+X4+X6) Centered Stand. Val. Cross product
15 1.48 3.93 1.48 5.84
12 0.35 0.93 0.35 0.33
11 -0.03 -0.07 -0.03 0.00
11 -0.03 -0.07 -0.03 0.00
13 0.73 1.93 0.73 1.41
9 -0.78 -2.07 -0.78 1.61
13 0.73 1.93 0.73 1.41
12 0.35 0.93 0.35 0.33
17 2.24 5.93 2.24 13.28
10 -0.40 -1.07 -0.40 0.43
13 0.73 1.93 0.73 1.41
9 -0.78 -2.07 -0.78 1.61
11 -0.03 -0.07 -0.03 0.00
12 0.35 0.93 0.35 0.33
10 -0.40 -1.07 -0.40 0.43
17 2.24 5.93 2.24 13.28
14 1.11 2.93 1.11 3.25
10 -0.40 -1.07 -0.40 0.43
9 -0.78 -2.07 -0.78 1.61
8 -1.16 -3.07 -1.16 3.55
6 -1.91 -5.07 -1.91 9.68
10 -0.40 -1.07 -0.40 0.43
10 -0.40 -1.07 -0.40 0.43
8 -1.16 -3.07 -1.16 3.55
9 -0.78 -2.07 -0.78 1.61
9 -0.78 -2.07 -0.78 1.61
9 -0.78 -2.07 -0.78 1.61
15 1.48 3.93 1.48 5.84
11 -0.03 -0.07 -0.03 0.00
9 -0.78 -2.07 -0.78 1.61
sum 0.00 0.00 76.89
Standar deviation 2.65 1.00
Correlation 1.0

Video Questions Segment 16


• Take the new variable x° = x3+x4+x6 computed on the example data set
for the n = 30 observations. Compute the mean and the standard deviation
of x°. Test the 2-tailed hypothesis that the population mean of x° equals
10, using a 95% certainty level.

X3+X4+X6 (values)
Mean 11,07
Standar deviation 2,65
min 6,00
max 17,00
Ho: mean 10,00
Certainty level 95%
T student 2,05
t 2,20
Conclusion Reject Ho

The t value exceeds the T student value for certain level of 95% and 29
degrees of freedom, so the hypothesis should be rejected.

• For the new variable x° as computed above, draw a 2-tailed 95%


confidence interval around the value of the sample mean.

• Variable x1 has a sample mean of 50%. Draw a 95% confidence interval (2-
tailed) around this value of the mean.
• In the example above (n° 3), with a sample mean of 0.50, how large a
sample size do you need in order to obtain a 95% confidence interval of
+/- .03?

The sample size needed to obtain a confidence interval of 95% with an interval
of 0.03 that is the double of the standard deviation, is calculated with the next
expression:

𝜎 = 0.03/2 𝜇 = 50%
1 − 0.5
0.0152 = 0.5 ∗
𝑛
1 − 0.5
0.015 = 2 ∗ √0.5 ∗
𝑛
𝑛 = 1100

• If the sample % is 50%, as in the data set for variable x1, then how likely is
it that the population percentage of x1 is actually less than 40%? (one-
tailed test)

Using the data x1, the value of 𝜎 = 0.51 with this value the probability than the
population percentage is less than 40% is 15%. It means that has a confidence
interval of +/- 0.15.

Video Questions Segment 17


• In the example data set, variable x1 identifies two subgroups; 15
respondents belong to subsample ‘0’ and 15 belong to subsample ‘1’.
Assuming that these are two unrelated or ‘independent’ subsamples, test
the hypothesis that the population means to the two groups on variable
x4 are different (95% confidence, 2-tailed).
X4 Variable Mean Standart Deviation Degrees of freedom Standart Deviation Degree of freedom t Probability
1 5
2 2
3 3
4 3
5 3
6 2
7 4
8 5 x1 3,27 1,18 14,00
9 5
10 1
11 3
12 2
13 3
14 4
15 4
0,49 28,00 0,68 0,50
16 5
17 3
18 4
19 3
20 2
21 1
22 1
23 1 x2 2,93 1,48 14,00
24 3
25 2
26 4
27 4
28 5
29 5
30 1

The null hypothesis, x1-x2≠0, cannot be rejected because the probability of the
test (50%) is less than the confidence interval of 95%.

• Now carry out the same test using a regression approach, i.e. regressing
variable x4 on a dummy variable to represent membership of group 2;
verify that the results of this approach are identical to those of the first
approach.
X4 Variable Mean Standart Deviation Degrees of freedom Standart Deviation Degree of freedom t Probability
1 5,00
2 2,00
3 3,00
4 3,00
5 3,00
6 2,00
7 4,00
8 5,00 x1 3,27 1,22 14,00
9 5,00
10 1,00
11 3,00
12 2,00
13 3,00
14 4,00
15 4,00
2,76 28,00 -0,66 0,51
16 1,73
17 -0,27
18 0,73
19 -0,27
20 -1,27
21 -2,27
22 -2,27
23 -2,27 x2 -0,33 1,53 14,00
24 -0,27
25 -1,27
26 0,73
27 0,73
28 1,73
29 1,73
30 -2,27

The result of the test with the dummy variable is 51% close to the result of the
previous analysis with 50%

• Carry out the problem for n° 1 above, assuming that the 15 observations
are paired (dependent). Compare the significance level of this test with
that of the test for problem n° 1.
X4 Variable Mean Standart Deviation Degrees of freedom Standart Deviation Degree of freedom t Probability
1 5,00
2 2,00
3 3,00
4 3,00
5 3,00
6 2,00
7 4,00
8 5,00 x1 3,27 1,18 14,00
9 5,00
10 1,00
11 3,00
12 2,00
13 3,00
14 4,00
15 4,00
1,48 28,00 1,23 0,23
16 1,73
17 -0,27
18 0,73
19 -0,27
20 -1,27
21 -2,27
22 -2,27
23 -2,27 x2 -0,33 1,48 14,00
24 -0,27
25 -1,27
26 0,73
27 0,73
28 1,73
29 1,73
30 -2,27

You might also like