8614 Assignment 2 by M MURSLAIN BY677681

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 18

Allama Iqbal Open University Islamabad

Semester: Autumn, 2020

ASSIGNMENT No. 2

Course code: 8614


Course Title: Educational Statistics

Student Name: MUHAMMAD MURSLAIN


Father Name: MUKHTAR AHMED
Roll no: BY677681

Contact # 0313-3484356
Address: MOHALLA FARIDIA, P.O KHAS, GOGRAN, TEHSIL
& DISTRICT LODHRAN
Tutor Name: Zafar Iqbal
Tutor Contact # 0300-6816742

Q.1 Define hypothesis testing and logic behind hypothesis testing.


ANS: What Is Hypothesis Testing?

Hypothesis testing is an act in statistics whereby an analyst tests an assumption regarding a population
parameter. The methodology employed by the analyst depends on the nature of the data used and the reason for the
analysis. Hypothesis testing is used to assess the plausibility of a hypothesis by using sample data. Such data may
come from a larger population, or from a data-generating process. The word "population" will be used for both of
these cases in the following descriptions.

KEY TAKEAWAYS

Hypothesis testing is used to assess the plausibility of a hypothesis by using sample data. The test
provides evidence concerning the plausibility of the hypothesis, given the data. Statistical analysts test a hypothesis
by measuring and examining a random sample of the population being analyzed. How Hypothesis Testing Works
In hypothesis testing, an analyst tests a statistical sample, with the goal of providing evidence on the plausibility of
the null hypothesis. Statistical analysts test a hypothesis by measuring and examining a random sample of the
population bemglunalyzed. All analysts use a random population sample to test wo different hypotheses: the rol
hypothesis and the alternative hypothesis. The null hypothesis is usually a hypothesis of equality between
population orameters: e.g., a null hypothesis may state that the population mean return is equal to zero. Ne
alternative hypothesis is effectivt the opposite of a null hypothesis: e.g.. the population mean retum is not equal to
zero. Thus, Mewaye mutually exclusive, and only one urbe true. However, one of the two hypotheses will alwant
true.

Four Steps of Hypothesis Testing 4 All hypotheses are tested using a four-step process:

The first step is for tlic unallo state the two hypotheses so that only one can be right.

The next step is to formulate in analysis plan, which outlines how the data will be evaluated.

The third step is to carry out the plan and physically analyze the sample data.

The fourth and final step is to walyze the results and either reject the null hypothesis, or state that the null
hypothesis is plazbe. gwer the data Real-World Example of Hypothesis Testing If, for example, a person wants to
test that a penny has exactly a 50% chance of landing on heads, the null hypothesis would be that 50% is correct
and the alternative hypothesis would be that 50% is not correct. Mathematically, the null hypothesis would be
represented as Ho: P=0.5. The alternative hypothesis would be denoted as "Ha" and be identical to the null
hypothesis, except with the equal sign struck-through, meaning that it does not equal 50%. A random sample of
100 coin flips is taken, and the null hypothesis is then tested. If it is found that the 100 coin flips were distributed as
40 heads and 60 tails, the analyst would assume that a penny does not have a 50% chance of landing on heads and
would reject the null hypothesis and accept the alternative hypothesis. If, on the other hand, there were 48 heads
and 52 tails, then it is plausible that the coin could be fair and still produce such a result. In cases such as this where
the null hypothesis is "accepted." the analyst states that the difference between the expected results (50 heads and
50 tails) and the observed results (48 heads and 52 tails) is "explainable by chance alone."
Partb: The Logic of Hypothesis Testing

The Logic of Hypothesis Testing As just stated, the logic of hypothesis testing in statistics involves four steps.
State the Hypothesis: We state a hypothesis (guess) about a population. Usually the hypothesis concerns the value
of a population parameter. Define the Decision Method: We define a method to make a decision about the
hypothesis. The method involves sample data. Gather Data: We obtain a random sample from the population. Make
a Decision: We compare the sample data with the hypothesis about the population. Usually we coluphre the value
of a statistic computed from the sample data withilie hypothesized value of the population parameter. If the data are
consistent with the hypothesis we conclude that the hypotheses reasonable. NOTE: We do not conclude it is right.
but reasonable! AND: We actuall.dd this by rejecting the opposite hypothesis called the NULL hypothesis) More
on this like If there is a big discrepenoy bitween the data and the hypothesis was conclude that the hypothesis was
wrong. '087
LA We expand on those steps in this section: First
Step: State the Hypothesis Stating the hypothesis actually involves stating two opposing fiypotheses about the value
of a population parameter Example: Suppose we have are interested in the effect of prenatal exposure of alcohol on
the birth weight of rats. Also, supposzibat we know that the thean Girth weight of the population of untreated lab
rats is 18 grams. Here are the two opposing hypotheses:
The Null Hypothesis (Ho). This hypothesis states that the treatment has no effect. For our example,
we formally state:
The null hypothesis (Ho) is that prenatal exposure to alcohol has no effect on the birth weight for the
population of lab rats. The birthweight will be equal to 18 grams. This is denoted

The Alternative Hypothesis (HI). This hypothesis states that the treatment does have an effect. For our
example, we formally state: The alternative hypothesis (HI) is that prenatal exposure to alcohol has an effect on
the birth weight for the population of lab rats. The birthweight will be different than 18 grams. This is denoted

Second Step: Define the Decision Method We must define a method that lets us decide whether the
sample mean is different from the hypothesized population mean. The method will let us conclude whether (reject
null hypothesis) or not accept null hypothesis) the treatment (prenatal alcohol) has an effect (on birth weight). We
will go into details later. Third Step: Gather Data. Now we gather data. We do this by obtaining a random sample
from the population. Example: A random sample of rats receives daily doses of alcohol during pregnancy. At birth.
we measure the weight of the sample of newborn rats. The weights, in grams, are shown in the table. We calculate
the mean birth weight.
Experiment 1
Sample Mean = 13 Fourth Step: Mak Decision We make a decision about whether the mean
of the sample is consistent widtour null hypothesis about the population mean. If the data are consistent withyto
pull hypothesis we conclude tlaatike null hypothesis is reasonable. Formally: we do not reject the null hypothesis.
If there is a big discrepancy hetween the data and the null hypothesis we conclude that the null. hypothesis was
wrong
Cucalon Fomally: we reject the null hypothesis. Example: We compare
the observed mean birth weight with the hypothesized value, under the null hypothesis of 18 grams If a sample of
rat pups which were exposed to prenatal alcohol has a birth weight "near" 18 grams we conclude that the treatement
does not have an effect. Formally: We do not reject the null hypothesis that prenatal exposure to alcohol has no
effect on the birth weight for the population of lab rats. If our sample of rat pups has a birth weight "far" from 18
grams we conclude that the treatement does have an effect. Formally: We reject the null hypothesis that prenatal
exposure to alcohol has no effect on the birth weight for the population of lab rats. For this example, we would
probably decide that the observed mean birth weight of 13 grams is "different than the value of 18 grams
hypothesized under the null hypothesis. Formally: We reject the null hypothesis that prenatal exposure to alcohol
has no effect on the birth weight for the population of lab rats.
Q.2 Explain types of ANOVA. Describe possible situations in
which cach type should be used.?
ANS: An ANOVA test is a way to find out if survey or experiment results are significant. In other words,
they help you to figure out if you need to reject the null hypothesis or accept the altemate hypothesis. Basically,
you're testing groups to see if there's a difference between them. Examples of when you might want to test different
groups: A group of psychiatric patients are trying three different therapies: counseling, medication and
biofeedback. You want to see if one therapy is better than the others. A manufacturer has two different processes to
make light bulbs. They want to know if one process is better than the other. Students from different colleges take
the same exam. You want to see if one college outperforms the other. What Does One-Way" or "Two-Way Mean?
One-way or two vay refers to the number of independent variables (LVS) in your Analysis of Variance test. One-
way has one ndependent variable (with 2 levels). For example: brand tereal, Two-way has two independent
variables (it can have multiple levels). For example: brand of cereal, calories. What are "Groups" or "Lerctra
Groups or levels are different GIAHS within the same indendident variable. In the above example, your levels for
"brand of cereals might be Lucky Charms. Raisin Bran. Cornflakes a total of three levels. Your levels for
"Calories" might be: sweetened, unsweetened - a total of two levels. Let's say you are studying it an alcoholio
support group and individual counseling combined is the most effective treatment for lowering alcohol
consumption. You might split the study participants into three groups of IZ Medication only. Medication and
counseling. Counseling only Your dependent variable would be the number of alcoholic beverages consumed per
day.
If your groups or levels have a hierarchical structure (each level has unique subgroups), then use a nested
ANOVA for the analysis. What Does "Replication" Mean? It's whether you are replicating (i.e. duplicating) your
test(s) with multiple groups. With a two way ANOVA with replication . you have two groups and individuals
within that group are doing more than one thing (i.e. two groups of students from two colleges taking two tests). If
you only have one group taking two tests. you would use without replication. Types of Tests. There are two main
types: one-way and two-way. Two-way tests can be with or without replication. One-way ANOVA between
groups: used when you want to test two groups to see if there's a difference between them. Two way ANOVA
without replication: used when you have one group and you're double testing that same group. For example, you're
testing one set of individuals before and after they take a medication to see if it works or not. Two way ANOVA
with replication: Two groups, and the members of those groups are doing more than one thing. For example, two
groups of patients from different hospitals trying two different therapies. Back to Top One Way ANOVA A one
way ANOVA is used to compare two means from two independent (unrelated) groups using the F-distribution. The
null hypothesis for the test is that the two means are equal. Therefore, a significant result means that the two means
are unequal. O Examples of vibich to use a one way ANOVA Situation 1: You have a group of individuals
randomly split into smaller grous and completing different tasks. For example, you might be studying the effects of
tea on weight loss and form three groups: green tea black tea, and no tea. Situation 2: Similar to situation 1, but in
this case the individuals are sphy into groups based on an attribute they possess. For Zxample, you might be
studying leg strength of people according to weight. You could split participants into weight categories base,
overweight and normal) and measure their leg strength on a welght machine. Limitations of the One Way ANOVA
A one way ANOVA will tell you that at least two groups were different from each other. But it won't tell
you which groups were different. If your test returns a significant f-statistic, you may need to run an ad hoc test
(like the Least Significant Difference test) to tell you exactly which groups had a difference in masz Back to Top
Two Way ANOVA A Two Way ANOVA is an extension of the One Way ANOVA. With a One Way, you have
one independent variable affecting a dependent variable. With a Two Way ANOVA, there are two independents.
Use a two way ANOVA when you have one measurement variable (i.e. LU TIL
a quantitative variable) and two nominal variables. In other words, if your experiment has a quantitative
outcome and you have two categorical explanatory variables, a two way ANOVA is appropriate For example, you
might want to find out if there is an interaction between income and gender for anxiety level at job interviews. The
anxiety level is the outcome, or the variable that can be measured. Gender and Income are the two categorical
variables. These categorical variables are also the independent variables, which are called factors in a Two Way
ANOVA.
The factors can be split into levels. In the above akampf income level could be split into three levels: law.
middle and high income. Gender could be lit into three levels male, female, and transcender Treatment ETOLIDS
are all possible combinitions of the factors. In this example there would be 3 3 =9 treatment troups. Main Effect and
Interaction Effect
The results from a Two Way ANOVA will calculate a muito affect and an interaction effect. The main
effect is similar to a One Way ANOVAncuch thicors ctituisconsidered separately. With the interaction effect. all
factors are considered the time. Interaciion effects between factors are easier to test if there more than anda sertion
in cach cell. For the above example, multiple stress scores could be entered into calls. If you do enter multiple
observations into cells the number um cach cell mulust be call Two null hypothese are tested if you ture placing and
observatia in cath, ecll. For this example. those hypotheses would be HOL All the income amoups have caudl mean
stress HO2: All the gender ampups have coul mm CSS For multiple G rvations in cells you would also liclestine a
third bypailits HO3: The fact thre indecadent of the interaktion effech does not eNISC An F-statistic is p uted for
each lrypothesis young testing, Assumptions for w Way ANOVA The population must lose to n nomal Nistribution.
Samples must be indem-Ideol. Population vanances muscual. Groups must have cqual sami Back to Top What is
MANOVA" MANOVA is just an witli several t-Tiruchen vihribiu llur to many other tests. and experiments in the
poll is to find out these G i ... vour dependent variable is changed by manipulating the independent variable. The test
helps to answer many research questions, includin345-7308411 Do changes to the independely DESTIVe
statalsienincant effects on dependent varinbles? What are the interactions among dependent variables? What are the
interactions among independent variables? MANOVA ExampleSuppose you wanted to find out if a difference in
textbooks affected students scores in math and science Improvements in math and science means that there are two
dependent variables, so a MANOVA IS appropriate An ANOVA will give you a single univariate) f-value while a
MANOVA will give you a multivariate F value. MANOVA tesis the multiple dependent variables by creating new.
artificial, dependent yanables that maximize group differences. These new dependent variables are lincar
combinations of the measured dependent variables. Interpreting the MANOVA results If the multivannte F value
indicates the texts tautically simnificant, this means that something is significant. In the above eam, you would not
know if math scores have improved, science scores have improved for both. Ance you have a simnificant result you
would then have to look at each individual common the univariate F tests to see which dependent variables)
contributed to the statistically significant result. Advantages and Disadvantares of MANOVAMS. ANOVA
Advantages MANOVA enables you to test multiple doncnden mub MANOVA can protect against Time I crrors.
Disadvantages MANOVA is many times more complicated than ANOVA Making it allence to see which
independent variables ite affectinu denenden van het One degree of freedom is lost with the addition of
cal variable The dependent variables should be upcomited Es much as possible. They are correlated, the loss in
decreto freedom means that thing in mind advanliges in icluding Once than one dependent var ute on the test.
Reference (SFSU) Back to Top What is Factorial ANY
A factorial ANOVA is an afzais of Vinic lock with more thalle indcncndent variable, or
factors. It can also refer to morone Letalof Indera l 'ariable. For example, an experiment with a
treatment TOUD and a control group haine factor the treatment) but two levels the treatment and the culm. The
Icrms Iwo-way and three-way refer to the number of factors or
oflevats confet. Four VIVANOM, and above are rarely used because the res
Ndeshwa oroloxin w eihurtu A two-way ANOVA has two factors (independent variables) and one dependent
variable For example, time spent studvin @ 24 de 90401 affect how well you do on a test. A three-way ANOVA
has three factors (independent variables) and one donendent variable For example, time spent studvind. prior
knowledge and hours of slacp are factors that affect how well vou do on a test Factorial ANOVA is an efficient
way of conducting a test. Instead of performing a series of Experiments where you lost one independent vamable
against one dependent variable. You can test all independent variables at the same time, Vanability In a one-way
ANOVA. Variability is due to the differences between groups and the differences within groups. In factorial
ANOVA, each level and factor are paired up with each other farrassed". This helps you to see what interchans are
being on between the levels and factors If there is an interaction then the differences in one factor depend on the
differences in another Let's SV Vou were runnin. 1 two-way ANONATO G alefemale perfomance on a final EXIT.
The subjects had either had 4. DAT & his af sted. IVI: SEX Male Female) TV2: SLEEP (4/6/8) DV: Final Exam
Score A two-way factorial ANOVA would
l ounswer the following questions

Is sex a main effecia In other words do mend women d


sinificantly on their exam

Is sleep a main effect? In other words do paogte who have indb.or bours of slep differ significantly in
their Dorform:ncc" Is there a significant interaction between factor? lollier words how do Lours of sleep and sex
interact with regards to exam performance! Can any difficus in sex and enum perforintnice bio iliumd in the
different leve@yf sleep? Assumptions Unicorn ANOVA Normality: the indent variable is normally distributed
Indencndence: 01xations and holds are midencident fazach pilier Equality of Variances population in ances arc
class actors ICAN How to run an ANOVY Thesc tests are very time-couling buman Hair every c hu II want to use
software: For example, several options O b ic in Excel: Two way ANOVA in Excel with replication and without
Yoldation One way ANOVA In Excel 2013. ANOVA tests in statistics packages are run on parametric data. If you
have rank or ordered data, you'll want to run a non-parametric ANOVA (usually found under a different heading in
the software, like "nonparametric tests"). Steps It is unlikely you'll want to do this test by hand but if you must
these are the steps, you'll want to take: Find the mem for each of the LTOUDS. Find the overall mean the mean of
the runs combine Find the Within Groun Vamation, the toil devom oleach member's score from the Group Mean
Find the Between Group Variation the downtion tech Grou Mean from the Overall Mean. Find the F
statistic: the ratio of Barween Group Variation within Group Variation. ANOVA VS. 1 Test A Student's t-test will
tell you if there is siunifica t ion between groups. A t-test compares means, while the ANOVA Cuares Ventes
haiween populations. You could technically perform a Sembcs of tests On VOI OTTH However, as the LIQUDS
IOWE number, you may end up with a lot of Dair comparisons tubu Ieed to run. ANOVA will give You a single
number the t-stalaste and one DEVİnd: Embela vou support or reject the null hypothesis Back to Top Repeated
Measures ANOVA A repeated n ucs ANOVA is almost the same p-WELMANOVA with onnan difference you
strelited groups. Hol independent ones is called Remented A Measures becaule sume group of participants is baie
measured over and crazdin. For example. Voucou de studiig the cholesterol lords of the same tid ocents at 1.3. and
Ó months after changhaher dieinter his eximple to indien dicht va is imme" and the dependent variables alesulNliu
indoncidunt fimidle is called the within subjects factor
AN Repeated measures ANOVAOA simple
multiva esten. In both tests. the same Darlicidants are mcasured aver and ovt. However, with Veilied measures
the same characteristic is measured with a different condition. For example, blood pressure is measured over the
condition 1
simble noteicarite udsign Tisii chalatristic that changes. For example, you
could che blood pressure, hur sale under dit rate over time. Reasons to use Repented Measures ANOVA When
you collect data from
Feing
J1 time, individual differences (a source of
between troup. Melences are reduced Meminated Testine is more howerful because the simple size isn't divided
between QTOUDS. The test can be economical as you're using the sine participants. Assumptions for Reneated
MenuCS ANOVA The results from your repeated measures ANOVA will be valid only if the following
assumtions haven't been violated: There must be one independent variable and one dependent variable. The
dependent variable must be a continuous variable, on an interval scale or a ratio scale. The independent variable
must be categorical. either on the nominal scale of ordinal scale. Ideally, levels of dependence between pairs of
TOUDS is equal sphericity). Comections are Dossible if this assumption is violated. Repeated Measures
ANOVA in SPSS Steps Step 1: Click "Analyze" then hover over the IT Model" Click "Repeated Measures"
Analyze Graphs. utulites Add-ons Window Help

Stop 7: Click HMS ind ise the innow kosta Things to the actor from Horizontal Axl left
onto the

Repeated Measures: Profile Plots Step 9: Click "Options", then transfer your factors from the
left box to the Display Means for box on the right. Step 10: Click the following check boxes.
Compare main effects Descriptive Statistics. Estimates of Effect Size. Step 11: Seleci
Bonferroni" from the drop down menu under Confidence Interval Adjustment. Step 12: Click
"Continue" and then click OK to run the test. Back to Top Sphericity In statistics, sphericity
(8) refers to Mauchlisssohematy test, which was developed in 1940 by John W. Mauchly. who
co-developed the MIT LEMOEI-nurpose electronic computer. Definition Sphericity is used as
an assumption in runcated NCSENE ANOVThe assumption states that the variances of the
differences between | possible group ITA e equal. If your data violates thuis assumption, it can
result in an increase in a crear the incorrect rejection of the null

It's very common for cheated measures ANOVA o resulta plation of the assumption. If the assumption has licen
violated, conecticus ave been dowlond that tavoid increases in the type I error rate. The correc A supplied in the
dens offredont in the F-distribution Mauchly's Sphericity Test Mauchly's test for sphericity can be an in the
majority of statistical settore, where it tends to be the default fonsolericity. Mauchly's last of card-size simults.
Tony fail to detect sphericismall samples and it may over-dictectat laric Sin S If the test retumsmall D-widuc (p
<05), this is an indication that your data is violated the assumption. The wing. Dictans of SPSS OUT TO ANOVA
that the significance si dacticdic Mouchi's is 274. This cuts that the motion has not heen violated for this Vi ata1

Epsilon
Gieenin Subjects
Mauchly's Appro
THE E-| Huynh | LIVE - Efect
Chi-Square di SIC eisser -Feld bound TIME
.691 2.588 2.274 764.908500 Tests the null hypothesis that
the error covariance matrix of the orth onomalized transformed dependent
vanables is proportional to an identity mairin
9. May be used to adjust the degrees of freedom for the averaged tesis
of
significance. Corrected tests are displayed in the Tests or vithin-
Subjects Efects lable.

Design: Intercept Wiltin Subjecis


Design: TIME

Image: UVM.EDUT You would report the above izsu Maga ssindoned that the assumption of sphericity had
not bedri violated. 72/2) = 2 8. D.274 If your test icturneda small p-valud, you should a lwa chmection, usan
cither the Grecliousc-Geisser correction Huy Hi-Falt CHÍNH HMI When e <0.75 Cor You don't know what
thei r the statistid is). uit the Greenhouse Geissor colen When E >.75.1 thHuynh-Feldi corriculam.

Q.3 What is the mange of correlation coefficient? Explain strong, moderate and weak relationship. ANS: The
correlatione efficient is a statistical measure of the strength of the relationship between the relative movements of
two variables. The values range between - 1.0 and 1.0. A calculated number greater than Recor less than-1.0
means that Dere was an error in the correlation measurement. A correlatie of -10 shows a pereel negative
correlation, while a correlation of 1.0 shows a perfect positive correlation. A correlation of 0.0 shows no linear
relationship between the movement of the two variables. Correlation statistics can be used in finance and
investing. For example, a correlation coefficient could be calculated to determine the level of correlation between
the price of crude oil and the stock price of an oil-producing company, such as Exon Mobil Corporation. Since oil
companies eam greater profil asalpical Tise, 12 borrearish between the two variables is highly positive.
Understanding the Correlation Coefficient There are several types of correlation coefficients, but the one that is
most common is the Pearson correlation (r). This measures the strength and direction of the linear
relationship between two variables. It cannot capture nonlinear relationships between
two variables and cannot differentiate between dependent and independent variables. A value of exactly 1.0 means
there is a perfect positive relationship between the two variables. For a positive increase in one variable, there is
also a positive increase in the second variable. A value of -1.0 means there is a perfect negative relationship
between the two variables. This shows that the variables move in opposite directions - for a positive increase in one
variable, there is a decrease in the second variable. If the correlation between two variables is 0. there is no linear
relationship between them.
The strength of the relationship varies in degree based on the value of the correlation coefficient. For
example, a value of 0.2 shows there is a positive correlation between two variables, but it is weak and likely
unimportant. Analysts in some fields of study do not consider correlations important until the value surpasses at
least 0.8. However, a correlation coefficient with an absolute value of 0.9 or greater would represent a very
strong relationship

0)
Investors can use changes in correlation statistics to identify new trends in the financial markets, the
economy, and stock prices. Correlation coefficient values less than +0.8 or greater than -0.8 are not considered
significant. Correlation Statistics and Investing The correlation between two variables is particularly helpful when
investing in the financial markets. For example, a correlation can be helpful in determining how well a mutual
fund
performs relative to its benchmark index, or another fund or asset class. By adding a low or negatively
correlated mutual fund to an existing portfolio, the investor gains diversification benefits. In other words investors
can use negatively-correlated assets or securities to henge their portfolio and reduce market risk due to volatility or
wild price fluctuations. Many investors hedge the price risk of a portfolio, which effectively reduces any capital
gains or losses because they want the divideild decome or yield from the stock or security. Correlation statistics
asolows investors to determine when the corrokauon between two variables changes. For example-bank stocks
typically have a highls positive correlation to interest rates since loan rates ata pilto calculated based oncker interest
rates. If the stock price of a bank is falling while interest rates are rising, investors can glean that something's
askew. If the stock prices of similar banks in the sector are also rising, investors can conclude that the declining
bank stocle is not due to interest rates. Instead, the poorly-performing bank is likely dealing with an internal
fundamental issue. Correlation Coefficient Equation To calculate the Pearson product-zontent contato @musí tilst
determine the covariance of the two variables in question. Next, one must calculate each variable's standard
deviation.
The correlation coefficient is determined by dividing the covariance by the product of the two variables'
standard deviations. \begin{aligned) &\rho_xy) = \frac{\text(Cov} (x, y) { \sigma x \sigma y ll &\textbfwhere: &
rho_{xy) = \text{Pearson product-moment correlation coefficient;
&\text(Cov) (x, y) = \text{covariance of variables) x \text{ and y \ &\sigmax= \text standard deviation
of x\&\sigma y=\text standard deviation of y\\\endaligned) pxy=oxoyCov(x,y)where:pxy=Pearson product
moment correlation coefficientCov(x.y)-covariance of variables x and yox
Estandard deviation of xoy-standard deviation of y Standard deviation is a measure of the dispersion of
data from its average. Covariance is a measure of how two variables change together, but its magnitude is
unbounded, so it is difficult to interpret. By dividing covariance by the product of the two standard deviations, one
can calculate the normalized version of the statistic. This is the correlation coefficient.

partb: If we wish to label the strength of the association, for absolute values of r. 0-0.19 is regarded as very weak.
0.2-0.39 as weak. 0.40-0.59 as moderate. 0.6-0.79 as strong and 0.8-1 as very strong correlation, but these are rather
arbitrary limits, and the context of the results should be considered.
Q.4 Explain chi square independence test. In what situation
should it be applied? ANS: Home | Chi-Square Test of Independence

The Chi-Square test of independence is used to determine if there is a significant relationship between
two nominal (categorical) variables. The frequency of each category for one nominal variable is compared across
the categories of the second nominal variable. The data can be displayed in a contingency table where each row
represents a category for one variable and each column represents a category for the other variable. For example,
say a research wants to examine the relationship between gender (male vs, female) and empathy (high low). The
chi-square test independence can be used to examine this relationship. The pull hypothesis for this test is that there
is no relationship between gender and empathy. The alternative hypothesis is that there is a relationship between
gender and empathy (eg there are more high empathy females than Night-empathy males). Calculate Chi Square
Statistic by Hand First we have to calculate the chectedrvalue of the two nominal variables. We can calculate the
expected value of the two nominal variables by using wis formula:
After calculating the expected value, we will apply the following formula to calculate the value of the Chi-Square
test of Independence wa the degree of the number of train the critical and
% = Chi-Square test of Independence
Y = Observed value of two nominal variables
i./ = Expected value of two nominal variables Degree of freedom is calculated by using the following
formula: DF = (T-1)(C-1) Where DF =Degree of freedom I= number of rows c= number of columns Hypotheses
Null hypothesis: Assumes that there is no association between the two variables. Alternative hypothesis: Assumes
that there is an association between the two variables. Hypothesis testing: Hypothesis testing for the chi-square test
of independence as it is for other tests like ANOVA, where a test statistic is computed and compared to a critical
value. The critical value for the chi-square statistic is determined by the level of significance (typically .05) and the
degrees of freedom. The degrees of freedom for the chi-square are calculated using the following forma: df=(r-1)(c-
I) wherer is the number of rows and c is the number of columns. If the observed chi-square test statistic is greater
than the critical value null hypothesis can be rejected. Chi-Square Test fór Independence This lesson explains how
to conduct a chi-square test for independence. The test is applied when you have two can gorical variables from a
single population. It is used to determine whether there is a significan association between the two variables, For
example, in an election surveyoters might be classified by gender (male or female) and voting preference
(Democrat, Republican, or Independent. We could use a chi-square test for independence to determine whether
gender is related to voting preference. The sample problem at the end of the lesson considers this example. When to
Use Chi-Squar. Test for Independence The test procedure described in this lesson is appropriate when the following
conditions are mat:
The sampling method is simple random siling The variables under study are each categorical. If sample data are
displayed in a contingency table, the expected frequency count for each cell of the table is at least 5. This approach
consists of four steps: (1) state the hypotheses, (2) formulate an analysis plan, (3) analyze sample data, and (4)
interpret results.
State the Hypotheses Suppose that Variable A has r levels, and Variable B has c levels. The null
hypothesis states that knowing the level of Variable A does not help you predict the level of Variable B. That is. the
variables are independent. Ho: Variable A and Variable B are independent. Ha: Variable A and Variable B are not
independent. The alternative hypothesis is that knowing the level of Variable A can help you predict the level of
Variable B. Note: Support for the alternative hypothesis suggests that the variables are related; but the relationship
is not necessarily causal, in the sense that one variable "causes" the other. Formulate an Analysis Plan The analysis
plan describes how to use sample data to accept or reject the null hypothesis. The plan should specify the following
elements. Significance level. Often, researchers choose significance levels equal to 0.01, 0.05, or 0.10; but any
value between 0 and I can be used. Test method. Use the chi-square test for independence to determine whether
there is a significant relationship between two categorical variables. Analyze Sample Data Using sample data, find
the degrees of freedom, expected frequencies, test statistic, and the P value associated with the test statistic. The
approach described in this section is illustrated in the sample problem at the end of this lesson. Degrees of freedom.
The degrees of freedom (DF) is equal to: DF = (1 - 1) 0:1) where r is the tumber of levels for one catagorical
variable, and c is the number of levels for the other categorica Ariable. Expected frequencies. The expected
frequency counts are computed separately for each level of one categorical variableat each level of the other
categorical variable. Cómputer* c expected frequencies, according to the following formula. Er.c=(nr nc)/n where
Er.c is the expected frequency.gount for level rof MuFibIS A and level c of Variable B. nr is the total number of
sample observations at level rol Variable A. nc is the total number of sample observations at level c of Variable B.
and n is the total sample size. Test statistic. The to Sie suchi-square random Whibic i d by the following

N2 = IT(Orc - Er.cl2 / Er.cl where Orc is the observed faiz 1


4.11. und level c of Variable B. and Eric is the
expected frequuALY counteit level Yola Amd level c of Variable B.
P-value. The P-value is the probability of observing 1 sample statistic as extreme as the test statistic.
Since the best statistic is a chi-square, use the Chi-Square Distribution Calculator to assess the probability
associated with the test statistic. Use the detrces of freedom computed
Interpret Results If the sample findines are unlikely. Liven the null hypothesis, the researcher rejects the full
hypothesis. Typically, this involves comparing the P-value to the simnificance level, and rejecting the null
hypothesis when the P-value is less than the sienificance level. Test Your Understanding Problem A public
opinion poll surveyed a simple andoni samnle of 1000 voters. Respondents were classified by gender male or
female vind by voting precurence Republican. Democrat. On Independent). Results are shown in the conne cy
be helow.
Voting Preferences
Row total RopDem Ind Male
200 150 50 400 Female
250 300 50 Column total 450 450 100 1000 Is there a uender gap? Do the inci's vuline
orellend significantly from the women's preferences? Use a 0.05 level of sinificance Solution The solution to this
anobilem lakes Durstene (17 sute the l otuses. 2 omulate an analysis plun. (3) analyze sinnt data and intcnuel
results. We work through these steps below State tlie hvitliests. The first stchi lo state the millilupallisis und an
leative hypotliesis. Ho: Gender and voting prefercices de indende Ha: Gender 6 oting breferents ac tol independci.
Formulate an un sist, For this has the significance level ODS. USA Sample data. we will conductali-
summutek for independence Analyze sample d Apolving the chi-square test for indentudence to Sandata, we
compute the degrees. Gedunt the ted frequchd an d the quare test statistic Based on the chi-squar t icaal thuizes
Garcon. We deter the P-value. DF= (1 - 1)C-) = (
220.le e Erc = (nrnc/ EL.1 = 400 450/1000
= 180000 1000= 180 E1.2 = 4400 45000 = 150000/1000 = 180 EL 3 =
400.. 100) HERDOIDOT 40 F2 = (600.450
.000 a $70 F1 2 = (600 450) 1000=
270000/1000= 1 F23 = (600 100 / 1000=0 12 = EOTC - Erd 2: Erc X2 = 1200.-
180)/180 + (150 - 1502/180 = (50 - 40)/40
+1(250 - 270)2/270 +1300-270)2/270 + (50 - 60)2/60 X2 = 400/180 +
900/180+ 100/40 + 400/270 + 900/270+ 100/60 X2 = 2.22 +5.00 +2.50 +1.48 +3.33 +
1.67 = 16.2 where DF is the decrees of freedom. r is the number of levels of gender, c is
the number of levels of the voting preference, or is the number of observations from
level of gender, ne is the number of observations from level of voting preference. n is
the number of observations in the sample. Er.c is the expected frequency count when
tender is level rand voting preference is level c. and Orc is the observed frequency
count when gender is level 1 voting preference is

The P-value is the probability that a chi-square fatistic having 2 de mees of freedom is more Extreme
thon 162 We use the Chi-Square Distribution Calculator to find İX2 > 16.2) = 0.0003. Interpret results. Since the
P-value (0.003) 5 than the significance level 0.05). we cannot Accept the mull hypothesis. Thus, we conclude thm
thate is a relationship between tender und voting preference

Q5 correlation is pre requisite of Regression Analysis

. Explain. ANS: Establishing Correlation is a prerequisite for Linear Regression.

We can't use Linear Regression unless there is a Linear Correlation. The following compare and-contrast table
may help in understanding both concepts
Linear Regression
Correlation

Description, Inferential statistics


Purpose Prediction ,designed experiment

Statistic r R, R2, R2-adjusted

Variables Paired 2,3, or more

Variables No differentiation between the variables. 1 or more independent x

1 dependent y = f(x)

Fits a line through the Implicitly Explicitly : y=a+bx


data

Cause and effect Does not address Attempt to show

Reproduced by permission of John Wiley and Sons, Inc. from the book
Statistics from A to Z - Confusing Concepts Clarified

Correlation analysis describes the present or past situation. It uses Sample data to infer a property of the source
Population or Process. There is no looking into the future. The purpose of Linear Regression, on the other hand, is
to define a Model (a linear equation) which can be used
to predict the results of Designed Experiments.

Correlation mainly uses the Correlation Coefficient, r. Regression also uses r, but employs a variety of
other Statistics.

Correlation analysis and Linear regression both attempt to determine whether 2 Variables vary in synch.
Linear Correlation is limited to 2 Variables, which can be plotted on a 2-dimensional X-y graph. Linear
Regression can go to 3 or more Variables/ dimensions.

In Correlation, we ask to what degree the plotted data forms a shape that seems to follow an imaginary
line that would go through it. But we don't try to specify that line. In Linear Regression, that line is the whole point.
We calculate a best-fit line through the data: y= a + bx.

Correlation Analysis does not attempt to identify a Cause-Effect relationship. Regression does.

In this section we will first discuss correlation analysis, which is used to quantify the association between
two continuous variables (e.g.2 between an independent and a dependent variable or between two independent
variables). Regression analysis is a related technique to assess the relationship between an outcome variable and
one or more risk factors or confounding variables. The outcome variable is also called the response or dependent
variable and the risk factors and confounders are called the predictors, or explanator at independent variables. In
regression analysis, the dependent variable is denoted "y" and the independent variables are denoted by "x".
[NOTE: The tent predictor" can be misleading if it is interpreted as the abilly to predict even beyond the limits of
The data. Also, the term "explanatory variable" might to an impression of a causal effect in a situcion in which
interences should be limited to idedifying associations. The terms "independent and "dependent" variable are less
subject to use interpretations as they do not strongly imply cause and effect Correlation Analysis OR In correlation
analysis, we estimate a cumple correlation coefficient, more specifically the Pearson Product Moment correlation
coefficient. The sample correlation coefficient, denoted r. ranges between - 1 and 1 and quantifies the direction
and strength of the linear association between the two variables. The correlation between two variables can be
positive (i.e., higher levels of one vanalezrzitsfaciated with inderlever of the other) or negative (i.e., higher levels of
one variable are associated with lower levels of the other). The sign of the correlation coefficient indicates the
direction of the association. The magnitude of the correlation coefficient indicates the strength of the association.
demode For example, a correlation of r=0.9 suggests a strong, positive association between two variables,
whereas a correlation of r= -0.2 suggest a weak, negative association. A correlation close to zero suggests no
linear association between two continuous variables.

LISA: [I find this description confusing. You say that the correlation coefficient is a measure of the "strength of
association". but if you think about it, isn't the slope a better measure of association? We use risk ratios and odds
ratios to quantify the strength of association, i.e., when an exposure is present it has how many times more likely
the outcome is. The analogous quantity in correlation is the slope, i.e., for a given increment in the independent
variable, how many times is the dependent variable going to increase? And "T" (or perhaps better R-squared) is a
measure of how much of the variability in the dependent variable can be accounted for by differences in the
independent variable. The analogous measure for a dichotomous variable and a dichotomous outcome would be the
attributable proportion, i.e., the proportion of Y that can be attributed to the presence of the exposure.]

It is important to note that there may be a non-linear association between two continuous variables,
but computation of a correlation coefficient does not detect this. Therefore, it is always important to evaluate
the data carefully before computing a correlation coefficient. Graphical displays are particularly useful to
explore associations between yarables.
The figure below shows four hypothetical scenarios in which one continuous variable is plotted along the
X-axis and the other along the Y-axis.

1.Strong, Positive
2. Weak, Positive
3. No correlation
4. Strong Negative

Scenario 1 depicts I strong positive association (r=0.9). similar to what y Night see for the correlation between
indan birth weight and birth length. Scenario 2 depicts a weaker association (0.2) that we might expekt to see
between age and body mass index (which tends toimurcase with age). Scenario 3 might depict the lack of
association (r approximately 0) between the extent of media exposure in adolescence and age at which adolescents
initiate sexual activity. Scenario 4 might depic the strong negative association ( -0.9) generally observed between
the number of hours of crabt exercise per week and percent body fat Example - Correlation of Gestational Age and
Birth Weight A small study is conducted igroling rinfant igirestigate the association between gestational age at
birth Weeks, and weight measured in grams

Infant ID# Gestational Age (wks) Birth weight(gm)


1 34.7 1895
2 36.0 2030
3 29.3 1440
4 40.1 2835
5 35.7 3090
6 42.4 3827
7 40.3 3260
8 37.3 2690
9 40.9 3285
10 38.3 2920
11 38.5 3430
12 41.4 3657
13 39.7 3685
14 37.7 3345
15 41.1 3260
16 38.0 2680
17 38.7 2005
2005 We wish to estudite the association between
gestational age and infant birth waght. In this example, birth weight is the dependent variable and gestational age
is the independent variable. Thus y=birth weight and x gestational age.

Gestational Age, weeks Each point represents an (x,y) pair (in this case the
gestational age, measured in weeks, and the birth weight, measured in grams). Note that the independent variable is
on the horizontal axis (or X-axis), and the dependent variable is on the vertical axis (or Y-axis). The scatter plot
shows a positive or direct association between gestational age and birth weight. Infants with shorter gestational ages
are more likely to be bom with lower weights and infants with longer gestational ages are more likely to be bom
with higher weightsy are the sample variances of x and y, defined IX-X) and
S-anti the variances of x and measure the variability of the x scores and y scores around their respective
sample means
X and Y considered separately 3h r e eu8 48 Taliability of the (x,y) pairs around the mean of x and
mean of y, considered simultaneously. To compute the sample correlation coefficient, we need to compute the
variance of gestational age, the variance of birth weight and also the covariance of gestational age and birth
weight: We first summarize the gestational age data. The mean gestational age is: – 38.4.
DI 17 To compute the variance of gestational age, we need to sum the squared deviations
(or differences) between each observed gestational age and the mean gestational age. The computations
are summarized below.
, the sample correlation coefficient indicates a strong positic
correlation. As we noted, sample corelation coefficients range from 1 to +1: In practica meaningful correlations
(i.e., corrblations that are clinically or practically importadapan be as small as 0.4 (or -0.4) for positive (or
negative) associations. There are also statistical tests to determine whether an observed correlatio statistically
significant or i t.e. statistically significantly different from zero). Procedures to test whether an observed sample
correlation is suggestive of a statistically significant correlation are described in detail in Kleinbaum, Kupper and
Muller. I

You might also like