The minimum size of subjects required for the research on human health, thermal comfort and
productivity is a frequently asked question. In this paper the idea of power analysis, which helps to
determine required sample size as well as to interpret research results, is introduced in order to promote
good practice of power analysis in the context of human and building environment relationship research.
How to calculate effect size from published article or experimental data is presented with plenty of
examples. The effect sizes of several physiological and psychological measurements indicating the effect
of indoor environment quality on human health, thermal comfort and productivity are presented, which
could be worked as references when researchers planning their own studies. How to determine required
sample size when planning a study and how to interpret the research results with power analysis are also
illustrated step by step with samples. Finally how to make decisions when evaluating the study results is
summarized. It is expected that these examples and the summary could help researchers to better apply
power analysis in indoor environment quality (IEQ) studies. Some statistical terms used in this paper,
such as power analysis, effect size, and t-test, etc., are explained in detail in the Appendix.
effect of creatine on cognitive function in young adults, Rawson One motivation for the use of retrospective power calculation is
et al. (2008) mentioned ‘‘Sample size was estimated based on data the desire to assess the strength of evidence for a null hypothesis –
from McMorris et al. and conducted assuming a power (1 b) of something that standard hypothesis tests are not designed for [8].
0.80 and a ¼ 0.05.’’ [7]. However, the lack of power analysis in But the observed power can never fulfill the goals [12]. The
research area on interaction between human and building envi- calculations of a retrospective power calculation is done by esti-
ronment may partly be due to a poor understanding of what it is, mating the population effect size using the observed effect size
what it can tell us and how it works. An understanding of statis- among the sample data, so it is based on the highly questionable
tical power supports two main applications [8]: (1) to estimate the assumption that the sample effect size is essentially identical to the
prospective power of a study, and (2) to estimate the parameters effect size in the population from which it was drawn [13]. Obvi-
required to achieve a desired level of statistical power for ously, this assumption is likely to be false, and the more so the
a proposed study. The aim of this paper is to promote good practice smaller the sample. In fact, the observed power is determined
of power analysis in study on relationship between human and completely by the significance level of a test (P-value) and there-
building environment by introducing the concept of power anal- fore adds nothing to the interpretation of results [12]. One
ysis and illustrating how to apply power analysis. The required important thing should be noted here is that, post hoc analysis, like
sample size for some commonly used statistical tests and the a priori analysis, requires researchers to specify population effect
expected effect size of some productivity measurements are size on a priori grounds.
presented. However, retrospective power is frequently interpreted in much
the same way as post hoc power analysis. This is particularly
2. What is power analysis? problematic when retrospective power calculations are used to
interpret results of significance test [8]. As the observed power is
The statistical power of a study depends on two main factors: a mere function of the observed effect size and hence of the
how big an effect (the effect size) the research hypothesis predicts observed P-value, so in general, statistical significance (lower
and how many subjects are in the study (the sample size) [9]. In any P-value) will result in high observed power and non-significance
study, the bigger the difference you expect between the two pop- (P > 0.05, etc.) will result in low observed power. If the observed
ulations, that is, the greater the effect size, the more statistical power for non-significant results is used as an indication of the
power in the study. As to the sample size, basically, the more people strength of evidence for the null hypothesis, it will erroneously
there are in the study, the more statistical power. Sample size suggest that the lower a P-value the stronger the evidence is in
affects statistical power because the larger the sample size, the favor of the null hypothesis [12]. For significant results high
smaller the standard deviation of the means. Statistical power is observed power will act to (falsely) strengthen the conclusions that
also affected by the significance level chosen, whether a one-tailed the researcher has drawn. In either case retrospective powers are
or two-tailed test is used, and the kind of hypothesis-testing highly undesirable [8].
procedure used. Further details on power analysis can be found in
the Appendix A. 2.2. The importance of sufficient statistical power
2.1. Types of power analysis Statistical significance is extremely important, but sophisticated
researchers and readers of research understand that there is more
A priori and post hoc power analyses are the two most common to the story of a study’s result than P < 0.05 or ns (not significant).
types of power analysis. A priori power analysis is usually used to Following Cohen’s (1962) pioneering work on the power of statis-
determine the necessary sample size N of a study and provides an tical tests in behavioral research [14], many authors have stressed
efficient method of controlling statistical power before a study is the necessity of statistical power analysis.
actually conducted, therefore can be recommended whenever Sample size calculations and power analysis are often critical for
resources such as the time and money required for data collection researchers to address specific scientific hypotheses and confirm
are not critical [10]. Post hoc analysis is less ideal than a priori credible treatment effects. Determining statistical power is very
analysis because only a is controlled, not b [11]. A post hoc analysis important when planning a study. If you do a study in which the
can be used to assess whether or not a published statistical test in statistical power is low, even if the research hypothesis is true, the
fact had a fair chance of rejecting an incorrect null hypothesis. study will probably not give statistically significant results. Thus,
There is a third variant – compromise power analysis, which can be the time and expense of carrying out the study would probably not
useful both before and after data collection [10]. It provides be worthwhile [15]. Calculating statistical power when planning
a pragmatic solution to the frequently encountered problem that a study helps to determine how many subjects are needed. If the
the ideal sample size N calculated by an a priori power analysis sample size is too small, the statistical tests may not be adequate to
exceeds the available resources. In such a situation, a researcher detect a difference that in reality is there. The smaller the sample is
could specify the maximum affordable sample size and using or the smaller the true difference if it exists, the greater is the
a compromise power analysis to compute a and 1 b associated probability of accepting the null hypothesis in error. The impor-
with a P-value. Alternatively, if a study has already been conducted tance of this to the experiments or surveys should be obvious. If, for
but has not yet been analyzed, a researcher could ask for example, the indoor environment quality has negative effects on
a reasonable decision criterion that guarantees perfectly balanced occupant’s health and productivity, but the effects are not shown
error risks given the size of the sample and the critical effect size in obviously or not in large difference at the early stage or during
which he or she is interested. In this paper we will focus on the first a short investigation period, which is especially true for laboratory
two types of power analysis. studies, the t-test or analysis of variance (ANOVA) may lead the
Post hoc power analysis should not be confused with the so- researcher to accept the null hypothesis and say that there is no
called retrospective power analysis, in which the effect size is significant effect when in reality there is. However the consequence
estimated from the sample data of the study and used to calculate should be serious as the negative effect of indoor environment
the observed power, a sample estimate of the true power [10]. quality on occupant’s health should be prevented at its early stage.
Computer software such as SPSS readily performs these calcula- Moreover, even a small productivity loss may result in a large
tions under the guise of ‘‘observed power’’. economic decrease, since the cost of the people in an office is an
order of magnitude higher than the cost of maintaining and oper- 3.1. Methods for effect size calculation
ating the building [16]. So with inadequate statistical power, the
work was likely to be not only a waste of time and energy, but also 3.1.1. Calculation of effect size of t-tests
misleading. For the two types of t-test, Cohen (1988) defined the effect size
Understanding statistical power is also extremely important of 0.2, 0.5, and 0.8 as small, medium, and large, respectively [9].
when looking at the results of a research, particularly for making Cohen’s d is calculated with formula (1).
sense of results that are not statistically significant or results that
are statistically significant but not practically significant. The x1 x2
d ¼ (1)
commonest factors which make a test unable to detect a change S
include too few samples, too small a difference between the means, sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
and a large variation in the values making up the means. Thus it is ðn1 1ÞS21 þ ðn2 1ÞS22
needed to check whether or not the test could have shown S ¼ (1a)
n1 þ n2
a difference where the difference existed in reality. A statistically
not significant result from a study with high statistical power does where x ¼ group mean, S ¼ standard deviation, n ¼ number of
suggest that either the research hypothesis is false or that there is subjects, subscripts 1 and 2 refers to the two groups.
less of an effect than as predicted. For example, if a study investi- To estimate Cohen’s d from published article, the means,
gating the effect of ventilation rate on human cognitive function is number of subjects, as well as the standard deviation of the two
planned with a statistical power of 90% when the effect size was groups should be listed. For example, Kahl (2005) investigated the
estimated to be 0.5 and come out with statistically significant effect of room temperature on mental task performance with 176
difference, then we can assure that there is 90 percent that the subjects (140 women and 36 men) [19]. The performance (mean -
ventilation rate really affects cognitive function and human standard deviation) of Reading task of the male group was
productivity. Implicit in this is that if no statistically significant 7.63 3.17, of the female group was 6.04 2.48. With these data
result is found, there is a 90% chance that the ventilation rate really first the S can be calculated with formula (1a), as 2.618. Then the
does not affect human cognitive function or the effect size was in Cohen’s d can be get with formula (1). The calculation result is 0.61,
fact less than 0.5. While a statistically not significant result from indicating that males performed better than females with
a study with low statistical power is truly inconclusive. However, as a medium size effect.
Cohen (1988) put it, ‘the null hypothesis had been mistakenly not When an experiment that uses a t-test does not list standard
rejected and a real effect was ignored because of inconclusive deviations but does list standard errors (SE), the standard devia-
results – indeed, often treated as nonexistent’ [9]. tions can be calculated as follows associated with the number of
3. Calculation of effect size pffiffiffi
S ¼ SE n (2)
In hypothesis testing when a result has a small P-value, we say
The t statistic can be used to calculate Cohen’s d when a research
that it is ‘‘statistically significant’’. In common usage, the word
that uses a t-test does not list standard deviation:
significant means ‘‘important’’. It is therefore tempting to that
statistically significant results must always be important. This is not sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
the case, as the P-value does not measure practical significance. n1 þ n2 n1 þ n2
d ¼ t (3)
What it does measure is the degree of confidence we can have that n1 n2 n1 þ n2 2
the true value is really different from the value specified by the null
In this situation, the t statistics and the number of subjects
hypothesis. When the P-value is small, then we can be confident
within each group should be listed. Usually the published article
that the true value is really different. This does not necessarily
will show the t statistics. Here is another example. Raymore et al.
imply that the difference is large enough to be of practical impor-
(2001) compared the socioeconomic index (SEI) of students who
tance. Sometimes statistically significant results do not have any
went away from home to go to college versus those who stayed at
scientific or practical importance. It is effect size that measures the
home. Here is an excerpt from their results section: ‘‘.females who
difference between the true value and the value specified by the
had left home were from higher SEI homes (N ¼ 115) than college
null hypothesis and hence indicates whether the difference is
females who had not left home (N ¼ 74) (t ¼ 4.19, df ¼ 187, p < 0.5)’’
practical important. Cohen came up with some effect size
[20]. Using these data, the Cohen’s d is 0.63, calculated with
conventions based on the effects found in psychology and behav-
formula (3).
ioral research in general [9]. The effect size is discussed in detail in
If the article does not list the number of subjects in each group
Appendix A.
but does list the total number of subjects, the Cohen’s d can be
In the practical setting the population values are typically not
estimated using formula (3a), assuming that both groups have
known and must be estimated from sample statistics. There are
roughly equal numbers of subjects.
several versions of effect size based on means differing with which
statistics are used. Cohen’s d, which is defined as the difference t
between two means divided by a standard deviation for the data dzpffiffiffiffiffiffiffiffiffiffiffiffi (3a)
[9], is one of the commonly used measurements of effect size. This
paper provides some simple methods and examples to calculate However, formula (3) and (3a) cannot be used for repeated-
Cohen’s d for both t-tests and ANOVA from experimental data as measure designs. The paired-samples t-test is used to test the null
well as published research [17,18]. Cohen’s d has two advantages hypothesis that the average of the differences between a series of
over other effect size measurements [18]. First, its burgeoning paired observations is zero. Observations are paired when, for
popularity is making it the standard. Thus, its calculation enables example, they are performed on the same samples or subjects. The
immediate comparison to increasingly larger number of published effect size d is defined as:
studies. Second, Cohen’s suggestion for effect size conventions
enables us to compare an experiment’s effect size results to known
d ¼ jxz j=Sz ¼ jx1 x2 j= S21 þ S22 2r12 $S1 $S2 (4)
where r12 denotes the correlation between the two random vari- on subjective physical comfort across all task: reading
ables. xz and Sz are the group means and standard deviation of the (F(3,174) ¼ 4.99, p < 0.01) .’’. In this example, dfBetween ¼ 3,
paired observations z. dfWithin ¼ 176, F ¼ 4.99. Therefore,
Comparing formula (1) with formula (4), it can be seen that the
main difference in calculation of effect size between t-test and F$dfBetween 4:99$3
h2 ¼ ¼ ¼ 0:078
paired-samples t-test is the correlation parameter r12. Paired- F$dfBetween þ dfWithin 4:99$3 þ 176
samples t-test is used for situation in which each subject is
measured twice and the data of each measure are dependent on qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
ffi 0:078
each other. Therefore the calculation of effect size of paired-sample f ¼ h2 = 1 h2 ¼ ¼ 0:29
ð1 0:078Þ
t-test should take into account of the correlation between the two
scores. The correlation parameter r12 is always larger than zero. The The result indicates that there was only a medium effect of
prior evidence suggests that for thermal comfort or productivity temperature on subjective physical comfort for reading task,
measurements the value of r12 varies from 0.7 to 1.0 [1,22–24]. although the effect was highly statistically significant.
Here is one more example. In a repeated-measures experiment,
the increment of transcranial magnetic stimulation (TMS)-evoked
thumb movements falling in the training target zone (TTZ) was 3.3. IEQ study
8.8 2.7% with levodopa, and 2.6 1.0% with placebo respectively
expressed in means and stand error [21]. The standard deviations In Sections 3.1 and 3.2, the methods to calculate effect size from
calculated with formula (2) were 8.1% and 3.0% respectively. The published researches have been provided with some examples.
correlation r12 between the two set data was not reported unfor- However, when searching for published researches on this purpose,
tunately, and r12 ¼ 0.8 was assumed based on our previous studies it was found that most published research on IEQ study did not
[22,23]. Therefore the effect size calculated with formula (4) would present enough or standard statistical results, only indicating
be 1.04. P-value (such as P < 0.01), therefore it is impossible to calculate
It can be seen from formula (1) and (4) that effect size is affected effect size based on their results. In this section, the effect sizes
considerably by the variation of the samples. For a particular pair of of some productivity research conducted by the authors are
means, as standard deviation increases, the effect size becomes presented.
smaller. Compared with the t-test with independent samples, the The methods of productivity measurement can be classified into
effect size of paired-samples t-test will always be larger because it three categories [25]: (1) performance measurement; (2) physio-
takes into account of the correlation between the two scores. logical parameter measurement; and (3) subjective questionnaires.
The performance measurements include neurobehavioral tests and
3.2. Calculate effect size of analysis of variance (ANOVA) simulated office work. The distinguishing characteristic of neuro-
behavioral approach is its emphasis on the identification and
Cohen (1988) defined the effect size of 0.1, 0.25, and 0.4 as small, measurement of behavioral deficits, for the influence of environ-
medium, and large, respectively for the F-test [9]. Cohen’s f is an ment on brain functions manifests behaviorally. Neurobehavioral
appropriate effect size measure in the context of an F-test for approach is neurobiologically justified since the central nervous
ANOVA. The Cohen’s f is defined as: system displays particular sensitivity to environmental distur-
bance. As a result, behavioral changes represent an avenue through
f ¼ Sm =S (5) which to evaluate early and less obvious effects of environmental
factors. The rationale for using physiological methods is based on
In formula (5) Sm is the standard deviation of the group means xi, the reasoning that physiological measure of activation or arousal
and S is the common standard deviation within each groups. are associated with increased activity in the nervous system, which
A different but equivalent way to specify the effect size is in is equated with an increase in the stress on the worker. Common
terms of partial Eta squared h2, which is defined as physiological measurements include: (a) cardiovascular measures
(heart rate, blood pressure); (b) respiratory system (respiration
S2Between $dfBetween rate, oxygen consumption); (c) nervous system (brain activity,
h2 ¼ (6)
SBetween $dfBetween þ S2Within $dfWithin
muscle tension, pupil size); (d) biochemistry (catecholamine).
Subjective questionnaires include rating scales for self-rated
where S2 is the population variance, df is the degree of freedom,
performance, workload assessment (e.g. NASA_TLX), or motivation
subscripts Between and Within mean between-group and within-
to do work. The rating scales are useful tools in tapping worker’s
internal feelings.
That is, h2 is the ratio of the between-groups variance and the
The effect size of the three types of measurement are calculated
total variance and can be interpreted as ‘‘proportion of variance
based on our former studies [1,22–24], as shown in Table 1, which
explained by group membership’’. When the between-groups and
can be worked as a reference for figuring out required sample size
within-groups variance estimates are not available, as is often true
when planning similar studies. The partial Eta squared h2 is the
in published research, it is possible to figure out h2 directly from
analysis result of experiment data with the SPSS software. The ratio
F statistic and the degrees of freedom. The formula is:
of correct ratio or memory capacity (accuracy) and the response
F$dfBetween time of each test were used as the performance index of neuro-
h2 ¼ (7) behavioral tests. The effect size conventions for ANOVA (the effect
F$dfBetween þ dfWithin
size of 0.1, 0.25, and 0.4 as small, medium, and large, respectively)
The relationship between h2 and f is: were applied here.
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi As shown in Table 1, the effect size indicates that there was
f ¼ h2 = 1 h2 (8) a large variation of heart rate variability (HRV) with perception of
thermal comfort, suggesting that HRV may be related to thermal
Consider once again Kahl’s (2005) study [19]. Here is an excerpt comfort and it may be useful to understand the mechanism of
from the results section: ‘‘There was a main effect of temperature thermal comfort [22,23]. It also can be seen from Table 1 that room
Table 1
Effect size of three types of productivity measurement: performance measurement,
physiological parameter measurement, and subjective rating questionnaires
Post hoc power analysis often makes sense after a study has
already been conducted and helps to interpret the study results. Do
the results that are statistically not significant prove no effects of
treatments, or are they just inconclusive results? The procedure of
post hoc power analysis is similar to a priori power analysis and can
be divided into 3 steps: (1) gather the needed information: the type
of hypothesis test and statistical model, and sample size, etc.; (2)
determine the significance level and the expected effect size. The
effect size is also estimated by the above mentioned methods; and
(3) calculate the statistical power with software of graphs. Tsutsumi
et al. (2007) investigated the effect of humidity on human
productivity under transient conditions from hot and humid
environment to thermally neutral condition with 12 subjects in
repeated-measures [28]. No significant effect of humidity on
occupants’ performance was found. So can we assume that
humidity has no effect on human productivity? How powerful the
study is? A post hoc power analysis is performed to interpret the
Fig. 5. Procedure of sample size calculation with a priori power analysis.
result. As shown in Fig. 7a, the post hoc power analysis is selected
and the two-tailed paired-samples t-test is selected as the statis-
tical model. The total sample size is 12 and a is set to be 0.05. Then,
Determine the expected effect size. As we have mentioned, the a medium effect size is estimated on a priori grounds to be 0.5
magnitude of predicted effect size can be estimated by many ways, referring to Table 1. Now all needed input data are ready. With these
such as running a pilot study, based on some precise theory, on data, the statistical power of the study is calculated to be 0.35 only.
previous experience with research of similar kind, or on Cohen’s The relatively low power may suggest that the statistically not
conventions. (5) Figure out required sample size with software or significant results from the study are only inconclusive results and
graphs or tables for power analysis. Next how to determine sample should not come to the conclusion that humidity has no effect on
size step by step with a priori power analysis is illustrated with an human productivity; validation of the results in further investiga-
example. tions with more subjects is needed.
An experiment will be performed to study the physiological Let’s consider another example of post hoc power analysis. Bakó-
mechanism of effects of air temperature on human productivity. Biró et al. (2004) investigated the effect of presence of personal
HRV is selected as the physiological index assessing the autonomic computers (PCs) on perceived air quality, SBS symptoms and
nervous system function. Four temperature levels, from cold to productivity [2]. It was reported that performance of text typing
warm, will be investigated. The first step is to specify hypothesis was significantly reduced when PCs were present. Thirty female
test on a parameter. It is supposed that either cold or warm will subjects were exposed to the two conditions-the presence or
increase HRV value, so it is a two-tailed test. The second step is to absence of PCs. The performance data were analyzed using the
determine the type of experimental design. Within-subject design paired t-test. Reported P-values were for a one-tailed test, i.e., in the
is used in this study, so the ANOVA with repeated measures will be expected direction that the presence of PCs had negative effects on
used as the statistical model. The nonsphericity correction 3 is set productivity. With these data, a post hoc power analysis is per-
to be 0.5, and the correlation factor among repeated measures is formed on this study, as shown in Fig. 7b. Again a medium effect size
set to be 0.5. The number of groups and repeated measures also is estimated on a priori grounds to be 0.5 referring to Table 1. It can
can be determined. In this study, all subjects will be exposed to the be seen that the statistical power of study is 0.85, which is higher
four temperature conditions, so there is one group with a repeti- than the widely required level (0.8). The post hoc power analysis
tion number of 4. Then the level of statistical power and signifi- implicate that the results of this study should be statistical reliable
cance value should be determined. The statistical power level is and important, considering the sample size is not very large.
set to be 0.8, and the significance level a is set to be 0.05,
according the widely used rule. The fourth step is to determine the 5.3. Proper application of power analysis
expected effect size. In this example, the expected effect size can
be estimated based on the previous study and determined to be Up to now we have understood that a priori power analysis is
0.5 (as shown in Table 1). The effect size can also be calculated used to determine the necessary sample size N of a test given
Fig. 6. Example of calculation of required sample size with power analysis software.
a desired a level, a desired power level (1 b), and the size of the The message here is that in judging a study’s results, there are
effect to be detected with probability 1 b. This provides an effi- two questions. First, is the result statistically significant? If it is, you
cient method of controlling statistical power before a study is can consider there to be a real effect. The next question then is
actually conducted. whether the effect size is large enough for the result to be useful or
One motivation to do a priori power analysis is to make sure that interesting, especially if the study has practical implications. It the
enough subjects are involved in the study. Therefore, many people sample is small, you can assume that a statistically significant result
may assume that the more subjects in the study, the more impor- is probably also practically important. But if the sample is very
tant its results. In a sense, just the reverse is true [29]. A study with large, you must consider the effect size directly, as it is quite
a very small effect size may also come to statistical significant. This possible that the effect size is too small to be useful.
is likely to happen when a study has high statistical power due to If the result is not statistically significant, the statistical power of
other factors, especially a large sample size. the study is considered. A non-significant result from a study with
Fig. 7. Example of post hoc power analysis indicating (a) relatively low statistical power; and (b) acceptable statistical power.
low statistical power is truly inconclusive. However, a non-signifi- varies with the independent variable. Carryover effects should be
cant result from a study with high statistical power does suggest controlled with balanced design when within-subject design is
that either the research hypothesis is false or that there is less of an utilized [27].
effect than was predicted when figuring statistical power [29]. It should also be kept in mind that effect size should be given
Table 2 summarizes the role of significance, sample size, and emphasis in the discussion of results. The Publication Manual of the
statistical power in interpreting research results. American Psychology Association stated the accepted standard of
how to present psychology research results, ‘‘For the reader to fully
understand the importance of your findings, it is almost always
6. Discussion necessary to include some index of effect size.’’ [29]. Effect size
not only plays an important role in power analysis, but also is
For some studies, it may be surprising to see how large samples a crucial ingredient in meta-analysis. Meta-analysis is an important
are needed in order to detect the predicted effect with sufficient development in recent years in statistics that has had profound
statistical power. Sample size is but one of several quality charac- effect on many fields, especially psychology and behavioral studies.
teristics of a statistical study; so if sample size is held fixed, we This procedure combines results from different studies, even
should focus on other aspects of study quality. For example, better results using different methods of measurement. When combining
instruments can be found that will bring the study up to a reason- results, the crucial thing combined is the effect sizes. Using meta-
able standard. Possible improvements on study design may also analysis, the researchers can combine the results of several studies
help to reduce the variance of effect size estimation, as illustrated that evaluated the effect of indoor environment quality on occu-
by the curves shown in Figs. 1–4. A fundamental advantage of the pants’ productivity or other aspects, and thus would provide an
within-subjects design is the reduction in error variance associated overall effect size evaluating to which extent the productivity is
with individual differences, due to the fact that in a between- affected by the change of working conditions. It would also tell how
subject design even though you randomly assigned subjects to effect sizes differ for studies done in different countries or different
groups, the two groups may differ with regard to important indi- populations. So including effect size when reporting the results of
vidual difference factors that affect the dependent variable. With a study could help future researchers to figure out statistical power
within-subject design, the conditions are always exactly equivalent when planning their own studies, and more important, could
with respect to individual difference variables since the subjects are provide useful information for future meta-analysts who will
the same in the different conditions. People typically vary less combine the results of many related studies.
within themselves (that is, when compared to themselves) than
when compared to others. Another advantage of within-subject 7. Conclusions
design is that it has more statistical power, first because as a result
by using the within-subject design the number of ‘‘subjects’’ has In this paper the methods to calculate effect size from experi-
been in effect increased relative to a between-subject design. For mental data or published research are presented in detail along
example, in the experiment comparing the productivity of four with many examples. It is recommended to include effect size when
temperature conditions [1], since the 24 subjects were repeatedly reporting results of a study. How to determine the right sample size
exposed to the four temperature conditions, it had four times as when planning a study and how to interpret the research results
many ‘‘subjects’’ as it would have if a between-subject design was with power analysis are also illustrated step by step with examples.
used. The second reason is that repeated-measures design removes The examples are expected to help researchers involved in IEQ
the variance due to individual overall differences among subjects. research to better understand power analysis and plan their own
In repeated-measure design, not the actual score, but its difference study as well as evaluate research results. A priori power analysis
from that subject’s mean across conditions is compared. Therefore can be used to figure out proper sample size, therefore it can
the variation due to overall between person tendencies is elimi- provide an efficient method of controlling statistical power before
nated - everyone’s scores only vary from their own mean. Now it a study is actually conducted, while the post hoc power analysis is
can be seen why the t-test for dependent means and one-way important when looking at the results of a research. Non-significant
ANOVA for repeated-measures have more statistical power and result with high statistical power does suggest the research
require fewer sample compared with their independent designs. So hypothesis is false, but a non-significant result with low statistical
within-subject design is recommended for human related investi- power is inconclusive. As to the statistical significant results, the
gations, considering the large individual difference and the diffi- critical question is whether the effect size is large enough for the
culty to recruit a large number of subjects. However it should be results to have practical implications.
noted that there is a fundamental disadvantage of the within-
subject design, which can be referred to as ‘‘carryover effects’’, two
basic types of which were practice and fatigue effects, meaning that
participation in one condition may affect performance in other
Appendix A
Outcome statistically Sample size Statistical Conclusion
significance power
Yes Small High Important results
Yes Large Low Might or might not have Hypothesis testing
practical importance
No Small Low Inconclusive Hypothesis testing is the use of statistics to determine the
No Large High Research hypothesis
probability that a given hypothesis is true. The usual process of
probably false
hypothesis testing consists of five steps [29].
(1) Formulate the null hypothesis H0 (commonly, that the obser-
ðn1 1ÞS21 þ ðn2 1ÞS22
vations are the result of pure chance) and the alternative Spooled ¼ (A2)
hypothesis H1 (commonly, that the observations show a real n1 þ n2 2
effect combined with a component of chance variation). This is where x ¼ group mean, S ¼ standard deviation, n ¼ number of
important as mis-stating the hypotheses will muddy the rest of subjects, subscripts 1 and 2 refer to the two groups.
the process. The numerical value of the t-statistic is proportional to the
(2) Consider the assumptions being made in doing the test and probability that the difference between means is statistically
identify a test statistic that can be used to assess the truth of significant. The larger the t-value, the more likely the difference
the null hypothesis; for example, assumptions about the between means is significant. This test is used only when it can be
statistical independence or about the form of the distribu- assumed that the two distributions have the same variance. When
tions of the observations. This is equally important as invalid this assumption is seriously violated, Welch’s t-test should be used.
assumptions will mean that the results of the test are invalid. When there is only one sample that has been tested twice
(3) Compute the P-value, which is the probability, assuming that (repeated measures) or when there are two samples that have been
the null hypothesis is true, of observing a result at least as matched or ‘‘paired’’, the paired t-test is used. In the paired t-test,
extreme as the test statistic. The smaller the P-value indicates the differences between all pairs must be calculated [30,31]. The
the stronger evidence against the null hypothesis. average and standard deviation of those differences are then used
(4) Compare the P-value to an acceptable significance value to calculate the t-statistic.
a (sometimes called an alpha value). Popular levels of signifi-
cance are 5% (0.05) and 1% (0.01).
(5) Decide to either fail to reject the null hypothesis or reject it in Analysis of variance (ANOVA)
favor of the alternative. If P a, that the observed effect is
statistically significant, the null hypothesis is rejected, and the The statistical procedure for testing differences among the
alternative hypothesis is valid. means of more than two groups is called the analysis of variance
(ANOVA) [15]. There are several types of ANOVA depending on the
When a researcher makes a directional hypothesis, the null number of treatments and the way they are applied to the subjects
hypothesis is also, in a sense, directional. For example, if the in the experiment. Two commonly used are one-way ANOVA and
research hypothesis is m1 > m2, then the null hypothesis is m1 m2. one-way ANOVA for repeated measures. The fixed effects one-way
Thus, to reject the null hypothesis, values of the test statistic had to ANOVA test can be viewed as an extension of the two group t-test
fall into one specified tail of its sampling distribution. For this for a difference of means to more than two groups, and is typically
reason, the test of a directional hypothesis is called one-tailed test. A used to test for differences among at least three groups. One-way
directional hypothesis only considers one tail (the other tail is ANOVA for repeated measures is used when the subjects are sub-
ignored as irrelevant to H1), thus all of a can be placed in that one jected to repeated measures; this means that the same subjects are
tail. When a research predicts an effect but does not predict used for each treatment.
a particular direction for the effect, it is called a non-directional The null hypothesis in an ANOVA is that the several populations
hypothesis. To test the significance of a non-directional hypothesis, being compared all have the same mean. The fundamental tech-
the possibility that the sample could be extreme at either tail of the nique of ANOVA is a partitioning of the total sum of squares (SS)
comparison distribution has to be taken into account. Thus, it is into components related to the effects used in the model [29]. For
called a two-tailed test. A two-tailed test requires us to consider example, for a simplified ANOVA with one type of treatment at
both sides of the H0 distribution, so we split a and place half in each different levels, the total sum of squares can be divided into within-
tail. group SS and between-group SS:
There are two kinds of decision errors in hypothesis testing:
SSTotal ¼ SSWithin þ SSBetween (A3)
Type 1 error and Type 2 error. You make a Type 1 error if you reject
the null hypothesis when in fact the null hypothesis is true.
The number of degrees of freedom (df) can be partitioned in
The significance value a indicates the chance of making a Type 1
a similar way and specifies the chi-square distribution which
error. Type 2 error is the error of failing to reject a null hypothesis
describes the associated sums of the squares.
when it is in fact not true. The probability of making a Type 2 error
is called b.
dfTotal ¼ dfWithin þ dfBetween ; dfBetween ¼ r 1; dfWithin
¼ nr ðA4Þ
null hypothesis and hence indicates whether the difference is level (1 b), the pre-specified significance level a, and the pop-
practical important. Standardized effect size is the name given to ulation effect size to be detected with probability 1 b. In contrast
a family of indices that measure the magnitude of a treatment to a priori power analysis, post hoc power analysis often makes
effect, and is very important because it allows us to compare the sense after a study has already been conducted. In post hoc analysis,
magnitude of experimental treatments from one experiment to 1 b is computed as a function of a, the population effect size
another [15]. In essence, a standardized effect size is the difference parameter, and the sample size N used in a study. In compromise
between two means divided by the standard deviation of the two power analysis, both a and 1 b are computed as functions of the
conditions. It is the division by the standard deviation that enables effect size, sample size N, and the error probability ratio q ¼ b/a. To
us to compare effect size across experiment. Stated as a formula: illustrate, setting q to 1 would mean that the researcher prefers
balanced Type 1 and Type 2 error risks (a ¼ b) whereas a q of 4
q ¼ ðm1 m2 Þ=s (A6) would imply that b ¼ 4a. In sensitivity analysis, the critical pop-
In equation (1), s refers to the standard deviation of the two ulation effect size is computed as a function of a, 1 b, and sample
populations assuming that both populations have the same stan- size N. Finally, criterion analysis computes a (and the associated
dard deviation. In practice, the pooled standard deviation spooled is decision criterion) as a function of 1 b, the effect size and a given
commonly used [10]. The pooled standard deviation is the square sample size.
root of the average of the squared standard deviations:
