0% found this document useful (0 votes)
218 views

Introduction To Sample Size Calculation Using G Power: Principles of Frequentist Statistics

This document provides an introduction to using the statistical software G*Power to conduct sample size calculations and power analyses. It discusses key statistical concepts like Type I and Type II error rates. It then explains how to install G*Power and use it to calculate the required sample size for common tests like independent and paired t-tests. The document provides a step-by-step guide for conducting an a priori power analysis for an independent t-test using G*Power.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
218 views

Introduction To Sample Size Calculation Using G Power: Principles of Frequentist Statistics

This document provides an introduction to using the statistical software G*Power to conduct sample size calculations and power analyses. It discusses key statistical concepts like Type I and Type II error rates. It then explains how to install G*Power and use it to calculate the required sample size for common tests like independent and paired t-tests. The document provides a step-by-step guide for conducting an a priori power analysis for an independent t-test using G*Power.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

Version 1.

1 (11/11/2019)

Introduction to sample size calculation


using G*Power
James Bartlett (​[email protected]​)

Principles of frequentist statistics

The problem of low statistical power

Installing G*Power

Types of tests
T-tests
Correlation
Analysis of Variance (ANOVA)

More advanced topics

How to calculate an effect size from a test statistic

How to increase statistical power

References

Principles of frequentist statistics


In order to utilise power analysis, it is important to understand the statistics we commonly use in
psychology, known as frequentist or classical statistics. This is a theory of statistics where
probability is assigned to long-run frequencies of observations, rather than assigning a
likelihood to one particular event. This is the basis of where you get ​p ​values from. The formal
definition of a ​p value is the probability of observing a result at least as extreme as the one
observed, assuming the null hypothesis is true (Cohen, 1994). This means a small ​p value
indicates the results are surprising if the null hypothesis is true, and a large p value indicates the
results are not very surprising if the null is true. The aim of this branch of statistics is to help you
make decisions and limit the amount of errors you will make in the long-run ​(Neyman, 1977)​.
There is a real emphasis on long-run here, as the probabilities do not relate to individual cases
or studies, but tell you the probability attached to the procedure if you repeated it many times.

There are two important concepts here: alpha and beta. Alpha is the probability of concluding
there is an effect when there is not one (type I error). This is normally set at .05 (5%) and it is
the threshold we look at for a significant effect. Setting alpha to .05 means we are willing to
make a type I error 5% of the time in the long run. Beta is the probability of concluding there is
not an effect when there really is one (type II error). This is normally set at .2 (20%), which
Version 1.1 (11/11/2019)

means we are willing to make a type II error 20% of the time in the long run. These values are
commonly used in psychology, but you could change them. However, both values should ideally
decrease rather than increase the number of errors you are willing to accept.

Power is the ability to detect an effect if there is one there to be found. Or in other words “if an
effect is a certain size, how likely are we to find it?” ​(Baguley, 2004; 73)​. Power relates to beta,
as power is 1 - beta. Therefore, if we set beta to .2, we can expect to detect a particular effect
size 80% of the time if we repeated the procedure over and over. Carefully designing an
experiment in advance allows you to control the type I and type II error rate you would expect in
the long run. However, studies have shown that these two concepts are not given much thought
when designing experiments.

The problem of low statistical power


There is a long history of warnings about low power. One of the first articles was ​Cohen (1962)
who found that the sample sizes used in articles only provided enough power to detect large
effects (by Cohen’s guidelines a standardised mean difference of 0.80). The sample sizes were
too small to reliably detect small to medium effects. This was also the case in the 1980s
(Sedlmeier & Gigerenzer, 1989)​, and it is still a problem in contemporary research ​(Button et al.,
2013)​. One reason for this is that people often use “rules of thumb”, such as always including 20
participants per cell. However, this is not an effective strategy. Even experienced researchers
overestimate the power provided by a given sample size, and underestimate the number of
participants required for a given effect size ​(Bakker, Hartgerink, Wicherts, & van der Maas,
2016)​. This shows you need to think carefully about power when you are designing an
experiment.

The implications of low power is a waste of resources and a lack of progress. A study that is not
sensitive to detect the effect of interest will just produce a non-significant finding more often than
not. However, this does not mean there is no effect, just that your test was not sensitive enough
to detect it. One analogy (paraphrased from this ​lecture by Richard Morey) to help understand
this is trying to tell two pictures apart that are very blurry. You can try and squint, but you just
cannot make out the details to compare them with any certainty. This is like trying to find a
significant difference between groups in an underpowered study. There might be a difference,
but you just do not have the sensitivity to differentiate the groups. In order to design an
experiment to be informative, it should be sufficiently powered to detect effects which you think
are practically interesting ​(Morey & Lakens, 2016)​. This is sometimes called the smallest effect
size of interest. Your test should be sensitive enough to avoid missing any values that you
would find practically or theoretically interesting. Fortunately, there is a way to calculate how
many participants are required to provide a sufficient level of power known as power analysis.

In the simplest case, there is a direct relationship between statistical power, the effect you are
interested in, alpha, and the sample size. This means that if you know three of these values,
Version 1.1 (11/11/2019)

you can calculate the fourth. For more complicated types of analyses, you need some additional
parameters, but we will tackle this as we come to it. The most common types of power analysis
​ nd sensitivity.
are ​a priori a

An ​a priori ​power analysis tells you how many participants are required to detect a given effect
size. A sensitivity power analysis tells you what effect sizes your sample size is sensitive to
detect. Both of these types of power analysis can be important for designing a study and
interpreting the results. If you need to calculate how many participants are required to detect a
​ ower analysis. If you know how many participants you
given effect, you can perform an ​a priori p
have (for example you may have a limited population or did not conduct an ​a priori power
analysis), you can perform a sensitivity power analysis to calculate which effect sizes your study
is sensitive enough to detect.

Another type of power analysis you might come across is post-hoc. This provides you with the
observed power given the sample size, effect size, and alpha. You can actually get SPSS to
provide this in the output. However, this type of power analysis is not recommended as it fails to
consider the long run aspect of these statistics. There is no probability attached to individual
studies. There is either an effect observed (significant ​p value), or there is not an effect
observed (non-significant ​p value). I highly recommend ignoring this type of power analysis and
focusing on ​a priori​ or sensitivity power analyses.

For this guide, we are going to look at how you can use G*Power ​(Faul, Erdfelder, Buchner, &
Lang, 2009) to estimate the sample size you need to detect the effect you are interested in, and
the considerations you need to make when designing an experiment.

Installing G*Power
G*Power is a free piece of software developed at Universität Düsseldorf in Germany.
Unfortunately it is no longer in development, with the last update being in July 2017. Therefore,
the aim of this guide is to help you navigate using G*Power as it is not the most user friendly
programme. You can download G*Power on ​this page​. Under the heading “download” click on
the appropriate version for whether you have a Windows or Mac computer. Follow the
installation instructions and open it up when it has finished installing.
Version 1.1 (11/11/2019)

Types of tests

T-tests
To start off, we will look at the simplest example in t-tests. We will look at how you can calculate
power for an independent samples and paired samples t-test.

Independent samples t-test (​a priori)​


If you open G*Power, you should have a window that looks like this:

We are going to begin by seeing how you can calculate power ​a priori for an independent
samples t-test. First, we will explore what each section of this window does.
● Test family - To select the family of test such as t tests, F tests (ANOVA), or 𝜒​2​. We need
the default t tests for this example, so keep it as it is.
● Statistical test - To select the specific type of test. Within each family, there are several
different types of test. For the t-test, you can have two groups, matched pairs, and
several others. For this example, we need two groups.
Version 1.1 (11/11/2019)

● Type of power analysis - This is where we choose whether we want an ​a priori or


sensitivity power analysis. For this example we want ​a priori to calculate the sample size
we need in advance to detect a given effect.
● Input parameters - these are the values we need to specify to conduct the power
analysis, we will go through these in turn.
○ Tails - is the test one- or two-tailed?
○ Effect size d - this is the standardised effect size known as Cohen’s d. Here we
can specify our smallest effect size of interest.
○ α err prob - this is our long run type one error rate which is conventionally set at
.05.
○ Power (1 - β err prob) - this is our long run power. Power is normally set at .80
(80%), but some researchers argue that this should be higher at .90 (90%).
○ Allocation ratio N2 / N1 - this specifically applies to tests with two groups. If this is
set to 1, sample size is calculated by specifying equal group sizes. Unequal
group sizes could be specified by changing this parameter.
● Output parameters - if we have selected all the previous options and pressed calculate,
this is where our required sample size will be.

The most difficult part in calculating the required sample size is deciding on an effect size. The
end of this guide is dedicated to helping you think about or calculate the effect size needed to
power your own studies. When you are less certain of the effects you are anticipating, you can
use general guidelines. For example, Cohen’s (1988) guidelines (e.g. small: Cohen’s d = 0.2,
medium: Cohen’s d = 0.5, large: Cohen’s d = 0.8) are still very popular. Other studies have tried
estimating the kind of effects that can be expected from particular fields. For this example, we
will use ​Richard, Bond, & Stokes-Zoota (2003) who conducted a gargantuan meta-analysis of
25,000 studies from different areas of social psychology. They wanted to quantitatively describe
the last century of research and found that across all studies, the average standardised effect
size was d = 0.43. We can use this as a rough guide to how many participants we would need to
detect an effect of this size.

We can plug these numbers into G*Power and select the following parameters: tail(s) = two,
effect size d = 0.43, α err prob = .05, Power (1 - β err prob) = 0.8, and Allocation ratio N2 / N1 =
1. You should get the following window:
Version 1.1 (11/11/2019)

This tells us that to detect the average effect size in social psychology, we would need two
groups of 86 participants (N = 172) to achieve 80% power in a two-tailed test. This is a much
bigger sample size than what you would normally find for the average t-test reported in a journal
article. This would be great if you had lots of resources, but as a psychology student, you may
not have the time to collect this amount of data. For modules that require you to conduct a small
research project, follow the sample size guidelines in the module, but think about what sample
size you would need if you were to conduct the study full scale and incorporate it into your
discussion.

Now that we have explored how many participants we would need to detect the average effect
size in social psychology, we can tinker with the parameters to see how the number of
participants changes. This is why it is so important to perform a power analysis before you start
collecting data, as you can explore how changing the parameters impacts the number of
participants you need. This allows you to be pragmatic and save resources where possible.
Version 1.1 (11/11/2019)

● Tail(s) - if you change the number of tails to one, this decreases the number of
participants in each group from 86 to 68. This saves a total of 36 participants. If your
experiment takes 30 minutes, that is saving you 18 hours worth of work while still
providing your experiment with sufficient power. However, using one-tailed tests can be
a contentious area. See ​Ruxton & Neuhäuser (2010) for an overview of when you can
justify using one-tailed tests.
● α err prob - setting alpha to .05 says in the long run, we want to limit the amount of type I
errors we make to 5%. Some suggest this is too high, and we should use a more
stringent error rate. If you change α err prob to .01, we would need 128 participants in
each group, 84 more participants than our first estimate (42 more hours of data
collection).
● Power (1 - β err prob) - this is where we specify the amount of type II errors we are
willing to make in the long run. This also has a conventional level of .80. There are also
calls for studies to be designed with a lower type II error rate by increasing power to .90.
This has a similar effect to lowering alpha. If we raise Power (1 - β err prob) to .90, we
would need 115 participants in each group, 58 more than our first estimate (29 more
hours of data collection).

It is important to balance creating an informative experiment with the amount of resources


available. This is why it is crucial this is performed in the planning phase of a study, as these
kind of decisions can be made before any participants have been recruited.

How can this be reported?


If we were to state this in a proposal or participants section of a report, the reader needs the
type of test and parameters in order to recreate your estimates. For the original example, we
could report it like this:

“In order to detect an effect size of Cohen’s d = 0.43 with 80% power (alpha = .05,
two-tailed), G*Power suggests we would need 86 participants per groups (N = 172) in an
independent samples t-test”.

This provides the reader with all the information they would need in order to reproduce the
power analysis, and ensure you have calculated it accurately.

Independent samples t-test (sensitivity)


Selecting an effect size of interest for an ​a priori power analysis would be an effective strategy if
we wanted to calculate how many participants are required before the study began. Now
imagine we had already collected data and knew the sample size, or had access to a specific
population of a known size. In this scenario, we would conduct a sensitivity power analysis. This
would tell us what effect sizes the study would be powered to detect in the long run for a given
alpha, beta, and sample size. This is helpful for interpreting your results in the discussion, as
you can outline what effect sizes your study was sensitive enough to detect, and which effects
Version 1.1 (11/11/2019)

would be too small for you to reliably detect. If you change type of power analysis to sensitivity,
you will get the following screen with slightly different input parameters:

All of these parameters should look familiar apart from Sample size group 1 and 2, and effect
size d is now under Output Parameters. Imagine we had finished collecting data and we knew
we had 40 participants in each group. If we enter 40 for both group 1 and 2, and enter the
standard details for alpha (.05), power (.80), and tails (two), we get the following output:
Version 1.1 (11/11/2019)

This tells us that the study is sensitive to detect effect sizes of d = 0.63 with 80% power. This
helps us to interpret the results sensibly if your result was not significant. If you did not plan with
power in mind, you can see what effect sizes your study is sensitive to detect. We would not
have enough power to reliably detect effects smaller than d = 0.63 with this number of
participants. It is important to highlight here that power exists along a curve. We have 80%
power to detect effects of d = 0.63, but we have 90% power to detect effects of approximately d
= 0.73 or 50% power to detect effects of around d = 0.45. This can be seen in the following
figure which you can create in G*Power using the X-Y plot for a range of values button:
Version 1.1 (11/11/2019)

This could also be done for an ​a priori power analysis, where you see the power curve for the
number of participants rather than effect sizes. This is why it is so important you select your
smallest effect size of interest when planning a study, as it will have greater power to detect
larger effects, but power decreases if the effects are smaller than anticipated.

How can this be reported?


We can also state the results of a sensitivity power analysis in a report, and the best place is in
the discussion as it helps you to interpret your results. For the example above, we could report it
like this:

“An independent samples t-test with 40 participants per groups (N = 80) would be
sensitive to effects of Cohen’s d = 0.63 with 80% power (alpha = .05, two-tailed). This
means the study would not be able to reliably detect effects smaller than Cohen’s d =
0.63”.

This provides the reader with all the information they would need in order to reproduce the
sensitivity power analysis, and ensure you have calculated it accurately.

Paired samples t-test (​a priori)​


In the first example, we looked at how we could conduct a power analysis for two groups of
participants. Now we will look at how you can conduct a power analysis for a within-subjects
design consisting of two conditions. If you select Means (matched pairs) from the statistical test
area, you should get a window like below:
Version 1.1 (11/11/2019)

Now this is even simpler than when we wanted to conduct a power analysis for an independent
samples t-test. We only have four parameters as we do not need to specify the allocation ratio.
As it is a paired samples t-test, every participant must contribute a value for each condition. If
we repeat the parameters from before and expect an effect size of d = 0.43 (here it is called dz
for the within-subjects version of Cohen’s d), your window should look like this:
Version 1.1 (11/11/2019)

This suggests we would need 45 participants to achieve 80% power using a two-tailed test. This
is 127 participants fewer than our first estimate (saving approximately 64 hours of data
collection). This is a very important lesson. Using a within-subjects design will always save you
participants for the simple reason that instead of every participant contributing one value, they
are contributing two values. Therefore, it approximately halves the sample size you need to
detect the same effect size (I recommend Daniël Laken’s ​blog post to learn more). When you
are designing a study, think about whether you could convert the design to within-subjects to
make it more efficient.
Version 1.1 (11/11/2019)

How can this be reported?


For this example, we could report it like this:

“In order to detect an effect size of Cohen’s ​d = 0.43 with 80% power (alpha = .05,
two-tailed), G*Power suggests we would need 45 participants in a paired samples t-test”.

This provides the reader with all the information they would need in order to reproduce the
power analysis, and ensure you have calculated it accurately.

Paired samples t-test (sensitivity)


If we change the type of power analysis to sensitivity, we can see what effect sizes a
within-subjects design is sensitive enough to detect. Imagine we sampled from 30 participants
without performing an ​a priori power analysis. Set the inputs to .05 (alpha) and .80 (Power), and
you should get the following output when you press calculate:
Version 1.1 (11/11/2019)

This shows that the design would be sensitive to detect an effect size of d = 0.53 with 30
participants. Remember power exists along a curve, as we would have more power for larger
effects, and lower power for smaller effects. Plot the curve using X-Y plot if you are interested.

How can this be reported?


For this example, we could report it like this:

“A paired samples t-test with 30 participants would be sensitive to effects of Cohen’s d =


0.53 with 80% power (alpha = .05, two-tailed). This means the study would not be able
to reliably detect effects smaller than Cohen’s d = 0.53”.

Correlation
The next simplest type of statistical test is the ability to detect a correlation between two
variables.

Correlation (​a priori)​


To work out the sample size required to detect a certain effect size, we need to select the Exact
test family and correlation: bivariate normal model. You should have a window that looks like
this:
Version 1.1 (11/11/2019)

Some of the input parameters are the same as we have seen previously, but we have two new
options:
● Correlation ρ H1 - This refers to the correlation you are interested in detecting. In the
case of correlation, this is your smallest effect size of interest.
● Correlation ρ H0 - This refers to the null hypothesis. In most statistical software, this is
assumed to be 0 as you want to test if the correlation is significantly different from 0, i.e.
no correlation. However, you could change this to any value you want to compare your
alternative correlation coefficient to.

For the first example, we will turn back to the meta-analysis by ​Richard, Bond, & Stokes-Zoota
(2003)​. The effect size can be converted between Cohen’s d and r (the correlation coefficient). If
you want to convert between different effect sizes, I recommend section 13 of this ​online
calculator​. Therefore, the average effect size in social psychology is equivalent to r = .21. If we
wanted to detect a correlation equivalent to or larger than .21, we could enter the following
parameters: tails (two), Correlation ρ H1 (.21), α err prob (.05), Power (0.8), and Correlation ρ
H0 (0). This should produce the following window:
Version 1.1 (11/11/2019)

This suggests that we would need 175 participants to detect a correlation of .21 with 80%
power. This may seem like a lot of participants, but this is what is necessary to detect a
correlation this small. Similar to the t-test, we can play around with the parameters to see how it
changes how many participants are required.
● Tail(s) - for a two-tailed correlation, we are interested in whether the correlation is
equivalent to or larger than ±.21. However, we may have good reason to expect that the
correlation is going to be positive, and it would be a better idea to use a one-tailed test.
Now we would only need 138 participants to detect a correlation of .21, which would be
37 participants fewer saving 19 hours of data collection.
● Power (1 - β err prob) - Perhaps we do not want to miss out on detecting the correlation
20% of the time in the long run, and wanted to conduct the test with greater sensitivity.
We would need 59 more participants (30 more hours of data collection) for a total of 234
participants to detect the correlation with 90% power (two-sided).

How can this be reported?


For this example, we could report it like this:

“In order to detect a Pearson’s correlation coefficient of ​r = .21 with 80% power (alpha =
.05, two-tailed), G*Power suggests we would need 175 participants”.

Correlations (sensitivity)
Like t-tests, if we know how many participants we have access to, we can see what effects our
design is sensitive enough to detect. In many neuroimaging studies, researchers will look at the
correlation between a demographic characteristic (e.g. age or number of cigarettes smoked per
day) and the amount of activation in a region of the brain. Neuroimaging studies are typically
very small as they are expensive to run, so you often find sample sizes of only 20 participants. If
we specify tails (two), alpha (.05), power (.80), and sample size (20), you should get the
following window:
Version 1.1 (11/11/2019)

This shows that with 20 participants, we would only have 80% power to detect correlations of r =
.58 in the long run. We would only have enough power to detect a large correlation by Cohen’s
guidelines. Note there is a new option here called Effect direction. This does not change the
size of the effect, but converts it to a positive or negative correlation depending on whether you
expect it to be bigger or smaller than 0.

How can this be reported?


For this example, we could report it like this:

“A Pearson’s correlation coefficient with 20 participants would be sensitive to effects of ​r


= .58 with 80% power (alpha = .05, two-tailed). This means the study would not be able
to reliably detect correlations smaller than ​r​ = .58”.
Version 1.1 (11/11/2019)

Analysis of Variance (ANOVA)


The final type of test we are going to explore in this guide is ANOVA. This is where you have
three or more groups or conditions. We will look at both the between-subjects and
within-subjects variants.

One-way between-subjects ANOVA (​a priori​)


We will start with between-subjects for when we have three or more groups. In order to calculate
how many participants we need, we will first need to select F tests as the Test family, and then
select ANOVA: Fixed effects, omnibus, one-way. You should have a screening that looks like
this:

Most of the input parameters are the same as what we have dealt with for t-tests and
correlation. However, we have a different effect size (Cohen’s f) to think about, and we need to
Version 1.1 (11/11/2019)

specify the number of groups we are interested in sampling which will normally be three or
more.

ANOVA is an omnibus test that compares the means across three or more groups. This means
Cohen’s d would not be informative as it describes the standardised mean difference between
two groups. In order to describe the average effect across many groups, there is Cohen’s f.
Cohen (1988) provided guidelines for this effect size too, with values of .10 (small), .25
(medium), and .40 (large). However, this effect size is not normally reported in journal articles or
produced by statistics software. In its place, we normally see partial eta-squared (𝜼​2​p​) which
describes the percentage of variance explained by the independent variable when the other
variables are partialed out. In order words, it isolates the effect of that particular independent
variable. When there is only one IV, 𝜼​2​p will provide the same result as eta-squared (𝜼​2​).
Fortunately, G*Power can convert from 𝜼​2​p​ to Cohen’s f in order to calculate the sample size.

With many effect sizes, you can convert one to the other. For example, you can convert
between r and Cohen’s d, and useful to us here, you can convert between Cohen’s d and 𝜼​2​p​. In
order to convert the different effect sizes, there is section 13 of this handy ​online calculator​. A
typical effect size in psychology is 𝜼​2​p = .04 which equates to Cohen’s d = 0.40. In order to use
𝜼​2​p in G*Power, we need to convert it to Cohen’s f. Next to Effect size f, there is a button called
Determine which will open a new tab next to the main window. From select procedure, specify
Effect size from variance, and then click Direct. Here is where you specify the 𝜼​2​p you are
powering the experiment for. Enter .04, and you should have a screen that looks like this:

If you click Calculate and transfer to main window, it will input the Cohen’s f value for you in the
main G*Power window. Finally, input alpha (.05), power (.80), and groups (3), and you should
get the following output:
Version 1.1 (11/11/2019)

This shows us that we would need 237 participants split across three groups in order to power
the effect at 80%. G*Power assumes you are going to recruit equal sample sizes which would
require 79 participants in each group. we can play around with some of the parameters to see
how it changes how many participants are required.
● Alpha - If we wanted to make fewer type I errors in the long-run, we could select a more
stringent alpha level of .01. We would now need 339 participants (113 per group) to
detect the effect with 80% power. This means 102 participants more, which would take
51 more hours of data collection.
● Power (1 - β err prob) - Perhaps we do not want to miss out on detecting the effect 20%
of the time in the long run, and wanted to conduct the test with greater sensitivity. We
would need 72 more participants (36 more hours of data collection) for a total of 309
participants to detect the effect with 90% power.
Version 1.1 (11/11/2019)

How can this be reported?


For this example, we could report it like this:

“In order to detect an effect of 𝜼​2​p = .04 with 80% power in a one-way between-subjects
ANOVA (three groups, alpha = .05), G*Power suggests we would need 79 participants in
each group (N = 237)”.

One-way between-subjects ANOVA (sensitivity)


Now that we know how many participants we would need to detect a given effect size, we can
consider how sensitive a study would be if we knew the sample size. Imagine that we had 72
participants split across four groups, and we wanted to know what effect sizes this is powered to
detect. Select sensitivity for type of power analysis, and enter alpha (.05), power (.80), sample
size (72), and number of groups (4). You should get the following output:
Version 1.1 (11/11/2019)

This shows us that we have 80% power to detect effect sizes of Cohen’s f = 0.40. This equates
to a large effect, and we can convert it to 𝜼​2​p using the ​online calculator​. This is equivalent to an
effect of 𝜼​2​p = .14. As a reminder, power exists along a curve. Cohen’s f = 0.40 is the smallest
effect size we can detect reliably at 80% power. However, we would have greater power to
detect larger effects, and lower power to detect smaller effects. It is all about what effect sizes
you do not want to miss out on. The power curve for 72 participants and four groups looks like
this:

How can this be reported?


For this example, we could report it like this:

“A one-way between-subjects ANOVA with 72 participants across four groups would be


sensitive to effects of 𝜼​2​p = .14 with 80% power (alpha = .05). This means the study
would not be able to reliably detect effects smaller than 𝜼​2​p​ = .14”.

One-way within-subjects ANOVA (​a priori​)


Now it is time to see what we can do when we want to calculate power for three or more levels
in a within-subjects design. This is going to be the most complicated design we look at in this
guide, as there are a few extra considerations we need to make. In order to calculate power for
a within-subjects design, we need to select ANOVA: Repeated measures, within factors. You
should have a screen that looks like this:
Version 1.1 (11/11/2019)

The first three input parameters should be familiar by now. The number of groups should be 1
as we have a fully within-subjects design. The number of measurements are the number of
conditions we have in our within-subjects IV. To keep it simple, we will work with three
conditions, so enter 3 as the number of measurements. Now we have a couple of unfamiliar
parameters.

The correlation among repeated measures is something we will not need to worry about for
most applications, but it’s important to understand why it is here in the first place. In a
within-subjects design, one of the things that affect power is how correlated the measurements
are. As the measurements come from the same people on similar conditions, they are usually
correlated. If there was 0 correlation between the conditions, the sample size calculation would
be very similar to a between-subjects design. As the correlation increases towards 1, the
sample size you would require to detect a given effect will get smaller. The option is here as
Version 1.1 (11/11/2019)

G*Power assumes the effect size (Cohen’s f) and the correlation between conditions are
separate. However, if you are using 𝜼​2​p from SPSS, the correlation is already factored in to the
effect size as it is based on the sum of squares. This means G*Power would provide a totally
misleading value for the required sample size. In order to tell G*Power the correlation is already
factored in to the effect size, click on options at the bottom of the window and choose which
effect size specification you want. For our purposes, we need as in SPSS. Select that option
and click OK, and you will notice that the correlation among repeated measures parameter has
disappeared. This is because we no longer need it when we use 𝜼​2​p​ from SPSS.

The second unfamiliar input parameter is the nonsphericity correction. If you have used a
within-subjects ANOVA in SPSS, you may be familiar with the assumption of sphericity. If
sphericity is violated, it can lead to a larger number of type I errors. Therefore, a nonsphericity
correction (e.g. Greenhouse-Geisser) is applied to decrease the degrees of freedom which
reduces power in order to control type I error rates. This means if we suspect the measures may
violate the sphericity assumption, we would need to factor this into the power analysis in order
to adequately power the experiment. To begin, we will leave the correction at 1 for no
correction, but later we will play around with lower values in order to explore the effect of a
nonsphericity correction on power.

For the first power analysis, we will use the same typical effect size found in psychology as the
between-subjects example. Click determine, and enter .04 for partial 𝜼​2 (make sure effect size is
set to SPSS in options). Click calculate and transfer to main window to convert it to Cohen’s f.
We will keep alpha (.05) and power (.80) at their conventional levels. Click calculate to get the
following window:
Version 1.1 (11/11/2019)

This shows that we would need 119 participants to complete three conditions for 80% power. If
we compare this to the sample size required for the same effect size in a between-subjects
design, we would need 118 fewer participants than the 237 participants before. This would save
59 hours worth of data collection. This should act as a periodic reminder that within-subjects
designs are more powerful than between-subjects designs.

Now it is time to play around with the parameters to see how it affects power.
● Number of groups - One of the interesting things you will find is if we recalculate this for
four conditions instead of three, we actually need ​fewer participants. We would need 90
participants to detect this effect across four conditions for 80% power. This is because
each participant is contributing more measurements, so the total number of observations
increases.
Version 1.1 (11/11/2019)

● Nonsphericity correction - Going back to three conditions, we can see the effect of a
more stringent nonsphericity correction by decreasing the parameter. If we have three
conditions, this can range from 0.5 to 1, with 0.5 being the most stringent correction (for
a different number of conditions, the smallest lower bound can be calculated by 1 / m - 1,
where m is the number of conditions. So for four conditions, it would be 1 / 3 = 0.33, but
we would use 0.34 as it must be bigger than the lower bound). If we selected 0.5, we
would need 192 participants to detect the effect size across three conditions. This is 73
more participants (37 more hours of data collection) than if we were to assume we do
not need to correct for nonsphericity. You might be wondering how you select the value
for nonsphericity correction. Hobson and Bishop (2016) have a supplementary section of
their article dedicated to their power analysis. This is a helpful source for seeing how a
power analysis is reported in a real study, and they choose the most stringent
nonsphericity correction. This means they are less likely to commit a type II error as they
may be overestimating the power they need, but this may not be feasible if you have less
resources. A good strategy is exploring different values and thinking about the maximum
number of participants you can recruit in the time and resources you have available.

How can this be reported?


For the first example, we could report it like this:

“In order to detect an effect of partial eta squared = .04 with 80% power in a one-way
within-subjects ANOVA (three groups, alpha = .05, non-sphericity correction = 1),
G*Power suggests we would need 119 participants”.

One-way within-subjects ANOVA (sensitivity)

The final thing to cover is to explore how sensitive a within-subjects design would be once we
know the sample size we are dealing with. Change type of power analysis to sensitivity. If we
did not conduct an ​a priori power analysis, but ended up with 61 participants and three
conditions, we would want to know what effect sizes we can reliably detect. If we retain the
same settings, and include 61 as the total sample size, we get the following output once we
click calculate:
Version 1.1 (11/11/2019)

This shows us that we would have 80% power to detect effect sizes of Cohen’s f = .29. This
corresponds to 𝜼​2​p​ = .08 or a medium effect size.

How can this be reported?


For this example, we could report it like this:

“A one-way within-subjects ANOVA with 61 participants across three conditions would


be sensitive to effects of 𝜼​2​p = .08 with 80% power (alpha = .05). This means the study
would not be able to reliably detect effects smaller than 𝜼​2​p​ = .08”.
Version 1.1 (11/11/2019)

More advanced topics


For the time being, one-way ANOVA is going to be the most complicated design that is covered
in this guide. This is because powering interactions in factorial designs is not very accurate in
G*Power. See this ​blog post for why G*Power drastically underestimates the sample you would
need to power an interaction. In order to adequately power interaction effects, you ideally need
to simulate your experiment. This means you would program the effect sizes you are expecting
across multiple factors and see how many times it would return a significant result if you
repeated it many times. This cannot be done in SPSS or G*Power, so it would require you to
learn a programming language like R or Python. Fortunately, two researchers have made this
process more user friendly by creating an online app to calculate power for main and interaction
effects in factorial designs. There is an article explaining their app (Lakens & Caldwell, 2019)
and the ​online app itself. If this is something you are interested in and you are struggling to use
it, please send me an email.

Alternatively, another way to efficiently work out the sample size is to check on your results as
you are collecting data. You might not have a fully informed idea of the effect size you are
expecting, or you may want to stop the study half way through if you already have convincing
evidence. However, this must be done extremely carefully. If you keep collecting data and
testing to see whether your results are significant, this drastically increases the type I error rate
(Simons, Nelson, and Simonsohn, 2011). If you check enough times, your study will eventually
produce a significant ​p value by chance even if the null hypothesis was really true. In order to
check your results before collecting more data, you need to perform a process called sequential
analysis (Lakens, 2014). This means that you can check the results intermittently, but for each
time you check the results you must perform a type I error correction. This works like a
Bonferroni correction for pairwise comparisons. For one method of sequential analysis, if you
check the data twice, your alpha would be .029 instead of .05 in order to control the increase in
type I error rate. This means for both the first and second look at the data, you would use an
alpha value of .029 instead of .05. See Lakens (2014) for an overview of this process.
Version 1.1 (11/11/2019)

How to calculate an effect size from a test statistic


Throughout this guide, we have used effect size guidelines or meta-analytic effect sizes in order
to select an effect size. However, these may not be directly applicable to the area of research
you are interested in. You may want to replicate or extend an article you have read. One of the
problems you may encounter with this, especially in older articles, is effect sizes are not
consistently reported. This is annoying, but fortunately you can recalculate effect sizes from
other information available to you, such as the test statistic and sample size. There is a direct
relationship between the test statistic and effect size. This means if you have access to the
sample size and test statistic, then you can recalculate an effect size based on this. You can
also use descriptive statistics, but these may not be reported for each analysis. Due to APA
style reporting, the test statistic and sample size should always be reported. There is a handy
online app created by Katherine Wood for calculating effect sizes from the information available
to you. We will be using this for the examples below.

For the first example, we will recalculate the effect size from Diemand-Yauman et al. (2011) as
they report a t-test with Cohen’s d, so we can see how well the recalculated effect size fits in
with what they have reported. On page three of their article, there is the following sentence: “An
independent samples t-test revealed that this trend was statistically significant (t(220) = 3.38, ​p
< .001, Cohen’s d = 0.45)”. From the methods, we know there are 222 participants, and the t
statistic equals 3.38. We can use this information in the ​online app to calculate the effect size.
Select the independent samples t-test tab, and enter 222 into total N and 3.38 into t Value. If
you click calculate, you should get the following output:

This shows that the recalculated effect size is nice and consistent. The value is 0.45 in both the
article and our recalculation. This window shows the range of information that you can enter to
Version 1.1 (11/11/2019)

recalculate the effect size. The minimum information you need is the total N and t Value, or you
will get an error message.

If you needed to recalculate the effect size for a paired samples t-test, then the options look very
similar. You only have one sample size to think about, as each participant will complete both
conditions. Therefore, we will move on to recalculating an effect size for ANOVA. Please note,
this calculator only works for between-subjects ANOVA, including main effects and interactions.
If you need the effect size for a one-way within-subjects ANOVA or a factorial mixed ANOVA,
then you would need the full SPSS output, which is not likely to be included in an article. If it is
available, you can use this ​spreadsheet by Daniël Lakens to calculate the effect size, but it’s
quite a lengthy process.

For the next example, we will use a one-way ANOVA reported in James et al. (2015). On page
1210, there is the following sentence: “there was a significant difference between groups in
overall intrusion frequency in daily life, F(3, 68) = 3.80, ​p = .01, η​2​p =.14”. If we click on the F
tests tab of the online calculator, we can enter 3.80 for the F statistic, 3 for treatment degrees of
freedom, and 68 for residual degrees of freedom. You should get the following output:

This shows that we get the same estimate for η​2​p as


​ what is reported in the original article. Both
values are .14.
Version 1.1 (11/11/2019)

How to increase statistical power

Increasing the effect size


The first option is to increase the unstandardised effect size. If you are manipulating your IV,
you could increase the effect by increasing the dose or exposure. For example, if you were
interested in the effect of alcohol on a certain behaviour, using a moderate dose of alcohol
should have a larger effect than a small dose of alcohol in comparison to a no alcohol control
group.

Decreasing the variability of your effect


It may not always be possible to just increase the unstandardised effect size. Perhaps you are
observing a behaviour rather than manipulating it. This is where you can increase the
standardised effect size by making it easier to detect. In G*Power, we have been using Cohen’s
d, which is the standardised mean difference. This is the difference between groups or
conditions divided by the standard deviation. This means that the difference is converted to a
uniform scale expressed in standard deviations. For example, say we wanted to compare two
groups on a reaction time experiment. There is a 7ms difference between group A and group B,
with a standard deviation of 20ms. This corresponds to a standardised mean difference of d =
0.35 (7 / 20). However, what would happen if we could measure our reaction times more
precisely? Now instead of a standard deviation of 20ms, we have a more precise measurement
of 7ms with a standard deviation of 10ms. This corresponds to standardised mean difference of
d = 0.70 (7 / 10). This means we have doubled the standardised effect size by decreasing
measurement error, but the unstandardised effect size has remained the same. This is an
important point for designing experiments. Try and think carefully about how you are measuring
your dependent variable. By using a more precise measure, you could decrease the number of
participants you need while maintaining an adequate level of statistical power. It may not always
be possible to half the variability, but even a 25% decrease in variability here could save you
114 participants in a between subjects design (two tailed, 80% power). To see how the
standardised effect size increases as you progressively decrease the variability in
measurement, see this plot below:
Version 1.1 (11/11/2019)

As the standard deviation decreases, it makes it easier to detect the same difference between
the two groups indicated by the decreasing red shaded area. For an overview of how
measurement error can impact psychological research, see ​Schmidt & Hunter (1996)​.

If you are using a cognitive task, another way to decrease the variability is to increase the
number of trials the participant completes (see Baker, 2019). The idea behind this is
experiments may have high within-subject variance, or the variance of the condition is high for
each participant. One way to decrease this is to increase the number of observations per
condition, as it increases the precision of the estimate in each participant. Therefore, if you are
limited in the number of participants you can collect, an alternative would be to make each
participant complete a larger number of trials.
Version 1.1 (11/11/2019)

References

Baguley, T. (2004). Understanding statistical power in the context of applied research. ​Applied

Ergonomics​, ​35(​ 2), 73–80. https://doi.org/10.1016/j.apergo.2004.01.002

Baker, D. H., Vilidaite, G., Lygo, F. A., Smith, A. K., Flack, T. R., Gouws, A. D., & Andrews, T. J.

(2019, July 3). Power contours: Optimising sample size and precision in experimental

psychology and human neuroscience.Retrieved from ​https://arxiv.org/abs/1902.06122

Bakker, M., Hartgerink, C. H. J., Wicherts, J. M., & van der Maas, H. L. J. (2016). Researchers’

Intuitions About Power in Psychological Research. ​Psychological Science​, ​27​(8),

1069–1077. https://doi.org/10.1177/0956797616647519

Button, K. S., Ioannidis, J. P. A., Mokrysz, C., Nosek, B. A., Flint, J., Robinson, E. S. J., &

Munafò, M. R. (2013). Power failure: why small sample size undermines the reliability of

neuroscience. ​Nature Reviews Neuroscience​, ​14​(5), 365–376.

https://doi.org/10.1038/nrn3475

Cohen, J. (1962). The statistical power of abnormal-social psychological research: A review.

The Journal of Abnormal and Social Psychology,​ ​65​(3), 145–153.

https://doi.org/10.1037/h0045186

Cohen, J. (1988) ​Statistical Power Analysis for the Behavioural Sciences.​ New Jersey:

Lawrence Erlbaum Associates

Cohen, J. (1994). ‘The Earth Is Round (P < .05)’. ​American Psychologist, 49,​ (12), 997–1003

Diemand-Yauman, C., Oppenheimer, D. M., & Vaughan, E. B. (2011). Fortune favors the ():

Effects of disfluency on educational outcomes. ​Cognition​, ​118(1)​, 111-115.

Faul, F., Erdfelder, E., Buchner, A., & Lang, A.-G. (2009). Statistical power analyses using
Version 1.1 (11/11/2019)

G*Power 3.1: Tests for correlation and regression analyses. ​Behavior Research

Methods​, ​41​(4), 1149–1160. https://doi.org/10.3758/BRM.41.4.1149

Hobson, H. M., & Bishop, D. V. (2016). Mu suppression–a good measure of the human mirror

neuron system?.​ Cortex, 82,​ 290-310

James, E. L., Bonsall, M. B., Hoppitt, L., Tunbridge, E. M., Geddes, J. R., Milton, A. L., &

Holmes, E. A. (2015). Computer game play reduces intrusive memories of experimental

trauma via reconsolidation-update mechanisms. ​Psychological science, 26(8)​,

1201-1215.

Lakens, D. (2014). Performing high-powered studies efficiently with sequential analyses.

European Journal of Social Psychology​, ​44(7),​ 701-710.

Lakens, D., & Caldwell, A. R. (2019, May 28). Simulation-Based Power-Analysis for Factorial

ANOVA Designs. Retrieved from ​https://doi.org/10.31234/osf.io/baxsf

Morey, R. D., & Lakens, D. (2016). Why most of psychology is statistically unfalsiable.

Retrieved from

https://github.com/richarddmorey/psychology_resolution/blob/master/paper/response.pdf

Neyman, J. (1977). Frequentist Probability and Frequentist Statistics. ​Synthese,​ ​36(​ 1), 97–131.

Richard, F. D., Bond, C. F., & Stokes-Zoota, J. J. (2003). One Hundred Years of Social

Psychology Quantitatively Described. ​Review of General Psychology​, ​7​(4), 331–363.

https://doi.org/10.1037/1089-2680.7.4.331

Ruxton, G. D., & Neuhäuser, M. (2010). When should we use one-tailed hypothesis testing?

Methods in Ecology and Evolution,​ ​1(​ 2), 114–117.

https://doi.org/10.1111/j.2041-210X.2010.00014.x

Schmidt, F. L., & Hunter, J. E. (1996). Measurement error in psychological research: Lessons

from 26 research scenarios. ​Psychological Methods​, ​1(​ 2), 199–223.


Version 1.1 (11/11/2019)

https://doi.org/10.1037/1082-989X.1.2.199

Sedlmeier, P., & Gigerenzer, G. (1989). Do Studies of Statistical Power Have an Effect on the

Power of Studies? ​Psychological Bulletin,​ ​105(​ 2), 8

Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psychology: Undisclosed

flexibility in data collection and analysis allows presenting anything as significant.

Psychological Science, 22(11)​, 1359-1366

You might also like