Author Note
Author Note
Sam S Webb*1, Margaret Jane Moore*1, Anna Yamshchikova1, Valeska Kozik1, Mihaela D
Duta1, Irina Voiculescu1 & Nele Demeyere1
(*joint first authors)
1
Department of Experimental Psychology, University of Oxford
Author Note
Abstract
Objective
Complex Figure Copy Tasks are one of the most commonly employed neuropsychological tests.
However, manual scoring of this test is time-consuming, requires training, and can then still be
inconsistent between different examiners. We aimed to develop and evaluate a novel, automated
Method
A cohort of 261 healthy adults and 203 stroke survivors completed the digital Oxford Cognitive
Screen – Plus Figure Copy Task. Responses were independently scored by two trained human
Results
Overall, the Automated Scoring Program was able to reliably extract and identify the separate
figure elements (average sensitivity and specificity of 92.10% and 90.20% respectively) and
assigned total scores which agreed well with manual scores (ICC = .83). Receiver Operating
manual scores, the Automated Scoring Program had an overall sensitivity and specificity of 80%
and 93.40% respectively (AUC = 86.70%). Automated total scores also reliably distinguished
between different clinical impairment groups with sub-acute stroke survivors scoring
significantly worse than longer term survivors, which in turn scored worse than neurologically
healthy adults.
Conclusions
These results demonstrate that the novel automated scoring algorithm was able to reliably extract
and accurately score Figure Copy Task data, even in cases where drawings were highly distorted
3
AUTOMATED SCORING OF COMPLEX FIGURES, now in-press at Neuropsychology
due to comorbid fine-motor deficits. This represents a significant advancement as this novel
technology can be employed to produce immediate, unbiased, and reproducible scores for Figure
Key points
Question
We aimed to develop and evaluate a novel, automated method for scoring a tablet-based Figure
Copy Task
Findings
The novel automated scoring algorithm was able to reliably extract and accurately score Figure
Copy Task data, even in cases where drawings were highly distorted due to comorbid fine-motor
deficits.
Importance
This represents a significant advancement as this novel technology can be employed to produce
immediate, unbiased, and reproducible scores for Figure Copy Task responses in clinical and
research environments
Next steps
behavior relationships (Crawford et al., 1992; Ellis & Young, 2013). However, comparisons
which employ these metrics can be limited by the quality of scoring of these neuropsychological
tests. For example, tests which require subjective examiner judgements may introduce
potentially confounding noise into neuropsychological analyses (Barker et al., 2011; Franzen,
2000; Moore et al., 2019; Watkins, 2017). Inter-rater reliability traditionally is improved by
multiple independent raters or tests (Franzen, 2000; Huygelier et al., 2020). However, more
demanding scoring procedures often are prohibitively time-consuming and can lead to studies
opting to rely on small, selected samples rather than larger, generalizable patient cohorts, or
similarly to only complete limited cognitive measures (e.g. MMSE (Folstein et al., 1983)) which
reduce the informational richness. For these reasons, identifying new methods for efficiently
improving both the scope and reliability of neuropsychological investigations. Here, we focus
on validating this approach in a specific, prominently studied, clinical cohort of stroke survivors,
as an example group where these automated scoring measures may improve methods to further
elucidate specific aspects of domain-specific cognitive impairments in Complex figure copy and
recall.
The Figure Copy test is one of the most commonly employed neuropsychological
assessment methods used to evaluate visuospatial constructional ability and nonverbal memory
in clinical environments (Shin et al., 2006). In traditional versions of this test, participants
complete two drawings of a composite geometric shape. First, participants are presented with a
6
AUTOMATED SCORING OF COMPLEX FIGURES, now in-press at Neuropsychology
target image and are asked to copy it from sight. Next, the target figure is removed and
participants are asked to reproduce it from memory (Demeyere et al., 2020; Schreiber et al.,
1999). The Rey-Osterrieth Complex Figure Test (ROCFT) (Somerville et al., 2000) is the most
well-known figure copy test, though many variations, including computerised versions (e.g.
Demeyere et al., 2020; Humphreys et al., 2017; Schreiber et al., 1999; Taylor, 1969), are in use.
Successful completion of any figure copy task requires participants to coordinate fine-
motor movements, employ visuospatial perception, maintain visual images in working memory,
and effectively plan and organise their responses (Shin et al., 2006). The Figure Copy Task has
been found to act as a reliably metric of a wide range of cognitive functions, and is therefore
useful for establishing a diverse range of brain-behavior relationships. Chechlacz et al. (2014)
conducted a voxel-lesion symptom mapping study aiming to identify the neural correlates of a
range of deficits captured by performance in a figure copy task. Analysis of this single
behavioral assessment yielded significant and distinct neural correlates associated with general
poor performance, lateralized omissions, spatial positioning errors, global feature impairment,
and local feature impairment (Chechlacz et al., 2014). Similarly, (Chen et al., 2016) conducted a
lesion mapping study investigating the correlates of principal component analysis-derived factors
underlying figure copy performance. This investigation identified brain regions associated with
high-level motor control, visuo-motor transformation, and multistep object use using only
behavioral data from a figure copy task. This wide range of assessed cognitive functions makes
the figure copy task an extremely valuable tool both for clinical diagnostic purposes and for
The Figure Copy Task is comparatively simple to complete and while assessing a diverse
range of functions. These advantages mean that this task is frequently employed within clinical
7
AUTOMATED SCORING OF COMPLEX FIGURES, now in-press at Neuropsychology
neuropsychological evaluations. A survey conducted by Rabin et al. (2016) found that the
ROCFT was the eighth most popular single neuropsychological assessment employed by a
sample of 512 North American neuropsychologists, with 7.6% reporting using this test (Rabin et
al., 2016). Previous research has suggested that Figure Copy Task performance can effectively
distinguish between various clinical populations (Alladi et al., 2006; Demeyere et al., 2020;
Freeman et al., 2000). For example, Freeman et al. (2000) administered the Rey-Osterrieth
Complex Figure test to a cohort of Alzheimer’s disease, ischemic vascular dementia, and
with patients with Alzheimer’s disease performing significantly worse than patients diagnosed
with vascular dementia or Parkinson’s Disease (Freeman et al., 2000). These findings suggest
that patients’ Figure Copy Task scores may provide clinically relevant information which can be
Patient performance on Figure Copy task is generally scored manually. For example,
examiners score performance on the Oxford Cognitive Screen – Plus (OCS-Plus) figure copy
task by reporting the presence, accuracy, and position of each individual figure element
examiners may disagree on which drawn line represents which element, especially in cases
where a patient has committed many errors. A significant amount of training is required to
ensure high agreement. This reliance on subjective examiner judgements inevitably introduces
human biases into Figure Copy scores. Relying on subjective interpretations of objective criteria
can result in systematic scoring biases, potentially precluding the validity of large-scale
comparisons involving Figure Copy Test data, especially in cases where multiple independent
8
AUTOMATED SCORING OF COMPLEX FIGURES, now in-press at Neuropsychology
examiners are involved. Automated algorithms have been repeatedly demonstrated to be able to
perform many diagnostic and classification tasks with greater sensitivity and specificity than
For this reason, several automated tools have been developed to quantify performance on
neuropsychological tests. Chen et al. (2020) developed a deep-learning based automated scoring
tool for the Clock Drawing Task, a common component of dementia screening batteries (Agrell
& Dehlin, 1998; Pinto & Peters, 2009). This investigation compared algorithmic and expert
assigned scores in a cohort of 1315 outpatients and concluded that the algorithm exhibited a
comparative scoring accuracy of 98.54% (Chen et al., 2020). Similarly, Moetesum et al. (2015)
applied an automated approach to assessing performance on the Bender Gestalt Test (Koppitz,
1964) within a sample of 18 healthy adults. The performance of this algorithm varied
dramatically depending on the specific gestalt component being assessed (range = 6/18 (overlap)
Two figure-copy specific automated scoring algorithms have been developed. First,
Canham et al. (2000) developed an automated scoring software for the commonly used Rey-
Osterrieth Complex Figure test. In this task, responses are generally manually scored by
categorising each of the target figure’s 18 elements according to whether or not they are present,
accurately drawn, and correctly placed within the response figure. Canham et al.'s (2000)
automated software matched these scoring criteria by first identifying distorted areas of patient
drawings, then locating and grading basic geometric shapes while employing unary metrics to
remove unsuitable features from patient drawings. This method was found to perform well on
real patient data with 75% of features being within 5% of the manually assigned scores and
98.6% within 10% (Canham et al., 2000). Second, the most recent, “state-of-the-art” figure copy
9
AUTOMATED SCORING OF COMPLEX FIGURES, now in-press at Neuropsychology
scoring tool was designed by Vogt et al. (2019), which demonstrated a .88 Pearson correlation
with human ratings of Rey-Osterrieth Complex Figure performance. While this performance in
near the documented human inter-rater agreement (.94), equivalence testing revealed that these
scoring methods did not produce strictly equivalent total scores. However, these algorithms were
designed specifically to score data from the Rey-Osterrieth Complex Figure test and do not
The purpose of the present investigation is to develop an automated scoring tool to score
the OCS-Plus (Demeyere et al., 2020) Figure Copy Task. This project aims to evaluate the
efficacy of this automated scoring tool by comparing automated versus manually assigned scores
and identifying potential sources of systematic disagreement. The utility of this automated
software for distinguishing between different clinical populations is also explored. Ultimately,
this project aims to deliver a robust automated clinical scoring tool to deliver immediate scoring
Methods
Participants
A cohort of 261 neurologically healthy adults were recruited as well as 203 stroke
survivors who completed the Figure Copy task within the OCS-Plus Tablet Screening Project
(REC reference: 18/SC/0044, IRAS project ID: 241571). Of the stroke survivors 49 were tested
on the Figure Copy test within 6 months of their stroke (termed sub-acute stroke participants)
and 154 stroke survivors were tested on the Figure Copy test on or after 6 months post-stroke
OCS-Plus validation project (Demeyere et al. 2020) from an existing pool of older healthy
participants were included in the OCS-Plus project if they were able to provide informed
consent, had sufficient English language proficiency to comprehend instructions, were at least 18
years old, and were able to remain alert for at least 20 minutes. The exclusion criteria included
inability for the participant to consent to take part, insufficient English language proficiency, and
Table 1
11
AUTOMATED SCORING OF COMPLEX FIGURES, now in-press at Neuropsychology
Summary demographics of the samples (49 = sub-acute stroke, 154 chronic stroke, 261 healthy
adults)
Note. HA refers to healthy adults, AS to sub-acute stroke, and CS to chronic stroke groups. For
age, education, days since stroke, and lesion volume, groups were compared using independent t-
tests and we report the Cohen’s d effect size. For handedness, sex, stroke side, stroke type, we
used chi squared analysis and report Cramer’s V effect size. * refers to significance below .05.
We collected additional measures from clinical notes including the Barthel Index
(Mahoney & Barthel, 1965) and the Oxford Cognitive Screen (Demeyere et al., 2015) to measure
functional ability and domain-specific cognitive impairment. As part of the 6-month follow up
protocol for the overarching study, we collected data on the Hospital Anxiety and Depression
Scale (Zigmon & Snaith, 1983), to measure anxiety and depression, the Stroke Impact Scale
(Duncan et al., 2002) to measure the domain-specific impact of stroke, and the Quality of Life
Scale (Al-Janabi, Flynn, & Coast, 2012) to assess the quality of life of the participants post-
stroke.
13
AUTOMATED SCORING OF COMPLEX FIGURES, now in-press at Neuropsychology
Table 2
Additional tests of mood, cognitive impairment, stroke severity, and the impact of stroke, for most of the stroke survivor sample (Sub-Acute stroke =
49, chronic stroke = 154)
Sub-Acute Chronic
Test battery
Measure n missing (%) M SD n missing (%) M SD
Barthel index 124 19 16 5
Oxford Cognitive Screen Number of domains impaired 49 0 3 2 148 4 3 2
Hospital Anxiety Scale Anxiety 135 12 6 4
Depression 134 13 5 4
Stroke Impact Scale Total 49 0 115 142 154 0 242 111
Strength 46 6 6 8 139 10 12 6
Hand 46 6 6 9 141 8 15 9
ADL 46 6 15 19 143 7 35 15
IADL 47 4 12 15 141 8 25 12
Mobility 47 4 14 17 143 7 31 13
Communication 46 6 13 16 142 8 28 9
Emotion 47 4 14 17 141 8 29 10
Memory 44 10 12 15 142 8 26 9
Quality of Life scale 128 17 15 3
Note. Sub-Acute and Chronic refer to when the Figure Copy test was administered, so either before 6 months post-stroke (termed sub-acute) or
greater or equal to 6 months post-stroke. ADL refers to activities of daily living, IADL refers to instrumental activities of daily living.
14
AUTOMATED SCORING OF COMPLEX FIGURES, now in-press at Neuropsychology
cognitive impairments within clinical and sub-clinical populations using fine-grained measures
(Demeyere et al., 2020). The OCS-Plus version used in this investigation was created in
MATLAB 2014b and was run on a Microsoft Surface Pro computer tablet (Windows 10 Pro,
version 1511). The OCS-Plus begins with a small practice to ensure even those with limited
experience with computer-tablet technology can complete tasks accurately, this practice involves
tapping a shape in the centre of the screen, and drawing a line between two small dots. The OCS-
Plus includes a computerized Figure Copy Task which is designed to be inclusive for severely
impaired patients, including a simple, multi-element target figure. In this task, participants are
asked to copy a composite geometric shape (Figure 1) once from sight and again from memory,
immediately following completion of the copy condition. Participants are not informed that they
will be asked to remember the figure until the beginning of the memory condition. Participants
are instructed to complete their drawing using a tablet stylus within a marked area underneath the
target figure. Participants are allowed unlimited time to complete each of these drawing tasks.
Figure 1
15
AUTOMATED SCORING OF COMPLEX FIGURES, now in-press at Neuropsychology
The target figure in the OCS-Plus Figure Copy Task. See Figure 2 for individual figure element
definitions.
allowing full, detailed reconstruction of the drawing process. Each completed drawing is
assigned a total score out of 60 with each of the individual 20 figure elements being scored
independently according to three independent criteria: presence, accuracy, and position (Figure
2). An element is scored as present if it has been drawn anywhere in the response figure.
Perseverative responses are not quantitatively penalized but are noted by the examiner. Elements
are marked as accurate if they are drawn with reasonable accuracy as could be expected from a
person with typical drawing ability. Reasonable allowances are made to account for the use of a
tablet computer stylus on the relatively slippery screen surface and comorbid age-related fine
motor impairments (e.g. arthritis). For example, slight inaccuracies in line joining as well as
obvious attempts to correct such errors (e.g. doubling up a line to ensure that it is straight) are not
accurate. As in accuracy scores, allowances are made to account for tablet usage and age-related
fine motor impairments. Scorers are instructed to only penalise each drawing position error once
and to disregard cases in which position errors within one element have led to placement errors
within neighbouring elements. These criteria are used to assign a score out of three for each
16
AUTOMATED SCORING OF COMPLEX FIGURES, now in-press at Neuropsychology
individual element shown in Figure 2, and these element scores are summed to produce a total
score. This scoring procedure is repeated for the copy and recall drawing condition.
A full scoring manual detailing the exact instructions given to scorers is openly available
on the Open Science Framework (Foster & Deardorff, 2017; https://osf.io/9dwpv/). Human raters
completed approximately two hours of training with the manual to complete the manual ratings.
The average time required to manually score a figure copy response varied between 1 and 5
minutes, depending on the degree of distortion and error present within the response drawing.
The automated scoring programme requires less than 5 seconds to score a drawing,
implementation of automated scoring can be expected to save between 2-10 minutes per
participant (2 drawings each). Note, we automatically scored full points (18 points) for border
elements if the participant had used the drawing area border as the figure border (i.e., had drawn
no border elements). This approach was adopted in order to avoid penalising participants who
used the outer border of the rectangular drawing area as the figure border. Given that this error
pattern occurred in both healthy adult participants (n=23 in the copy condition, n=27 in the recall
condition) and patients (n=14 in the copy condition and n= 11 in the recall condition), we judged
this to not have represented a clinical deficit, such as closing-in behaviour, and instead attributed
these errors to the presentation of the space indicating where to draw on the tablet being too
similar in size and shape to the drawing, along with potential misunderstanding of instructions.
Given the small number of patient responses (14 at maximum, and more for the healthy adults)
this scoring rationale did not significantly impact the results of the conducted analyses.
Figure 2
Note. Elements are divided into three sections: A) border lines, B) internal dividers, and C) the
detail components. Each element is given a score out of three with one point being awarded
based on presence, accuracy, and position independently. If viewed in black and white, each
element on the figure is highlighted individually in red in each mini figure. Figure available at
The Automated Scoring Program created in this project was developed in Python 3.7 and
employs functions from the packages SciPy (Jones et al., 2001), Shapely (Gillies, 2015), Kivy
(Virbel et al., 2011), and PyLaTeX (Fennema, 2014). This program employs output variables
created by the OCS-Plus software including (x,y) coordinates of patient responses, time stamps,
and final drawing images. Before scoring each element, this software first pre-processes this data
First, in noise removal, all pen strokes totalling fewer than five pixels are removed, as
these responses represent very small marks which were most likely created by accidentally
touching the pen to the tablet. Similarly, all elements which are abnormally distant from other
elements are removed, as these marks are unlikely to be a part of a participant’s intended
response. Abnormal distance is determined via calculating the centroids for each element, and
then use k-dimensional trees (Maneewongvatana & Mount, 2002) to find nearest neighbours for
18
AUTOMATED SCORING OF COMPLEX FIGURES, now in-press at Neuropsychology
1
each of the centres within the distance r , such that r = min(fi g h , fi g w ), where h is height of
2
figure and w is width. Second, participant drawings are normalised. This step is essential due to
the large variance in participant response sizes, orientation angles, and positions within the
positioned with the bottom left-most point at coordinate (0,0) then scaling the x and y axis to
In the third step, circular elements are identified within the normalised response drawing.
Circles are defined as a continuous path which meets the criteria detailed in Figure 2. The values
of these parameter cut-offs were adjusted to the values which optimise overall performance.
processes a series of points on a single curve and outputs a simplified element path composed of
straight lines (Douglas & Peucker, 1973). Vector calculations are then used to determine the
angle between multiple lines on each simplified curve, to identify turning points, and to
In step five, star and cross figure elements are identified by finding all sets of lines
composed of intersecting paths where the length of each line is less than half of the drawing’s
total height and individual line lengths are within the third quartile plus 1.5 of the interquartile
range of each of the intersecting lines. Line sets of three or more lines where the smallest angle
between lines is greater than or equal to 30 degrees are defined as stars and sets of 2 or more
lines of which the smallest angle between lines is greater than or equal to a threshold, empirically
determined at 36 degrees are defined as crosses. Finally, in the last step, line elements of the
response figure are identified. The orientation of each remaining unclassified figure element is
19
AUTOMATED SCORING OF COMPLEX FIGURES, now in-press at Neuropsychology
determined as either vertical, horizontal, right, or left slanted by calculating the angles between
simplified lines and the normalised x-axis. Euclidian distance calculations (Deza & Deza, 2009)
are then used to match each drawn line to its corresponding element in the target figure.
Once this six-step pre-processing is completed, response drawing total scores are
accuracy, and position. The Automated Scoring Program marks an element as present if it has
been identified in the pre-processing steps described above. Accuracy scoring criteria differ
based on the element being assessed. For components such as circles, stars, and crosses to be
successfully identified by the pre-processing, they must be drawn with a reasonable degree of
accuracy. For this reason, if a circle, star, or cross is marked as being present, it is also scored as
being accurate. The accuracy of linear elements is scored by calculating the best fit line of the
The distance between a drawn point and a target point in 2D space is calculated as the
absolute difference between their respective x and y-coordinates. Linear elements are scored as
accurate if the maximum distance from any point of the target element to the best fit regression
line is less than 10, the length of the best fit line is greater than or equal to 70% of the target path
length, and the angle between the best fit line and target line segment is less than 10 degrees. If
two line segments, which are defined as separate in the figure template, are drawn as a
continuous line in the participant’s drawing, the algorithm is able to split the drawn line segment
in order to assess fit of the separated line segments to the original template as to avoid
underscoring presence.
Finally, element position is scored by comparing the location of drawn paths to the
location of the corresponding element within the target figure. The algorithm assigns each drawn
20
AUTOMATED SCORING OF COMPLEX FIGURES, now in-press at Neuropsychology
linear element to its corresponding target element, if it has the same orientation and the distance
between the elements is less than 20% of the total drawn figure height. As these position criteria
have to be met in order to identify a line, such a line is automatically scored as being in the
correct position. The detail elements star, cross, and circle are scored as positioned correctly if
their distance from the target location is less than 50% of the drawn figure height. Similarly, to
manual scoring, the automatic scoring program scores full border elements points if the all
border elements are not present. This scoring process results in a total score out of 60 points for
each response drawing. Full details on the design and implementation of this Automated Scoring
Program can be found in the original masters dissertation which details the Program
(Yamshchikova, 2019). The Figure Copy software can be downloaded for Academic Use from
Data analysis
The manual scoring data included in this investigation was completed independently by
Anon author 1 (rater 1) and Anon author 2 (rater 2). Both raters were trained to score drawings
and both scored all 928 responses included in this investigation. During scoring, all figures were
randomized and anonymized so that raters were blind to drawing condition, participant group,
and identity. First, the degree of agreement between human rater scores was assessed. Given that
figure copy total scores represent an aggregate measure which may not accurately capture inter-
element variation, these analyses were conducted on total scores and on an element-wise basis.
Agreement was measured in two ways. First, summed scores were compared using an intraclass
correlation coefficient (ICC; model, ICC1: i.e., single scores, random raters), which measures the
ratio of true variance divided by true variance plus error variance (Koo & Mae, 2016) and ranges
from 0 to 1. Cohen’s kappa reliability statistic was used for binary data such awarding a
21
AUTOMATED SCORING OF COMPLEX FIGURES, now in-press at Neuropsychology
presence, accuracy, or position score or not and is scaled as a standardised correlation coefficient
to enable cross-study interpretation (McHugh, 2012). This investigation employs the ICC
reliability benchmarks proposed by Koo and Mae (2016): <=.50 = poor reliability; >.50 - <=.75
= moderate reliability; >.75 - <=.90 = good reliability; >.90 = excellent reliability. All Cohen’s
kappa calculations employ the agreement benchmarks defined by McHugh (2012): 0-.20 = no
agreement; .21-.39 = minimal agreement; .40-.59 = weak agreement; .60-.79 = moderate; .80-.90
Next, the agreement between the Automated Scoring Program and aggregate human
scores was determined. Element-wise sensitivity (True Positives / True Positives + False
Negatives) and specificity (True negatives / True Negatives + False Positives) was calculated. In
these calculations, False Negatives represented cases in which an element was identified by
manual scoring, but not by the automated program. Conversely, False Positives represented cases
where an element was identified by the automated program, but not by human raters. Sensitivity
analysis is usually used in the case of determining whether a test correctly identifies a specific
group of cases from another, in our case presence of an element or not. The benchmark for
interpretation is that sensitivity + specificity should be close to or above 1.50 (or 150-200 when
We also examined how the Automated Scoring Program resolved cases in which the
raters did not assign identical scores. Next, a qualitative analysis of cases in which the automated
program was and was not able to extract meaningful scores was conducted. Finally, the known-
group discriminability of total scores assigned by the Automated Scoring Program was
examined.
22
AUTOMATED SCORING OF COMPLEX FIGURES, now in-press at Neuropsychology
Statistical analyses were conducted in R (version 3.5.1 (2018-07-02), R Core Team,
2018), the data and analyses scripts used to generate this manuscript are openly available
(https://osf.io/3k6gs/). We used the following packages for statistical analyses and visualisation
ggplot2 (version 3.3.2) (Wickham, 2016), cowplot (version 1.1.0) (Wilke, 2019), psych (version
1.8.12) (Revelle, 2018), irr (version 0.84.1) (Gamer et al., 2019), pROC (version 1.16.2) (Robin
et al., 2011), rcompanion (version 2.3.7) (Mangiafico, 2019), rstatix (version 0.4.0)
Results
The average total score assigned by human raters was 57.72 (SD=6.09) for copy
condition drawings and 44.4 (SD= 10.69) for recall condition responses. Raters exhibited a high
degree of agreement between assigned total scores, with a cumulative interclass correlation
(ICC) of .97, F(927,927)=58.40, p <.001, 95% CI [.96-.97]. This close agreement was present
within both the copy (ICC =.97, F(463,463)=57.56, p <.001, 95% CI [96-.97]) and recall (ICC
Of the 55680 elements scored, only 3.64% were assigned conflicting element sub-scores
by the assessors. Of all elements, raters disagreed on position scores most frequently (1.30%),
followed by accuracy scores (1.26%), and then presence (1.08%). See Figure 3. Raters were
found to disagree on more recall condition elements (5.51%) than copy condition elements
(1.78%). This difference is likely due to the comparatively greater quality variation present
within delayed recall drawing responses (recall variance = 126.63, copy variance = 17.41).
23
AUTOMATED SCORING OF COMPLEX FIGURES, now in-press at Neuropsychology
Figure 3
Note. Between-rater total score comparisons across both copy and recall condition figure copy
drawings (N=928) demonstrated a high degree of agreement across total accuracy, position, and
presence scores (ICCs = .95, .96, and .96 respectively). The dashed line represents perfect
correlation (slope =1, intercept = 0) in order to demonstrate deviation of agreement, and solid
line reflecting best fit line. Figure available at https://osf.io/2qwfn/ under a CC-BY4.0 license.
Next, elements which caused the highest degree of disagreement between the raters were
identified. The most frequent element to be disagreed upon across all subs-scores was the middle
bottom right interior divider slanted line (element 11, see Table 4) where the human raters
disagreed on all three sub-scores a total of 37 / 928 times (3.99%). The small left vertical interior
divider line (element 12) had the highest number of two sub-score disagreements (4.42%, n =
41), with position representing the most commonly disputed sub-score. Finally, the circle
(element 14) had the highest number of cases in which human raters differed within a single sub-
The comparative accuracy of the automated figure copy scores was evaluated against the
manually assigned scores. For the element-wise analyses, only element scores where both raters
agreed (96.36% of all scores) were included in these analyses. For total score comparisons, we
averaged the two raters total scores. This procedure was adopted to ensure the Automated
Scoring Program was able to accurately score figures versus agreed-upon scores before moving
on to more complex cases. Overall, the scores assigned by the automated program and raters
exhibited a high degree of agreement both in terms of total score (ICC = .83 F(927,927)=21.90,
p <.001, 95% CI [.23-.93]) and element scores (Cohen’s k = .63, 95% CI [.63-.63], p<.001). The
same was true for the recall condition (total score ICC = .83, F(463,463)=22.69, p <.001, 95% CI
[.23-.94], element-wise agreement Cohen’s k =.71, 95% CI [.71-.72], p<.001), and copy
condition (total score ICC = .58, F(463,463)=7.20, p <.001, 95% CI [.01-.81], element-wise
A further way to compare the Automated Scoring Program to the human raters scoring, is
to compare whether the same participants are identified as impaired on either scoring version.
Receiver operating characteristic (ROC) analyses were conducted to compare total score
binarized impairment categorizations (i.e., less than 2SDs below the mean) of the automated
assigned scores to those based on the standard manual scoring and cut-offs in copy and recall
conditions. In this way we directly compared the impairment classification between manually
and automatically derived scores, rather than trying to determine presence of a stroke event.
When compared to impairment categorisations made based on manual scores overall (i.e., across
both copy and recall conditions), the Automated Scoring Program was found to have a total
condition, with a slightly lower Youden index in the copy condition (sensitivity = 79.13%copy &
80.70% recall, specificity = 90.14%copy & 96.81% recall, Youden index = .69 copy & .78 recall). When
overall sensitivity and specificity are summed, we get a value of 173.44% or 1.73 in raw units,
meaning that our test had above excellent ability determine impairment classification compared
to manual scores (Power et al., 2013). Table 3 summarises the average scores attained by each
sample group per copy and recall condition, and presents group specific sensitivity and
specificity statistics.
26
AUTOMATED SCORING OF COMPLEX FIGURES, now in-press at Neuropsychology
Figure 4
Note. There were separate cut offs for recall and copy conditions of the Figure Copy Task.
Impairment on the task was classified as greater than 2SDs below the mean score, and the overall
27
AUTOMATED SCORING OF COMPLEX FIGURES, now in-press at Neuropsychology
graph takes into account both conditions. Figure available at https://osf.io/q6zys/ under a CC-
BY4.0 license.
28
AUTOMATED SCORING OF COMPLEX FIGURES, now in-press at Neuropsychology
Table 3
Summary statistics of performance of the participants scored by human raters and the Automated Scoring Program,
Total Human Algorithm score True True False False
Condition Group Sens. Spec.
score M (SD) M (SD) positive negative positive negative
Figure Healthy
59.03 (2.90) 55.65 (4.68) 2.30 91.19 2.68 3.83 37.52 97.14
Copy adults
Sub-
Acute 50.2 (13.45) 44.84 (11.62) 24.49 59.18 16.33 0 100 78.37
stroke
Chronic
57.33 (5.98) 48.96 (7.95) 12.99 70.13 16.88 0 100 80.60
Stroke
Overall 57.72 (6.09) 52.29 (7.94) 8.19 80.82 8.84 2.16 79.13 90.14
Figure Healthy
47.83 (7.97) 43.95 (8.70) 1.53 95.40 1.15 1.92 44.35 98.81
Recall adults
Sub-
Acute 32.36 (14.31) 31.57 (12.86) 26.53 71.43 2.04 0 100 97.22
stroke
Chronic
41.69 (10.84) 34.13 (11.67) 18.83 71.43 5.84 3.90 82.84 92.44
Stroke
Overall 44.4 (10.69) 39.38 (11.49) 9.91 84.91 2.80 2.37 80.7 96.81
Note. We present sensitivity/specificity of the impairment classifications of the Automated Scoring Program (e.g., <2SDs from healthy
adult mean total score) compared to ground truth impairment classifications of the manually scored total scores, per group and overall.
29
AUTOMATED SCORING OF COMPLEX FIGURES, now in-press at Neuropsychology
To further illustrate the degree of agreement between scoring methods overall in terms of
total score we classified assigned automated scores into four categories: (1) a direct match with
averaged rater total scores (2) within 5% of averaged rater total scores (3) between five (not
inclusive) and 10% (inclusive) of averaged rater total scores, (4) between 10 and 15% deviation
from averaged rater total scores, and (5) greater than 15% deviation from averages rater scores
(e.g. Canham et al., 2000). We found that that 83.51% of scores from the algorithm were within
15% of the average rater scores (39.76% within 5%) and that the maximum deviation was 52%
(n=1). In this single extreme case, the participant had drawn a non-element outside of the figure
boundary, but within the maximum bounds, skewing the normalisation process such that the
algorithm failed to recognise one side of the otherwise perfect figure. 16.16% were scored by the
Illustrates the agreement between the human raters and the automated scoring program in both
Note. The top panel illustrates the relationship between automated and average manual Figure
Copy total scores (N=928). Lower panels present a comparison of accuracy, position, and
31
AUTOMATED SCORING OF COMPLEX FIGURES, now in-press at Neuropsychology
presence element scores. The dashed line represents perfect agreement (slope =1, intercept = 0),
and the linear best fit line is in black. This reveals the automatic scoring algorithm underscores,
and that the underscoring is probably due to underscoring of accuracy. Figure available at
Table 4 presents the automated program’s proportion of element hits, misses, false
positives, and correct rejections for element-wise presence scores versus the human raters.
Overall, the automated algorithm was found to exhibit an average element sensitivity of 90.10%
and an average specificity of 92.20%. See supplementary materials for sensitivity tables for each
element score (i.e., presence, accuracy, and position) and condition separately.
32
AUTOMATED SCORING OF COMPLEX FIGURES, now in-press at Neuropsychology
Table 4
The Automated Scoring Program’s proportion of presence score hits, misses, false positives, and
correct rejections versus the matched human rater data for combined copy and recall condition
drawings
Next, Cohen’s k analyses were performed to evaluate the degree of agreement between
automated and manual element accuracy, position, and presence scores. These scoring methods
were found to exhibit a high degree of agreement on position (k =.82, 95% CI [.82-.84], p<.001)
and presence (k =.76, 95% CI [.76-.77], p<.001) scores, but a lower degree of agreement within
accuracy scores (k =.41, 95% CI [.41-.42], p<.001). The greatest source of disagreement between
the automated and manual scorings was found to be element accuracy false positives (22.63%
accuracy false positives versus 2.81% position and 4.72% presence false positives), resulting in a
Thus far, only data from element sub-scores in which human raters agreed with one
another, or averaged total scores, has been considered. However, it is also important to
investigate the performance of the automated program in more ambiguous cases. For this reason,
the Automated Scoring Program was then evaluated within drawings for where human raters
disagreed. To do this, we examined element or total score cases in which the two raters did not
agree.
First, ICC analyses were conducted to identify the degree of agreement between
automated scores and individual rater’s assigned scores within cases where raters had assigned
different total scores (37.07% of the time). Automated scores exhibited high consistency with
both rater one’s assigned scores (ICC = .80, F(553,553)=17.91, p <.001, 95% CI [.25-.92]) and
rater two’s (ICC = .79, F(553,553)=18.73, p <.001, 95% CI [.13-.92]) in cases where raters had
We then examined how the Automated Scoring Program resolved these rater
disagreements. On cases where there was a clear disagreement between humans raters (i.e., three
element points disagreed upon) or cases with clearer agreement between human raters (i.e., only
one element point disagreed upon) the Automated Scoring Program tended to give more points.
Cases in which raters disagreed on a single element sub-score were classed as one-point
disagreements. Similarly, cases where raters disagreed on all three element sub-scores are termed
three-point disagreements. Element 11 (middle bottom right slanted line) had the most three-
point disagreements, and element 14 (the circle) had the most one-point disagreements. The
algorithm scored the majority of participants three points (46.77% for element 11, 71.66% for
element 14), and the consistency between the Automated Scoring Program and the average
35
AUTOMATED SCORING OF COMPLEX FIGURES, now in-press at Neuropsychology
human rater score was moderate (element 11 ICC = .61, F(36,36)=8.32, p <.001, 95% CI
[-.02- .85], element 14 ICC = .61, F(56,56)=10.44, p <.001, 95% CI [-.08- .85])).
However, in cases there the scoring was more ambiguous (i.e., disagreement by two-
points was common on the small left vertical interior divider line, element 12), the automated
scoring algorithm scored less favourable, giving the majority of participants zero points
(37.39%). For these participants, however, the consistency between the Automated Scoring
Program and the average human rater total score was still good (ICC = .69, F(40,40)=15.80, p
When looking at all elements score disagreements between raters regardless of the degree
of disagreement (i.e., by one-point or more etc), the automatic scoring program proportionally
gave less accuracy, position, and presence points (57.64% of the time) than awarded them
(42.36% of the time), this was especially the case for accuracy where the Automated Scoring
Program gave far more inaccuracy points than accuracy points. This is can be seen in Figure 4.
Overall, the automated scorers matched well with the scores assigned by human raters
with the majority of the automated total scores being within 15% of the manually assigned
scores. The Automated Scoring Program was reliably able to extract and identify figure elements
in drawings. For example, drawings which contained distorted or disconnected lines (Figure 6
panel A), partial copies (Figure 6 panel B), additional elements (Figure 6 panel C), and mild
tremor (Figure 6 panel D) were generally scored accurately. Overall, the Automated Scoring
Program was able to successfully discriminate elements from a wide range of imperfectly drawn
figures.
36
AUTOMATED SCORING OF COMPLEX FIGURES, now in-press at Neuropsychology
Figure 6
37
AUTOMATED SCORING OF COMPLEX FIGURES, now in-press at Neuropsychology
Examples of distorted drawings from which the algorithm correctly identified and scored
The ability of the Automated Scoring Program to effectively quantify Figure Copy Task
robust with <10.06% element false positives and <1.17% element false negatives, there are
specific response patterns which may result in systematic scoring failures. For example, the
scoring algorithm is less accurate in cases where extra elements were included (Figure 7 Panels
A, D, F). The Automated Scoring Program also struggles in cases where participants extend line
elements beyond their template boundaries (Figure 7, Panel B), had attempted to correct
mistakes (Figure 7, Panel C), or had failed to draw figure border elements (Figure 7, Panel E).
39
AUTOMATED SCORING OF COMPLEX FIGURES, now in-press at Neuropsychology
Figure 7
Sample figures in which response features caused inaccuracies within the Automated Scoring
Program.
Note. Elements which distorted the automated scoring process are highlighted in red. The
Automated Scoring Program was found to struggle when additional figure elements were drawn
(A, C, D, F), when borders were missing (E), or when elements were extended beyond their
40
AUTOMATED SCORING OF COMPLEX FIGURES, now in-press at Neuropsychology
template boundaries (B). Number of participants who the raters identified as making these errors
are noted in panel labels. Note panel E shows a drawing with a perfect score, due to awarding
full border element points if no border element is present and/or the participant used the edge of
the on-screen drawing area as the figure border. Figure available at https://osf.io/8ahtu/ under a
CC-BY4.0 license.
The Automated Scoring Program was found to encounter the most difficulty when
scoring circle and star elements, as the algorithm must employ precise criteria (e.g. number and
angle of intersecting lines) to identify these features. For example, the automated algorithm
struggled to identify the circular element, missing 7.22% of circles which were marked as
present by human raters. This systematic false negative specifically occurred when the circle was
drawn as an arc, as multiple distinct overlapping lines, or another non-closed path (Figure 8
panel A). Star elements also may not be correctly identified, with the automated program missing
5.60% of stars marked as present by human rater. This false negative occurs if stars are drawn as
a single continuous path, rather than as distinct lines (Figure 8 panel B). However, it should be
noted that overall the inaccurate scoring by the algorithm was comparatively infrequent,
Figure 8
Further examples in which response features caused inaccuracies within the Automated Scoring
Program.
Note. Illustrates examples of figures where circular elements (Panel A) and star elements (Panel
B) were not identified by the Automated Scoring Program. Circles which were drawn as
multiple, overlapping lines or multiple separated lines were not identified (Panel A, distinct lines
highlighted in red and green, or for black and white print, in disjointed translucent highlight).
Similarly, stars which were drawn as a continuous line were missed by the Automated Scoring
Known-group discriminability
42
AUTOMATED SCORING OF COMPLEX FIGURES, now in-press at Neuropsychology
In order to sanity check the scoring of the Automated Scoring Program, we compared the three
sample groups (i.e., sub-acute stroke, chronic stroke, and healthy adults), to see if they performed
differently from each other. Given typical recovery trajectories following stroke, we would
expect that the sub-acute stroke group would score lower than the chronic stroke group, and that
the chronic stroke group would score lower than the healthy adult group. An ANCOVA analysis
was conducted to establish the differences between the healthy adult and stroke survivor groups
in their total scores while controlling for statistical differences of demographics of age and
education. For the copy condition, the ANCOVA revealed a significant effect of group, when
controlling for age and education (F(2,386)=40.80, p <.001). Tukey HSD test indicated that
healthy adults performed significantly better than both sub-acute stroke survivors (Mdifference=7.65,
p<.001, d=-1.72) and chronic stroke survivors (Mdifference=3.96, p<.001, d=-1.10). For stroke
survivors specifically, lesion volume was added to the model as a covariate and Tukey HSD
demonstrated significant differences between the sub-acute and chronic stroke survivors (M
=4.69, p=.01, d=-.46). For the recall condition, when only controlling for the effects of age
difference
and education on total score, healthy adults again performed significantly better in the recall
condition than both sub-acute stroke survivors (Mdifference=8.94, p<.001, d=-1.31) and chronic
stroke survivors (Mdifference=4.63, p<.001, d=-.99). When additionally adding lesion volume, on
this recall data, the Tukey HSD analysis revealed no significant difference in performance on the
recall conditions (Mdifference=3.89, p=.13, d=-.21). See Figure 9 for the distributions of total score
Figure 9
Illustrates the distributions of total score from three groups on the Figure Copy test, both copy
Note. White dot represents separate group means, individual scores are in light blue, and score
distributions are presented in darker blue. Sub-Acute and Chronic refer to stroke survivors before
44
AUTOMATED SCORING OF COMPLEX FIGURES, now in-press at Neuropsychology
and after 2 months post-stroke. Figure available at https://osf.io/pfmcu/ under a CC-BY4.0
license.
Discussion
This investigation aimed to develop a novel, automated program to score the OCS-Plus
Figure Copy Task (Demeyere et al., 2020) and to evaluate the accuracy and utility of this
automated tool. Overall, the automated scoring algorithm was able to reliably extract and
identify individual figure elements and to assign total scores which agreed well with manual
scores across both the copy and recall conditions. Compared to overall impairment
categorisations based on manual scores, the Automated Scoring Program had a high overall
sensitivity and specificity and was reliably able to distinguish between different clinical
impairment groups. The novel automated program was found to be generally robust and very
close to the manual scoring overall. There is a clear benefit of automating Figure Copy scoring,
in terms of time and cost savings, in particular allowing this screening assessment to be used
without the need for highly trained neuropsychologists to administer and score the task. At the
group level, the scoring tool is clearly able to distinguish groups, and diagnostic accuracy
compared to manual scoring was very high with an overall AUC of 86.7%. At an individual
patient level, we did note some specific response patterns which resulted in systematic scoring
failures on the automatic tool. Even though these were low in incidence, if the scoring algorithm
automated scores with full visualization of the original drawing in the reports is key to help
interpret all scores at the individual level (see https://osf.io/y9xvk/ for an example output of the
Automated Scoring Program). Overall, the very high alignment with manual scoring means this
45
AUTOMATED SCORING OF COMPLEX FIGURES, now in-press at Neuropsychology
program represents a significant and pragmatic advancement over traditionally employed manual
scoring procedures, setting the scene for potential implementation in wide-scale screening
Within this investigation, the two human raters were found to assign scores with a high
degree of agreement. This consistency was present across individual element sub-scores as well
as within both copy and recall condition data. When human raters did not assign identical scores,
the source of disagreement was most commonly individual element position and accuracy scores.
However, the human raters in this investigation completed an extensive training program
designed to standardize assigned scores which is typically not feasible to implement at scale
within clinical environments. In order to allow a more automated and wide scale range of
cognitive screening to be conducted, reducing the need for high level training and
Overall, the automated scoring algorithm was able to reliably extract and accurately score
individual elements within patient Figure Copy Task responses. In cases where human raters
assigned identical scores, there was a high degree of consistency between automated and
manually assigned total scores and moderate agreement within individual element scores. The
overall human-algorithm score correlations in this investigation were largely similar to those
reported by Vogt et al. (2019) (0.83 versus 0.88 respectively). Within individual elements, the
Automated Scoring Program demonstrated extremely high sensitivity (92.10%) and specificity
represent algorithmic errors, but instead suggest the use of slightly different, but not necessarily
less valid, scoring criteria. For example, the automated program has a tendency to be stricter than
humans raters when awarding points within the accuracy element sub-score. However, despite
46
AUTOMATED SCORING OF COMPLEX FIGURES, now in-press at Neuropsychology
this systematic difference within accuracy sub-scores, the vast majority of automated total scores
(83.51%) were within 15% of manually assigned total scores. These findings suggest that the
automated program employs slightly different element scoring criteria than the human raters, but
this variance does not result in substantial changes within response total scores.
The performance of the automated scoring algorithm was also separately investigated
within responses where human raters did not assign identical scores. This is a particularly critical
method for resolving such human rater disagreements. Within individual response elements
which were assigned different scores by human raters, the automated program tended to employ
more lenient scoring criteria. For example, when a specific element was scored as being
inaccurately drawn by one rater but accurately drawn by the other, the Automated Scoring
Program was more likely to report that the element had been drawn accurately. Despite this
tendency to be more lenient, as a whole, automated scores exhibited high consistency with both
rater one and rater two’s assigned total scores in cases where both raters disagreed. This indicates
that the automated program’s systematic lenient scoring of disagreed upon individual elements
does not appear to produce systematic biases within overall response scores. In any case where
quantitative scores are assigned to response which do not have a clear “ground truth” score, some
consistency where human raters may not and provides the clear advantage of being able to
The Automated Scoring Program was found to exhibit several clear strengths over
manual scoring procedures. First, was able to systematically assign completely reproducible
scores even in cases where drawings were distorted. Given that this investigation included data
47
AUTOMATED SCORING OF COMPLEX FIGURES, now in-press at Neuropsychology
from a representative sample of sub-acute stroke survivors exhibiting a range of common post-
stroke cognitive impairments, responses were frequently extremely dissimilar to the target figure.
The automated program was found to cope well with drawing inaccuracies due to co-morbid fine
motor impairments, omissions due to visuospatial deficits, perseveration errors, and other
common post-stroke impairment patterns. This robustness greatly adds to the automated
program’s potential clinical utility. Second, while manual scoring of Complex Figure Copy
drawing requires training and time to complete, the automated program is able to instantly
produce detailed score breakdowns. This makes employing an automated scoring procedure
extremely time efficient, which is a valued attribute especially within clinical settings. Finally,
the scores generated by the automated algorithm are completely reproducible. These
standardized scores are one of the greatest advantages of employing automated over manual
scoring methods, as they facilitate valid score comparisons across many different raters in many
Despite these advantages, some potential weaknesses were identified within the
automated scoring procedure. First, there are specific response patterns which were found to
result in systematic underscoring. For example, the automated program struggled to identify
circle and star elements which did not meet its exact mathematical extraction criteria but were
easily identifiable by human raters (Figure 8). Similarly, the Automated Scoring Program
struggled to accurately score drawings when large, extra features were present within the
response space (Figure 7). These failures occurred infrequently but represent a potential avenue
for improving the automated scoring procedure, or even simply providing an extra element of
confidence ratings for each figure, to flag up those which may have been underscored. Future
research should aim to identify more flexible methods for identifying more complex elements
48
AUTOMATED SCORING OF COMPLEX FIGURES, now in-press at Neuropsychology
and for preventing the presence of large extra elements from distorting figure segmentation.
Finally, the automated program employs slightly different element sub-scoring strategies than
human raters. Where the circle, star, and cross elements can be identified by the automated
program, it is automatically scored as “accurate” due to the equations having specific placement
and line intersection requirements. This means that for these specific 3 detail elements, they
cannot be scored as present correctly if they are not also scored as being accurate. However, this
difference in scoring was not found to result in significant disagreements between automated and
human-assigned scores.
the best results are achieved when scoring procedures employ limited human feedback to address
minor weaknesses in otherwise robust algorithms. For example, Wang et al (2016) developed a
deep learning algorithm to identify and segment potentially cancerous tissue in mammograms
which found that a trained pathologist achieved an AUC of 99.6% whilst the automated
segmentation program achieved and AUC of 96.6%. However, when the automated output was
briefly reviewed by the trained pathologist to remove obvious false positive cell clusters, the
maximal AUC of 99.5% was achieved whilst retaining the time-efficiently benefit of employing
an automated scoring method (Wang et al., 2016). A similar approach could potentially be taken
to improve the performance of the automated scoring program presented within this
investigation. For example, human raters could quickly screen all figures assigned very low
scores by the algorithm to flag cases where normalizing errors have produced false negative
scores. Overall, the automated scoring program was found to provide a robust and reliable
method for analyzing a wide range of Figure Copy Task responses. However, future
investigations can aim to further explore clinical feasibility and acceptability and within this
49
AUTOMATED SCORING OF COMPLEX FIGURES, now in-press at Neuropsychology
investigate whether employing a collaborative scoring approach could maximize the efficacy and
accuracy of automated scoring processes. Importantly, the automated scores were found to
reliably distinguish between participants falling into different impairment groups. On average,
sub-acute stroke survivors were assigned significantly lower scores than chronic stroke
survivors, who were in turn assigned lower average scores than neurologically healthy adults.
These findings are in line with expectations, demonstrating the external validity of automated
Figure Copy Task Scores. Receiver Operating Curve analysis demonstrated that, compared to
overall impairment categorizations based on manual scores, the Automated Scoring Program had
an extremely high overall sensitivity and specificity (80% and 93.4% respectively) (AUC =
86.7%). This finding illustrates that impairment classifications based on automated scores alone
are largely comparable to those assigned by human raters. Taken together, this external validity
and ability to identify overall impairment highlight the automated scorning program’s potential
clinical utility.
evaluations within both clinical and research settings. From a clinical perspective, automated
scoring offers a time-efficient solution for standardizing Figure Copy Scores in order to more
reliably detect impairment patterns across many different patient groups. Examiners will no
longer have to complete time-consuming scoring or training procedures, and will be provided
with immediate, highly detailed scoring results. This in turn may help improve the speed and
deficits and open the door to wider population-based cognitive screening and (assisted) self-
assessments. From a research perspective, employing automated Figure Copy Scoring helps
reduce bias due to the reliance on subjective examiner judgments. This is a critical advantage, as
50
AUTOMATED SCORING OF COMPLEX FIGURES, now in-press at Neuropsychology
it facilitates valid, large-scale comparisons of Figure Copy Task data collected by different
examiners, within different patient groups or research settings. Automated scoring is also
completely reproducible, augmenting the reliability of any findings based on analysis this
scoring data. Overall, the results of this investigation strongly suggest that the novel, automated
Figure Copy Scoring tool is a robust and reliable scoring methodology which can be employed to
produce immediate, unbiased, and reproducible scores for Complex Figure Copy Task responses
Limitations
There are several potential avenues through which future research can aim to expand on
the findings of this investigation. First, Complex Figure Copy Tasks are not only commonly
employed within stroke patients, are also regularly administered to patients with suspected
dementia, traumatic brain injury, and other neurological deficits. Patients falling within each of
these impairment categories may exhibit different error patterns within Figure Copy Tasks.
Future research can aim to investigate whether this Automated Scoring Program performs
equally well across these patient groups and to determine whether these Figure Copy tasks can
reliably differentiate between a wider range of clinical populations. Second, this Automated
Scoring Program was specifically designed to score the computerised OCS-Plus Figure Copy
Task and does not generalise to other Figure Copy Task Stimuli. In order to facilitate automated
scoring of other common Figure Copy Tasks (such as the Rey-Ossterich Complex Figure), future
research will need to develop additional, specialised automated scoring algorithms. Similarly, the
automated algorithm relies on detailed (x,y) coordinates and timestamps produced by a tablet
clinical and research environments (e.g. Bauer et al 2012), many Figure Copy Tasks are still
51
AUTOMATED SCORING OF COMPLEX FIGURES, now in-press at Neuropsychology
administered in pen and paper format, and the embedding of computerised testing in clinical
Conclusions
This investigation presents a novel, automated scoring tool for the OCS-Plus Figure Copy Task
(Demeyere et al., 2020). Overall, the automated scoring algorithm was able to reliably extract
and identify individual figure elements and to assign total scores which agreed well with manual
scores across both the copy and recall conditions. This automated program was reliably able to
identify overall impairment patterns and distinguish between different clinical impairment
groups. This represents a significant advancement as this novel technology can be employed to
produce immediate, unbiased, and reproducible scores for Complex Figure Copy Task responses
in clinical and research environments. More generally, the findings of this investigation suggest
that automated scoring procedures can be implemented to improve the scope and quality of
References
Agrell, B., & Dehlin, O. (1998). The clock-drawing test. Age and Ageing, 27(3), 399–403.
Alladi, S., Arnold, R., Mitchell, J., Nestor, P. J., & Hodges, J. R. (2006). Mild cognitive
https://doi.org/10.1017/S0033291705006744
Al-Janabi, H., Flynn, T. N., & Coast, J. (2012). Development of a self-report measure of
Barker, L. A., Morton, N., Morrison, T. G., & McGuire, B. E. (2011). Inter-rater reliability of the
https://doi.org/10.3109/02699052.2011.597046
Canham, R., Smith, S. L., & Tyrrell, A. M. (2000). Automated scoring of a neuropsychological
test: The Rey Osterrieth complex figure. Proceedings of the 26th Euromicro Conference.
https://doi.org/10.1109/EURMIC.2000.874519
Chechlacz, M., Novick, A., Rotshtein, P., Bickerton, W.-L., Humphreys, G. W., & Demeyere, N.
Chen, H., Pan, X., Lau, J. K. L., Bickerton, W.-L., Pradeep, B., Taheri, M., Humphreys, G., &
https://doi.org/10.1016/j.nicl.2016.04.007
Chen, S., Stromer, D., Alabdalrahim, H. A., Schwab, S., Weih, M., & Maier, A. (2020).
Crawford, J. R., Parker, D. M., McKinnley, W., & McKinlay, W. W. (1992). A Handbook of
Dawes, R. M., Faust, D., & Meehl, P. E. (1989). Clinical versus actuarial judgment. Science,
243(4899), 1668–1674.
Demeyere, N., Haupt, M., Webb, S. S., Strobel, L., Milosevich, E. T., Moore, M. J., Wright, H.,
Demeyere, N., Riddoch, M. J., Slavkova, E. D., Bickerton, W. L., & Humphreys, G. W. (2015).
Deza, M. M., & Deza, E. (2009). Encyclopedia of distances. In Encyclopedia of distances (pp. 1–
583). Springer.
Douglas, D. H., & Peucker, T. K. (1973). Algorithms for the reduction of the number of points
Duncan, P. W., Reker, D. M., Horner, R. D., Samsa, G. P., Hoenig, H., LaClair, B. J., & Dudley,
Folstein, M. F., Robins, L. N., & Helzer, J. E. (1983). The mini-mental state examination.
Foster, E. D., & Deardorff, A. (2017). Open Science Framework (OSF). Journal of the Medical
Freeman, R. Q., Giovannetti, T., Lamar, M., Cloud, B. S., Stern, R. A., Kaplan, E., & Libon, D.
Gamer, M., Lemon, J., Fellows, I., & Puspendra, S. (2019). irr: Various Coefficients of
https://CRAN.R-project.org/package=irr
Gillies, S. (2015). The Shapely User Manual Shapely 1.2 and 1.3 documentation. Zugriff Am,
11–16.
Humphreys, G. W., Duta, M. D., Montana, L., Demeyere, N., McCrory, C., Rohr, J., Kahn, K.,
Tollman, S., & Berkman, L. (2017). Cognitive function in low-income and low-literacy
settings: Validation of the tablet-based Oxford Cognitive Screen in the Health and Aging
Jones, E., Oliphant, P., & Peterson, P. (2001). SciPy: Open source scientific tools for Python.
Www.Scipy.Org.
Kassambara, A. (2020). rstatix: Pipe-Friendly Framework for Basic Statistical Tests: Vol. 0.4.0.
Mahoney, F., & Barthel, D. W. (1965). Functional evaluation ; the Barthel index. A simple index
searching with clustered point sets. Data Structures, Near Neighbor Searches, and
Meehl, P. E. (1954). Clinical versus statistical prediction: A theoretical analysis and a review of
the evidence.
Moetesum, M., Siddiqi, I., Masroor, U., & Djeddi, C. (2015). Automated scoring of Bender
Gestalt Test using image analysis techniques. 2015 13th International Conference on
https://doi.org/10.1109/ICDAR.2015.7333845
56
AUTOMATED SCORING OF COMPLEX FIGURES, now in-press at Neuropsychology
Moore, M. J., Vancleef, K., Shalev, N., Husain, M., & Demeyere, N. (2019). When neglect is
https://doi.org/10.1136/jnnp-2018-319668
Pinto, E., & Peters, R. (2009). Literature review of the Clock Drawing Test as a tool for
Power, M., Fell, G., & Wright, M. (2013). Principles for high-quality, high-value testing. BMJ
R Core Team. (2018). R: A language and environment for statistical computing. R Foundation
Rabin, L. A., Paolillo, E., & Barr, W. B. (2016). Stability in Test-Usage Practices of Clinical
Neuropsychologists in the United States and Canada Over a 10-Year Period: A Follow-
206–230. https://doi.org/10.1093/arclin/acw007
Robin, X., Turck, N., Hainard, A., Tiberti, N., Lisacek, F., Sanchez, J.-C., & Müller, M. (2011).
pROC: An open-source package for R and S+ to analyze and compare ROC curves. BMC
Schreiber, H. E., Javorsky, D. J., Robinson, J. E., & Stern, R. A. (1999). Rey-Osterrieth Complex
Y;FT509
Shin, M.-S., Park, S.-Y., Park, S.-R., Seol, S.-H., & Kwon, J. S. (2006). Clinical and empirical
applications of the Rey–Osterrieth Complex Figure Test. Nature Protocols, 1(2), 892–
899. https://doi.org/10.1038/nprot.2006.115
Somerville, J., Tremont, G., & Stern, R. A. (2000). The Boston qualitative scoring system as a
Virbel, M., Hansen, T., & Lobunets, O. (2011). Kivy–a framework for rapid creation of
Wang, D., Khosla, A., Gargeya, R., Irshad, H., & Beck, A. H. (2016). Deep learning for
https://doi.org/10.1080/13854046.2017.1317364
Wickham, H. (2011). The split-apply-combine strategy for data analysis. Journal of Statistical
Wilke, C. O. (2019). cowplot: Streamlined Plot Theme and Plot Annotations for “ggplot2”: Vol.
1.0.0.
58
AUTOMATED SCORING OF COMPLEX FIGURES, now in-press at Neuropsychology
Yamshchikova, A. (2019). Automated scoring: An investigation into Figure Copy Task, a
Zigmond, A. S., & Snaith, R. P. (1983). The hospital anxiety and depression scale. Acta