Guarnera Et Al. (2017) - Why Do Forensic Experts Disagree

See
discussions, stats, and author profiles for this publication at:

https://www.researchgate.net/publication/317631486
Why do forensic experts disagree?

Sources of unreliability and bias in
forensic psychology evaluations.
Article · June 2017

DOI: 10.1037/tps0000114
CITATION READS
1 21
3 authors, including:
Lucy Guarnera Daniel C Murrie

University of Virginia University of Virginia
5 PUBLICATIONS 91 CITATIONS 73 PUBLICATIONS 1,696 CITATIONS
SEE PROFILE SEE PROFILE
Some of the authors of this publication are also working on these related projects:
Psychopathy Screening of Incarcerated Juveniles View project
All content following this page was uploaded by Lucy Guarnera on 12 July 2017.
The user has requested enhancement of the downloaded file.

Translational Issues in Psychological Science © 2017 American Psychological Association
2017, Vol. 3, No. 2, 143–152 2332-2136/17/$12.00 http://dx.doi.org/10.1037/tps0000114
Why Do Forensic Experts Disagree? Sources of Unreliability and

Bias in Forensic Psychology Evaluations
Lucy A. Guarnera Daniel C. Murrie

University of Virginia University of Virginia School of Medicine
Marcus T. Boccaccini
Sam Houston State University
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
This document is copyrighted by the American Psychological Association or one of its allied publishers.
Recently, the National Research Council, Committee on Identifying the Needs of the
Forensic Science Community (2009) and President’s Council of Advisors on Science and
Technology (PCAST; 2016) identified significant concerns about unreliability and bias in
the forensic sciences. Two broad categories of problems also appear applicable to forensic
psychology: (1) unknown or insufficient field reliability of forensic procedures, and (2)
experts’ lack of independence from those requesting their services. We overview and
integrate research documenting sources of disagreement and bias in forensic psychology
evaluations, including limited training and certification for forensic evaluators, unstandard-
ized methods, individual evaluator differences, and adversarial allegiance. Unreliable opin-
ions can result in arbitrary or unjust legal outcomes for forensic examinees, as well as
diminish confidence in psychological expertise within the legal system. We present rec-
ommendations for translating these research findings into policy and practice reforms
intended to improve reliability and reduce bias in forensic psychology. We also recommend
avenues for future research to continue to monitor progress and suggest new reforms.
What is the significance of this article for the general public?

This review summarizes and integrates research on sources of disagreement and bias in
forensic psychology evaluations, including limited training and certification, unstan-
dardized methods, individual evaluator differences, and allegiance to the retaining party.
Disagreement can result in arbitrary or unjust legal outcomes for forensic examinees, as
well as diminish confidence in psychological expertise. Thus, policy and practice
changes are needed to improve the reliability of forensic opinions.
Keywords: forensic evaluation, forensic instrument, adversarial allegiance, human

factors, bias
Imagine you are a criminal defendant or civil answering a difficult psycholegal question
litigant undergoing a forensic evaluation by a about you and your case. For example, “Were
psychologist, psychiatrist, or other clinician. you sane or insane at the time of the offense?
The forensic evaluator has been tasked with How likely is it that you will be violent in the
future? Are you psychologically stable enough
to fulfill your job duties?” The forensic evalu-
ator interviews you, reads records about your
Lucy A. Guarnera, Department of Psychology, Univer- history, speaks to some sources close to you,
sity of Virginia; Daniel C. Murrie, Institute of Law, Psy-
chiatry, and Public Policy, University of Virginia School of and perhaps administers some psychological
Medicine; Marcus T. Boccaccini, Department of Psychol- tests. The evaluator then forms a forensic
ogy, Sam Houston State University. opinion about your case—and the opinion is
Correspondence concerning this article should be ad-
dressed to Lucy A. Guarnera, Department of Psychology,
not in your favor. You might wonder whether
University of Virginia, P.O. Box 400400, Charlottesville, most forensic clinicians would have reached
VA 22904-4400. E-mail: [email protected] this same opinion. Would a second (or third,
143
144 GUARNERA, MURRIE, AND BOCCACCINI
or fourth) evaluator have come to a different, Unknown or Insufficient Field Reliability of

perhaps more favorable conclusion? In other Forensic Opinions
words, how often do forensic psychologists
disagree? And why does such disagreement The (un)Reliability of Forensic Psychology?
occur?
Interrater reliability is the degree of consen-
Questions about reliability and bias in fo-
sus among multiple independent raters.1 Of par-
rensic psychology feel more pressing as au-
ticular interest within forensic psychology is
thorities ask similar questions about long- field reliability—the interrater reliability among
trusted forensic science procedures. Recently, practitioners performing under routine practice
the National Research Council, Committee on conditions typical of real-world work (Wood,
Identifying the Needs of the Forensic Science Nezworski, & Stejskal, 1996). In general, the
Community (2009) and President’s Council field reliability of forensic opinions is either
of Advisors on Science and Technology unknown or far from perfect. For example, a
(PCAST; 2016) reviewed the state of forensic recent meta-analysis concluded that for evalua-
science, covering a wide range of disciplines tions of adjudicative competency— one of the
including analyses of DNA, fingerprints, hair, most common forensic psychology proce-
tire treads, bite marks, and ballistics. Both dures—pairs of independent evaluators assess-
governmental councils concluded that the er- ing the same defendant disagreed in approxi-
ror rates of many forensic techniques are un- mately 15%–30% of cases (Guarnera & Murrie,
known, and that forensic scientists are prone in press). This corresponds to rater agreement
to a variety of contextual biases. Consistent coefficients (i.e., Cohen’s kappa) in the range of
with the National Research Council (NRC) .30 –.65, which indicates fair to moderate agree-
and PCAST’s concerns, research has document according to most kappa interpretation
mented subjectivity and bias even in the fo- schemes (e.g., Landis & Koch, 1977). Field
rensic science procedures that courts have reliability rates for other common forensic opin-
considered most reliable, such as analyses of ions are similar although generally somewhat
DNA (Dror & Hampikian, 2011) and finger- lower; pairs of independent evaluators tend to
prints (Dror & Rosenthal, 2008). disagree in approximately 25%–35% of sanity
While forensic evaluators strive for objectiv- cases (␬ ⬇ .25–.65; Guarnera & Murrie, in
ity and seek to avoid conflicts of interest (Amer- press) and almost half (45%) of conditional
ican Psychological Association, 2013), a foren- release cases (␬ ⫽ .19; Acklin, Fuger, & Gow-
sic opinion may be influenced by multiple ensmith, 2015). In a related issue, the interrater
sources of variability and bias that can be pow- reliability of forensic assessment instruments
erful enough to cause independent evaluators to scored under routine practice conditions in the
form different opinions about the same defen- field is often poorer than what has been docu-
dant (see Figure 1). The purpose of this review mented in controlled validation studies and re-
is to summarize and integrate research docu- ported in test manuals (C. S. Miller, Kimonis,
menting various sources of disagreement in fo- Otto, Kline, & Wasserman, 2012).
rensic evaluations, as well as suggest promising We discuss many possible reasons for these
avenues of future research. We also present less-than-ideal field reliability rates, but one key
recommendations for translating these research foundational explanation is that forming a fo-
findings into policy and practice reforms in- rensic opinion is an extraordinarily difficult
tended to improve the reliability of forensic task. For example, evaluations of legal sanity
evaluations. require clinicians to use limited and often con-
The NRC and PCAST reports identified tradictory information to draw conclusions
two broad categories of problems in forensic about the mental state of a defendant at the time
science that appear applicable to forensic psy- they committed the crime, which may have
chology: (1) unknown or insufficient field re- been months or even years ago. A survey of a
liability of forensic procedures, and (2) ex- variety of medical and psychological proce-
perts’ lack of independence from those
requesting their services. We address both of 1
See, generally, Gwet (2014) for a more in-depth defi-
these areas in turn. nition and discussion of interrater reliability.
DISAGREEMENT AND BIAS IN FORENSIC EVALUATIONS 145
Figure 1. Sources of disagreement and bias in forensic evaluations.
dures confirms that complex decision tasks in- any state-level certification in forensic mental
volving the integration of multiple sources of health assessment, and those that do may have
data, such as rating child behavior problems or weak standards (e.g., attend one brief, initial
classifying stroke severity, tend to settle at fair training session or have previous clinical expe-
to moderate reliability rates (kappa or intraclass rience; Gowensmith, Pinals, & Karas, 2015).
correlation [ICC] ⬇ .30 –.75; Meyer, Mihura, & Thus, many states continue to have the bulk
Smith, 2005). This is in contrast to simple ob- of their forensic evaluations performed by “oc-
ject counts (e.g., counting decayed or missing casional experts,” general clinicians without
teeth) or physical measurements (e.g., measur- specialized forensic training (Grisso, 1987, p.
ing organ size on an ultrasound), where reliabil- 833).2 Unsurprisingly, studies assessing the
ity tends to be higher, with rater agreement thoroughness, relevance, and accuracy of the
coefficients greater than .90 (Meyer et al., reports forensic clinicians submit to the court
2005). Along these lines, Mossman (2013) re- routinely find them deficient (Fuger, Acklin,
cently performed mathematical simulations of Nguyen, Ignacio, & Gowensmith, 2014). For
competency evaluations and concluded that fair example, Skeem and colleagues (1998) found
to moderate reliability estimates were about as that competency evaluators’ reports in Utah
good as could reasonably be expected given the failed to incorporate legally relevant aspects of
inherent difficulty of the task. competency and failed to adequately describe
the reasoning underlying their final forensic
Limited Training and Certification for opinion.
Forensic Evaluators This training gap is important because empir-
ical research suggests that evaluators with
Besides the unreliability that may be intrinsic greater training produce more reliable forensic
to a complex, ambiguous task such as forensic opinions. A compelling recent study conducted
evaluation, research has identified multiple ex- in Hawaii examined interrater reliability rates
trinsic sources of expert disagreement. One for three types of common forensic opinions
such source is limited training and certification (adjudicative competency, legal sanity, and vi-
for forensic evaluators. While specialized train- olence risk assessment) both before and after
ing programs and board certifications have be- the state adopted more stringent certification
come far more commonplace and rigorous since
the early days of the field in the 1970s and
2
1980s, the training and certification of typical Occasional experts are likely more common in rural or
clinicians conducting forensic evaluations today other underresourced areas where forensic mental health
assessments are needed, but no highly trained, board-
remains variable and often poor (DeMatteo, certified forensic clinicians are available. Thus, the court’s
Marczyk, Krauss, & Burl, 2009). For example, only option may be a general clinician without specialized
only about one third to one half of states have training in forensic assessment.
standards in 2014 (Gowensmith, Sledd, & Ses- .78 and .74, respectively) than less structured
sarego, 2014). These new standards included a instruments like the Psychopathy Checklist—
mandatory 3-day training, written test, submis- Revised (PCL-R; Hare, 2003), which showed
sion of a mock report, peer review process, and an ICC1 of .60 in the field. Even within the
continuing education. Postcertification, reliabil- PCL-R, more objective items with explicit scor-
ity rates improved for all three types of evalu- ing rules (e.g., criminal versatility, juvenile de-
ations (competency: 13% increase, p ⫽ .08; linquency, revocation of conditional release;
sanity: 17% increase, p ⫽ .04; risk: 29% in- ICCA1 ⫽ .75–.80) tend to show greater field
crease, p ⫽ .001). Gowensmith and colleagues’ reliability than more subjective items requiring
(2014) results provide the first direct evidence impressionistic judgments (e.g., impulsivity,
that more stringent state-level certification stan- glibness, callousness; ICCA1 ⫽ .23–.36; Sturup
dards can improve the field reliability of foren- et al., 2014).

sic opinions.
Individual Evaluator Differences
Unstandardized Methods
In addition to evaluators’ inconsistent training
One likely reason why training and certification and methods, patterns of stable individual differ-
increase interrater reliability is that they promote ences among evaluators—as opposed to mere in-
standardized evaluation methods among forensic accuracy or random variation—seem to contribute
clinicians. While there are now greater resources to divergent forensic opinions. For example, eval-
and consensus concerning appropriate practice uators appear to vary widely in the base rates at
than even a decade ago, forensic psychologists which they find defendants incompetent or insane,
still vary widely in what they actually do during even when all evaluators in the sample assessed
any particular forensic evaluation (Heilbrun & defendants drawn essentially at random from the
Brooks, 2010). For example, Neal and Grisso same population (Murrie, Boccaccini, Zapf, War-
(2014) found that 74% of forensic clinicians in a ren, & Henderson, 2008; Murrie & Warren,
large international sample reported using at least 2005). For example, in a sample of 59 clinicians
one structured assessment tool in their most recent conducting a total of 4,498 evaluations of legal
assessments—meaning the remaining 26% used sanity, seven clinicians found zero defendants in-
clinical judgment alone. The 434 clinicians in the sane while three clinicians found 50% of all de-
sample reported using a surprising total of 286 fendants insane (Murrie & Warren, 2005). Simi-
different tools, many with unknown reliability or larly, some evaluators assign consistently higher
validity. Furthermore, the sources of information or lower PCL-R scores than others, even when
clinicians reported using (e.g., medical records, there are no obvious differences among examin-
justice records, educational records, collateral in- ees that might explain these scoring trends (Boc-
terviews, psychological testing) varied widely caccini, Turner, & Murrie, 2008). Stable patterns
even within a particular type of evaluation. This of differences suggest that evaluators may adopt
diversity of methods—including the variety and at idiosyncratic decision thresholds that consistently
times total lack of structured tools—is likely a shift their forensic opinions or instrument scores
major contributor to disagreement among forensic in a particular direction, especially when faced
evaluators. with ambiguous cases (Mossman, 2013).
Even within the category of structured tools, Two factors that may contribute to evalua-
research shows that forensic assessment instru- tors’ different decision thresholds are evaluator
ments with explicit scoring rules based on ob- personality and evaluator attitudes. Regarding
jective criteria yield higher field reliability than personality, A. K. Miller and colleagues (2011)
instruments involving more holistic or subjec- demonstrated that evaluators who described
tive judgments. C. S. Miller and colleagues themselves as more agreeable on a personality
(2012) found that more structured risk assess- questionnaire rated offenders as less psycho-
ment instruments such as the Static-99R pathic on PCL-R items assessing glibness, gran-
(Helmus, Thornton, Hanson, & Babchishin, diosity, conning, and pathological lying (zero-
2012) and Minnesota Sex Offender Screening order correlation ⫽ ⫺.51, p ⬍ .01). The authors
Tool—Revised (MnSOST-R; Epperson et al., concluded that, since agreeable people tend to
1998) showed higher field reliability (ICC1 ⫽ assume the best about others, evaluators higher
on agreeableness may have been less willing to More recently, using scores from structured risk
assume that equivocal data from the case files instruments (e.g., PCL-R, Static-99R) as a con-
indicated psychopathy. venient way to quantify differences in expert
Regarding attitudes, early studies found that opinion, researchers examining archival data
evaluators’ personal attitudes toward the insanity found large scoring differences according to
defense predicted whether they reached an insan- side of retention—prosecution-retained evalua-
ity opinion in case vignettes (Homant & Kennedy, tors produced higher risk scores that made the
1987). Vignette-based research and practitioner examinee look more dangerous, while defense-
surveys have found that evaluators with pro- retained evaluators produced lower risk scores
death-penalty attitudes are more likely to find hy- that made the examinee look more benign. For
pothetical defendants competent for execution example, Murrie et al. (2009) found an average
(Palker-Corell, 2007) or accept referrals for death difference of 5.8 points on the PCL-R (score
penalty evaluations (Neal, 2016). Furthermore, range: 0 – 40) between opposing sexually vio-
evaluators themselves appear to acknowledge the lent predator (SVP) evaluators in Texas, a dif-
potential influence of attitudes on their forensic ference twice the standard error of measurement
work. In a recent qualitative study, many forensic reported in the test manual (Hare, 2003).3
evaluators identified preexisting personal, moral, Recent surveys also suggest that evaluators
or political values as influences on their forensic tend to interpret risk scores in a way that favors
opinions (Neal & Brodsky, 2016). the side that retained them (Boccaccini, Chevalier,
Murrie, & Varela, 2015; Chevalier, Boccaccini,
Forensic Psychologists’ Lack of Murrie, & Varela, 2015). For example, Chevalier
Independence From the Retaining Party et al. (2015) found that 94% of state-retained SVP
evaluators reported using high-risk/need norms
Upon these concerns about unknown or less- for the Static-99R (a way of interpreting scores
than-ideal field reliability of forensic psychology that makes the examinee seem more risky, as
procedures, we now add concerns about forensic compared to routine sample norms), but only 33%
experts’ lack of independence from those request- of respondent-retained evaluators reported using
ing their services (NRC, 2009). As far back as the high-risk/need norms. Thus, two opposing evalu-
1800s, legal experts have lamented the apparent ators who arrive at the same numerical score on a
frequency of scientific experts espousing the risk assessment instrument might still draw biased
views of the side that hired them (perhaps for conclusions that favor the retaining side through
financial gain), leading one judge to comment, differing norm selection.
“[T]he vicious method of the Law, which permits These surveys and field studies of adversarial
and requires each of the opposing parties to sum- allegiance cannot rule out of the possibility of
mon the witnesses on the party’s own account[,] selection effects creating the observed scoring dif-
. . . naturally makes the witness himself a partisan” ferences (Murrie & Boccaccini, 2015). Attorneys
(Wigmore, 1924). More modern surveys continue may preselect evaluators whom they know to be
to identify partisan bias as judges’ main concern sympathetic to their point of view, or gather pre-
about expert testimony, citing experts who appear liminary opinions from multiple evaluators and
to “abandon objectivity” and “become advocates” ultimately retain only the most favorable opinion.
for the retaining party (Krafka, Dunn, Johnson,
Furthermore, evaluators may self-select according
Cecil, & Miletich, 2002, p. 328).
to preexisting attitudes or preferences, choosing to
Research on forensic psychologists working
accept or decline particular types of cases or cases
within adversarial settings appears to validate
from particular referral sources (Neal, 2016). To
some of these concerns about adversarial alle-
eliminate the possible influence of selection ef-
giance, the tendency for experts to reach con-
clusions that support the party who retained
them (Murrie et al., 2009). Some early studies 3
SVP refers to sexually violent predator provisions,
suggested that clinicians drifted toward opin- which allow for sexual offenders to be civilly committed
ions favorable to the retaining party in a real-life after completing their criminal sentence. While SVP pro-
ceedings are technically civil, they still involve an adver-
civil litigation following a mining disaster (Zus- sarial arrangement, with different forensic psychologists
man & Simon, 1983) and in case vignettes testifying for the state and for the respondent (i.e., the
simulating sanity evaluations (Otto, 1989). individual being considered for commitment).
fects, Murrie and colleagues (2013) conducted an chology procedures were widely known, legal
experiment where practicing forensic evaluators decision makers might be able to weight their
were randomly assigned to believe they were confidence in psychological testimony accord-
working for the prosecution or the defense on a ing to the reliability of the procedure in question
real-world case consultation. Even with random (Butler, 2013). In addition, by carefully cata-
assignment, evaluators still tended to score cases loguing variables specific to the examiner, ex-
in the direction of allegiance. Unsurprisingly, al- aminee, and evaluation context from which re-
legiance effects were larger for the PCL-R (me- liability figures are drawn, field reliability
dium to large effect sizes; d ⫽ 0.55–0.85) than for research can also shed light on factors associ-
the more structured and objective Static-99R ated with better or worse reliability, suggesting
(small effect sizes; d ⫽ .20 –.42).4 While the Mur- further avenues for improvement (Guarnera &
rie et al. (2013) experiment used sex offender case Murrie, in press).
files scored with popular risk assessment instru- Given that increased standardization of fo-
ments, other types of forensic evaluations and rensic methods has the potential to ameliorate
instruments likely show the same vulnerability to multiple sources of unreliability and bias de-
adversarial allegiance. scribed here, more investigation of forensic in-
struments, checklists, practice guidelines, and
Future Directions for Research, Practice, other methods of standardization is a second
and Policy research priority (Ægisdóttir et al., 2006). Some
of this research should continue to focus on
The research overviewed here points to the creating standardized tools for forensic evalua-
growing realization that some portion of every tions and populations for which none are cur-
forensic opinion—perhaps a larger portion than rently available, particularly civil evaluations
we might now acknowledge— has more to do such as guardianship, child protection, fitness
with the examiner than the examinee. This is a for duty, and civil torts like emotional injury
serious problem that risks arbitrary or unjust (Heilbrun & Brooks, 2010). Future research can
outcomes for those undergoing forensic evalu- also continue to seek improvements to the cur-
ations, as well as diminishing the legal system’s rently modest predictive accuracy of risk assess-
confidence in psychological expertise. Unreli- ment instruments (Fazel, Singh, Doll, & Grann,
able evaluations can also put the community at 2012). However, given the current gap between
risk (e.g., assigning a low risk score to a truly the availability of forensic instruments and their
high-risk individual likely to offend again). At limited use by forensic evaluators in the field,
the same time, some degree of unreliability and perhaps more pressing is research on the imple-
bias on complex human decision tasks is un- mentation of forensic instruments in routine prac-
avoidable in light of our “bounded rationality” tice. More qualitative (e.g., Pinals, Tillbrook, &
(Gigerenzer & Goldstein, 1996). Given this ten- Mumley, 2006) and quantitative (e.g., Neal &
sion, what next steps are possible to prevent Grisso, 2014) investigations of how instruments
forensic psychology from becoming the NRC or are administered in routine practice, why instru-
PCAST’s next target? ments are or are not used, and what practical
Just as research has helped uncover these obstacles evaluators encounter are needed. With-
problems, further research can continue to de- out greater understanding of how instruments
fine the scope of the problem and suggest solu- are (or are not) implemented in practice—
tions. As a much-needed first step, foundational particularly in rural or other underresourced ar-
research should establish field reliability rates eas— continuing to develop new tools may not
for various types of forensic evaluations in or- translate to their increased use in the field.
der to assess the current situation and gauge Third, a clear recommendation for improving
progress toward improvement. Only a handful evaluator reliability is that states without stan-
of field reliability studies exist for a few types of
forensic evaluations (i.e., adjudicative compe- 4
tency, legal sanity, conditional release), and vir- These effect sizes held true for three out of four cases
included in the study. One case, involving an individual
tually nothing is known about the field reliabil- with exceptionally low risk, did not show evidence of ad-
ity of other types of evaluations, particularly versarial allegiance. All evaluators rated this individual as
civil evaluations. If error rates of forensic psy- similarly low risk, regardless of side of retention.
dards for the training and certification of foren- or compare their own base rates of incompe-
sic experts should adopt them, and states with tency and insanity findings to those of their
weak standards (e.g., mere workshop atten- colleagues.
dance) should strengthen them. What is less Ambitious evaluators could even experiment
clear, however, is what kinds and doses of train- with blinding themselves to the source of refer-
ing can improve reliability with the greatest ral in order to counteract adversarial allegiance
efficiency. Drawing from extensive research in (Robertson & Kesselheim, 2016). For example,
industrial and organizational psychology, cre- evaluators could try using a case manager, an
dentialing requirements that mimic the type of individual who communicates with attorneys
work evaluators do as part of their job (e.g., and controls the inflow and outflow of informa-
mock reports, peer review, apprenticing) may tion, in order to prevent irrelevant biasing in-
foster professional competency better than re- formation (such as the identity of the retaining
quirements dissimilar to job duties (e.g., written party) from reaching the evaluator (Dror, 2013).
tests; Phillips, 1998). Given that both evaluators Evaluators may soon be able to market (to at-
and certifying bodies have limited time and torneys or the court) their willingness to serve
resources, research into the most potent ingre- as blinded experts, since research suggests that
dients of successful forensic credentialing is a mock jurors view the testimony of blinded ex-
third research priority. perts as more credible (Robertson & Yokum,
Even while this important research remains 2012).
to be done, practicing forensic evaluators still Although individual evaluators can make
have many options to reduce the impact of many voluntary changes today in order to re-
unreliability and bias in their own work. While duce the impact of unreliability and bias on their
many clinicians cite introspection (i.e., looking forensic opinions, other reforms require wider-
inward in order to identify one’s own biases) as ranging structural transformation. For example,
a primary method to counteract personal ideol- state-level legislative action is needed to man-
ogy, idiosyncratic responses to examinees, and date more than one independent forensic opin-
other individual differences (Neal & Brodsky, ion. Requiring more than one independent opin-
2016), research suggests that introspection is ion is a powerful way to combat unreliability
ineffective and may even be counterproductive and bias by reducing the impact of any one
(Pronin, Lin, & Ross, 2002). Thus, more disci- evaluator’s error (Larrick, 2004). For example,
plined changes to personal practice are needed. by statute, Hawaii mandates three independent,
For example, when conducting evaluations for nonadversarial forensic opinions for all felony
which well-validated structured tools exist, defendants being evaluated for adjudicative
evaluators could commit to using such tools as competency and legal sanity (Hawaii Revised
a personal standard of practice. This would en- Statutes, 2003, sections 704 – 404 and 704 –
tail justifying to themselves (or preferably col- 406). Only nine other states require more than
leagues) why they did or did not use an avail- one competency evaluator, and 14 states allow
able tool for a particular case. Practicing (but do not require) more than one evaluator
forensic evaluators could also use simple debi- (Gowensmith et al., 2015). For more states to
asing methods to counteract confirmation bias, join these ranks, state legislators would need to
such as the “consider-the-opposite” technique prioritize funding for multiple forensic evalua-
in which evaluators ask themselves, “What are tions per defendant, likely a substantial outlay.
some reasons my initial judgment might be Similarly, more stringent state-level certifica-
wrong?” (Mussweiler, Strack, & Pfeiffer, tion standards would require considerable fi-
2000). To increase personal accountability, nancial investment in the infrastructure neces-
evaluators could keep organized records of their sary to organize trainings, vet certification
own forensic opinions and instrument scores, or materials, maintain records, and enforce com-
even help organize larger databases for evalua- pliance.
tors within their own institution or locality (Le- Even slower to change than state legislation
rner & Tetlock, 1999). Using these personal and infrastructure might be existing legal
data sets, evaluators might look for mean dif- norms, such as judges’ current willingness to
ferences in their own instrument scores when admit nonblinded, partisan experts. While au-
retained by the prosecution versus the defense, thoritative calls to action like the NRC and
PCAST reports may have some influence, most Chevalier, C. S., Boccaccini, M. T., Murrie, D. C., &
legal change only happens by the accretion of Varela, J. G. (2015). Static-99R reporting practices
legal precedent, which is a slow and unpredict- in sexually violent predator cases: Does norm se-
able process. Thus, radical changes regarding lection reflect adversarial allegiance? Law and Hu-
the roles and expectations of forensic experts— man Behavior, 39, 209 –218. http://dx.doi.org/10
.1037/lhb0000114
such as “hot tubbing,” a system pioneered in
DeMatteo, D., Marczyk, G., Krauss, D. A., & Burl, J.
Australia where opposing experts are ques- (2009). Educational and training models in foren-
tioned simultaneously and can also question sic psychology. Training and Education in Profes-
each other (Edmund, 2009)—seem unlikely to sional Psychology, 3, 184 –191. http://dx.doi.org/
take root any time soon in the American legal 10.1037/a0014582
system. Regardless, we hope the growing Dror, I. E. (2013). Practical solutions to cognitive and
awareness of problems of unreliability and bias human factor challenges in forensic science. Fo-
in the forensic sciences—in the wake of the rensic Science Policy & Management, 4, 105–113.
NRC and PCAST reports— can spur on legal http://dx.doi.org/10.1080/19409044.2014.901437
reforms, as well as create urgency to prioritize Dror, I. E., & Hampikian, G. (2011). Subjectivity and
some of these larger structural and funding bias in forensic DNA mixture interpretation. Sci-
changes within forensic psychology. ence & Justice, 51, 204 –208. http://dx.doi.org/10
.1016/j.scijus.2011.08.004
Dror, I., & Rosenthal, R. (2008). Meta-analytically
quantifying the reliability and biasability of foren-
References sic experts. Journal of Forensic Sciences, 53, 900 –
903. http://dx.doi.org/10.1111/j.1556-4029.2008
Acklin, M. W., Fuger, K., & Gowensmith, W. N.
.00762.x
(2015). Examiner agreement and judicial consen-
Edmund, G. (2009). Merton and the hot tub: Scientific
sus in forensic mental health evaluations. Journal
conventions and expert evidence in Australian civil
of Forensic Psychology Practice, 15, 318 –343.
procedure. Law and Contemporary Problems, 72,
http://dx.doi.org/10.1080/15228932.2015.1051447
Ægisdóttir, S., White, M. J., Spengler, P. M., Maugh- 159 –189. http://www.jstor.org/stable/40647170
erman, A. S., Anderson, L. A., Cook, R. S., . . . Epperson, D. L., Kaul, J. D., Goldman, R., Hout,
Rush, J. D. (2006). The meta-analysis of clinical S. J., Hesselton, D., & Alexander, W. (1998).
judgment project: Fifty-six years of accumulated Minnesota Sex Offender Screening Tool—Revised
research on clinical versus statistical prediction. (MnSOST-R). St. Paul, MN: Minnesota Depart-
Counseling Psychologist, 34, 341–382. http://dx ment of Corrections.
.doi.org/10.1177/0011000005285875 Fazel, S., Singh, J. P., Doll, H., & Grann, M. (2012).
American Psychological Association. (2013). Spe- Use of risk assessment instruments to predict vio-
cialty guidelines for forensic psychology. Ameri- lence and antisocial behaviour in 73 samples in-
can Psychologist, 68, 7–19. http://dx.doi.org/10 volving 24,827 people: Systematic review and
.1037/a0029889 meta-analysis. British Medical Journal, 345,
Boccaccini, M. T., Chevalier, C. S., Murrie, D. C., & e4692. http://dx.doi.org/10.1136/bmj.e4692
Varela, J. G. (2015). Psychopathy Checklist– Fuger, K. D., Acklin, M. W., Nguyen, A. H., Ignacio,
Revised use and reporting practices in sexually L. A., & Gowensmith, W. N. (2014). Quality of
violent predator evaluations. Sexual Abuse. Ad- criminal responsibility reports submitted to the Ha-
vanced online publication. http://dx.doi.org/10 waii judiciary. International Journal of Law and
.1177/1079063215612443 Psychiatry, 37, 272–280. http://dx.doi.org/10
Boccaccini, M. T., Turner, D. B., & Murrie, D. C. .1016/j.ijlp.2013.11.020
(2008). Do some evaluators report consistently Gigerenzer, G., & Goldstein, D. G. (1996). Reason-
higher or lower PCL-R scores than others? Find- ing the fast and frugal way: Models of bounded
ings from a statewide sample of sexually violent rationality. Psychological Review, 103, 650 – 669.
predator evaluations. Psychology, Public Policy, http://dx.doi.org/10.1037/0033-295X.103.4.650
and Law, 14, 262–283. http://dx.doi.org/10.1037/ Gowensmith, W. N., Pinals, D. A., & Karas, A. C.
a0014523 (2015). States’ standards for training and certify-
Butler, H. A. (2013). Debiasing juror perceptions of ing evaluators of competency to stand trial. Jour-
the infallibility of forensic identification evidence: nal of Forensic Psychology Practice, 15, 295–317.
The utility of educational and perspective-taking http://dx.doi.org/10.1080/15228932.2015.1046798
debiasing methods (Unpublished doctoral disserta- Gowensmith, W. N., Sledd, M., & Sessarego, S.
tion). Claremont Graduate University, Claremont, (2014). The impact of stringent certification stan-
CA. dards on forensic evaluator reliability. Paper pre-
sented at the annual meeting of the American Assessment, 84, 296 –314. http://dx.doi.org/10
Psychological Association, Washington, DC. .1207/s15327752jpa8403_09
Grisso, T. (1987). The economic and scientific future Miller, A. K., Rufino, K. A., Boccaccini, M. T.,
of forensic psychological assessment. American Jackson, R. L., & Murrie, D. C. (2011). On indi-
Psychologist, 42, 831– 839. http://dx.doi.org/10 vidual differences in person perception: Raters’
.1037/0003-066X.42.9.831 personality traits relate to their Psychopathy
Guarnera, L. G., & Murrie, D. C. (in press). Field Checklist—Revised scoring tendencies. Assess-
reliability of competency and sanity opinions: A ment, 18, 253–260. http://dx.doi.org/10.1177/
systematic review and meta-analysis. Psychologi- 1073191111402460
cal Assessment. Miller, C. S., Kimonis, E. R., Otto, R. K., Kline,
Gwet, K. L. (2014). Handbook of inter-rater reliabil- S. M., & Wasserman, A. L. (2012). Reliability of
ity: The definitive guide to measuring the extent of risk assessment measures used in sexually violent
agreement among raters. Gaithersburg, MD: Ad- predator proceedings. Psychological Assessment,
vanced Analytics. 24, 944 –953. http://dx.doi.org/10.1037/a0028411
Hare, R. D. (2003). The Hare Psychopathy Checklist– Mossman, D. (2013). When forensic examiners dis-
Revised (2nd ed.). Toronto, Ontario, Canada: Multi- agree: Bias, or just inaccuracy? Psychology, Public
Health Systems. Policy, and Law, 19, 40 –55. http://dx.doi.org/10
Hawaii Revised Statutes, Vol. 14, 704 – 404 (2003). .1037/a0029242
http://dx.doi.org/10.1007/BF01044699 Murrie, D. C., & Boccaccini, M. T. (2015). Adver-
Heilbrun, K., & Brooks, S. (2010). Forensic psychol- sarial allegiance among forensic experts. Annual
ogy and forensic science: A proposed agenda for Review of Law and Social Science, 11, 37–55.
the next decade. Psychology, Public Policy, and http://dx.doi.org/10.1146/annurev-lawsocsci-
Law, 16, 219 –253. http://dx.doi.org/10.1037/ 120814-121714
a0019138 Murrie, D. C., Boccaccini, M. T., Guarnera, L. A., &
Helmus, L., Thornton, D., Hanson, R. K., & Bab- Rufino, K. A. (2013). Are forensic experts biased
chishin, K. M. (2012). Improving the predictive ac- by the side that retained them? Psychological Sci-
curacy of Static-99 and Static-2002 with older sex ence, 24, 1889 –1897. http://dx.doi.org/10.1177/
offenders: Revised age weights. Sexual Abuse, 24, 0956797613481812
64 –101. Murrie, D. C., Boccaccini, M. T., Turner, D. B.,
Homant, R. J., & Kennedy, D. B. (1987). Subjective Meeks, M., Woods, C., & Tussey, C. (2009). Rater
factors in clinicians’ judgments of insanity: Com- (dis)agreement on risk assessment measures in
parison of a hypothetical case and an actual case. sexually violent predator proceedings: Evidence of
Professional Psychology: Research and Practice, adversarial allegiance in forensic evaluation? Psy-
18, 439 – 446. http://dx.doi.org/10.1037/0735-7028 chology, Public Policy, and Law, 15, 19 –53.
.18.5.439 http://dx.doi.org/10.1037/a0014897
Krafka, C., Dunn, M. A., Johnson, M. T., Cecil, J. S., Murrie, D. C., Boccaccini, M. T., Zapf, P. A., War-
& Miletich, D. (2002). Judge and attorney experi- ren, J. I., & Henderson, C. E. (2008). Clinician
ences, practices, and concerns regarding expert variation in findings of competence to stand trial.
testimony in federal civil trials. Psychology, Public Psychology, Public Policy, and Law, 14, 177–193.
Policy, and Law, 8, 309 –332. http://dx.doi.org/10 http://dx.doi.org/10.1037/a0013578
.1037/1076-8971.8.3.309 Murrie, D. C., & Warren, J. I. (2005). Clinician
Landis, J. R., & Koch, G. G. (1977). The measure- variation in rates of legal sanity opinions: Impli-
ment of observer agreement for categorical data. cations for self-monitoring. Professional Psychol-
Biometrics, 33, 159 –174. http://dx.doi.org/10 ogy: Research and Practice, 36, 519 –524. http://
.2307/2529310 dx.doi.org/10.1037/0735-7028.36.5.519
Larrick, R. P. (2004). Debiasing. In D. J. Koehler & Mussweiler, T., Strack, F., & Pfeiffer, T. (2000).
N. Harvey (Eds.), Blackwell handbook of judgment Overcoming the inevitable anchoring effect: Con-
and decision making (pp. 316 –338). Oxford, UK: sidering the opposite compensates for selective
Blackwell. http://dx.doi.org/10.1002/9780470 accessibility. Personality and Social Psychology
752937.ch16 Bulletin, 26, 1142–1150. http://dx.doi.org/10
Lerner, J. S., & Tetlock, P. E. (1999). Accounting for .1177/01461672002611010
the effects of accountability. Psychological Bulle- National Research Council, Committee on Identify-
tin, 125, 255–275. http://dx.doi.org/10.1037/0033- ing the Needs of the Forensic Science Community.
2909.125.2.255 (2009). Strengthening forensic science in the
Meyer, G. J., Mihura, J. L., & Smith, B. L. (2005). United States: A path forward. Washington, DC:
The interclinician reliability of Rorschach interpre- National Academies Press. Retrieved from https://
tation in four data sets. Journal of Personality www.ncjrs.gov/pdffiles1/nij/grants/228091.pdf
Neal, T. M. (2016). Are forensic experts already Pronin, E., Lin, D. Y., & Ross, L. (2002). The bias
biased before adversarial legal parties hire them? blind spot: Perception of bias in self versus others.
PLoS ONE, 11, e0154434. http://dx.doi.org/10 Personality and Social Psychology Bulletin, 28,
.1371/journal.pone.0154434 369 –381. http://dx.doi.org/10.1177/0146167
Neal, T., & Brodsky, S. L. (2016). Forensic psychol- 202286008
ogists’ perceptions of bias and potential correction Robertson, C. T., & Kesselheim, A. S. (2016). Blind-
strategies in forensic mental health evaluations. ing as a solution to bias: Strengthening biomedical
Psychology, Public Policy, and Law, 22, 58 –76. science, forensic science, and law. San Diego, CA:
http://dx.doi.org/10.1037/law0000077 Elsevier.
Neal, T., & Grisso, T. (2014). Assessment practices Robertson, C. T., & Yokum, D. V. (2012). The effect
and expert judgment methods in forensic psychol- of blinded experts on juror verdicts. Journal of
ogy and psychiatry: An international snapshot. Empirical Legal Studies, 9, 765–794. http://dx.doi
Criminal Justice and Behavior, 41, 1406 –1421. .org/10.1111/j.1740-1461.2012.01273.x

http://dx.doi.org/10.1177/0093854814548449 Skeem, J. L., Golding, S. L., Cohn, N. B., & Berge,
Otto, R. K. (1989). Bias and expert testimony of G. (1998). Logic and reliability of evaluations of
mental health professionals in adversarial proceed- competence to stand trial. Law and Human Behav-
ings: A preliminary investigation. Behavioral Sci- ior, 22, 519 –547. http://dx.doi.org/10.1023/A:
ences & the Law, 7, 267–273. http://dx.doi.org/10 1025787429972
.1002/bsl.2370070210 Sturup, J., Edens, J. F., Sörman, K., Karlberg, D.,
Palker-Corell, A. M. (2007). Mental health profes- Fredriksson, B., & Kristiansson, M. (2014). Field
sionals’ decision-making in competence for execu- reliability of the Psychopathy Checklist—Revised
tion evaluations (Unpublished doctoral disserta- among life sentenced prisoners in Sweden. Law
tion). Sam Houston State University, Huntsville, and Human Behavior, 38, 315–324. http://dx.doi
TX. .org/10.1037/lhb0000063
Phillips, J. M. (1998). Effects of realistic job pre- Wigmore, J. H. (1924). To abolish partisanship of
views on multiple organizational outcomes: A expert witnesses, as illustrated in the Loeb-
meta-analysis. Academy of Management Journal, Leopold case. Journal of the American Institute of
41, 673– 690. http://dx.doi.org/10.2307/256964 Criminal Law and Criminology, 15, 341–343.
Pinals, D. A., Tillbrook, C. E., & Mumley, D. L. Wood, J. M., Nezworski, T., & Stejskal, W. J. (1996).
(2006). Practical application of the MacArthur The comprehensive system for the Rorschach: A
competence assessment tool-criminal adjudication critical examination. Psychological Science, 7,
(MacCAT-CA) in a public sector forensic setting. 3–10. http://dx.doi.org/10.1111/j.1467-9280.1996
Journal of the American Academy of Psychiatry .tb00658.x
and the Law, 34, 179 –188. Zusman, J., & Simon, J. (1983). Differences in re-
President’s Council of Advisors on Science and peated psychiatric examinations of litigants to a
Technology (PCAST). (2016). Report to the Pres- lawsuit. American Journal of Psychiatry, 140,
ident: Forensic science in the criminal courts: 1300 –1304. http://dx.doi.org/10.1176/ajp.140.10
Ensuring scientific validity of feature-comparison .1300
methods. Washington, DC: Executive Office of the
President of the United States. Retrieved from
https://www.whitehouse.gov/sites/default/files/ Received April 2, 2016
microsites/ostp/PCAST/pcast_forensic_science_ Revision received March 7, 2017
report_final.pdf Accepted March 24, 2017 䡲
View publication stats

Guarnera Et Al. (2017) - Why Do Forensic Experts Disagree

Uploaded by

Copyright:

Available Formats

Guarnera Et Al. (2017) - Why Do Forensic Experts Disagree

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Guarnera Et Al. (2017) - Why Do Forensic Experts Disagree

Uploaded by

Copyright:

Available Formats

See

discussions, stats, and author profiles for this publication at:

Why do forensic experts disagree?

Article · June 2017

Lucy Guarnera Daniel C Murrie

SEE PROFILE SEE PROFILE

Psychopathy Screening of Incarcerated Juveniles View project

The user has requested enhancement of the downloaded file.

Why Do Forensic Experts Disagree? Sources of Unreliability and

Lucy A. Guarnera Daniel C. Murrie

What is the significance of this article for the general public?

Keywords: forensic evaluation, forensic instrument, adversarial allegiance, human

or fourth) evaluator have come to a different, Unknown or Insufficient Field Reliability of

Figure 1. Sources of disagreement and bias in forensic evaluations.

dards can improve the field reliability of foren- et al., 2014).

Criminal Justice and Behavior, 41, 1406 –1421. .org/10.1111/j.1740-1461.2012.01273.x

View publication stats

You might also like