Psychological Assessment Chapter 7 - Utility PDF

Download as pdf or txt
Download as pdf or txt
You are on page 1of 7
At a glance
Powered by AI
The document discusses the concept of utility in psychometrics and how it refers to how useful a test or intervention is. It also discusses factors like psychometric soundness, costs, and time that can affect a test's utility.

The document mentions that psychometric soundness (reliability and validity), costs, time, and other considerations are involved in judging a test's utility. High reliability and validity are needed but not sufficient for high utility. The selection ratio and circumstances of test use also impact utility.

The document describes item-mapping, bookmark, and other IRT-based methods as well as methods like discriminant analysis for setting cut scores to determine minimum competency levels. It involves training experts to make judgments about difficulty levels.

Chapter 7 Psychological Assessment Reference: Cohen; Psychological Testing and Assessment

Chapter 7: Utility • one way of monitoring the drug use of cocaine users being treated on
an outpatient basis is through regular urine tests
• in everyday language, we use the term utility to refer to the usefulness of
some thing or some process
- as an alternative, researchers developed a patch which could detect
cocaine use through sweat
- in psychometrics, utility (test utility) refers to how useful a test is
- in a study designed to explore the utility of the sweat patch with 63
• some frequently raised utility-related questions: opiate-dependent volunteers who were seeking treatment,
- how useful is this test in terms of cost? investigators found a 92% level of agreement between a positive
- how useful is this test in terms of time? urine test for cocaine and a positive test on the sweat patch for
- what is the comparative utility of this test? cocaine
‣ comparative utility: how useful this test is as compared to - these results would seem to be encouraging for the developers of
another test the patch
- what is the clinical utility of this test? - however, this high rate of agreement occurred only when the patch
had been untampered with and properly applied — which wasn’t that
‣ clinical utility: how useful it is for diagnostic assessment or
often
treatment purposes
- what the diagnostic utility of this neurological test? - overall, the researchers felt compelled to conclude that the sweat
patch had limited utility as a means of monitoring drug use in
‣ diagnostic utility: how useful it is for classifi cation purposes
outpatient treatment facilities
- is this personnel test used for promoting middle-management
- even though a test may be psychometrically sound, it may have little
employees more useful than using no test at all?
utility — particularly if the targeted testtakers demonstrate a tendency
- how useful is the training program in place for new recruits? to “tamper with,” or otherwise fail to scrupulously follow the test’s
- should this new intervention be used in place of an existing directions
intervention?

B. Psychometric Soundness
7.1 What Is Utility? • factors variously referred to as economic, financial, or budget-related in
• utility: the usefulness or practical value of testing to improve effi ciency nature must certainly be taken into account
and/or to aid in decision making | the usefulness or practical value of a - in fact, one of the most basic elements in any utility analysis is the
training program or intervention financial cost of the selection device under study
- in this definition, “testing” refers to anything from a single test to a - cost: disadvantages, losses, or expenses in both economic and
large-scale testing program that employs a battery of tests noneconomic terms
- for simplicity, we refer to the utility of one individual test • used with respect to test utility decisions, the term costs can be
- judgments concerning the utility of a test are made on the basis of interpreted in the traditional, economic sense — expenditures associated
test reliability and validity data as well as on other data with testing or not testing
- if testing is to be conducted, then it may be necessary to allocate
funds to purchase
Factors That Affect a Test’s Utility
1. a particular test
• a number of considerations are involved in making a judgment about
2. a supply of blank test protocols
the utility of a test
3. computerized test processing, scoring, and interpretation

A. Psychometric Soundness - costs of testing may come in the form of


1. payment to professional personnel and staff associated with test
• psychometric soundness: the reliability and validity of a test
administration, scoring, and interpretation
- a test is psychometrically sound for a particular purpose if reliability
2. facility rental, mortgage, and/or other charges related to the
and validity coefficients are acceptably high
usage of the test facility
- how can an index of utility be distinguished from an index of
3. insurance, legal, accounting, licensing, and other routine costs of
reliability or validity?
doing business
‣ answer: an index of reliability can tell us how consistently a test
measures what it measures; and an index of validity can tell us
- in some settings, these costs may be offset by revenue, such as fees
paid by testtakers
whether a test measures what it purports to measure
‣ e.g. private clinics
‣ an index of utility can tell us the practical value of the information
derived from test scores - in others, these costs will be paid from the test user’s funds, which
may in turn derive from sources such as private donations or
• in previous chapters, it was noted that reliability sets a ceiling on validity
government grants
- conclusion that a comparable relationship exists between validity and ‣ e.g. research organizations
utility and conclude that “validity sets a ceiling on utility”
• economic costs listed here are the easy ones to calculate
- in many instances, such a conclusion would be defensible
- not so easy are other economic costs, particularly those associated
- after all, a test must be valid to be useful
with not testing or testing with an instrument that turns out to be
• unfortunately, few things about utility theory and its application are ineffective
simple and uncomplicated
‣ e.g. what if a commercial airline converted its current hiring and
- the higher the criterion-related validity of test scores for making a training program to a much less expensive program with much less
particular decision, the higher the utility of the test is likely to be rigorous (and perhaps ineffective) testing for all personnel?
- however, there are exceptions to this general rule ‣ what economic (and noneconomic) consequences do you envision
- many factors may enter into an estimate of a test’s utility, and there might result from such action?
are great variations in which the utility of a test is determined ‣ would cost-cutting actions such as those described previously be
‣ e.g. in study of the utility of a test used for personnel selection the prudent from a business perspective?
selection ratio may be very high • the resulting cost savings from elimination of such assessment programs
‣ if e selection ratio is very high, most people who apply for the job would pale in comparison to the probable losses in customer revenue
are being hired once word got out about the airline’s strategy for cost cutting
‣ under such circumstances, the validity of the test may have little to - additionally, revenue losses would be irrevocably compounded by
do with the test’s utility any safety-related incidents (with their attendant lawsuits) that
• the other side of the coin — would it be accurate to conclude that “a valid occurred as a consequence of such imprudent cost cutting
test is a useful test”? • mention of the variable of “loss of confidence” brings us to another
- it is not the case that “a valid test is a useful test” meaning of “costs” in terms of utility analyses — costs in terms of loss
- people refer to a test as “valid” if scores on the test have been shown - noneconomic costs of drastic cost cutting by the airline might come
to be good indicators of how the person will score on the criterion in the form of harm to airline passengers and crew as a result of
incompetent pilots and crews

In-Depth Study Guide Cruz 1 of 7


Chapter 7 Psychological Assessment Reference: Cohen; Psychological Testing and Assessment

- although people and insurance companies do place dollar amounts - stakes involving the utility of tests can indeed be quite high
on the loss of life and limb, for our purposes we can still categorize • how do professionals in the field of testing and assessment balance
such tragic losses as noneconomic in nature variables such as psychometric soundness, benefits, and costs?
• other noneconomic costs of testing can be far more subtle - how do they decide that the benefits outweigh the costs and that a
- e.g. consider a published study that examined the utility of taking test or intervention indeed has utility?
four X-ray pictures as compared to two X-ray pictures in routine - other less definable elements — such as prudence, vision, and, for
screening for fractured ribs among potential child abuse victims lack of a better term, common sense — must be ever-present in the
‣ a four-view series of X-rays differed significantly from the more process
traditional, two-view series in terms of the number of fractures - a psychometrically sound test of practical value is worth paying for,
identified even when the dollar cost is high, if the potential benefits of its use
‣ these authors found diagnostic utility in adding two X-ray views to are also high or if the potential costs of not using it are high
the more traditional protocol
‣ financial cost of using the two additional X-rays was seen as worth
7.2 Utility Analysis
it, given the consequences and potential costs of failing to
diagnose the injuries What Is a Utility Analysis?
- here, the noneconomic cost concerns the risk of letting a potential • utility analysis: a family of techniques that entail a cost–benefi t analysis
child abuser continue to abuse a child without detection designed to yield information relevant to a decision about the
usefulness and/or practical value of a tool of assessment | an umbrella
term covering various possible methods, each requiring various kinds of
C. Benefits
data to be inputted and yielding various kinds of output
• when evaluating the utility of a particular test, an evaluation is made of
- a utility analysis is not one specific technique used for one specific
the costs incurred by testing as compared to the benefits accrued from
objective
testing
- some are quite sophisticated, employing high-level mathematical
- benefit: profi ts, gains, or advantages in both economic and models for weighting different variables under consideration
noneconomic terms
- others are more straightforward and can be readily understood in
• from an economic perspective, the cost of administering tests can be
terms of answers to relatively uncomplicated questions
minuscule when compared to the financial returns a successful testing
‣ e.g. “Which test gives us more bang for the buck?”
program can yield
• a utility analysis may be undertaken for the purpose of evaluating
- e.g. if a new personnel testing program results in the selection of
whether the benefits of using a test outweigh the costs
employees who produce significantly more than other employees,
then the program will have been responsible for greater productivity - the utility analysis will help make decisions regarding whether:
on the part of the new employees ‣ one test is preferable to another test for a specific purpose
‣ greater productivity may lead to greater overall company profits ‣ one tool of assessment preferable to another tool of assessment
• there are also many potential noneconomic benefits for a specific purpose (e.g. a test vs. a behavioral observation)
- in industrial settings, a partial list of such noneconomic benefits— ‣ the addition of one or more tests (or other tools of assessment) to
many with economic benefits as well include: one or more tests (or other tools of assessment) that are already in
use is preferable for a specific purpose
‣ increase in the quality of workers’ performance
‣ no testing or assessment is preferable to any testing or assessment
‣ increase in the quantity of workers’ performance
• endpoint of a utility analysis is typically an educated decision about
‣ decrease in the time needed to train workers
which possible courses of action is optimal
‣ reduction in the number of accidents
- e.g. the use of a particular approach to assessment in selecting
‣ reduction in worker turnover
managers could save a telephone company more than $13 million
• the cost of administering tests can be well worth it if the result is certain over four years
noneconomic benefits
• a solid foundation in the language of this endeavor is essential
- e.g. consider the admissions program in place at most universities
‣ educational institutions that pride themselves on their graduates
are often on the lookout for ways to improve the way that they How Is a Utility Analysis Conducted?
select applicants for their programs is to the credit of a university • the specific objective of a utility analysis will dictate what sort of
that their graduates succeed at their chosen careers information will be required as well as the specific methods to be used
‣ a large portion of happy, successful graduates enhances the - two general approaches to utility analysis
university’s reputation and sends the message that the university is
doing something right A. Expectancy Data
‣ related benefits to a university that has students successfully going • some utility analyses will require little more than converting a scatterplot
through its programs include high morale, a good learning of test data to an expectancy table
environment, and reduced load on counselors and on disciplinary
and boards
- expectancy table: can provide an indication of the likelihood that a
testtaker will score within some interval of scores on a criterion
‣ a good work environment and a good learning environment are measure — an interval that may be categorized as “passing,”
not necessarily things that money can buy “acceptable,” or “failing”
‣ such outcomes can result from a well-administered admissions
- an expectancy table can provide vital information to decision-makers
program that consistently selects qualified students who will keep
‣ e.g. with regard to the utility of a new and experimental personnel
up with the work and “fit in” to the environment of a particular
test in a corporate setting an expectancy table might indicate that
university
the higher a worker’s score is on this new test, the greater the
• one of the economic benefits of a diagnostic test used to make probability that the worker will be judged successful
decisions about involuntary hospitalization of psychiatric patients is a
‣ by instituting this new test on a permanent basis, the company
benefit to society at large
could reasonably expect to improve its productivity
- persons are frequently confined for psychiatric reasons if they are
• return on investment: the ratio of benefi ts to costs
harmful to themselves or others
• tables that could be used as aid for personnel directors in their decision-
- the more useful tools of assessment are, the safer society will be from
making chores were published by Taylor and Russell
individuals intent on inflicting harm
- Taylor-Russell tables: provide an estimate of the extent to which
- clearly, the potential noneconomic benefit derived from the use of
inclusion of a particular test in the selection system will improve
such diagnostic tools is great
selection
- it is also true, however, that the potential economic costs are great
when errors are made
- specifically, the tables provide an estimate of the percentage of
employees hired by the use of a particular test who will be successful
- errors in clinical determination in cases of involuntary hospitalization at their jobs, given different combinations of three variables: the
may cause people who are not threats to be denied their freedom test’s validity, the selection ratio used, and the base rate

In-Depth Study Guide Cruz 2 of 7


Chapter 7 Psychological Assessment Reference: Cohen; Psychological Testing and Assessment

• value assigned for the test’s validity is the computed validity coefficient - the first part of the formula represents the benefits
- selection ratio: a numerical value that refl ects the relationship ‣ N: the number of applicants selected per year
between the number of people to be hired and the number of ‣ T: the average length of time in the position (or, tenure)
people available to be hired ‣ rxy: the (criterion-related) validity coeffi cient for the given predictor
‣ e.g. if there are 50 positions and 100 applicants, then the selection and criterion
ratio is 50/100, or .50 ‣ SDy: the standard deviation of performance (in dollars) of
- base rate: the percentage of people hired under the existing system employees
for a particular position ‣ Zm: the mean (standardized) score on the test for selected
‣ e.g. a firm employs 25 computer programmers and 20 are applicants
considered successful, the base rate would be .80
- the second part of the formula represents the cost
- with knowledge of the validity coefficient of a particular test along ‣ N: the number of applicants
with the selection ratio, reference to the Taylor-Russell tables
‣ C: the cost of the test for each applicant
provides the personnel officer with an estimate of how much using
the test would improve selection over existing methods • a difficulty in using this formula is estimating the value of SDy, a value
that is, quite literally, estimated
• a sample Taylor-Russell table is presented in Table 7–1
- one recommended way to estimate SDy is by setting it equal to 40%
- this table is for the base rate of .60, meaning that 60% of those hired
of the mean salary for the job
under the existing system are successful in their work
• suppose 60 Federale Express (FE) drivers are selected per year and that
- down the left are validity coefficients for a test that could be used to
each driver stays with FE for one and a half years
help select employees
- further suppose that the standard deviation of performance of the
- across the top are selection ratios which reflect the proportion of the
drivers is about $9,000 (40% of annual salary), that the criterion-
people applying for the jobs who will be hired
related validity of FERT (Federal Express Road Test) scores is .40, and
- if a test is introduced to help select employees in a situation with a that the mean standardized FERT score for applicants is +1.0
selection ratio of .20 and if the new test has a predictive validity
coefficient of .55, then the table shows that the base rate will increase
- benefits: 60 × 1.5 × .40 × $9,000 × 1.0 = $324,000
to .88 - when the costs of testing ($24,000) are subtracted from the financial
benefits of testing ($324,000), it can be seen that the utility gain
- rather than 60% of the hired employees being expected to perform
amounts to $300,000
successfully, a full 88% can be expected to do so
• would it be wise for a company to make an investment of $24,000 to
• one limitation of the Taylor-Russell tables is that the relationship between
receive a return of about $300,000?
the predictor (the test) and the criterion (rating of performance on the
job) must be linear - most people (and corporations) would be more than willing if they
knew that the return would be more than $12.50 for each dollar
- e.g. if there is some point at which job performance levels off, no
invested
matter how high the score on the test, use of the Taylor-Russell tables
would be inappropriate - c learly, with such a return on investment, using the FERT does
provide a cost-effective method of selecting delivery drivers
- another limitation of the Taylor-Russell tables is the potential difficulty
of identifying a criterion score that separates “successful” from • a modification of the BCG formula exists for researchers who prefer their
“unsuccessful” employees findings in terms of productivity gains rather than financial ones

• the potential problems of the Taylor-Russell tables were avoided by the - productivity gains: an estimated increase in work output
Naylor-Shine Tables - productivity gain = (N) (T) (rxy) (SDp) (Zm) — (N) (C)
- Naylor-Shine Tables: set of tables that provided an indication of the - in this modification of the formula, the value of the standard deviation
difference in average criterion scores for the selected group as of productivity, SDp is substituted for the value of the standard
compared with the original group | entails obtaining the difference deviation of performance in dollars, SDy
between the means of the selected and unselected groups to derive - the result is a formula that helps estimate the percent increase in
an index of what the test is adding to already established procedures output expected through the use of a particular test
• both tables can assist in judging the utility of a particular test, the former • throughout this text, we have sought to illustrate psychometric principles
by determining the increase over current procedures and the latter by with reference to contemporary, practical illustrations from everyday life
determining the increase in average score on some criterion measure - e.g in recent years, there has increasingly been calls for police to
- with both, the validity coefficient used must be one obtained by wear body cameras as a means to reduce inappropriate use of force
concurrent validation procedures — obtained with respect to current against citizens
employees hired by the selection process at the time of the study - in response to such demands important questions regarding the
• the fact is that many other kinds of variables might enter into hiring and utility of such systems have been raised — that is, will it really make a
other sorts of personnel selection decisions (including decisions relating difference in the behavior of police personnel
to promotion and firing)
- additional variables might include applicants’ minority status, C. Decision Theory and Test Utiity
physical or mental health, or drug use
• Cronbach and Gleser’s Psychological Tests and Personnel Decisions:
- given that many variables may affect a personnel selection decision, the most oft-cited application of statistical decision theory to the fi eld of
of what use is a given test in the decision process? psychological testing
• expectancy data provided by the Taylor-Russell or Naylor-Shine tables
- the idea of applying statistical decision theory to questions of test
could be used to shed light on many utility-related decisions, particularly utility was conceptually appealing and promising, and an
those confined to questions concerning the validity of an employment authoritative textbook of the day reflects the great enthusiasm with
test and the selection ratio employed which this marriage of enterprises was greeted:
- in many instances, however, the purpose of a utility analysis is to ‣ the basic decision-theory approach to selection and placement
answer a question related to costs and benefits in terms of dollars has a number of advantages over the more classical approach
and cents based upon the correlation model
- answer may be found using the Brogden-Cronbach-Gleser formula - generally, Cronbach and Gleser presented:
1. a classification of decision problems
B. The Brogden-Cronbach-Gleser Formula 2. various selection strategies ranging from single-stage processes
• the work of Brogden, Cronbach, and Gleser has been immortalized in to sequential analyses
the Brogden-Cronbach-Gleser formula 3. a quantitative analysis of the relationship between test utility, the
- Brogden-Cronbach-Gleser formula: used to calculate the dollar selection ratio, cost of the testing program, and expected value of
amount of a utility gain resulting from the use of a particular selection the outcome
instrument under specifi ed conditions 4. adaptive treatment: a recommendation that in some instances
- utility gain: an estimate of the benefi t (monetary or otherwise) of job requirements be tailored to the applicant’s ability instead of
using a particular test or selection method | the other way around
- utility gain = (N) (T) (rxy) (SDy) (Zm) — (N) (C)
In-Depth Study Guide Cruz 3 of 7
Chapter 7 Psychological Assessment Reference: Cohen; Psychological Testing and Assessment

• let’s illustrate decision theory • Schimdt made a number of calculations using different values for some
- recall the definition of five terms that you learned in the previous of the variables
chapter: base rate, hit rate, miss rate, false positive, and false negative - e.g. knowing that some of the tests previously used in the hiring
- imagine that you developed a procedure called the Vapor Test (VT), process had validity coefficients ranging from .00 to .50, they varied
which was designed to determine if alive subjects are indeed the value of the test’s validity coefficient (along with other factors
breathing such as different selection ratios) and examined the relative efficiency
of the various conditions
- procedure for the VT entails having the examiner hold a mirror under
the subject’s nose for a minute and observing whether the subject’s ‣ among their findings was that the existing selection ratio and
breath fogs the mirror selection process provided a great gain in efficiency over a
previous situation (when the gain was equal to almost $6 million
- 100 introductory psychology students are administered the VT, and it
per year)
is concluded that 89 were, in fact, breathing (whereas 11 are
deemed, on the basis of the VT, not to be breathing) ‣ the existing selection ratio and selection process provided an even
greater gain in efficiency over a previously existing situation
- is the VT a good test? no.
‣ here, one year, the gain in efficiency was estimated to be equal to
- because the base rate is 100% of the population, we really don’t even
over $97 million
need a test to measure the characteristic breathing
• the employer in the previous study was the U.S. government
- if we did need such a measurement procedure, we wouldn’t use one
that was inaccurate in approximately 11% of the cases - Hunter and Schmidt (1981) applied the same type of analysis to the
national workforce and made a compelling argument with respect to
- a test is obviously of no value if the hit rate is higher without using it
the critical relationship between valid tests and measurement
- one measure of the value of a test lies in the extent to which its use procedures and our national productivity
improves on the hit rate that exists without its use
• employers are reluctant to use decision-theory-based strategies in their
• suppose a test is administered to a group of 100 job applicants and that
hiring practices because of the complexity of their application and the
some cutoff score is applied to distinguish applicants who will be hired
threat of legal challenges
(applicants judged to have passed the test) from applicants whose
employment application will be rejected (applicants judged to have - although decision theory approaches to assessment hold great
promise, this promise has yet to be fulfilled
failed the test
- further suppose that some criterion measure will be applied some
time later to ascertain whether the newly hired person was Some Practical Considerations
considered a success or a failure • a number of practical matters must be considered when conducting
- if the test is a perfect predictor (if its validity coefficient is equal to 1), utility analyses
then two distinct types of outcomes can be identified: - e.g. issues related to existing base rates can affect the accuracy of
1. applicants will score at or above the cutoff score on the test and decisions made on the basis of tests
be successful at the job ‣ attention must be paid to this factor when the base rates are
2. applicants will score below the cutoff score and would not have extremely low or high because such a situation may render the test
been successful at the job useless as a tool of selection
• in reality, few, if any, employment tests are perfect predictors - focusing on the area of personnel selection, there are some other
- consequently, two additional types of outcomes are possible: practical matters to keep in mind
3. some applicants will score at or above the cutoff score, be hired,
and fail at the job A. The Pool of Job Applicants
4. some applicants who scored below the cutoff score and were not • there exists, “out there,” what seems to be a limitless supply of potential
hired could have been successful at the job employees just waiting to be evaluated and possibly selected for
- people in the third category could be categorized as false positives employment
- those in the fourth category could be categorized as false negatives - e.g. utility estimates such as those derived by Schmidt are based on
• in this illustration, logic tells us that if the selection ratio is, say, 90%, then the assumption that there will be a ready supply of viable applicants
the cutoff score will probably be set lower than if the selection ratio is 5% from which to choose and fill positions
- if the selection ratio is 90%, then it is a good bet that the number of - perhaps for some types of jobs, that is the case
false positives will be greater than if the selection ratio is 5% - there are certain jobs, however, that require unique skills or demand
- conversely, if the selection ratio is only 5%, it is a good bet that the great sacrifice that there are relatively few people who would even
number of false negatives will be greater than if the selection ratio is apply, let alone be selected
90% - the pool of possible job applicants for a particular type of position
• decision theory provides guidelines for setting optimal cutoff scores may vary with the economic climate
- in setting such scores, the relative seriousness of making false- - it may be that in periods of high unemployment there are
positive or false-negative selection decisions is frequently taken into significantly more people in the pool of possible job applicants than
account in periods of high employment
‣ e.g. it is prudent for an airline personnel office to set cutoff scores • related to issues concerning the available pool of job applicants is the
on tests for pilots that might result in a false negative (a qualified issue of how many people would actually accept the employment
pilot being rejected) as opposed to a cutoff score that would allow position offered to them even if they were found to be a qualified
a false positive (an unqualified pilot being hired) candidate
• principles of decision theory applied to problems of test utility have led - utility models are constructed on the assumption that all people
to some enlightening and impressive findings selected by a personnel test accept the position they are offered
- Schmidt demonstrated in dollars and cents how the utility of a - in fact, many of the top performers on the test are people who,
company’s selection program (and the validity coefficient of the tests because of their superior and desirable abilities, are also being
used in that program) can play a critical role in the profitability of the offered positions by other potential employers
company - consequently, top performers on the test are probably the least likely
- asked supervisors to rate (in terms of dollars) the value of good, of all of the job applicants to actually be hired
average, and poor programmers - utility estimates thus tend to overestimate the utility of the
- this information was used in conjunction with other information, measurement tool
including these facts: - these estimates may have to be adjusted downward as much as 80%
1. each year the employer hired 600 new programmers in order to provide a more realistic estimate of the utility of a tool of
assessment
2. the average programmer remained on the job for about 10 years
3. the Programmer Aptitude Test currently in use as part of the hiring
process had a validity coefficient of .76 B. The Complexity of the Job
4. it cost about $10 per applicant to administer the test • the same sorts of approaches to utility analysis are put to work for
5. the company currently employed more than 4,000 programmers positions that vary greatly in terms of complexity

In-Depth Study Guide Cruz 4 of 7


Chapter 7 Psychological Assessment Reference: Cohen; Psychological Testing and Assessment

- same sorts of data are gathered, same sorts of analytic methods may - is that really the case? could it be that a very high score in one stage
be applied, and same sorts of utility models may be invoked for of a multistage evaluation “balances out” a relatively low score in
different positions another stage of the evaluation?
- yet as Hunter observed, the more complex the job, the more people - compensatory model of selection: an assumption is made that high
differ on how well or poorly they do that job scores on one attribute can, in fact, compensate for low scores on
- whether or not the same utility models apply to jobs of varied another attribute
complexity, and whether or not the same utility analysis methods are - a person strong in some areas and weak in others can perform as
equally applicable, remain matters of debate successfully in a position as a person with moderate abilities in all
areas relevant to the position in question
C. The Cut Score In Use • the compensatory model is appealing, especially when post-hire training
or other opportunities are available to develop proficiencies
• cut score | cutoff score: a reference point derived as a result of a
judgment and used to divide a set of data into two or more - consider an applicant with strong driving skills but weak customer
classifi cations, with some action to be taken or some inference to be service skills
made on the basis of the classifi cations - all it might take for this applicant to blossom into an outstanding
- reference is frequently made to different types of cut scores employee is some additional education and training in customer
service
‣ e.g. distinction can be made between a relative cut score and a
fixed cut score • when a compensatory selection model is in place, the individual making
the selection will differentially weight the predictors being used in order
• relative cut score | norm-referenced cut score: a reference point that is
to arrive at a total score
set based on norm-related considerations rather than on the relationship
of test scores to a criterion | this type of cut score is set with reference to - such differential weightings may reflect value judgments made on
the performance of a group (or some target segment of a group) the part of the test developers regarding the relative importance of
different criteria used in hiring
• envision your instructor announcing on the first day of class that, for each
of the four examinations, the top 10% of all scores on each test would ‣ e.g. safe driving history may be weighted higher in the selection
receive the grade of A formula than is customer service

- the cut score in use would be relative to the scores achieved by a ‣ this weighting might be based on a company-wide “safety first”
targeted group (in this case, the top 10% of the class) ethic

- the score used to define who would and would not achieve the grade ‣ it may also be based on a company belief that skill in driving safely
of A on each test could be quite different for each of the four tests, is less amenable to education and training than skill in customer
depending upon where the boundary line for the 10% cutoff fell on service
each test - the statistical tool that is ideally suited for making such selection
• fixed cut score | absolute cut scores: a reference point that is typically decisions within the framework of a compensatory model is multiple
set with reference to a judgment concerning a minimum level of regression
profi ciency required to be included in a particular classifi cation - other tools, as we will see in what follows, are used to set cut scores
- consider the score achieved on the road test for a driver’s license
- performance of other would-be drivers has no bearing upon whether 7.3 Methods for Setting Cut Scores
an individual testtaker is classified as “licensed” or “not licensed” • if you have ever had the experience of earning a grade of B when you
- all that really matters here is: “Is this driver able to meet the fixed and came oh-so-close to the cut score needed for a grade A, then you have
absolute score on the road test necessary to be licensed?” no doubt spent some time pondering the way that cut scores are
• a distinction can also be made between the terms multiple cut scores determined
and multiple hurdles as used in decision-making processes - educators, researchers, and others with diverse backgrounds have
- multiple cut scores: the use of two or more cut scores with reference spent countless hours questioning, debating, and — judging from the
to one predictor for the purpose of categorizing testtakers nature of the heated debates in literature — agonizing about various
‣ e.g. your instructor may have multiple cut scores in place every aspects of cut scores
time an examination is administered, and each class member will - cut scores applied to a wide array of tests may be used to make
be assigned to one category ( A, B, C, D, or F) on the basis of various “high-stakes” decisions, a partial listing of which include:
scores on that examination ‣ who gets into what college or graduate school
‣ meeting or exceeding one cut score will result in an A for the ‣ who is certified to practice a particular occupation
examination, and so forth ‣ who is accepted for employment or promoted
- this is an example of multiple cut scores being used with a single ‣ who is legally able to drive
predictor
‣ who is legally competent to stand trial
• we may also speak of multiple cut scores being used in an evaluation
‣ who is considered to be legally intoxicated
that entails several predictors wherein applicants must meet the
requisite cut score on every predictor to be considered for the position ‣ who not guilty by reason of insanity

- a more sophisticated but cost-effective multiple cut-score method • journal articles, books, and other scholarly publications wrestle with
can involve several “hurdles” to overcome issues regarding the optimal method of “making the cut” with cut scores

• multiple hurdle: at every stage in a multistage selection process, a cut - become acquainted with various methods in use today for setting
score is in place for each predictor used fixed and relative cut scores

- the cut score used for each predictor will be designed to ensure that - although no one method has won universal acceptance, some
each applicant possess some minimum level of a specific attribute or methods are more popular than others
skill
- multiple hurdles may be thought of as one collective element of a The Angoff Method
multistage decision-making process in which the achievement of a • Angoff method: can be applied to personnel selection tasks as well as
particular cut score on one test is necessary in order to advance to to questions regarding the presence or absence of a particular trait,
the next stage of evaluation in the selection process attribute, or ability
‣ e.g. in applying colleges, applicants may have to successfully meet
- used for purposes of personnel selection, experts in the area provide
some standard in order to move to the next stage in a series of estimates regarding how testtakers who have at least minimal
stages competence for the position should answer test items correctly
- each stage entails unique demands (and cut scores) to be - for purposes relating to the determination of whether or not
successfully met, or hurdles to be overcome, if an applicant is to testtakers possess a particular trait, attribute, or ability, an expert
proceed to the next stage panel makes judgments concerning the way a person with that trait,
• multiple-hurdle selection methods assume that an individual must attribute, or ability would respond to test items
possess a certain minimum amount of knowledge, skill, or ability for
- in both cases, the judgments of the experts are averaged to yield cut
each attribute measured by a predictor to be successful in the desired scores for the test
position

In-Depth Study Guide Cruz 5 of 7


Chapter 7 Psychological Assessment Reference: Cohen; Psychological Testing and Assessment

- persons who score at or above the cut score are considered high - if so, that difficulty level is set as the cut score; if not, the process
enough to be hired or to be sufficiently high in the trait, attribute, or continues until the appropriate difficulty level has been selected
ability of interest - typically, the process involves several rounds of judgments in which
- this simple technique has wide appeal and works well experts may receive feedback regarding how their ratings compare
- the Achilles heel of the Angoff method is when there is low inter-rater to ratings made by other experts
reliability and major disagreement regarding how certain • bookmark method: more typically used in academic applications |
populations of testtakers should respond to items begins with the training of experts with regard to the minimal
- it may be time for “Plan B,” a strategy for setting cut scores that is knowledge, skills, and/or abilities that testtakers should possess in order
driven more by data and less by subjective judgments to “pass”
- subsequent to this training, the experts are given a book of items,
with one item printed per page, such that items are arranged in an
The Known Groups Method ascending order of difficulty
• known groups method | method of contrasting groups: entails
- the expert then places a “bookmark” between the two pages (two
collection of data on the predictor of interest from groups known to items) that separate testtakers who have acquired the minimum from
possess, and not to possess, a trait, attribute, or ability of interest those who have not
- based on an analysis of data, a cut score is set on the test that best - the bookmark serves as the cut score
discriminates the two groups’ test performance
- additional rounds of bookmarking with the same or other judges may
• consider a hypothetical online college called Internet Oxford University take place as necessary
(IOU) which offers a remedial math course for students who have not
been adequately prepared in high school for college-level math
- in the end, the level of difficulty to use as the cut score is decided
upon by the test developers
- but who needs to take remedial math before taking regular math?
- of course, none of these procedures are free of possible drawbacks
- senior personnel in the IOU Math Department prepare a placement
test called the “Who Needs to Take Remedial Math? Test” (WNTRMT)
- concerns include issues regarding the training of experts, possible
floor and ceiling effects, and the optimal length of item booklets
- next is, “What shall the cut score on the WNTRMT be?”
- by administering the test to a selected population and then setting a
cut score based on the performance of two contrasting groups: Other Methods
1. students who successfully completed college-level math • many other methods of cut-score setting exist
2. students who failed college-level math - Hambleton and Novick presented a decision-theoretic approach to
setting cut scores
• the WNTRMT is administered to all incoming freshmen — IOU collects all
test data and holds it for a semester - R.L. Thorndike proposed a norm-referenced method for setting cut
scores called the method of predictive yield
- it then analyzes the scores of two approximately equal-sized groups
of students who took college-level math courses: a group who ‣ method of predictive yield: took into account the number of
passed the course and a group whose final grades were a D or an F positions to be fi lled, projections regarding the likelihood of offer
acceptance, and the distribution of applicant scores
- IOU statisticians will now use these data to choose the score that best
discriminates the two groups from each other, which is the score at - discriminant analysis | discriminant function analysis: a family of
the point of least difference between the two groups statistical techniques used to shed light on the relationship between
identifi ed variables and two — and in some cases more — naturally
- the two groups are indistinguishable at a score of 6
occurring groups
- consequently, the cutoff score on the IOU shall be 6
‣ e.g. scores on tests vs persons judged to be successful at a job and
• the main problem with using known groups is that determination of
persons judged unsuccessful at a job
where to set the cutoff score is inherently affected by the composition of
• given the importance of setting cut scores and how much can be at stake
the contrasting groups
for individuals “cut” by them, research and debate on the issues involved
- no standard set of guidelines exist for choosing contrasting groups are likely to continue
- in the IOU example, the university officials could have chosen to
contrast just the A students with the F students when deriving a cut
score
- other types of problems in choosing scores from contrasting groups
occur in other studies
‣ e.g. in setting cut scores for a clinical measure of depression, just
how depressed do respondents from the depressed group have to
be?
‣ how “normal” should the respondents in the nondepressed group
be?

IRT-Based Methods
• the methods described thus far for setting cut scores are based on
classical test score theory
- cut scores are typically set based on tessttakers’ performance across
all the items on the test; some portion of the total number of items on
the test must be scored “correct” order for the testtaker to “pass” the
test
• within an IRT framework, however, things can be done differently
- each item is associated with a particular level of difficulty
- in order to “pass” the test, the testtaker must answer items that are
deemed to be above some minimum level of difficulty, which is
determined by experts and serves as the cut score
• there are several IRT-based methods for determining the difficulty level
reflected by a cut score
- item-mapping method: a technique that has found application in
setting cut scores for licensing examinations | entails the
arrangement of items in a histogram, with each column in the
histogram containing items deemed to be of equivalent value
- judges are presented with sample items from each column and are
asked whether or not a minimally competent licensed individual
would answer those items correctly about half the time
In-Depth Study Guide Cruz 6 of 7
Chapter 7 Psychological Assessment Reference: Cohen; Psychological Testing and Assessment

Stream The Sammy Side Up Podcast


on Spotify and iTunes! :>

In-Depth Study Guide Cruz 7 of 7

You might also like