TOS Outline PsychAssess 2.0

Psychological Assessment
#BLEPP2023
Source: Cohen & Swerdlik (2018), Kaplan & Saccuzzo (2018), Psych Pearls
Psychometric Properties and Principles (39) ▪ Item: a specific stimulus to which a person
Psychometric Properties essential in Constructing, responds overtly and this response is being
Selecting, Interpreting tests scored or evaluated
Psychological Testing - process of measuring ▪ Administration Procedures: one-to-one basis or
psychology-related variables by means of devices or group administration
procedures designed to obtain a sample of behavior ▪ Score: code or summary of statement, usually
- numerical in nature but not necessarily numerical in nature, but
- individual or by group reflects an evaluation of performance on a test
- administrators can be interchangeable without ▪ Scoring: the process of assigning scores to
affecting the evaluation performances
- requires technician-like skills in terms of ▪ Cut-Score: reference point derived by judgement
administration and scoring and used to divide a set of data into two or more
- yield a test score or series of test score classification
- minutes to few hours ▪ Psychometric Soundness: technical quality
Psychological Assessment - gathering and integration ▪ Psychometrics: science of psychological
of psychology-related data for the purpose of making measurement
psychological evaluation ▪ Psychometrist or Psychometrician: refer to
- answers referral question thru the use of different professional who uses, analyzes, and interprets
tools of evaluation psychological data
- individual Ability or Maximal Performance Test – assess what
- assessor is the key to the process of selecting tests a person can do
and/or other tools of evaluation 1. Achievement Test – measurement of the previous
- requires an educated selection of tools of evaluation, learning
skill in evaluation, and thoughtful organization and - used to measure general knowledge in a specific
integration of data period of time
- entails logical problem-solving that brings to bear - used to assess mastery
many sources of data assigned to answer the referral - rely mostly on content validity
question - fact-based or conceptual
- Educational: evaluate abilities and skills relevant in 2. Aptitude – refers to the potential for learning or
school context acquiring a specific skill
- Retrospective: draw conclusions about psychological - tends to focus on informal learning
aspects of a person as they existed at some point in time - rely mostly on predictive validity
prior to the assessment 3. Intelligence – refers to a person’s general potential
- Remote: subject is not in physical proximity to the to solve problems, adapt to changing environments,
person conducting the evaluation abstract thinking, and profit from experience
- Ecological Momentary: “in the moment” evaluation Human Ability – considerable overlap of
of specific problems and related cognitive and achievement, aptitude, and intelligence test
behavioral variables at the very time and place that they Typical Performance Test – measure usual or habitual
occur thoughts, feelings, and behavior
- Collaborative: the assessor and assesee may work as - indicate how test takers think and act on a daily basis
“partners” from initial contact through final feedback - use interval scales
- Therapeutic: therapeutic self-discovery and new - no right and wrong answers
understanding are encouraged Personality Test – measures individual dispositions
- Dynamic: describe interactive approach to and preferences
psychological assessment that usually follows the - designed to identify characteristic
model: evaluation > intervention of some sort > - measured ideographically or nomothetically
evaluation 1. Structured Personality tests – provide statement,
o Psychological Test – device or procedure designed usually self-report, and require the subject to choose
to measure variables related to psychology between two or more alternative responses
▪ Content: subject matter 2. Projective Personality Tests – unstructured, and the
▪ Format: form, plan, structure, arrangement, stimulus or response are ambiguous
layout
Hi :) this reviewer is FREE! u can share it with others but never sell it okay? let’s help each other <3 -aly
#BLEPP2023
3. Attitude Test – elicit personal beliefs and opinions o Behavioral Observation – monitoring of actions of
4. Interest Inventories – measures likes and dislikes others or oneself by visual or electronic means while
as well as one’s personality orientation towards the recording quantitative and/or qualitative information
world of work regarding those actions
Other Tests: ▪ Naturalistic Observation: observe humans in
1. Speed Tests – the interest is the number of times a natural setting
test taker can answer correctly in a specific period ▪ SORC Model: Stimulus, Organismic Valuables,
2. Power Tests – reflects the level of difficulty of items Actual Response, Consequence
the test takers answer correctly o Role Play – defined as acting an improvised or
3. Values Inventory partially improvised part in a stimulated situation
4. Trade Test ▪ Role Play Test: assesses are directed to act as if
5. Neuropsychological Test they are in a particular situation
6. Norm-Referenced test o Other tools include computer, physiological devices
7. Criterion-Referenced Tests (biofeedback devices)
o Interview – method of gathering information Psychological Assessment Process
through direct communication involving reciprocal 1. Determining the Referral Question
exchange 2. Acquiring Knowledge relating to the content of
Standardized/Structured – questions are prepared the problem
Non-standardized/Unstructured – pursue relevant 3. Data collection
ideas in depth 4. Data Interpretation
Semi-Standardized/Focused – may probe further on o Hit Rate – accurately predicts success or failure
specific number of questions o Profile – narrative description, graph, table. Or other
Non-Directive – subject is allowed to express his representations of the extent to which a person has
feelings without fear of disapproval demonstrated certain targeted characteristics as a
▪ Mental Status Examination: determines the result of the administration or application of tools of
mental status of the patient assessment
▪ Intake Interview: determine why the client came o Actuarial Assessment – an approach to evaluation
for assessment; chance to inform the client about characterized by the application of empirically
the policies, fees, and process involved demonstrated statistical rules as determining factor
▪ Social Case: biographical sketch of the client in assessors’ judgement and actions
▪ Employment Interview: determine whether the o Mechanical Prediction – application of computer
candidate is suitable for hiring algorithms together with statistical rules and
▪ Panel Interview (Board Interview): more than probabilities to generate findings and
one interviewer participates in the assessment recommendations
▪ Motivational Interview: used by counselors and o Extra-Test Behavior – observations made by an
clinicians to gather information about some examiner regarding what the examinee does and how
problematic behavior, while simultaneously the examinee reacts during the course of testing that
attempting to address it therapeutically are indirectly related to the test’s specific content but
o Portfolio – samples of one’s ability and of possible significance to interpretation
accomplishment Parties in Psychological Assessment
o Case History Data – refers to records, transcripts, 1. Test Author/Developer – creates the tests or other
and other accounts in written, pictorial, or other form methods of assessment
that preserve archival information, official and 2. Test Publishers – they publish, market, sell, and
informal accounts, and other data and items relevant control the distribution of tests
to an assessee 3. Test Reviewers – prepare evaluative critiques based
▪ Case study: a report or illustrative account on the technical and practical aspects of the tests
concerning a person or an event that was 4. Test Users – uses the test of assessment
compiled on the basis of case history data 5. Test Takers – those who take the tests
▪ Groupthink: result of the varied forces that drive 6. Test Sponsors – institutions or government who
decision-makers to reach a consensus contract test developers for various testing services
7. Society
#BLEPP2023
the types of item content that would provide
o Test Battery – selection of tests and assessment insight to it, to gauge the strength of that trait
procedures typically composed of tests designed to o Measuring traits and states means of a test entails
measure different variables but having a common developing not only appropriate tests items but
objective also appropriate ways to score the test and
Assumptions about Psychological Testing and interpret the results
Assessment o Cumulative Scoring – assumption that the more
Assumption 1: Psychological Traits and States Exist the testtaker responds in a particular direction
o Trait – any distinguishable, relatively enduring keyed by the test manual as correct or consistent
way in which one individual varies from another with a particular trait, the higher that testtaker is
- Permit people predict the present from the past presumed to be on the targeted ability or trait
- Characteristic patterns of thinking, feeling, and Assumption 3: Test-Rlated Behavior Predicts Non-
behaving that generalize across similar situations, Test-Related Behavior
differ systematically between individuals, and remain o The tasks in some tests mimics the actual
rather stable across time behaviors that the test user is attempting to
- Psychological Trait – intelligence, specific understand
intellectual abilities, cognitive style, adjustment, o Such tests only yield a sample of the behavior that
interests, attitudes, sexual orientation and preferences, can be expected to be emitted under nontest
psychopathology, etc. conditions
o States – distinguish one person from another but Assumption 4: Test and Other Measurement
are relatively less enduring Techniques have strengths and weaknesses
- Characteristic pattern of thinking, feeling, and o Competent test users understand and appreciate
behaving in a concrete situation at a specific moment the limitations of the test they use as well as how
in time those limitations might be compensated for by
- Identify those behaviors that can be controlled by data from other sources
manipulating the situation Assumption 5: Various Sources of Error are part of
o Psychological Traits exists as construct the Assessment Process
- Construct: an informed, scientific concept developed o Error – refers to something that is more than
or constructed to explain a behavior, inferred from expected; it is component of the measurement
overt behavior process
- Overt Behavior: an observable action or the product ▪ Refers to a long-standing assumption that
of an observable action factors other than what a test attempts to
o Trait is not expected to be manifested in behavior measure will influence performance on the test
100% of the time ▪ Error Variance – the component of a test
o Whether a trait manifests itself in observable score attributable to sources other than the trait
behavior, and to what degree it manifests, is or ability measured
presumed to depend not only on the strength of the o Potential Sources of error variance:
trait in the individual but also on the nature of the 1. Assessors
action (situation-dependent) 2. Measuring Instruments
o Context within which behavior occurs also plays a 3. Random errors such as luck
role in helping us select appropriate trait terms for o Classical Test Theory – each testtaker has true
observed behaviors score on a test that would be obtained but for the
o Definition of trait and state also refer to a way in action of measurement error
which one individual varies from another Assumption 6: Testing and Assessment can be
o Assessors may make comparisons among people conducted in a Fair and Unbiased Manner
who, because of their membership in some group o Despite best efforts of many professionals,
or for any number of other reasons, are decidedly fairness-related questions and problems do
not average occasionally rise
Assumption 2: Psychological Traits and States can In al questions about tests with regards to fairness, it is
be Quantified and Measured important to keep in mind that tests are tools ꟷthey can
o Once the trait, state or other construct has been be used properly or improperly
defined to be measured, a test developer consider
#BLEPP2023
Assumption 7: Testing and Assessment Benefit ▪ Factors that contribute to inconsistency:
Society characteristics of the individual, test, or situation,
o Considering the many critical decisions that are which have nothing to do with the attribute being
based on testing and assessment procedures, we measured, but still affect the scores
can readily appreciate the need for tests o Goals of Reliability:
Reliability ✓ Estimate errors
o Reliability – dependability or consistency of the ✓ Devise techniques to improve testing and reduce
instrument or scores obtained by the same person errors
when re-examined with the same test on different o Variance – useful in describing sources of test score
occasions, or with different sets of equivalent items variability
▪ Test may be reliable in one context, but ▪ True Variance: variance from true differences
unreliable in another ▪ Error Variance: variance from irrelevant random
▪ Estimate the range of possible random sources
fluctuations that can be expected in an Measurement Error – all of the factors associated
individual’s score with the process of measuring some variable, other than
▪ Free from errors the variable being measured
▪ More number of items = higher reliability - difference between the observed score and the true
▪ Minimizing error score
▪ Using only representative sample to obtain an - Positive: can increase one’s score
observed score - Negative: decrease one’s score
▪ True score cannot be found - Sources of Error Variance:
▪ Reliability Coefficient: index of reliability, a a. Item Sampling/Content Sampling: refer to variation
proportion that indicates the ratio between the among items within a test as well as to variation among
true score variance on a test and the total items between tests
variance - The extent to which testtaker’s score is affected by the
o Classical Test Theory (True Score Theory) – score content sampled on a test and by the way the content is
on a ability tests is presumed to reflect not only the sampled is a source of error variance
testtaker’s true score on the ability being measured b. Test Administration- testtaker’s motivation or
but also the error attention, environment, etc.
▪ Error: refers to the component of the observed c. Test Scoring and Interpretation – may employ
test score that does not have to do with the objective-type items amenable to computer scoring of
testtaker’s ability well-documented reliability
▪ Errors of measurement are random Random Error – source of error in measuring a
targeted variable caused by unpredictable fluctuations
and inconsistencies of other variables in measurement
process (e.g., noise, temperature, weather)
Systematic Error – source of error in a measuring a
variable that is typically constant or proportionate to
what is presumed to be the true values of the variable
being measured
- has consistent effect on the true score
- SD does not change, the mean does
▪ Reliability refers to the proportion of total
variance attributed to true variance
▪ The greater the proportion of the total variance
▪ When you average all the observed scores attributed to true variance, the more reliable the
obtained over a period of time, then the result test
would be closest to the true score ▪ Error variance may increase or decrease a test
▪ The greater number of items, the higher the score by varying amounts, consistency of test
reliability score, and thus, the reliability can be affected
▪ Factors the contribute to consistency: stable
attributes
#BLEPP2023
Test-Retest Reliability time
Error: Time Sampling - most rigorous and burdensome, since test developers
- time sampling reliability create two forms of the test
- an estimate of reliability obtained by correlating - main problem: difference between the two test
pairs of scores from the same people on two different - test scores may be affected by motivation, fatigue, or
administrations of the test intervening events
- appropriate when evaluating the reliability of a test - means and the variances of the observed scores must
that purports to measure an enduring and stable be equal for two forms
attribute such as personality trait - Statistical Tool: Pearson R or Spearman Rho
- established by comparing the scores obtained from Internal Consistency (Inter-Item Reliability)
two successive measurements of the same individuals Error: Item Sampling Homogeneity
and calculating a correlated between the two set of - used when tests are administered once
scores - consistency among items within the test
- the longer the time passes, the greater likelihood that - measures the internal consistency of the test which is
the reliability coefficient would be insignificant the degree to which each item measures the same
- Carryover Effects: happened when the test-retest construct
interval is short, wherein the second test is influenced - measurement for unstable traits
by the first test because they remember or practiced - if all items measure the same construct, then it has a
the previous test = inflated correlation/overestimation good internal consistency
of reliability - useful in assessing Homogeneity
- Practice Effect: scores on the second session are - Homogeneity: if a test contains items that measure a
higher due to their experience of the first session of single trait (unifactorial)
testing - Heterogeneity: degree to which a test measures
- test-retest with longer interval might be affected of different factors (more than one factor/trait)
other extreme factors, thus, resulting to low - more homogenous = higher inter-item consistency
correlation - KR-20: used for inter-item consistency of
- lower correlation = poor reliability dichotomous items (intelligence tests, personality tests
- Mortality: problems in absences in second session with yes or no options, multiple choice), unequal
(just remove the first tests of the absents) variances, dichotomous scored
- Coefficient of Stability - KR-21: if all the items have the same degree of
- statistical tool: Pearson R, Spearman Rho difficulty (speed tests), equal variances, dichotomous
Parallel Forms/Alternate Forms Reliability scored
Error: Item Sampling (Immediate), Item Sampling - Cronbach’s Coefficient Alpha: used when two
changes over time (delaued) halves of the test have unequal variances and on tests
- established when at least two different versions of containing non-dichotomous items, unequal variances
the test yield almost the same scores - Average Proportional Distance: measure used to
- has the most universal applicability evaluate internal consistence of a test that focuses on
- Parallel Forms: each form of the test, the means, the degree of differences that exists between item
and the variances, are EQUAL; same items, different scores
positionings/numberings Split-Half Reliability
- Alternate Forms: simply different version of a test Error: Item sample: Nature of Split
that has been constructed so as to be parallel - Split Half Reliability: obtained by correlating two
- test should contain the same number of items and the pairs of scores obtained from equivalent halves of a
items should be expressed in the same form and single test administered ONCE
should cover the same type of content; range and - useful when it is impractical or undesirable to assess
difficulty must also be equal reliability with two tests or to administer a test twice
- if there is a test leakage, use the form that is not - cannot just divide the items in the middle because it
mostly administered might spuriously raise or lower the reliability
- Counterbalancing: technique to avoid carryover coefficient, so just randomly assign items or assign
effects for parallel forms, by using different sequence odd-numbered items to one half and even-numbered
for groups items to the other half
- can be administered on the same day or different
#BLEPP2023
- Spearman-Brown Formula: allows a test developer o Criterion-Referenced Tests – designed to provide
of user to estimate internal consistency reliability an indication of where a testtaker stands with respect
from a correlation of two halves of a test, if each half to some variable or criterion
had been the length of the whole test and have the ▪ As individual differences decrease, a traditional
equal variances measure of reliability would also decrease,
- Spearman-Brown Prophecy Formula: estimates how regardless of the stability of individual
many more items are needed in order to achieve the performance
target reliability o Classical Test Theory – everyone has a “true score”
- multiply the estimate to the original number of items on test
- Rulon’s Formula: counterpart of spearman-brown ▪ True Score: genuinely reflects an individual’s
formula, which is the ratio of the variance of ability level as measured by a particular test
difference between the odd and even splits and the ▪ Random Error
variance of the total, combined odd-even, score o Domain Sampling Theory – estimate the extent to
- if the reliability of the original test is relatively low, which specific sources of variation under defined
then developer could create new items, clarify test conditions are contributing to the test scores
instructions, or simplifying the scoring rules ▪ Considers problem created by using a limited
- equal variances, dichotomous scored number of items to represent a larger and more
- Statistical Tool: Pearson R or Spearman Rho complicated construct
Inter-Scorer Reliability ▪ Test reliability is conceived of as an objective
Error: Scorer Differences measure of how precisely the test score assesses
- the degree of agreement or consistency between two the domain from which the test draws a sample
or more scorers with regard to a particular measure ▪ Generalizability Theory: based on the idea that a
- used for coding nonbehavioral behavior person’s test scores vary from testing to testing
- observer differences because of the variables in the testing situations
- Fleiss Kappa: determine the level between TWO or ▪ Universe: test situation
MORE raters when the method of assessment is ▪ Facets: number of items in the test, amount of
measured on CATEGORICAL SCALE review, and the purpose of test administration
- Cohen’s Kappa: two raters only ▪ According to Generalizability Theory, given the
- Krippendorff’s Alpha: two or more rater, based on exact same conditions of all the facets in the
observed disagreement corrected for disagreement universe, the exact same test score should be
expected by chance obtained (Universe score)
o Tests designed to measure one factor (Homogenous) ▪ Decision Study: developers examine the
are expected to have high degree of internal usefulness of test scores in helping the test user
consistency and vice versa make decisions
o Dynamic – trait, state, or ability presumed to be ever- ▪ Systematic Error
changing as a function of situational and cognitive o Item Response Theory – the probability that a
experience person with X ability will be able to perform at a
o Static – barely changing or relatively unchanging level of Y in a test
o Restriction of range or Restriction of variance – if ▪ Focus: item difficulty
the variance of either variable in a correlational ▪ Latent-Trait Theory
analysis is restricted by the sampling procedure used, ▪ a system of assumption about measurement and
then the resulting correlation coefficient tends to be the extent to which item measures the trait
lower ▪ The computer is used to focus on the range of
o Power Tests – when time limit is long enough to item difficulty that helps assess an individual’s
allow test takers to attempt all times ability level
o Speed Tests – generally contains items of uniform ▪ If you got several easy items correct, the
level of difficulty with time limit computer will them move to more difficult items
▪ Reliability should be based on performance from ▪ Difficulty: attribute of not being easily
two independent testing periods using test-retest accomplished, solved, or comprehended
and alternate-forms or split-half-reliability ▪ Discrimination: degree to which an item
differentiates among people with higher or lower
levels of the trait, ability etc.
#BLEPP2023
▪ Dichotomous: can be answered with only one of 3. False Positive (Type 1) – success does not occur
two alternative responses 4. False Negative (Type 2) – predicted failure but
▪ Polytomous: 3 or more alternative responses succeed
o Standard Error of Measurement – provide a
measure of the precision of an observed test score
▪ Standard deviation of errors as the basic measure
of error
▪ Index of the amount of inconsistent or the amount
of the expected error in an individual’s score
▪ Allows to quantify the extent to which a test
provide accurate scores
▪ Provides an estimate of the amount of error
inherent in an observed score or measurement
▪ Higher reliability, lower SEM
▪ Used to estimate or infer the extent to which an
Validity
observed score deviates from a true score
o Validity – a judgment or estimate of how well a test
▪ Standard Error of a Score
measures what it supposed to measure
▪ Confidence Interval: a range or band of test
▪ Evidence about the appropriateness of inferences
scores that is likely to contain true scores
drawn from test scores
o Standard Error of the Difference – can aid a test
▪ Degree to which the measurement procedure
user in determining how large a difference should be
measures the variables to measure
before it is considered statistically significant
▪ Inferences – logical result or deduction
o Standard Error of Estimate – refers to the standard
▪ May diminish as the culture or times change
error of the difference between the predicted and
✓ Predicts future performance
observed values
✓ Measures appropriate domain
o Confidence Interval – a range of and of test score
✓ Measures appropriate characteristics
that is likely to contain true score
o Validation – the process of gathering and evaluating
▪ Tells us the relative ability of the true score within
evidence about validity
the specified range and confidence level
o Validation Studies – yield insights regarding a
▪ The larger the range, the higher the confidence
particular population of testtakers as compared to the
o If the reliability is low, you can increase the number
norming sample described in a test manual
of items or use factor analysis and item analysis to
o Internal Validity – degree of control among
increase internal consistency
variables in the study (increased through random
o Reliability Estimates – nature of the test will often
assignment)
determine the reliability metric
o External Validity – generalizability of the research
a) Homogenous (unifactor) or heterogeneous
results (increased through random selection)
(multifactor)
o Conceptual Validity – focuses on individual with
b) Dynamic (unstable) or static (stable)
their unique histories and behaviors
c) Range of scores is restricted or not
▪ Means of evaluating and integrating test data so
d) Speed Test or Power Test
that the clinician’s conclusions make accurate
e) Criterion or non-Criterion
statements about the examinee
o Test Sensitivity – detects true positive
o Face Validity – a test appears to measure to the
o Test Specificity – detects true negative
person being tested than to what the test actually
o Base Rate – proportion of the population that
measures
actually possess the characteristic of interest
o Selection ratio – no. of available positions compared Content Validity
to the no. of applicants - describes a judgement of how adequately a test
o Four Possible Hit and Miss Outcomes samples behavior representative of the universe of
1. True Positives (Sensitivity) – predict success behavior that the test was designed to sample
that does occur - when the proportion of the material covered by the
2. True Negatives (Specificity) – predict failure test approximates the proportion of material covered in
that does occur the course
#BLEPP2023
- Test Blueprint: a plan regarding the types of - logical and statistical
information to be covered by the items, the no. of items - judgement about the appropriateness of inferences
tapping each area of coverage, the organization of the drawn from test scores regarding individual standing
items, and so forth on variable called construct
- more logical than statistical - Construct: an informed, scientific idea developed or
- concerned with the extent to which the test is hypothesized to describe or explain behavior;
representative of defined body of content consisting the unobservable, presupposed traits that may invoke to
topics and processes describe test behavior or criterion performance
- panel of experts can review the test items and rate - One way a test developer can improve the
them in terms of how closely they match the objective homogeneity of a test containing dichotomous items is
or domain specification by eliminating items that do not show significant
- examine if items are essential, useful and necessary correlation coefficients with total test scores
- construct underrepresentation: failure to capture - If it is an academic test and high scorers on the entire
important components of a construct test for some reason tended to get that particular item
- construct-irrelevant variance: happens when scores wrong while low scorers got it right, then the item is
are influenced by factors irrelevant to the construct obviously not a good one
- Lawshe: developed the formula of Content Validity - Some constructs lend themselves more readily than
Ratio others to predictions of change over time
- Zero CVR: exactly half of the experts rate the item as - Method of Contrasted Groups: demonstrate that
essential scores on the test vary in a predictable way as a
Criterion Validity function of membership in a group
- more statistical than logical - If a test is a valid measure of a particular construct,
- a judgement of how adequately a test score can be then the scores from the group of people who does not
used to infer an individual’s most probable standing on have that construct would have different test scores
some measure of interestꟷthe measure of interest being than those who really possesses that construct
criterion - Convergent Evidence: if scores on the test
- Criterion: standard on which a judgement or decision undergoing construct validation tend to highly
may be made correlated with another established, validated test that
- Characteristics: relevant, valid, uncontaminated measures the same construct
- Criterion Contamination: occurs when the criterion - Discriminant Evidence: a validity coefficient
measure includes aspects of performance that are not showing little relationship between test scores and/or
part of the job or when the measure is affected by other variables with which scores on the test being
“construct-irrelevant” (Messick, 1989) factors that are construct-validated should not be correlated
not part of the criterion construct - test is homogenous
1. Concurrent Validity: If the test scores obtained at - test score increases or decreases as a function of age,
about the same time as the criterion measures are passage of time, or experimental manipulation
obtained; economically efficient - pretest-posttest differences
2. Predictive Validity: measures of the relationship - scores differ from groups
between test scores and a criterion measure obtained at - scores correlated with scores on other test in
a future time accordance to what is predicted
- Incremental Validity: the degree to which an o Factor Analysis – designed to identify factors or
additional predictor explains something about the specific variables that are typically attributes,
criterion measure that is not explained by predictors characteristics, or dimensions on which people may
already in use; used to improve the domain differ
- related to predictive validity wherein it is defined as ▪ Developed by Charles Spearman
the degree to which an additional predictor explains ▪ Employed as data reduction method
something about the criterion measure that is not ▪ Used to study the interrelationships among set of
explained by predictors already in use variables
Construct Validity (Umbrella Validity) ▪ Identify the factor or factors in common between
- covers all types of validity test scores on subscales within a particular test
▪ Explanatory FA: estimating or extracting factors;
deciding how many factors must be retained
#BLEPP2023
▪ Confirmatory FA: researchers test the degree to o Cost – disadvantages, losses, or expenses both
which a hypothetical model fits the actual data economic and noneconomic terms
▪ Factor Loading: conveys info about the extent to o Benefit – profits, gains or advantages
which the factor determines the test score or o The cost of test administration can be well worth it if
scores the results is certain noneconomic benefits
▪ can be used to obtain both convergent and o Utility Analysis – family of techniques that entail a
discriminant validity cost-benefit analysis designed to yield information
o Cross-Validation – revalidation of the test to a relevant to a decision about the usefulness and/or
criterion based on another group different from the practical value of a tool of assessment
original group form which the test was validated o Expectancy table – provide an indication that a
▪ Validity Shrinkage: decrease in validity after testtaker will score within some interval of scores on
cross-validation a criterion measure – passing, acceptable, failing
▪ Co-Validation: validation of more than one test o Might indicate future behaviors, then if successful,
from the same group the test is working as it should
▪ Co-Norming: norming more than one test from o Taylor-Russel Tables – provide an estimate of the
the same group extent to which inclusion of a particular test in the
o Bias – factor inherent in a test that systematically selection system will improve selection
prevents accurate, impartial measurement o Selection Ratio – numerical value that reflects the
▪ Prejudice, preferential treatment relationship between the number of people to be
▪ Prevention during test dev through a procedure hired and the number of people available to be hired
called Estimated True Score Transformation
o Rating – numerical or verbal judgement that places
a person or an attribute along a continuum identified
by a scale of numerical or word descriptors known as o Base Rate – percentage of people hired under the
Rating Scale existing system for a particular position
▪ Rating Error: intentional or unintentional misuse o One limitation of Taylor-Russel Tables is that the
of the scale relationship between the predictor (test) and criterion
▪ Leniency Error: rater is lenient in scoring must be linear
(Generosity Error) o Naylor-Shine Tables – entails obtaining the
▪ Severity Error: rater is strict in scoring difference between the means of the selected and
▪ Central Tendency Error: rater’s rating would tend unselected groups to derive an index of what the test
to cluster in the middle of the rating scale is adding to already established procedures
▪ One way to overcome rating errors is to use o Brogden-Cronbach-Gleser Formula – used to
rankings calculate the dollar amount of a utility gain resulting
▪ Halo Effect: tendency to give high score due to from the use of a particular selection instrument
failure to discriminate among conceptually o Utility Gain – estimate of the benefit of using a
distinct and potentially independent aspects of a particular test
ratee’s behavior o Productivity Gains – an estimated increase in work
o Fairness – the extent to which a test is used in an output
impartial, just, and equitable way o High performing applicants may have been offered
o Attempting to define the validity of the test will be in other companies as well
futile if the test is NOT reliable o The more complex the job, the more people differ on
Utility how well or poorly they do that job
o Utility – usefulness or practical value of testing to o Cut Score – reference point derived as a result of a
improve efficiency judgement and used to divide a set of data into two
o Can tell us something about the practical value of the or more classifications
information derived from scores on the test Relative Cut Score – reference point based on norm-
o Helps us make better decisions related considerations (norm-referenced); e.g, NMAT
o Higher criterion-related validity = higher utility Fixed Cut Scores – set with reference to a judgement
o One of the most basic elements in utility analysis is concerning minimum level of proficiency required;
financial cost of the selection device e.g., Board Exams
#BLEPP2023
Multiple Cut Scores – refers to the use of two or more Validity
cut scores with reference to one predictor for the
purpose of categorization
Multiple Hurdle – multi-stage selection process, a cut
score is in place for each predictor
Compensatory Model of Selection – assumption that
high scores on one attribute can compensate for lower
scores
o Angoff Method – setting fixed cut scores
▪ low interrater reliability
o Known Groups Method – collection of data on the
predictor of interest from group known to possess
and not possess a trait of interest
▪ The determination of where to set cutoff score is
inherently affected by the composition of
contrasting groups Item Difficulty
o IRT-Based Methods – cut scores are typically set
based on testtaker’s performance across all the items
on the test
▪ Item-Mapping Method: arrangement of items in
histogram, with each column containing items
with deemed to be equivalent value
▪ Bookmark Method: expert places “bookmark”
between the two pages that are deemed to separate
testtakers who have acquired the minimal
knowledge, skills, and/or abilities from those who Item Discrimination
are not
o Method of Predictive Yield – took into account the
number of positions to be filled, projections
regarding the likelihood of offer acceptance, and the
distribution of applicant scores
o Discriminant Analysis – shed light on the
relationship between identified variables and two
naturally occurring groups
Reason for accepting or rejecting instruments and P-Value
tools based on Psychometric Properties o P-Value ≤ ∞, reject null hypothesis
Reliability o P-Value ≥ ∞, accept null hypothesis
o Basic Research = 0.70 to 0.90 Research Methods and Statistics (20)

o Clinical Setting = 0.90 to 0.95 Statistics Applied in Research Studies on tests and
Tests Development
Measures of Central Tendency - statistics that
indicates the average or midmost score between the
extreme scores in a distribution
#BLEPP2023
- Goal: Identify the most typical or representative of Measures of Spread or Variability – statistics that
entire group describe the amount of variation in a distribution
- Measures of Central Location - gives idea of how well the measure of central
Mean - the average of all the tendency represent the data
raw scores - large spread of values means large differences
- Equal to the sum of the between individual scores
observations divided by Range - equal to the difference
the number of between highest and the
observations lowest score
- Interval and ratio data - Provides a quick but
(when normal gross description of the
distribution) spread of scores
- Point of least squares - When its value is based
- Balance point for the on extreme scores of the
distribution distribution, the resulting
- susceptible to outliers description of variation
Median – the middle score of the may be understated or
distribution overstated
- Ordinal, Interval, Ratio Interquartile Range - difference between Q1
- for extreme scores, use and Q2
median Semi-Quartile Range - interquartile range
- Identical for sample and divided by 2
population Standard Deviation - approximation of the
- Also used when there average deviation around
has an unknown or the mean
undetermined score - gives detail of how
- Used in “open-ended” much above or below a
categories (e.g., 5 or score to the mean
more, more than 8, at - equal to the square root
least 10) of the average squared
- For ordinal data deviations about the
- if the distribution is mean
skewed for ratio/interval - Equal to the square root
data, use median of the variance
Mode - most frequently - Distance from the mean
occurring score in the Variance - equal to the arithmetic
distribution mean of the squares of
- Bimodal Distribution: if the differences between
there are two scores that the scores in a
occur with highest distribution and their
frequency mean
- Not commonly used - average squared
- Useful in analyses of deviation around the
qualitative or verbal mean
nature Measures of Location
- For nominal scales, Percentile or Percentile - not linearly
discrete variables Rank transformable, converged
- Value of the mode gives at the middle and the
an indication of the shape outer ends show large
of the distribution as well interval
as a measure of central
tendency
#BLEPP2023
- expressed in terms of and the differences of
the percentage of persons their salaries
in the standardization One-Way Repeated - 1 group, measured at
sample who fall below a Measures least 3 times
given score - e.g., measuring the
- indicates the focus level of board
individual’s relative reviewers during
position in the morning, afternoon, and
standardization sample night sessions of review
Quartile - dividing points between Two-Way ANOVA - 3 or more groups, tested
the four quarters in the for 2 variables
distribution - e.g., people in different
- Specific point socio-economic status
- Quarter: refers to an and the differences of
interval their salaries and their
Decile/STEN - divide into 10 equal eating habits
parts ANCOVA - used when you need to
- a measure of the control for an additional
asymmetry of the variable which may be
probability distribution of influencing the
a real-valued random relationship between your
about its mean independent and
Correlation dependent variable
Pearson R - interval/ratio + ANOVA Mixed Design - 2 or more groups,
interval/ratio measured more than 3
Spearman Rho - ordinal + ordinal times
Biserial - artificial Dichotomous + - e.g., Young Adults,
interval/ratio Middle Adults, and Old
Point Biserial - true dichotomous + Adults’ blood pressure is
interval/ratio measured during
Phi Coefficient - nominal (true dic) + breakfast, lunch, and
nominal (true/artificial dinner
dic.) Non-Parametric Tests
Tetrachoric - Art. Dichotomous + Art. Mann Whitney U Test - t-test independent
Dichotomos Wilcoxon Signed Rank - t-test dependent
Kendall’s - 3 or more ordinal/rank Test
Rank Biserial - nominal + ordinal Kruskal-Wallis H Test - one-way/two-way
Differences ANOVA
T-test Independent - two separate groups, Friedman Test - ANOVA repeated
random assignment measures
- e.g., blood pressure of Lambda - for 2 groups of nominal
male and female grad data
students Chi-Square
T-Test Dependent - one group, two scores Goodness of Fit - used to measure
- e.g., blood pressure differences and involves
before and after the nominal data and only
lecture of Grad students one variable with 2 or
One-Way ANOVA - 3 or more groups, tested more categories
once Test of Independence - used to measure
- e.g., people in different correlation and involves
socio-economic status nominal data and two
#BLEPP2023
variables with two or II. Test Construction – stage in the process that entails
more categories writing test items, revisions, formatting, setting scoring
Regression – used when one wants to provide rules
framework of prediction on the basis of one factor in - it is not good to create an item that contains numerous
order to predict the probable value of another factor ideas
Linear Regression of Y - Y = a + bX - Item Pool: reservoir or well from which the items will
on X - Used to predict the or will not be drawn for the final version of the test
unknown value of - Item Banks: relatively large and easily accessible
variable Y when value of collection of test questions
variable X is known - Computerized Adaptive Testing: refers to an
Linear Regression of X - X = c + dY interactive, computer administered test-taking process
on Y - Used to predict the wherein items presented to the testtaker are based in
unknown value of part on the testtaker’s performance on previous items
variable X using the - The test administered may be different for each
known variable Y testtaker, depending on the test performance on the
items presented
- Reduces floor and ceiling effects
- Floor Effects: occurs when there is some lower limit
on a survey or questionnaire and a large percentage of
respondents score near this lower limit (testtakers have
low scores)
- Ceiling Effects: occurs when there is some upper limit
on a survey or questionnaire and a large percentage of
respondents score near this upper limit (testtakers have
high scores)
- Item Branching: ability of the computer to tailor the
o True Dichotomy – dichotomy in which there are content and order of presentation of items on the basis
only fixed possible categories of responses to previous items
o Artificial Dichotomy - dichotomy in which there are - Item Format: form, plan, structure, arrangement, and
other possibilities in a certain category layout of individual test items
Methods and Statistics used in Research Studies and - Dichotomous Format: offers two alternatives for each
Test Construction item
Test Development - Polychotomous Format: each item has more than two
o Test Development – an umbrella term for all that alternatives
goes into the process of creating a test - Category Format: a format where respondents are
I. Test Conceptualization – brainstorming of ideas asked to rate a construct
about what kind of test a developer wants to publish 1. Checklist – subject receives a longlist of adjectives
- stage wherein the ff. is determined: construct, goal, and indicates whether each one if characteristic of
user, taker, administration, format, response, benefits, himself or herself
costs, interpretation 2. Guttman Scale – items are arranged from weaker to
- determines whether the test would be norm- stronger expressions of attitude, belief, or feelings
referenced or criterion-referenced - Selected-Response Format: require testtakers to select
- Pilot Work/Pilot Study/Pilot Research – preliminary response from a set of alternative responses
research surrounding the creation of a prototype of the 1. Multiple Choice - Has three elements: stem
test (question), a correct option, and several incorrect
- Attempts to determine how best to measure a targeted alternatives (distractors or foils), Should’ve one
construct correct answer, has grammatically parallel alternatives,
- Entail lit reviews and experimentation, creation, similar length, alternatives that fit grammatically with
revision, and deletion of preliminary items the stem, avoid ridiculous distractors, not excessively
long, “all of the above”, “none of the above” (25%)
#BLEPP2023
- Effective Distractors: a distractor that was chosen 3. Constant Sum – respondents are asked to allocate a
equally by both high and low performing groups that constant sum of units, such as points, among set of
enhances the consistency of test results stimulus objects with respect to some criterion
- Ineffective Distractors: may hurt the reliability of the 4. Q-Sort Technique – sort object based on similarity
test because they are time consuming to read and can with respect to some criterion
limit the no. of good items Non-Comparative Scales of Measurement
- Cute Distractors: less likely to be chosen, may affect 1. Continuous Rating – rate the objects by placing a
the reliability of the test bec the testtakers may guess mark at the appropriate position on a continuous line
from the remaining options that runs from one extreme of the criterion variable to
2. Matching Item - Test taker is presented with two the other
columns: Premises and Responses - e.g., Rating Guardians of the Galaxy as the best
3. Binary Choice - Usually takes the form of a sentence Marvel Movie of Phase 4
that requires the testtaker to indicate whether the 2. Itemized Rating – having numbers or brief
statement is or is not a fact (50%) descriptions associated with each category
- Constructed-Response Format: requires testtakers to - e.g., 1 if your like the item the most, 2 if so-so, 3 if
supply or to create the correct answer, not merely you hate it
selecting it 3. Likert Scale – indicate their own attitudes by
1. Completion Item - Requires the examinee to checking how strongly they agree or disagree with
provide a word or phrase that completes a sentence carefully worded statements that range from very
2. Short-Answer - Should be written clearly enough positive to very negative towards attitudinal object
that the testtaker can respond succinctly, with short - principle of measuring attitudes by asking people to
answer respond to a series of statements about a topic, in terms
3. Essay – allows creative integration and expression of the extent to which they agree with them
of the material 4. Visual Analogue Scale – a 100-mm line that allows
- Scaling: process of setting rules for assigning subjects to express the magnitude of an experience or
numbers in measurement belief
Primary Scales of Measurement 5. Semantic Differential Scale – derive respondent’s
1. Nominal - involve classification or categorization attitude towards the given object by asking him to
based on one or more distinguishing characteristics select an appropriate position on a scale between two
- Label and categorize observations but do not make bipolar opposites
any quantitative distinctions between observations 6. Staple Scale – developed to measure the direction
- mode and intensity of an attitude simultaneously
2. Ordinal - rank ordering on some characteristics is 7. Summative Scale – final score is obtained by
also permissible summing the ratings across all the items
- median 8. Thurstone Scale – involves the collection of a
3. Ratio - contains equal intervals, has no absolute zero variety of different statements about a phenomenon
point (even negative values have interpretation to it) which are ranked by an expert panel in order to develop
- Zero value does not mean it represents none the questionnaire
4. Interval - - has true zero point (if the score is zero, - allows multiple answers
it means none/null) 9. Ipsative Scale – the respondent must choose
- Easiest to manipulate between two or more equally socially acceptable
Comparative Scales of Measurement options
1. Paired Comparison - produces ordinal data by III. Test Tryout - the test should be tried out on people
presenting with pairs of two stimuli which they are who are similar in critical respects to the people for
asked to compare whom the test was designed
- respondent is presented with two objects at a time and - An informal rule of thumb should be no fewer than 5
asked to select one object according to some criterion and preferably as many as 10 for each item (the more,
2. Rank Order – respondents are presented with the better)
several items simultaneously and asked to rank them in - Risk of using few subjects = phantom factors emerge
order or priority - Should be executed under conditions as identical as
possible
#BLEPP2023
- A good test item is one that answered correctly by high - The higher Item-Validity index, the greater the test’s
scorers as a whole criterion-related validity
- Empirical Criterion Keying: administering a large - Item-Discrimination Index: measure of item
pool of test items to a sample of individuals who are discrimination; measure of the difference between the
known to differ on the construct being measured proportion of high scorers answering an item correctly
- Item Analysis: statistical procedure used to analyze and the proportion of low scorers answering the item
items, evaluate test items correctly
- Discriminability Analysis: employed to examine - Extreme Group Method: compares people who have
correlation between each item and the total score of the done well with those who have done poorly
test - Discrimination Index: difference between these
- Item: suggest a sample of behavior of an individual proportion
- Table of Specification: a blueprint of the test in terms - Point-Biserial Method: correlation between a
of number of items per difficulty, topic importance, or dichotomous variable and continuous variable
taxonomy
- Guidelines for Item writing: Define clearly what to
measure, generate item pool, avoid long items, keep the
level of reading difficulty appropriate for those who
will complete the test, avoid double-barreled items,
consider making positive and negative worded items
- Double-Barreled Items: items that convey more than
one ideas at the same time
- Item Difficulty: defined by the number of people who - Item-Characteristic Curve: graphic representation of
get a particular item correct item difficulty and discrimination
- Item-Difficulty Index: calculating the proportion of - Guessing: one that eluded any universally accepted
the total number of testtakers who answered the item solutions
correctly; The larger, the easier the item - Item analyses taken under speed conditions yield
- Item-Endorsement Index for personality testing, misleading or uninterpretable results
percentage of individual who endorsed an item in a - Restrict item analysis on a speed test only to the items
personality test completed by the testtaker
- The optimal average item difficulty is approx. 50% - Test developer ideally should administer the test to be
with items on the testing ranging in difficulty from item-analyzed with generous time limits to complete
about 30% to 80% the test
Scoring Items/Scoring Models
1. Cumulative Model – testtaker obtains a measure of
the level of the trait; thus, high scorers may suggest
high level in the trait being measured
2. Class Scoring/Category Scoring – testtaker
response earn credit toward placement in a particular
class or category with other testtaker whose pattern of
responses is similar in some way
3. Ipsative Scoring – compares testtaker’s score on one
- Omnibus Spiral Format: items in an ability are scale within a test to another scale within that same test,
arranged into increasing difficulty two unrelated constructs
- Item-Reliability Index: provides an indication of the IV. Test Revision – characterize each item according to
internal consistency of a test its strength and weaknesses
- The higher Item-Reliability index, the greater the - As revision proceeds, the advantage of writing a large
test’s internal consistency item pool becomes more apparent because some items
- Item-Validity Index: designed to provide an indication were removed and must be replaced by the items in the
of the degree to which a test is measure what it purports item pool
to measure - Administer the revised test under standardized
conditions to a second appropriate sample of examinee
#BLEPP2023
- Cross-Validation: revalidation of a test on a sample of o Basal Level – the level of which a the minimum
testtakers other than those on who test performance was criterion number of correct responses is obtained
originally found to be a valid predictor of some o Computer Assisted Psychological Assessment –
criterion; often results to validity shrinkage standardized test administration is assured for
- Validity Shrinkage: decrease in item validities that testtakers and variation is kept to a minimum
inevitably occurs after cross-validation ▪ Test content and length is tailored according to
- Co-validation: conducted on two or more test using
the taker’s ability
the same sample of testtakers
Statistics
- Co-norming: creation of norms or the revision of
o Measurement – the act of assigning numbers or
existing norms
symbols to characteristics of things according to
- Anchor Protocol: test protocol scored by highly
rules
authoritative scorer that is designed as a model for
Descriptive Statistics – methods used to provide
scoring and a mechanism for resolving scoring
concise description of a collection of quantitative
discrepancies
information
- Scoring Drift: discrepancy between scoring in an
anchor protocol and the scoring of another protocol Inferential Statistics – method used to make
- Differential Item Functioning: item functions inferences from observations of a small group of people
differently in one group of testtakers known to have the known as sample to a larger group of individuals
same level of the underlying trait known as population
- DIF Analysis: test developers scrutinize group by o Magnitude – the property of “moreness”
group item response curves looking for DIF Items o Equal Intervals – the difference between two points
- DIF Items: items that respondents from different at any place on the scale has the same meaning as the
groups at the same level of underlying trait have difference between two other points that differ by the
different probabilities of endorsing a function of their same number of scale units
group membership o Absolute 0 – when nothing of the property being
o Computerized Adaptive Testing – refers to an measured exists
o Scale – a set of numbers who properties model
interactive, computer administered test-taking
empirical properties of the objects to which the
process wherein items presented to the testtaker are numbers are assigned
based in part on the testtaker’s performance on Continuous Scale – takes on any value within the
previous items range and the possible value within that range is infinite
▪ The test administered may be different for each - used to measure a variable which can theoretically be
testtaker, depending on the test performance on divided
the items presented Discrete Scale – can be counted; has distinct, countable
▪ Reduces floor and ceiling effects values
▪ Floor Effects: occurs when there is some lower - used to measure a variable which cannot be
limit on a survey or questionnaire and a large theoretically be divided
percentage of respondents score near this lower o Error – refers to the collective influence of all the
limit (testtakers have low scores) factors on a test score or measurement beyond those
▪ Ceiling Effects: occurs when there is some upper specifically measured by the test or measurement
limit on a survey or questionnaire and a large ▪ Degree to which the test score/measurement may
be wrong, considering other factors like state of
percentage of respondents score near this upper
the testtaker, venue, test itself etc.
limit (testtakers have high scores) ▪ Measurement with continuous scale always
▪ Item Branching: ability of the computer to tailor involve with error
the content and order of presentation of items on Four Levels of Scales of Measurement
the basis of responses to previous items Nominal – involve classification or categorization
▪ Routing Test: subtest used to direct or route the based on one or more distinguishing characteristics
testtaker to a suitable level of items - Label and categorize observations but do not make
▪ Item-Mapping Method: setting cut scores that any quantitative distinctions between observations
entails a histographic representation of items and - mode
expert judgments regarding item effectiveness
#BLEPP2023
Ordinal - rank ordering on some characteristics is also - Bimodal Distribution: if there are two scores that
permissible occur with highest frequency
- median - Not commonly used
Interval - contains equal intervals, has no absolute zero - Useful in analyses of qualitative or verbal nature
point (even negative values have interpretation to it) - For nominal scales, discrete variables
- Zero value does not mean it represents none - Value of the mode gives an indication of the shape of
Ratio - has true zero point (if the score is zero, it means the distribution as well as a measure of central tendency
none/null) o Variability – an indication how scores in a
- Easiest to manipulate distribution are scattered or dispersed
o Distribution – defined as a set of test scores arrayed o Measures of Variability – statistics that describe the
for recording or study amount of variation in a distribution
o Raw Scores – straightforward, unmodified o Range – equal to the difference between highest and
accounting of performance that is usually numerical the lowest score
o Frequency Distribution – all scores are listed ▪ Provides a quick but gross description of the
alongside the number of times each score occurred spread of scores
o Independent Variable – being manipulated in the ▪ When its value is based on extreme scores of the
study distribution, the resulting description of variation
o Quasi-Independent Variable – nonmanipulated may be understated or overstated
variable to designate groups o Quartile – dividing points between the four quarters
▪ Factor: for ANOVA in the distribution
Post-Hoc Tests – used in ANOVA to determine which ▪ Specific point
mean differences are significantly different ▪ Quarter: refers to an interval
Tukey’s HSD test – allows the compute a single value ▪ Interquartile Range: measure of variability equal
that determines the minimum difference between to the difference between Q3 and Q1
treatment means that is necessary for significance ▪ Semi-interquartile Range: equal to the
o Measures of Central Tendency – statistics that interquartile range divided by 2
indicates the average or midmost score between the o Standard Deviation – equal to the square root of the
extreme scores in a distribution average squared deviations about the mean
▪ Goal: Identify the most typical or representative ▪ Equal to the square root of the variance
of entire group ▪ Variance: equal to the arithmetic mean of the
Mean – the average of all the raw scores squares of the differences between the scores in a
- Equal to the sum of the observations divided by the distribution and their mean
number of observations ▪ Distance from the mean
- Interval and ratio data (when normal distribution) o Normal Curve – also known as Gaussian Curve
- Point of least squares o Bell-shaped, smooth, mathematically defined curve
- Balance point for the distribution that is highest at its center
Median – the middle score of the distribution o Asymptotically = approaches but never touches the
- Ordinal, Interval, Ratio axis
- Useful in cases where relatively few scores fall at the o Tail – 2 – 3 standard deviations above and below the
high end of the distribution or relatively few scores fall mean
at the low end of the distribution
- In other words, for extreme scores, use median
(skewed)
- Identical for sample and population
- Also used when there has an unknown or
undetermined score
- Used in “open-ended” categories (e.g., 5 or more,
more than 8, at least 10)
- For ordinal data
Mode – most frequently occurring score in the
distribution
#BLEPP2023
▪ Mean < Median < Mode

o Skewed is associated with abnormal, perhaps
because the skewed distribution deviates from the
symmetrical or so-called normal distribution
o Symmetrical Distribution – right side of the graph o Kurtosis – steepness if a distribution in its center
is mirror image of the left side Platykurtic – relatively flat
▪ Has only one mode and it is in the center of the Leptokurtic – relatively peaked
distribution Mesokurtic – somewhere in the middle
▪ Mean = median = mode
o Skewness – nature and extent to which symmetry is
absent
o Positive Skewed – few scores fall the high end of the
distribution
▪ The exam is difficult
▪ More items that was easier would have been
desirable in order to better discriminate at the
lower end of the distribution of test scores
▪ High Kurtosis = high peak and fatter tails

▪ Lower Kurtosis = rounded peak and thinner tails
o Standard Score – raw score that has been converted
from one scale to another scale
o Z-Scores – results from the conversion of a raw score
into a number indicating how many SD units the raw
score is below or above the mean of the distribution
▪ Identify and describe the exact location of each
score in a distribution
▪ Standardize an entire distribution
▪ Mean > Median > Mode
▪ Zero plus or minus one scale
o Negative Skewed – when relatively few of the scores
▪ Have negative values
fall at the low end of the distribution
▪ Requires that we know the value of the variance
▪ The exam is easy
to compute the standard error
▪ More items of a higher level of difficulty would
o T-Scores – a scale with a mean set at 50 and a
make it possible to better discriminate between
standard deviation set at 10
scores at the upper end of the distribution
▪ Fifty plus or minus 10 scale
▪ 5 standard deviations below the mean would be
equal to a t-score of 0
▪ Raw score that fell in the mean has T of 50
▪ Raw score 5 standard deviations about the mean
would be equal to a T of 100
▪ No negative values
#BLEPP2023
▪ Used when the population or variance is unknown o Directional Hypothesis Test or One-Tailed Test –
o Stanine – a method of scaling test scores on a nine- statistical hypotheses specify either an increase or a
point standard scale with a mean of five (5) and a decrease in the population mean
standard deviation of two (2) o T-Test – used to test hypotheses about an unknown
o Linear Transformation – one that retains a direct population mean and variance
numerical relationship to the original raw score ▪ Can be used in “before and after” type of research
o Nonlinear Transformation – required when the ▪ Sample must consist of independent
data under consideration are not normally distributed observationsꟷthat is, if there is not consistent,
o Normalizing the distribution involves stretching the predictable relationship between the first
skewed curve into the shape of a normal curve and observation and the second
creating a corresponding scale of standard scores, a ▪ The population that is sampled must be normal
scale that is technically referred to as Normalized ▪ If not normal distribution, use a large sample
Standard Score Scale o Correlation Coefficient – number that provides us
o Generally preferrable to fine-tune the test according with an index of the strength of the relationship
to difficulty or other relevant variables so that the between two things
resulting distribution will approximate the normal o Correlation – an expression of the degree and
curve direction of correspondence between two things
o STEN – standard to ten; divides a scale into 10 units ▪ + & - = direction
▪ Number anywhere to -1 to 1 = magnitude
▪ Positive – same direction, either both going up
or both going down
▪ Negative – Inverse Direction, either DV is up
and IV goes down or IV goes up and DV goes
down
▪ 0 = no correlation
Mean SD
Z-Score 0 1
T-Score 50 10
Stanine 5 2
STEN 5.5 2
IQ 100 15
GRE or SAT 500 100
o Hypothesis Testing – statistical method that uses a o Pearson r/Pearson Correlation

sample data to evaluate a hypothesis about a Coefficient/Pearson Product-Moment Coefficient
population of Correlation – used when two variables being
Alternative Hypothesis – states there is a change, correlated are continuous and linear
difference, or relationships ▪ Devised by Karl Pearson
Null Hypothesis – no change, no difference, or no ▪ Coefficient of Determination – an indication of
relationship how much variance is shared by the X- and Y-
o Alpha Level or Level of Significance – used to variables
define concept of “very unlikely” in a hypothesis test o Spearman Rho/Rank-Order Correlation
o Critical Region – composed of extreme values that Coefficient/Rank-Difference Correlation
are very unlikely to be obtained if the null hypothesis Coefficient – frequently used if the sample size is
is true small and when both sets of measurement are in
o If sample data fall in the critical region, the null ordinal
hypothesis is rejected ▪ Developed by Charles Spearman
o The alpha level for a hypothesis test is the probability o Outlier – extremely atypical point located at a
that the test will lead to a Type I error relatively long distance from the rest of the
coordinate points in a scatterplot
#BLEPP2023
o Regression Analysis – used for prediction b. Aptitude – refers to the potential for learning or
▪ Predict the values of a dependent or response acquiring a specific skill
variable based on values of at least one c. Intelligence – refers to a person’s general potential
independent or explanatory variable to solve problems, adapt to changing environments,
▪ Residual: the difference between an observed abstract thinking, and profit from experience
value of the response variable and the value of the Human Ability – considerable overlap of
response variable predicted from the regression achievement, aptitude, and intelligence test
line Typical Performance Test – measure usual or habitual
▪ The Principle of Least Squares thoughts, feelings, and behavior
▪ Standard Error of Estimate: standard deviation of Personality Test – measures individual dispositions
the residuals in regression analysis and preferences
▪ Slope: determines how much the Y variable a. Structured Personality tests – provide statement,
changes when X is increased by 1 point usually self-report, and require the subject to choose
o T-Test (Independent) – comparison or determining between two or more alternative responses
differences b. Projective Personality Tests – unstructured, and the
▪ 2 different groups/independent samples + stimulus or response are ambiguous
interval/ratio scales (continuous variables) c. Attitude Test – elicit personal beliefs and opinions
Equal Variance – 2 groups are equal d. Interest Inventories – measures likes and dislikes
Unequal Variance – groups are unequal as well as one’s personality orientation towards the
o T-test (Dependent)/Paired Test – one groups world of work
nominal (either matched or repeated measures) + 2 - Purpose: for evaluation, drawing conclusions of some
treatments aspects of the behavior of a person, therapy, decision-
o One-Way ANOVA – 3 or more IV, 1 DV comparison making
of differences - Settings: Industrial, Clinical, Educational,
o Two-Way ANOVA – 2 IV, 1 DV Counseling, Business, Courts, Research
o Critical Value – reject the null and accept the - Population: Test Developers, Test Publishers, Test
alternative if [ obtained value > critical value ] Reviewers, Test Users, Test Sponsors, Test Takers,
o P-Value (Probability Value) – reject null and accept Society
alternative if [ p-value < alpha level ] Levels of Tests
o Norms – refer to the performances by defined groups 1. Level A – anyone under a direction of a supervisor
on a particular test or consultant
o Age-Related Norms – certain tests have different 2. Level B – psychometricians and psychologists only
normative groups for age groups 3. Level C – psychologists only
o Tracking – tendency to stay at about the same level 2. Interview – method of gathering information
relative to one’s peers through direct communication involving reciprocal
Norm-Referenced Tests – compares each person with exchange
the norm - can be structured, unstructured, semi-structured, or
Criterion-Referenced Tests – describes specific types non-directive
of skills, tasks, or knowledge that the test taker can - Mental Status Examination: determines the mental
demonstrate status of the patient
Selection of Assessment Methods and Tools and Uses, - Intake Interview: determine why the client came for
Benefits, and Limitations of Assessment tools and assessment; chance to inform the client about the
instruments (32) policies, fees, and process involved
Identify appropriate assessment methods, tools (2) - Social Case: biographical sketch of the client
1. Test – measuring device or procedure - Employment Interview: determine whether the
- Psychological Test: device or procedure designed to candidate is suitable for hiring
measure variables related to psychology - Panel Interview (Board Interview): more than one
Ability or Maximal Performance Test – assess what interviewer participates in the assessment
a person can do - Motivational Interview: used by counselors and
a. Achievement Test – measurement of the previous clinicians to gather information about some
learning problematic behavior, while simultaneously attempting
to address it therapeutically
#BLEPP2023
3. Portfolio – samples of one’s ability and - provides behavioral observations during
accomplishment administration
- Purpose: Usually in industrial settings for evaluation Wechsler Intelligence Scales (WAIS-IV, WPPSI-IV,
of future performance WISC-V)
4. Case History Data – refers to records, transcripts, [C]
and other accounts in written, pictorial, or other form - WAIS (16-90 years old), WPPSI (2-6 years old),
that preserve archival information, official and WISC (6-11)
informal accounts, and other data and items relevant to - individually administered
an assessee - norm-referenced
5. Behavioral Observation – monitoring of actions of - Standard Scores: 100 (mean), 15 (SD)
others or oneself by visual or electronic means while - Scaled Scores: 10 (mean), 3 (SD)
recording quantitative and/or qualitative information - addresses the weakness in Stanford-Binet
regarding those actions - could also assess functioning in people with brain
- Naturalistic Observation: observe humans in natural injury
setting - evaluates patterns of brain dysfunction
6. Role Play – defined as acting an improvised or - yields FSIQ, Index Scores (Verbal Comprehension,
partially improvised part in a stimulated situation Perceptual Reasoning, Working Memory, and
- Role Play Test: assesses are directed to act as if they Processing Speed), and subtest-level scaled scores
are in a particular situation Raven’s Progressive Matrices (RPM)
- Purpose: Assessment and Evaluation [B]
- Settings: Industrial, Clinical - 4 – 90 years old
- Population: Job Applicants, Children - nonverbal test
7. Computers – using technology to assess an client, - used to measure general intelligence & abstract
thus, can serve as test administrators and very efficient reasoning
test scorers - multiple choice of abstract reasoning
8. Others: videos, biofeedback devices - group test
Intelligence Tests - IRT-Based
Stanford-Binet Intelligence Scale 5th Ed. (SB-5) Culture Fair Intelligence Test (CFIT)
[C] [ B]
- 2-85 years old - Nonverbal instrument to measure your analytical and
- individually administered reasoning ability in the abstract and novel situations
- norm-referenced - Measures individual intelligence in a manner
- Scales: Verbal, Nonverbal, and Full Scale (FSIQ) designed to reduced, as much as possible, the influence
- Nonverbal and Verbal Cognitive Factors: Fluid of culture
Reasoning, Knowledge, Quantitative Reasoning, - Individual or by group
Visual-Spatial Processing, Working Memory - Aids in the identification of learning problems and
- age scale and point-scale format helps in making more reliable and informed decisions
- originally created to identify mentally disabled in relation to the special education needs of children
children in Paris Purdue Non-Language Test
- 1908 Scale introduced Age Scale format and Mental [B]
Age - Designed to measure mental ability, since it consists
- 1916 scale significantly applied IQ concept entirely of geometric forms
- Standard Scores: 100 (mean), 15 (SD) - Culture-fair
- Scaled Scores: 10 (mean), 3 (SD) - Self-Administering
- co-normed with Bender-Gestalt and Woodcock- Panukat ng Katalinuhang Pilipino
Johnson Tests - Basis for screening, classifying, and identifying needs
- based on Cattell-Horn-Carroll Model of General that will enhance the learning process
Intellectual Ability - In business, it is utilized as predictors of occupational
- no accommodations for pwds achievement by gauging applicant’s ability and fitness
- 2 routing tests for a particular job
- w/ teaching items, floor level, and ceiling level
#BLEPP2023
- Essential for determining one’s capacity to handle the - K Scale = reveals a person’s defensiveness around
challenges associated with certain degree programs certain questions and traits; also faking good
- Subtests: Vocabulary, Analogy, Numerical Ability, - K scale sometimes used to correct scores on five
Nonverbal Ability clinical scales. The scores are statistically corrected for
Wonderlic Personnel Test (WPT) an individual’s overwillingness or unwillingness to
- Assessing cognitive ability and problem-solving admit deviance
aptitude of prospective employees - “Cannot Say” (CNS) Scale: measures how a person
- Multiple choice, answered in 12 minutes doesn’t answer a test item
Armed Services Vocational Aptitude Battery - High ? Scale: client might have difficulties with
- Most widely used aptitude test in US reading, psychomotor retardation, or extreme
- Multiple-aptitude battery that measures developed defensiveness
abilities and helps predict future academic and - True Response Inconsistency (TRIN): five true, then
occupational success in the military five false answers
Kaufman Assessment Battery for Children-II - Varied Response Inconsistency (VRIN): random true
(KABC-II) or false
- Infrequency-Psychopathology Scale (Fp): reveal
- Alan & Nadeen Kaufman intentional or unintentional over-reporting
- for assessing cognitive development in children - FBS Scale: “symptom validity scale” designed to
- 13 to 18 years old detect intentional over-reporting of symptoms
Personality Tests - Back Page Infrequency (Fb): reflects significant
Minnesota Multiphasic Personality Inventory change in the testtaker’s approach to the latter part of
(MMPI-2) the test
[C] Myers-Briggs Type Indicator (MBTI)
- Katherine Cook Briggs and Isabel Briggs Myers
- Multiphasic personality inventory intended for used - Self-report inventory designed to identify a person’s
with both clinical and normal populations to identify personality type, strengths, and preferences
sources of maladjustment and personal strengths - Extraversion-Introversion Scale: where you prefer to
- Starke Hathaway and J. Charnley McKinley focus your attention and energy, the outer world and
- Help in diagnosing mental health disorders, external events or your inner world of ideas and
distinguishing normal from abnormal experiences
- should be administered to someone with no guilt - Sensing-Intuition Scale: how do you take inform, you
feelings for creating a crime take in or focus on interpreting and adding meaning on
- individual or by groups the information
- Clinical Scales: Hypochondriasis, Depression, - Thinking-Feeling Scale: how do you make decisions,
Hysteria, Psychopathic Deviate, logical or following what your heart says
Masculinity/Femininity, Paranoia, Psychasthenia - Judging-Perceiving Scale: how do you orient the
(Anxiety, Depression, OCD), Schizophrenia, outer world? What is your style in dealing with the
Hypomania, Social Introversion outer world – get things decided or stay open to new
- Lie Scale (L Scale): items that are somewhat negative info and options?
but apply to most people; assess the likelihood of the Edward’s Preference Personality Schedule (EPPS)
test taker to approach the instrument with defensive [B]
mindset - designed primarily as an instrument for research and
- High in L scale = faking good counselling purposes to provide quick and convenient
- High in F scale = faking bad, severe distress or measures of a number of relatively normal personality
psychopathology variables
- Superlative Self Presentation Scale (S Scale): a - based of Murray’s Need Theory
measure of defensiveness; Superlative Self- - Objective, forced-choice inventory for assessing the
Presentation to see if you intentionally distort answers relative importance that an individual places on 15
to look better personality variables
- Correction Scale (K Scale): reflection of the frankness - Useful in personal counselling and with non-clinical
of the testtaker’s self-report adults
- Individual
#BLEPP2023
Guilford-Zimmerman Temperament Survey - 5 years and older
(GZTS) - subjects look at 10 ambiguous inkblot images and
- items are stated affirmatively rather than in question describe what they see in each one
nd
form, using the 2 person pronoun - once used to diagnose mental illnesses like
- measures 10 personality traits: General Activity, schizophrenia
Restraint, Ascendance, Sociability, Emotional Stability, - Exner System: coding system used in this test
Objectivity, Friendliness, Thoughtfulness, Personal - Content: the name or class of objects used in the
Relations, Masculinity patient’s responses
NEO Personality Inventory (NEO-PI-R)
- Standard questionnaire measure of the Five Factor Content:
Model, provides systematic assessment of emotional, 1. Nature
interpersonal, experiential, attitudinal, and 2. Animal Feature
motivational styles 3. Whole Human
- gold standard for personality assessment 4. Human Feature
- Self-Administered 5. Fictional/Mythical Human Detail
- Neuroticism: identifies individuals who are prone to 6. Sex
psychological distress
- Extraversion: quantity and intensity of energy Determinants:
directed 1. Form
- Openness To Experience: active seeking and 2. Movement
appreciation of experiences for their own sake 3. Color
- Agreeableness: the kind of interactions an individual 4. Shading
prefers from compassion to tough mindedness 5. Pairs and Reflections
- Conscientiousness: degree of organization,
persistence, control, and motivation in goal-directed Location:
behavior 1. W – the whole inkblot was used to depict an image
Panukat ng Ugali at Pagkatao/Panukat ng 2. D – commonly described part of the blot was used
Pagkataong Pilipino 3. Dd – an uncommonly described or unusual detail
- Indigenous personality test was used
- Tap specific values, traits and behavioral dimensions 4. S – the white space in the background was used
related or meaningful to the study of Filipinos Thematic Apperception Test
Sixteen Personality Factor Questionnaire [C]
- Raymond Cattell - Christiana Morgan and Henry Murray
- constructed through factor analysis - 5 and above
- Evaluates a personality on two levels of traits - 31 picture cards serve as stimuli for stories and
- Primary Scales: Warmth, Reasoning, Emotional descriptions about relationships or social situations
Stability, Dominance, Liveliness, Rule-Consciousness, - popularly known as the picture interpretation
Social Boldness, Sensitivity, Vigilance, technique because it uses a standard series of
Abstractedness, Privateness, Apprehension, Openness provocative yet ambiguous pictures about which the
to change, Self-Reliance, Perfectionism, Tension subject is asked to tell a story
- Global Scales: Extraversion, Anxiety, Tough- - also modified African American testtakers
Mindedness, Independence, Self-Control Children’s Apperception Test
Big Five Inventory-II (BFI-2)
- Bellak & Bellak
- Soto & John - 3-10 years old
- Assesses big 5 domains and 15 facets - based on the idea that animals engaged in various
- for commercial purposes to researches and students activities were useful in stimulating projective
Projective Tests storytelling by children
Rorshcach Inkblot Test Hand Test
[C]
- Hermann Rorschach - Edward Wagner
- 5 years old and above
#BLEPP2023
- used to measure action tendencies, particularly acting - can also be used to assess brain damage and general
out and aggressive behavior, in adults and children mental functioning
- 10 cards (1 blank) - measures the person’s psychological and emotional
Apperceptive Personality Test (APT) functioning
- The house reflects the person’s experience of their
- Holmstrom et. Al. immediate social world
- attempt to address the criticisms of TAT - The tree is a more direct expression of the person’s
- introduced objectivity in scoring system emotional and psychological sense of self
- 8 cards include male and female of different ages and - The person is a more direct reflection of the person’s
minority group members sense of self
- testtakers will respond to a series of multiple choice Draw-A-Person Test (DAP)
questions after storytelling
Word Association Test (WAT) - Florence Goodenough
- 4 to 10 years old
- Rapaport et. Al. - a projective drawing task that is often utilized in
- presentation of a list of stimulus words, assessee psychological assessments of children
responds verbally or in writing the first thing that - Aspects such as the size of the head, placement of the
comes into their minds arms, and even things such as if teeth were drawn or
Rotter Incomplete Sentences Blank (RISB) not are thought to reveal a range of personality traits
-Helps people who have anxieties taking tests (no strict
- Julian Rotter & Janet Rafferty format)
- Grade 9 to Adulthood -Can assess people with communication problems
- most popular SCT -Relatively culture free
SACK’s Sentence Completion Test (SSCT) -Allow for self-administration
Kinetic Family Drawing
- Joseph Sacks and Sidney Levy
- 12 years old and older - Burns & Kaufman
- asks respondents to complete 60 questions with the - derived from Hulses’ FDT “doing something”
first thing that comes to mind across four areas: Family, Clinical & Counseling Tests
Sex, Interpersonal, Relationships and Self concept Millon Clinical Multiaxial Scale-IV (MCMI-IV)
Bender-Gestalt Visual Motor Test
[C] - Theodore Millon
- 18 years old and above
- Lauretta Bender - for diagnosing and treatment of personality disorders
- 4 years and older - exaggeration of polarities results to maladaptive
- consists of a series of durable template cards, each behavior
displaying a unique figure, then they are asked to draw - Pleasure-Pain: the fundamental evolutionary task
each figure as he or she observes it - Active-Passive: one adapts to the environment or
- provides interpretative information about an adapts the environment to one’s self
individual’s development and neuropsychological - Self-Others: invest to others versus invest to oneself
functioning Beck Depression Inventory (BDI-II)
- reveals the maturation level of visuomotor
perceptions, which is associated with language ability - Aaron Beck
and various functions of intelligence - 13 to 80 years old
House-Tree-Person Test (HTP) - 21-item self-report that tapos Major Depressive
symptoms accdg. to the criteria in the DSM
- John Buck and Emmanuel Hammer MacAndrew Alcoholism Scale (MAC & MAC-R)
- 3 years and up
- measures aspects of a person’s personality through - from MMPI-II
interpretation of drawings and responses to questions - Personality & attitude variables thought to underlie
alcoholism
#BLEPP2023
California Psychological Inventory (CPI-III) - can take note of verbal - sometimes, due to
and nonverbal cues negligence of interviewer
- attempts to evaluate personality in normally adjusted - flexible and interviewee, it can
individuals - time and cost effective miss out important
- has validity scales that determines faking bad and - both structured and information
faking good unstructured allows - interviewer’s effect on
- interpersonal style and orientation, normative clinicians to place a wider, the interviewee
orientation and values, cognitive and intellectual more meaningful context - various error such as
function, and role and personal style - can also be used to help halo effect, primacy
- has special purpose scales, such as managerial predict future behaviors effect, etc.
potential, work orientation, creative temperament, interviews allow - interrater reliability
leadership potential, amicability, law enforcement - clinicians to establish - interviewer bias
orientation, tough-mindedness rapport and encourage
Rosenberg Self-Esteem Scale client self-exploration.
Portfolio
- measures global feelings of self-worth - provides comprehensive - can be very demanding
- 10-item, 4 point likert scale illustration of the client - time consuming
- used with addolescents which highlights the
Dispositional Resilience Scale (DRS) strengths and weaknesses
Observation
- measures psychological hardiness defined as the - flexible - For private practitioners,
ability to view stressful situations as meaningful, - suitable for subjs that it is typically not practical
changeable, and challenging cannot be studied in lab or economically feasible
Ego Resiliency Scale-Revised setting to spend hours out of the
- measure ego resiliency or emotional intelligence - more realistic consulting room
HOPE Scale - affordable observing clients as they
- developed by Snyder - can detect patterns go about their daily lives
- Agency: cognitive model with goal driven energy - lack of scientific control,
- Pathway: capacity to contrast systems to meet goals ethical considerations,
- good measure of hope for traumatized people and potential for bias from
- positively correlated with health psychological observers and subjects
adjustment, high achievement, good problem solving - unable to draw cause-
skills, and positive health-related outcomes and-effect conclusions
Satisfaction with Life Scale (SWLS) - lack of control
- overall assessment of life satisfaction as a cognitive - lack of validity
judgmental process - observer bias
Positive and Negative Affect Schedule (PANAS) Case History
- measure the level of positive and negative emotions a - can fully show the - cannot be used to
test taker has during the test administration experience of the observer generalize a phenomenon
Strengths and weaknesses of assessment tools (2) in the program
Pros Cons - shed light on an
Test individual’s past and
- can gather a sample of - In crisis situations when current adjustment as well
behavior objectively with relatively rapid decisions as on the events and
lesser bias need to be made, it can be circumstances that may
- flexible, can be verbal or impractical to take the have contributed to any
nonverbal time required to changes in adjustment
administer and interpret Role Play
tests - encourages individuals - may not be as useful as
Interview to come together to find the real thing in all
solutions and to get to situations
#BLEPP2023
know how their - time-consuming
colleagues think - expensive
- group can discuss ways - inconvenient to assess in
to potentially resolve the a real situation
situation and participants - While some employees
leave with as much will be comfortable role
information as possible, playing, they’re less adept
resulting in more efficient at getting into the required
handling of similar real- mood needed to actually
life scenarios replicate a situation
Test Administration, Scoring, Interpretation and ▪ The greater number of items, the higher the
reliability
Usage (20)
Detect Errors and impacts in Test ▪ Factors that contribute to inconsistency:
Issues in Intelligence Testing characteristics of the individual, test, or situation,
which have nothing to do with the attribute being
1. Flynn Effect – progressive rise in intelligence score
that is expected to occur on a normed intelligence test measured, but still affect the scores
from the date when the test was first normed o Error Variance – variance from irrelevant random
sources
▪ Gradual increase in the general intelligence
among newborns Measurement Error – all of the factors associated
▪ Frog Pond Effect: theory that individuals with the process of measuring some variable, other than
evaluate themselves as worse when in a group of the variable being measured
high-performing individuals - difference between the observed score and the true
2. Culture Bias of Testing score
▪ Culture-Free: attempt to eliminate culture so - Positive: can increase one’s score
nature can be isolated - Negative: decrease one’s score
▪ Impossible to develop bec culture is evident in its - Sources of Error Variance:
influence since birth or an individual and the a. Item Sampling/Content Sampling
interaction between nature and nurture is b. Test Administration
cumulative and not relative c. Test Scoring and Interpretation
▪ Culture Fair: minimize the influence of culture Random Error – source of error in measuring a
with regard to various aspects of the evaluation targeted variable caused by unpredictable fluctuations
procedures and inconsistencies of other variables in measurement
▪ Fair to all, fair to some cultures, fair only to one process (e.g., noise, temperature, weather)
culture Systematic Error – source of error in a measuring a
▪ Culture Loading: the extent to which a test variable that is typically constant or proportionate to
incorporates the vocabulary concepts traditions, what is presumed to be the true values of the variable
knowledge etc. with particular culture being measured
Errors: Reliability - has consistent effect on the true score
o Classical Test Theory (True Score Theory) – score - SD does not change, the mean does
on ability tests is presumed to reflect not only the ▪ Error variance may increase or decrease a test
testtaker’s true score on the ability being measured score by varying amounts, consistency of test
but also the error score, and thus, the reliability can be affected
▪ Error: refers to the component of the observed Test-Retest Reliability
test score that does not have to do with the Error: Time Sampling
testtaker’s ability - the longer the time passes, the greater likelihood that
▪ Errors of measurement are random the reliability coefficient would be insignificant
- Carryover Effects: happened when the test-retest
interval is short, wherein the second test is influenced
by the first test because they remember or practiced the
previous test = inflated correlation/overestimation of
reliability
- Practice Effect: scores on the second session are
#BLEPP2023
higher due to their experience of the first session of 2. True Negatives (Specificity) – predict failure
testing that does occur
- test-retest with longer interval might be affected of 3. False Positive (Type 1) – success does not occur
other extreme factors, thus, resulting to low correlation 4. False Negative (Type 2) – predicted failure but
- target time for next administration: at least two weeks succeed
Parallel Forms/Alternate Forms Reliability
Error: Item Sampling (Immediate), Item Sampling
changes over time (delayed)
- Counterbalancing: technique to avoid carryover
effects for parallel forms, by using different sequence
for groups
- most rigorous and burdensome, since test developers
create two forms of the test
- main problem: difference between the two tests
- test scores may be affected by motivation, fatigue, or
intervening events
- create a large set of questions that address the same
construct then randomly divide the questions into two
Errors due to Behavioral Assessment
sets
1. Reactivity – when evaluated, the behavior increases
Internal Consistency (Inter-Item Reliability)
- Hawthorne Effect
Error: Item Sampling Homogeneity
2. Drift – moving away from what one has learned
Split-Half Reliability
going to idiosyncratic definitions of behavior
Error: Item sample: Nature of Split - subjects should be retrained in a point of time
Inter-Scorer Reliability - Contrast Effect: cognitive bias that distorts our
Error: Scorer Differences perception of something when we compare it to
o Standard Error of Measurement – provide a something else, by enhancing the differences between
measure of the precision of an observed test score them
▪ Standard deviation of errors as the basic measure 3. Expectancies – tendency for results to be influenced
of error by what test administrators expect to find
▪ Index of the amount of inconsistent or the amount - Rosenthal/Pygmalion Effect: Test administrator’s
of the expected error in an individual’s score expected results influences the result of the test
▪ Allows to quantify the extent to which a test - Golem Effect: negative expectations decreases one’s
provide accurate scores performance
▪ Provides an estimate of the amount of error 4. Rating Errors – intentional or unintentional misuse
inherent in an observed score or measurement of the scale
▪ Higher reliability, lower SEM - Leniency Error: rater is lenient in scoring (Generosity
▪ Used to estimate or infer the extent to which an Error)
observed score deviates from a true score - Severity Error: rater is strict in scoring
▪ Standard Error of a Score - Central Tendency Error: rater’s rating would tend to
▪ Confidence Interval: a range or band of test cluster in the middle of the rating scale
scores that is likely to contain true scores - Halo Effect: tendency to give high score due to failure
o Standard Error of the Difference – can aid a test to discriminate among conceptually distinct and
user in determining how large a difference should be potentially independent aspects of a ratee’s behavior
before it is considered statistically significant - snap judgement on the basis of positive trait
o Standard Error of Estimate – refers to the standard - Horn Effect: Opposite of Halo Effect
error of the difference between the predicted and - One way to overcome rating errors is to use rankings
observed values 5. Fundamental Attribution Error – tendency to
o Four Possible Hit and Miss Outcomes explain someone’s behavior based on internal factors
1. True Positives (Sensitivity) – predict success such as personality or disposition, and to underestimate
that does occur the influence the external factors have on another
person’s behavior, blaming it on the situation
#BLEPP2023
- Barnum Effect: people tend to accept vague to ensure that services were not denied. However,
personality descriptions as accurate descriptions of the services are discontinued once the appropriate
themselves (Aunt Fanny Effect) services are available
o Bias – factor inherent in a test that systematically o Psychologists should discuss the limits of
prevents accurate, impartial measurement confidentiality, uses of the information that would be
▪ Prejudice, preferential treatment generated from the services to the persons and
▪ Prevention during test dev through a procedure organizations with whom they establish a scientific
called Estimated True Score Transformation or professional relationships
Ethical Principles and Standards of Practice (19) o Before recording voices or images, they must obtain
o If mistakes was made, they should do something to permission first from all persons involved or their
correct or minimize the mistakes legal rep
o If an ethical violation made by another psychologist o Only discuss confidential information with persons
was witnessed, they should resolve the issue with clearly concerned/involved with the matters
informal resolution, as long as it does not violate any o Disclosure is allowed with appropriate consent
confidentiality rights that may be involved ▪ No consent is not allowed UNLESS mandated by
o If informal resolution is not enough or appropriate, the law
referral to state or national committees on o No disclosure of confidential information that could
professional ethics, state licensing boards, or the lead to the identification of a client unless they have
appropriate institutional authorities can be done. obtained prior consent or the disclosure cannot be
Still, confidentiality rights of the professional in avoided
question must be kept. ▪ Only disclose necessary information
o Failure to cooperate in ethics investigation itself, is o Exemptions to disclosure:
an ethics violation, unless they request for deferment ✓ If the client is disguised/identity is protected
of adjudication of an ethics complaint ✓ Has consent
o Psychologists must file complaints responsibly by ✓ Legally mandated
checking facts about the allegations o Psychologists can create public statements as long as
o Psychologists DO NOT deny persons employment, they would be responsible for it
advancement, admissions, tenure or promotion based ▪ They cannot compensate employees of the media
solely upon their having made or their being the in return for publicity in a news item
subject of an ethics complaint ▪ Paid Advertisement must be clearly recognizable
▪ Just because they are questioned by the ethics ▪ when they are commenting publicly via internet,
committee or involved in an on-going ethics media, etc., they must ensure that their statement
investigation, they would be discriminated or are based on their professional knowledge in
denied advancement accord with appropriate psych literature and
▪ Unless the outcome of the proceedings are practice, consistent with ethics, and do not
already considered indicate that a professional relationship has been
o Psychologists should do their services within the established with the recipient
boundaries of their competence, which is based on o Must provide accurate information and obtain
the amount of training, education, experience, or approval prior to conducting the research
consultation they had o Informed consent is required, which include:
o When they are tasked to provide services to ✓ Purpose of the research
clients who are deprived with mental health ✓ Duration and procedures
services (e.g., communities far from the urban ✓ Right to decline and withdraw
cities), however, they were still not able to obtain ✓ Consequences of declining or withdrawing
the needed competence for the job, they could ✓ Potential risks, discomfort, or adverse effects
still provide services AS LONG AS they make ✓ Benefits
reasonable effort to obtain the competence ✓ Limits of confidentiality
required, just to ensure that the services were not ✓ Incentives for participation
denied to those communities ✓ Researcher’s contact information
o During emergencies, psychologists provide o Permission for recording images or vices are needed
services to individuals, even though they are yet unless the research consists of solely naturalistic
to complete the competency/training needed just
#BLEPP2023
observations in public places, or research designed o Art. 12 of Revised Penal Code – Insanity Plea
includes deception end
▪ Consent must be obtained during debriefing
o Dispense or Omitting Informed consent only when: congratulations on reaching the end of this reviewer!! i
1. Research would not create distress or harm hope u learned something!! :D
▪ Study of normal educational practices conducted
in an educational settings one day, we will be remembered.
▪ Anonymous questionnaires, naturalistic
observation, archival research - aly <3
▪ Confidentiality is protected
2. Permitted by law
o Avoid offering excessive incentives for research
participation that could coerce participation
o DO not conduct study that involves deception unless
they have justified the use of deceptive techniques in
the study
▪ Must be discussed as early as possible and not
during the conclusion of data collection
o They must give opportunity to the participants about
the nature, results, and conclusions of the research
and make sure that there are no misconceptions about
the research
o Must ensure the safety and minimize the discomfort,
infection, illness, and pain of animal subjects
▪ If so, procedures must be justified and be as
minimal as possible
▪ During termination, they must do it rapidly and
minimize the pain
o Must no present portions of another’s work or data
as their own
▪ Must take responsibility and credit, including
authorship credit, only for work they have
actually performed or to which they have
substantially contributed
▪ Faculty advisors discuss publication credit with
students as early as possible
o After publishing, they should not withhold data from
other competent professionals who intends to
reanalyze the data
▪ Shared data must be used only for the declared
purpose
o RA 9258 – Guidance and Counseling Act of 2004
o RA 9262 – Violence Against Women and Children
o RA 7610 – Child Abuse
o RA 9165 – Comprehensive Dangerous Drugs Act of
2002
o RA 11469 – Bayanihan to Heal as One Act
o RA 7277 – Magna Carta for Disabled Persons
o RA 11210 – Expanded Maternity Leave Law
o RA 11650 – Inclusive Education Law
o RA 10173 – Data Privacy Act
o House Bill 4982 – SOGIE Bill

TOS Outline PsychAssess 2.0

Uploaded by

Copyright:

Available Formats

TOS Outline PsychAssess 2.0

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

TOS Outline PsychAssess 2.0

Uploaded by

Copyright:

Available Formats

Psychological Assessment

o Basic Research = 0.70 to 0.90 Research Methods and Statistics (20)

▪ Mean < Median < Mode

▪ High Kurtosis = high peak and fatter tails

o Hypothesis Testing – statistical method that uses a o Pearson r/Pearson Correlation

You might also like