Lecture 3 (Scales of Measurement)

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 48

Variables,

Scales of Measurement
and
Goodness of Measures
Variables
VARIABLES
• Definition: Variables are properties or characteristics of people or things that vary
in quality or magnitude from person to person or object to object (Miller &
Nicholson, 1976)
• Demographic characteristics
• Personality traits
• Communication styles or competencies
• Constructs
• in order to be a variable, a variable must vary (e.g., not be a constant), that is, it
must take on different values, levels, intensities, or states
Definitions
• Variable: “any entity that can take on a variety of different values”
(Wrench et al, 2008, p. 104)
• gender
• self-esteem
• managerial style
• stuttering severity
• attributes, values, and levels are the variations in a variable
• Attribute: political party:
• Value: Democrat, Republican, Independent, etc.
• Attribute: Self-esteem
• Level: High, Medium, Low
Independent Variable
• the variable that is manipulated either by the researcher or by nature or
circumstance
• independent variables are also called “stimulus” “input” or “predictor”
variables
• analogous to the “cause” in a cause-effect relationship
“operationalization” of the independent
variable

• Operationalization: • Operationalization can include:


• variations in stimulus conditions
translating an abstract (public schools versus home
concept into a tangible, schooling)
• variations in levels or degrees (mild
observable form in an vs. moderate vs. strong fear appeals)
experiment • variations based on standardized
scales or diagnostic instruments (low
vs. high self esteem scores)
• variations in “intact” or “self-
selected” groups (smokers vs. non-
smokers)
varieties and types of variables
• Discrete variables • Dichotomous variables:
• Nominal variables: distinct, mutually • true/false, female/male,
exclusive categories democrat/republican
• religions; Christians, Muslims, Jews, etc. • Ordered variables: mutually
• occupations; truck driver, teacher, exclusive categories, but with an
engineer order, sequence, or hierarchy
• marital status; single, married, divorced • fall, winter, summer, spring
• Concrete versus abstract variables • K-6, junior high, high school,
• concrete; relatively fixed, unchanging college
• Gender
• ethnicity
• abstract; dynamic, transitory
• mood, emotion
• occupation
Continuous variables: include constant
increments or gradations, which can be
arithmetically compared and contrasted
IQ scores
self-esteem scores
age
heart rate, blood pressure
number of gestures
dependent variable
• a variable that is observed or measured, and that is influenced or changed by the
independent variable
• dependent variables are also known as “response” or “output” or “criterion” variables
• analogous to the “effect” in a cause-effect relationship
• A problem in a research

confounding variable
• also known as extraneous variables or intervening variables
• confounding variables “muddy the waters”
• alternate causal factors or contributory factors which unintentionally influence the
results of an experiment, but aren’t the subject of the study
mediating & moderating
variable
• a.k.a. moderating, intervening, intermediary, or mediating variables
• a 2nd or 3rd variable that can increase or decrease the relationship
between an independent and a dependent variable.
• for example, whether listeners are persuaded more by the quality or
quantity of arguments is moderated by their degree of involvement in
an issue.
operationalization
• definition: the specific steps or procedures required to translate an abstract
concept into a concrete, testable variable
• example: high versus low self-esteem (split-half or top vs. bottom third?)
• example: on-line versus traditional classroom (how much e-learning constitutes an “on-line”
class?)

examples of operationalization
• credibility (high versus low)
• culture/ethnicity (self-report)
• type of speech therapy (in-clinic vs. at school, vs. at home)
• compliance-gaining strategy preferences (positive versus negative, self-benefit versus other benefit)
• “powerless” language style
• fear appeals (mild, moderate, strong)
• food server touch versus no touch
Scales of Measurement
SCALES

• A scale is a tool or mechanism by which individuals


are distinguished as to how they differ from one
another on the variables of interest to our study. The
scale or tool could be a gross one in the sense that it
would only broadly categorize individuals on certain
variables, or it could be a fine-tuned tool that would
differentiate individuals on the variables with varying
degrees of sophistication.
Measurement and Measurement Scales

• Measurement is the foundation of any scientific investigation


• Everything we do begins with the measurement of whatever it is we
want to study
• Definition: measurement is the assignment of numbers to objects
Example:
When we use a personality test such as the EPQ (Eysenck
Personality Questionnaire)
to obtain a measure of Extraversion – ‘how outgoing someone is’
we are measuring that personality characteristic by assigning a
number (a score on the test) to an object (a person)
Four Types of Measurement Scales
Nominal
Ordinal
Interval
Ratio
• The scales are distinguished on the relationships assumed to exist
between objects having different scale values
• The four scale types are ordered in that all later scales have all the
properties of earlier scales—plus additional properties
Nominal Scale
• Nominal scales are used for labeling variables, without any quantitative
 value.  “Nominal” scales could simply be called “labels.”  
• Not really a ‘scale’ because it does not scale objects along any dimension
• A good way to remember all of this is that “nominal” sounds a lot like
“name” and nominal scales are kind of like “names” or labels.
• Examples of Nominal Scales
Ordinal Scale
• It is the order of the values is what’s important and significant, but the differences between each one is not
really known.  
• Take a look at the example below.  In each case, we know that a #4 is better than a #3 or #2, but we don’t
know–and cannot quantify–how much better it is.  For example, is the difference between “OK” and
“Unhappy” the same as the difference between “Very Happy” and “Happy?”  We can’t say.
• Numbers are used to place objects in order
• But, there is no information regarding the differences (intervals) between points on the scale
• Ordinal scales are typically measures of non-numeric concepts like satisfaction, happiness, discomfort, etc.
• “Ordinal” is easy to remember because is sounds like “order” and that’s the key to remember with “ordinal
scales”–it is the order that matters, but that’s all you really get from these.
Interval Scale
• Interval scales are numeric scales in which we know not only the order,
but also the exact differences between the values.  
• The classic example of an interval scale is Celsius temperature because
the difference between each value is the same.  For example, the
difference between 60 and 50 degrees is a measurable 10 degrees, as is
the difference between 80 and 70 degrees.  Time is another good
example of an interval scale in which the increments are known,
consistent, and measurable.
• Interval scales not only tell us about order, but also about the value
between each item. The interval differences are meaningful
• Here’s the problem with interval scales: they don’t have a “true zero.”
 For example, there is no such thing as “no temperature.”  Without a
true zero, it is impossible to compute ratios. 
• Therefore, we can’t defend ratio relationships
True Zero… What does it
means?
• Essentially, if a scale can have a negative value, then it has no true zero point. 

If a level of measurement has a true zero point, then a value of 0 means you have nothing.
An interval scale (a scale where the difference or interval between the values is important,
as opposed to, for example, ordinal or ranked data, where the difference between 1st and
2nd is not necessarily the same as the difference between 4th and 5th), such as
temperature in °C, does not have a true zero point. This means that 0 °C is not the coldest
temperature, as you can have a negative amount of °C, and as such 20 °C is not twice as
cold as 10 °C, since the scale doesn't start at zero. 

A ratio scale is the same as an interval scale, but it has a true zero point.
Some examples of this are mass, length, currency and Kelvin. 2kg, 2m,
$2 and 2 K, are all twice the value of 1kg, 1m, $1 and 1 K.
Ratio Scale
• Ratio scales are the ultimate nirvana when it comes to
measurement scales because
i. they tell us about the order,
ii. they tell us the exact value between units, AND
iii. they also have an absolute zero–which allows for a wide range
of both descriptive and inferential statistics to be applied.  

• Ratio scales provide a wealth of possibilities when it comes to


statistical analysis.  These variables can be meaningfully added,
subtracted, multiplied, divided (ratios).  Central tendency can be This Device
measured by mode, median, or mean; measures of dispersion, Provides Two
Examples of Ratio
such as standard deviation and coefficient of variation can also Scales (height and
be calculated from ratio scales. weight)
• Now that we know the four different types of
scales that can be used to measure the
operationally defined dimensions and elements
of a variable, it is necessary to examine the
methods of scaling (that is, assigning numbers or
symbols) to elicit the attitudinal responses of
subjects toward objects, events, or persons.
• There are two main categories of attitudinal scales (not to be
confused with the four different types of scales)—the rating
scale and the ranking scale.
Rating scales
• have several response categories and are used to elicit
responses with regard to the object, event, or person
studied.
Ranking scales,
• on the other hand, make comparisons between or among
objects, events, or persons and elicit the preferred choices
and ranking among them.
RATING SCALES
The following rating scales are often used in organizational research:
1. Dichotomous scale
2. Category scale
3. Likert scale
4. Numerical scales
5. Semantic differential scale
6. Itemized rating scale
7. Fixed or constant sum rating scale
8. Stapel scale
9. Graphic rating scale
10. Consensus scale
RANKING SCALES
1. Paired Comparison
2. Forced Choice
3. Comparative Scale
Rating Scales
Dichotomous Scale
• The dichotomous scale is used to elicit a Yes or No answer, as in the example below.
Note that a nominal scale is used to elicit the response.
• Do you own a car? Yes No

Category Scale
• The category scale uses multiple items to elicit a single response as per the following
example. This also uses the nominal scale.
• Where in northern California do you reside?
• North Bay
• South Bay
• East Bay
• Peninsula
• Other
Likert Scale
• The Likert scale is designed to examine how strongly subjects agree or disagree with
statements on a 5-point scale with the following anchors:
Strongly Neither Agree Strongly
Disagree Disagree Nor Disagree Agree Agree
1 2 3 4 5

Semantic Differential Scale


• Several bipolar attributes are identified at the extremes of the scale, and respondents
are asked to indicate their attitudes, on what may be called a semantics pace, toward a
particular individual, object, or event on each of the attributes.
• The bipolar adjectives used, for instance, would employ such terms as Good–Bad;
Strong–Weak; Hot–Cold. The semantic differential scale is used to assess respondents‘
attitudes toward a particular brand, advertisement, object, or individual. The responses
can be plotted to obtain a good idea of their perceptions. This is treated as an interval
scale.
Numerical Scale
• The numerical scale is similar to the semantic differential scale, with the
difference that numbers on a 5-point or 7-point scale are provided, with bipolar
adjectives at both ends, as illustrated below. This is also an interval scale.
• How pleased are you with your new real estate agent?
Extremely Extremely
Pleased 7 6 5 4 3 2 1Displeased

Itemized Rating Scale


• A 5-point or 7-point scale with anchors, as needed, is provided for each item
and the respondent states the appropriate number on the side of each item, or
circles the relevant number against each item, as per the examples that follow.
The responses to the items are then summated.
Fixed or Constant Sum Scale
• The respondents are here asked to distribute a given number of points across various items as per the example
below. This is more in the nature of an ordinal scale.
• In choosing a toilet soap, indicate the importance you attach to each of the following five aspects by allotting
points for each to total 100 in all.
• Fragrance —
• Color —
• Shape —
• Size —
• Texture of lather —
• Total points 100
Stapel Scale
This scale simultaneously measures both the direction and intensity of the attitude toward
the items under study. The characteristic of interest to the study is placed at the center and
a numerical scale ranging, say, from + 3 to – 3, on either side of the item as illustrated
below. This gives an idea of how close or distant the individual response to the stimulus is,
as shown in the example below. Since this does not have an absolute zero point, this is an
interval scale.
• State how you would rate your supervisor’s abilities with respect to each of the
characteristics mentioned below, by circling the appropriate number.
+3 +3 +3
+2 +2 +2
+1 +1 +1
Adopting Modern Product Interpersonal
Technology Innovation Skills
–1 –1 –1
–2 –2 –2
–3 –3 –3
Graphic Rating Scale
• A graphical representation helps the respondents to indicate on this scale their answers
to a particular question by placing a mark at the appropriate point on the line, as in the
following example. This is an ordinal scale, though the following example might appear
to make it look like an interval scale.
10 Excellent
On a scale of 1 to 10, –
how would you rate – 5 All right
your supervisor? –
– 1 Very bad
This scale is easy to respond to. The brief descriptions on the scale points are meant to
serve as a guide in locating the rating rather than represent discrete categories.
The faces scale, which depicts faces ranging from smiling to sad, is also a graphic rating
scale. used to obtain responses regarding people‘s feelings with respect to some aspect—
say, how they feel about their jobs.
Consensus Scale
• Scales are also developed by consensus, where a panel of judges selects
certain items, which in its view measure the relevant concept. The items
are chosen particularly based on their pertinence or relevance to the
concept. Such a consensus scale is developed after the selected items are
examined and tested for their validity and reliability.

• One such consensus scale is the Thurstone Equal Appearing Interval Scale,
where a concept is measured by a complex process followed by a panel of
judges. Using a pile of cards containing several descriptions of the concept,
a panel of judges offers inputs to indicate how close or not the statements
are to the concept under study. The scale is then developed based on the
consensus reached. However, this scale is rarely used for measuring
organizational concepts because of the time necessary to develop it.
• Let us say that there are four product lines and the manager
seeks information that would help decide which product line
should get the most attention.
• Let us now assume that 35% of the respondents choose the
first product, 25% the second, and 20% choose each of
products three and four as of importance to them.
• The manager cannot then conclude that the first product is
the most preferred since 65% of the respondents did not
choose that product! Alternative methods used are the
paired comparisons, forced choice, and the comparative
scale, which are discussed below under Ranking Scales.
RANKING SCALES

• Ranking scales are used to tap preferences


between two or among more objects or items
(ordinal in nature). However, such ranking may
not give definitive clues to some of the answers
sought.
Paired Comparison
• The paired comparison scale is used when, among a small number of
objects, respondents are asked to choose between two objects at a
time. This helps to assess preferences.
• If, for instance, in the previous example, during the paired comparisons,
respondents consistently show a preference for product one over
products two, three, and four, the manager reliably understands which
product line demands his utmost attention.
• However, as the number of objects to be compared increases, so does
the number of paired comparisons. The greater the number of objects
or stimuli, the greater the number of paired comparisons presented to
the respondents, and the greater the respondent fatigue.
• Hence paired comparison is a good method if the number of stimuli
presented is small.
Forced Choice
•The forced choice enables respondents to rank objects relative to one another, among the
alternatives provided. This is easier for the respondents, particularly if the number of choices to be
ranked is limited in number.
•Rank the following magazines that you would like to subscribe to in the order of preference,
assigning 1 for the most preferred choice and 5 for the least preferred.
•Geo —
•Express —
•SAMA —
•Aj —

Comparative Scale
The comparative scale provides a benchmark or a point of reference to assess attitudes toward the
current object, event, or situation under study. An example of the use of comparative scale follows.
•In a volatile financial environment, compared to stocks, how wise or useful is it to invest in Treasury
bonds? Please circle the appropriate response.
More Useful About the Same Less Useful
1234 5
RELIABILITY
• The reliability of a measure indicates the extent to which it is without bias (error
free) and hence ensures consistent measurement across time and across the
various items in the instrument. In other words, the reliability of a measure is an
indication of the stability and consistency with which the instrument measures
the concept and helps to assess the ―goodness‖ of a measure.
Stability of Measures
• The ability of a measure to remain the same over time—despite uncontrollable
testing conditions or the state of the respondents themselves—is indicative of its
stability and low vulnerability to changes in the situation. This attests to its
―goodness‖ because the concept is stably measured, no matter when it is done.
Two tests of stability are
• test–retest reliability and
• parallel-form reliability.
Test–Retest Reliability
• The reliability coefficient obtained with a repetition of the same measure on a second
occasion is called test–retest reliability. That is, when a questionnaire containing
some items that are supposed to measure a concept is administered to a set of
respondents now, and again to the same respondents, say several weeks to 6 months
later, then the correlation between the scores obtained at the two different times
from one and the same set of respondents is called the test–retest coefficient. The
higher it is, the better the test–retest reliability, and consequently, the stability of the
measure across time.
Parallel-Form Reliability
• When responses on two comparable sets of measures tapping the same construct
are highly correlated, we have parallel-form reliability. Both forms have similar items
and the same response format, the only changes being the wordings and the order or
sequence of the questions. What we try to establish here is the error variability
resulting from wording and ordering of the questions. If two such comparable forms
are highly correlated, we may be fairly certain that the measures are reasonably
reliable, with minimal error variance caused by wording, ordering, or other factors.
Internal Consistency of Measures
The internal consistency of measures is indicative of the
homogeneity of the items in the measure that tap the
construct. In other words, the items should ―hang together
as a set, and be capable of independently measuring the same
concept so that the respondents attach the same overall
meaning to each of the items. This can be seen by examining
if the items and the subsets of items in the measuring
instrument are correlated highly. Consistency can be
examined through the inter-item consistency reliability and
split-half reliability tests.
Inter-item Consistency Reliability
• This is a test of the consistency of respondents‘ answers to all
the items in a measure. To the degree that items are
independent measures of the same concept, they will be
correlated with one another. The most popular test of inter-
item consistency reliability is the Cronbach‘s coefficient alpha
(Cronbach‘s alpha; Cronbach, 1946), which is used for
multipoint-scaled items. The higher the coefficients, the
better the measuring instrument.
Split-Half Reliability
• Split-half reliability reflects the correlations between two halves of an
instrument. The estimates would vary depending on how the items in
the measure are split into two halves.
• The AP Psych exam is measured this way in which one person's odd
questions are compared to another person's even questions and if the
scores were the same or similar the test would have a high degree of
reliability.
• A test is split into two, odds and evens, if the two scores for the two
tests are similar then the test is reliable.
• It should be noted that the consistency of the judgment of several
raters on how they view a phenomenon or interpret some responses
is termed interrater reliability, and should not be confused with the
reliability of a measuring instrument.
• Interrater reliability is especially relevant when the data are obtained
through observations, projective tests, or unstructured interviews, all
of which are liable to be subjectively interpreted.
• It is important to note that reliability is a necessary but not sufficient
condition of the test of goodness of a measure. For example, one
could very reliably measure a concept establishing high stability and
consistency, but it may not be the concept that one had set out to
measure. Validity ensures the ability of a scale to measure the
intended concept.
• We will now discuss the concept of validity.
VALIDITY
• We examined earlier, the terms internal validity and external validity in the
context of experimental designs. That is, we were concerned about the issue
of the authenticity of the cause-and-effect relationships (internal validity), and
their generalizability to the external environment (external validity).
• We are now going to examine the validity of the measuring instrument itself.
That is, when we ask a set of questions (i.e., develop a measuring instrument)
with the hope that we are tapping the concept, how can we be reasonably
certain that we are indeed measuring the concept we set out to do and not
something else? This can be determined by applying certain validity tests.
• Several types of validity tests are used to test the goodness of measures and
writers use different terms to denote them.
• For the sake of clarity, we may group validity tests under three broad headings:
content validity, criterion-related validity, and construct validity.
Content Validity
• Content validity ensures that the measure includes an adequate and
representative set of items that tap the concept. The more the scale items
represent the domain or universe of the concept being measured, the greater
the content validity. To put it differently, content validity is a function of how
well the dimensions and elements of a concept have been delineated.
• A panel of judges can confirm to the content validity of the instrument.
• Face validity is considered by some as a basic and a very minimum index
of content validity. Face validity indicates that the items that are intended to
measure a concept, do on the face of it look like they measure the concept.
Some researchers do not see it fit to treat face validity as a valid component
of content validity.
Criterion-Related Validity
• Criterion-related validity is established when the measure differentiates individuals on a criterion it is
expected to predict. This can be done by establishing concurrent validity or predictive validity, as
explained below.
• Concurrent validity is established when the scale discriminates individuals who are known to be
different; that is, they should score differently on the instrument as in the example that follows.
• If a measure of work ethic is developed and administered to a group of welfare recipients, the scale
should differentiate those who are enthusiastic about accepting a job and glad of an opportunity to
be off welfare, from those who would not want to work even when offered a job. Obviously, those
with high work ethic values would not want to be on welfare and would yearn for employment to be
on their own. Those who are low on work ethic values, on the other hand, might exploit the
opportunity to survive on welfare for as long as possible, deeming work to be a drudgery. If both
types of individuals have the same score on the work ethic scale, then the test would not be a
measure of work ethic, but of something else.
• Predictive validity indicates the ability of the measuring instrument to differentiate among
individuals with reference to a future criterion. For example, if an aptitude or ability test
administered to employees at the time of recruitment is to differentiate individuals on the basis of
their future job performance, then those who score low on the test should be poor performers and
those with high scores good performers.
Construct Validity
• Construct validity testifies to how well the results obtained from the use of
the measure fit the theories around which the test is designed. This is
assessed through convergent and discriminant validity, which are explained
below.
• Convergent validity is established when the scores obtained with two
different instruments measuring the same concept are highly correlated.
• Discriminant validity is established when, based on theory, two
variables are predicted to be uncorrelated, and the scores obtained by
measuring them are indeed empirically found to be so.
POINTS TO PONDER
• Validity can be established in different ways.
• In sum, the goodness of measures is established through the different
kinds of validity and reliability. The results of any research can only be as
good as the measures that tap the concepts in the theoretical frame- work.
• We need to use well-validated and reliable measures to ensure that our
research is scientific.
• Fortunately, measures have been developed for many important concepts
in organizational research and their psychometric properties (i.e., the
reliability and validity) established by the developers.
• Thus, researchers can use the instruments already reputed to be ―good,
rather than laboriously develop their own measures. When using these
measures, however, researchers should cite the source (i.e., the author
and reference) so that the reader can seek more information if necessary.
POINTS TO PONDER Cont…
• It is not unusual that two or more equally good measures are developed for
the same concept.
• When more than one scale exists for any variable, it is preferable to use the
measure that has better reliability and validity and is also more frequently
used.
• At times, we may also have to adapt an established measure to suit the
setting. For example, a scale that is used to measure job performance, job
characteristics, or job satisfaction in the manufacturing industry may have to
be modified slightly to suit a utility company or a health care organization.
• The work environment in each case is different and the wordings in the
instrument may have to be suitably adapted. However, in doing this, we are
tampering with an established scale, and it would be advisable to test it for
the adequacy of the validity and reliability afresh.

You might also like