An Investigation Into The Content Validity of A Vietnamese Standardized Test of English Proficiency (Vstep.3-5) Reading Test

Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

AN INVESTIGATION INTO THE CONTENT VALIDITY

OF A VIETNAMESE STANDARDIZED TEST


OF ENGLISH PROFICIENCY (VSTEP.3-5) READING TEST

Nguyen Thi Phuong Thao*


Center for Language Testing and Assessment, VNU University of Languages and International Studies,
Pham Van Dong, Cau Giay, Hanoi, Vietnam

Received 07 March 2018


Revised 26 July 2018; Accepted 31 July 2018

Abstract: This paper investigated the content validity of a Vietnamese Standardized Test of English
Proficiency (VSTEP.3-5) Reading test via both qualitative and quantitative methods1. The aim of the study
is to evaluate the relevance and the coverage of the content in this test compared with the description in
the test specification and the actual performance of examinees. With the content analysis provided by three
testing experts using Bachman and Palmer’s 1996 framework and test score analysis, the study results
in a relatively high consistency of the test content with the test design framework and the test takers’
performance. These findings help confirm the content validity of the specific investigated test paper.
However, a need for content review is raised from the research as some problems have been revealed from
the analysis.
Keywords: language testing, content validity, reading comprehension test, standardized test

1. Introduction 12
English since March 2015. The test aims at
In foreign language testing, it is crucial assessing English proficiency from level 3 to
to ensure the test validity – one of the six level 5 according to the Common European
significant qualities (along with reliability, Framework of Reference for languages
authenticity, practicality, interactiveness for Vietnamese learners (CEFR-VN) or
and impact) for test usefulness (Bachman from level B1 to level C1 according to the
& Palmer, 1996). Accordingly, designing Common European Framework of Reference
a valid reading test is of great concern for Languages (CEFR) for users in various
of language educators and researchers majors and professions using four skills.
(Bachman and Palmer, 1996; Alderson, There have not been many studies on this
2000; Jin Yan, 2002). test since only two articles were published
regarding the rater consistency in rating L2
The Vietnamese Standardized Test of
learners’ writing task by Nguyen Thi Quynh
English Proficiency (VSTEP.3-5) has been
Yen (2016) and the washback effect of the
implemented for Vietnamese learners of
test on the graduation standard of English-
major students at University of Languages
*
Tel: 84-963716969
and International Studies (ULIS), Vietnam
Email: [email protected] National University (VNU) by Nguyen
1 This study was completed under the sponsorship
Thuy Lan (2017). The test analysis has been
of the University of Languages and International
so far under-researched.
Studies (ULIS-VNU) in the project N.16.23
130 N.T.P. Thao / VNU Journal of Foreign Studies, Vol.34, No.4 (2018) 129-143

Like the other skills, the reading tests from the copy of the test and/or test design
have been developed, designed and expected guidelines. In other words, test specifications
to be valid in its use. It is of importance that the and example items are to be investigated.
test measures what it is supposed to measure Likewise, when designing a test, test
(Henning, 2001: 91). In this sense, validity developers also pay their attention to the
“refers to the interpretations or actions that are content or ability domain covered in the test
made on the basis of test scores” and “must be from which test tasks/items are generated.
evaluated with respect to the purpose of the Therefore, consideration of the test content
test and how the test is used” (Sireci, 2009). In plays an important role to both test users
the scope of this study, the author would like and test developers. “Demonstrating that
to evaluate the content validity of a specific a test is relevant to and covers a given
area of content or ability is therefore a
VSTEP.3-5 reading test with a focus on the
necessary part of validation” (Bachman,
content of the test and the test scores. The
1990:244). In this sense, content validity is
results of this study, to an extent, are expected
concerned with whether or not the content
to respond to concerns about the quality of the
of the test is “sufficiently representative
test to the public.
and comprehensive for the test to be a valid
2. Literature review measure of what it is supposed to measure”
(Henning, 2001:91).
2.1. Models of validity
As regards the evidential basis of content
As it is claimed by researchers, validity is validity, Bachman (1990) discussed the two
the most important quality of test interpretation following aspects: content relevance and
or test use (Bachman, 1990). The inferences content coverage. Content relevance requires
or decisions we make based on the test scores “the specification of the behavioral domain
will guarantee the test’s meaningfulness, in question and the attendant specification
appropriateness and usefulness (American of the task or test domain.” (Messick,
Psychological Association, 1985). In 1980:1017). According to Bachman (1990),
examining such qualities related to the content relevance should be considered in the
validity of a test, test scores play the key specification of the ability domain – or the
role but are not the only factor as it needs constructs to be tested, and the test method
to come together with the teaching syllabus, facets – aspects of the whole testing procedure.
the test specification and other factors. As a This is directly linked with the test design
result, the concept of validity has been seen process to see whether the items generated
from different perspectives, which leads to for the test can reflect the constructs to be
the fact that there are different viewpoints to measured and the nature of the responses that
categorize this most crucial quality of a test. the test taker is expected to make. The second
Due to the purpose and the scope of this paper, aspect of content validity is named content
the researcher will present two main types coverage or “the extent to which the tasks
of validity, and how content validity can be required in the test adequately represent the
examined. behavioral domain in question” (Bachman,
1990:245). Regarding test validation, this is
Content validity the basis to evaluate how much the test items
As test users, we have a tendency to represent the domain(s); in other words, how
examine the test content, which can be seen much they match the specification.
VNU Journal of Foreign Studies, Vol.34, No.4 (2018) 129-143 131

The limitation of content validity is step process, from score to construct and from
that its does not take into account the actual construct to use” (Kane, 2006:21)
performance of test takers (Cronbach, 1971;
2.2. Examining the content validity of the test
Bachman, 1990). It is an essential part of the
validation process, but it is sufficient all by In the previous parts of the literature
itself as inferences about examinees’ abilities review, content validity and construct validity
cannot be made from it. have been discussed on their own. In this
section, the content validity is going to be
Construct validity
examined with a link to the construct validity
According to Bachman (1990:254), in some recent researchers’ view to explain
construct validity “concerns the extent to why the author chose to cover both the content
which performance on tests is consistent and test performances in the analysis.
with predictions that we make on the basis
As synthesized by Messick (1980),
of a theory of abilities, or constructs.” This is
together with criterion validity, content
related to the way test scores are interpreted
validity is seen as part of construct validity
and how this interpretation can reflect the
in the view of “unifying concept.” However,
abilities the test aims to measure in advance.
the current standards suggest five sources of
By the 1980s, this model was widely validity evidence in which rather than referring
accepted as a general approach to validity to “types”, “categories”, or “aspects” of
by Messick (1980, 1988, and 1989). proposes, a validation framework is proposed
Messick adopted a broadly defined version based on five “sources of validity evidence”
of the construct model to make it a unifying (AERA et al., 1999: 11, cited in Sireci, 2009).
framework for validity when he involved all The five sources include test content, response
evidence for validity (namely content and processes, internal structure, relations to
criterion evidence) into the construct validity. other variables, and consequences of testing.
He considered the two models’ supporting Among them, evidence based on test content
roles in showing the relevance of test tasks “refers to traditional forms of content validity
to the construct of interest, and validating evidence” (Sireci, 2009: 30).
secondary measures of a construct against
Furthermore, Lissitz and Samuelsen
its primary measures. According to Messick
(2007: 482) are “attempting to move away
(1988, 1989), there are three major positive
from a unitary theory focused on construct
impacts of utilizing the construct model as
validity and to reorient educators to the
the unified framework for validity. Firstly,
importance of content validity and the general
the construct model focuses on a number of
problem of test development.” Chalhoub-
issues in the interpretations and uses of test
Deville (2009:242) absolutely supported the
scores, and not just on the correlation of
focus of attention on content validity which
test scores with specific criteria in specific
should be examined through “the qualities
settings for specific test takers. Secondly,
of test content, the interpretation and uses
its emphasis lies in how the assumptions in
of test scores, the consequences of proposed
score interpretations prove their pervasive
score interpretation and uses, and theory
role. Finally, the construct model allows for
refinement.” The investigation of content
the possibility of alternative interpretations
validity, according to Chalhoub-Deville
and uses of test scores. As can be seen from
(2009), follows the operationalization of
this analysis, the construct validity is based
content that Lissitz and Samuelsen presented
on the interpretations of test scores in “a two-
132 N.T.P. Thao / VNU Journal of Foreign Studies, Vol.34, No.4 (2018) 129-143

in their 2007 article. It includes test standards achieve the final result of VSTEP.3-5 test.
and tasks which are captured by domain Like other skills, the reading test focuses
description of the test in general, and test on evaluating English language learners’
specification in particular. As a result, the reading proficiency from level 3 (B1) to level
content validity of the test can be primarily 5 (C1). There are four reading passages with
seen from the comparison between the test 10 multiple choice four-option question per
tasks/items and the test specification. This is passage for test takers to complete in the total
what we do before the test event, called “a time of 60 minutes. The passages range in
priori validity evidence” (Weir, 2005). After terms of length and topics. As a case study
the test event, “posteriori validity evidence” which is seen to be the basis of future research,
is collected related to scoring validity, this paper only focused on one test.
criterion-related validity and consequential
The particular test assessed was selected
validity (Weir, 2005). To ensure scoring
at random from a sample pool of VSTEP.3-5
validity, which is considered “the
tests which have undergone the same
superordinate for all the aspects of reliability”
procedure of designing and reviewing. This
(Weir, 2005:22), test administrators and
aims at providing objectivity to the study.
developers need to see the “extent to which
test results are stable over time, consistent in Also, only tests that were taken by at least
terms of the content sampling, and free from 100 candidates were included in the sample
bias” (Weir, 2005:23). In this sense, scoring pool to increase the reliability of test score
validity helps provide evidence to support analysis.
the content validity. 3.2. Research participants
In summary, the current paper followed For the pre-test stage, three experienced
a combination of methods in assessing the lecturers who have been working in the field of
content validity of the reading test. It is a language testing and assessment participated in
process spanning before and after the test
the evaluation of the test content by working
event. For the pre-test stage, the test content
with both the test paper and test specification
was judged by comparing it with the test
based on a framework of language task
specification. Later the test scores were
characteristics including setting, test rubric,
analyzed in the post-test stage for support
input, expected response, the relationship
of the content validity by examining if the
between input and response, which is originally
content of the specific item needs reviewing
based on the analysis of item difficulty and proposed by Bachman and Palmer (1996).
item fit to the test specification. The research participants also included
598 test takers who took the VSTEP.3-5
3. Research methodology
reading test. This population is a combination
3.1. Research subjects of majored and non-majored English students
The researcher chose a VSTEP.3-5 at VNU and candidates who are working in
reading test used in one of the examinations a range of fields at various ages throughout
administered by the University of Languages the country. Therefore, the test scores are
and International Studies (ULIS), Vietnam expected to reflect the performance of a
National University, Hanoi (VNU). This is variety of English language learners when
one among the four separate skill tests that taking the reading test.
examinees are required to fulfill in order to
VNU Journal of Foreign Studies, Vol.34, No.4 (2018) 129-143 133

3.3. Research questions of the test with the description in the test
1. To what extent is the content of specification. The data was collected using the
the reading test compatible with the test Compleat Lexical Tutor software version 6.2
specification? which is a vocabulary profiler tool (http://www.
lextutor.ca/), the software provided the statistical
2. To what extent do the reading test
data of inputted text based on the research from
results reflect its content validity?
the British National Corpus (BNC) representing
3.4. Research methods and data analysis a vocabulary profile of K1 to K20 frequency
The study made use of both quantitative lists. Moreover, the readability index was
and qualitative data collection. Firstly, an checked from the website https://readable.io/
analysis of the test paper comparing it with and cross checked with the result from Microsoft
the test specification was conducted. The Word software. The website showed the level of
framework followed the original one proposed the text at A, B and C; rather than one of the six
by Bachman and Palmer (1996). This widely levels of the CEFR.
used framework in language testing has After that, more qualitative data
been applied in previous studies such as were collected through a group discussion
Bachman and Palmer (1996), Carr (2006), between the researcher and three experts
Manxia (2008) and Dong (2011). However, as who did the analysis of the test paper. In the
analyzed from Manxia (2008), this framework discussion, the experts shared their thoughts
was not designed for any particular types about the insights of the test related to the
of test tasks or examinations. According to proposed and estimated item difficulty
the nature of reading and characteristics of level, the characteristics of the stems and
reading tests, “characteristics of the input” and options as well as an overall evaluation of
“characteristics of the expected response” are the compatibility between the investigated
advised to be evaluated. In this study, “input” test paper and the reading test specification.
refers to the four reading passages that test These two methods helped collect the data
takers were asked questions about during their to answer research question one which aims
examination. It involves length, language of at the compatibility between the test items/
input, domain and text level. This is also an questions and the test specification.
adaptation from Bachman and Palmer’s model Secondly, the test scores were reported
since it is closely related to the test specification with descriptive statistics and item response
– the blueprint or the guidelines of test design theory (IRT) results as a means of incorporating
that test writers are supposed to follow. examinee performance into the Bachman
“Expected response” aims at the response types and Palmer model. IRT is basically related to
and specifically the options of each question. “accurate test scoring and development of test
The analysis pointed out how similar and items” (An & Yung, 2014). There are some
different the test paper under evaluation was parameters than can be calculated; however,
written compared with the test specification. this study focused on the item measure which
To be specific, regarding characteristics of the means item difficulty and item fit to see how the
input, the study compared the length, language examinees’ performance in each item/question
of input, domain and text level. In terms of matches the estimated item/question levels
expected response, it is response type and
in the test specifications. In this way, we can
reading skills which are analyzed. The analysis
evaluate the quality of the items with a real pool
was conducted by comparing these features
of examinees.
134 N.T.P. Thao / VNU Journal of Foreign Studies, Vol.34, No.4 (2018) 129-143

4.
Results and discussion Characteristics of the input
4.1. Research question 1: To what extent is the In terms of the input, attention was paid to
content of the reading test compatible with the specific features that suited reading passages.
test specification? Table 1 displays the detailed illustration of the
As presented in the methodology, analysis by comparing the requirements in the
Bachman and Palmer’s framework was adopted test specification and the manifestations of the
in this study with a focus on the analysis of investigated test paper.
characteristics of the input and the response.

Table 1. Characteristics of the input

Characteristics of the input described in Characteristics of the input in the


the test specification test paper
Passage 1: 452 words
Passage 1, 2, 3: ~ 400 words/passage Passage 2: 450 words
Length
Passage 4: ~ 500 words/passage Passage 3: 456 words
Passage 4: 503 words
Vocabulary
Passage 1: K1+K2 words 94.31%
Passage 1, 2: mostly high-frequency
Passage 2: K1+K2 words 87.23%
words, some low-frequency words
Passage 3: K1+K2 words 77.13%
Passage 3, 4: more low-frequency words
Passage 4: K1+K2 words 77.41%
Language are expected
of input Grammar
Passage 1, 2, 3: a combination of simple,
Passage 1, 2, 3, 4: the majority is
compound and complex sentences
compound and complex sentences
Passage 4: a majority of compound and
complex sentences
The passage should belong to one of the
Passage 1 & 2: educational domain
Domain four domains: personal, public, educational
Passage 3 & 4: public domain
and occupational
Passage 1: Level B
(Average grade level 6.9
Reading ease: 76.6%)
Passage 2: Level B
Passage 1: B1 level (Average grade level 9.4
Passage 2 & 3: B2 level Reading ease: 58.4%)
Text level
Passage 4: C1 level Passage 3: Level C
(Average grade level 11
Reading ease: 50%)
Passage 4: Level C
(Average grade level 13.8
Reading ease: 34.5%)
VNU Journal of Foreign Studies, Vol.34, No.4 (2018) 129-143 135

The table shows that the test was With regards to the discussion with the
generally an effective presentation of the three reviewers, positive comments on the
test specification under the investigated quality of the texts were noted. Reviewer
characteristics of the input. Most of the 1 saw a good job in the capability to
description was satisfactorily met in the four discriminate the level of the four passages,
reading passages. Regarding the length of i.e. the difficulty level changed respectively
the input and domain, all the passages were from passage 1 to passage 4. Also the variety
accepted in the range of word number as the of specific topics allowed for examinees to
total word counts can fluctuate within 10% of demonstrate a breath of understanding. This
the total number and belonged to reasonable feedback was also reported from reviewer 2
domains with suitable topics. In terms of and 3. Reviewer 2, however, pointed out the
lexical resource of the input, according to problem with grammatical structures that
O’Keeffe et al. (2003), Dang and Webb the above table displays. The percentage of
(2016) cited in Szudarski (2017), the first compound and complex sentences in all four
two thousand words, i.e. K1 and K2 words texts outnumbered the simple ones, which
are the high-frequency ones and the rest from might be challenging for readers at lower
K3, academic word list and off-list words. levels like B1 to process. For the text level, the
Based on these studies, it can be claimed that experts emphasized the role of test developers
the proportion of high and low-frequency in evaluating the difficulty of the input which
words in the four passages satisfied the test should not solely depend on the readability
specification. Last but not least, the text level tool. It is ultimately the test writer’s expertise
should be mentioned in this study as it is of at analyzing the language of the passage that
the priority of the test design according to the best assesses the reading level of a text.
test specification. As the goal of the test is to
Characteristics of the response
distinguish examinees’ reading proficiency
level at levels B1, B2 and C1, the requirement Following the analysis suggested by
from the test specification also aims at these Manxia (2008), this paper focused on two
three levels as seen from the table. The four features of the response, namely response
passages were checked with the website type and reading skills. Typically, the reading
https://readable.io/ and Microsoft Word; skills should be mentioned in the input
however, it is admitted that there is not any regarding the test item; however, to make the
official tool to assess the readability of the analysis coherent and compatible with the test
inputted text. Therefore, the result should be specification, the researcher decided to keep
considered a reference to the study which both the test item and the item options in this
partially reflects the requirement and needs part. The analysis results can be seen in Table 2.
more discussion with the test reviewers.
136 N.T.P. Thao / VNU Journal of Foreign Studies, Vol.34, No.4 (2018) 129-143

Table 2. Characteristics of the response

Characteristics of the response Characteristics of the response in


described in the test specification the test paper
Response Multiple choice questions with four
Multiple choice questions with four options
type options
Reading for main idea Reading for main idea
Reading for specific information/details Reading for specific information/details
Reading for reference Reading for reference
Understanding vocabulary in context Reading for vocabulary in context
Reading Understanding implicit/explicit author’s Reading for author’s opinion/attitude
skills opinion/attitude Reading for inference
Reading for inference Understanding the organizational
Understanding the organizational patterns patterns of the passage
of the passage Understanding the purpose of the
Understanding the purpose of the passage passage

The table shows that the test met the a problem came about in this aspect when fewer
requirement of the test specification in terms of B1 low questions were found than planned.
response type and reading skills. All forty items Otherwise, there were more B1 mid, B2 low
were written in the form of multiple choice with and B2 mid questions in the investigated paper
four options and covered a number of sub-skills compared to the test specification. There was
that the test specification suggested for different an agreement among the test reviewers that
question levels. For an in-depth analysis into the the number of high-level items was more than
test items, to evaluate the extent they matched the that in the test specification. This explains a
test specification, i.e. the content coverage, three finding that low-level test takers had difficulty
reviewers were arranged to work individually with this test, i.e. the test was more difficult than
and discuss in groups to assess the quality of test the requirement of the test specification. The
items. In the assessment, firstly, all reviewers reviewers also commented on the tendency to
agreed that there were a range of question types have several questions that test a specific skill
that aimed at different skills in the test. All in one passage. For example, in passage 2, four
these types appeared in the test specification. out of ten questions focus on sentence meaning,
Secondly, the majority of the questions or whether explicitly or implicitly expressed; and
items appropriately reflected the intended item another passage had one question for main idea
difficulty. The test covers three CEFR levels (B1, and one question for main purpose. In fact, this
B2, and C1); furthermore, the test specification is not mentioned in the test specification as a
adds three levels of complexity (low, mid, high) constraint for the test designers; however, the
to each level, creating nine levels of questions test specification recommends that the test writer
from the test. Due to the confidentiality of the should balance and vary the kind of skills tested
test, a detailed description cannot be presented in each passage particularly and in the whole test
here for either the test specification or the overall.
current test itself. In this research, the reviewers To sum up, it can be concluded in this
all claimed that nine levels of difficulty could study that the test paper followed the test
be pointed out from the forty items. However,
VNU Journal of Foreign Studies, Vol.34, No.4 (2018) 129-143 137

specification with all requirements regarding The evidence to answer this question
its content. The analysis of the input and was obtained from the analysis of test
response by presenting statistical data and scores by using the descriptive statistics and
reviewers’ feedback made it possible to the IRT model.
confirm the content validity via content Descriptive statistics
relevance and content coverage of the test.
The descriptive statistics of the reading
4.2. Research question 2: To what extent do the test are presented in Table 3 and Figure 1.
reading test results reflect its content validity?
Table 3. Score distribution of the test (N = 598)

Items N Min Max Mean Mode SD Skewness Kurtosis


40 598 4 37 15.080/40 15 5.082 .288 -.153

Figure 1. Score distribution of the test (N = 598)

It can be seen that the mean score is the content specification is maintained in the
relatively low at 15.080/40. More importantly, real test.
the skewness is positive (.288), showing that As shown in Table 4, the mean measures
the score distribution is slightly skewed to the (difficulty) for item and person are .00 and
right. This indicates that the reading test was -.62 respectively, which means the test takers
rather difficult to the test takers. The initial found the test difficult in general. Additionally,
analysis of descriptive statistics strengthened it is evident from Table 4 that the infit and
the comments that the three experts made outfit statistics for both item and person are
about the level of the test providing an overall within the desirable range which is from .8 to
impression that it is more difficult than what is 1.2 for the mean square and from -2 to +2 for
required in the specification. the z-standardized values (Wright & Linacre,
IRT results 1994). Therefore, it is safe to say that overall,
In order to get a detailed description of the data fit the model expectations for both
the test items and personal performance, the person and item. That is, the test is productive
IRT results which focus on item difficulty and for measuring the construct of reading
item fit to the test specification were collected. comprehension, and the data have reasonable
predictability in general.
These are significant tools to assess whether
138 N.T.P. Thao / VNU Journal of Foreign Studies, Vol.34, No.4 (2018) 129-143

Table 4. Measure, fit statistics, reliability, and separation of the test (N = 598)
Measure Infit Outfit
Reliability Separation
Mean SE MNSQ ZSTD MNSQ ZSTD
Item .00 .10 1.00 -.3 1.03 .0 .99 9.37

Furthermore, the reliability estimate for Rasch analysis. Table 6 provides the logit
reading items and the item separation resulted values of items which represent the difficulty
from Rasch analysis are high at .99 and 9.37 of items (item measure) estimated by the
respectively, showing very high internal Rasch model. In the Rasch model, the item
consistency for the items in the reading test. with the higher logit value is more difficult,
Simply put, the test has a wide spread of item thus requiring a higher ability to solve. Figure
difficulty, and the number of test takers was 2 illustrates the spread of test takers’ reading
large enough to confirm a reproducible item proficiency levels and the difficulty range of
difficulty hierarchy. This point matches the reading items over the same measure scale.
description in the test specification that the As observed from the table and the figure,
item difficulty levels range from B1 low to C1 the item difficulties in the reading test ranged
high; and also matches the qualitative analysis widely from -2.9 to 2.05 with the mean set at 0
from the three test reviewers presented in by the model. Items 2, 13, and 28 are the most
research question 1. challenging while items 1, 11, and 7 are the
Item and person measure easiest. It is easily seen from Figure 2 that the
spread of item difficulty covered nearly the
First, a correlation analysis was run to
whole range of all persons’ abilities. Only the
examine the correlations between the person
persons at the top and bottom of the scale did
measure and the test takers’ raw scores, and
not have the items of equivalent levels. That
between the item measure and the proportion
is, the easiest item seemed difficult for several
correct p value. The results are presented in
examinees, and there were a few examinees
Table 5, which shows that the correlations are
whose reading proficiency surpassed the
nearly perfect, very close to ±1. From such
highest level tested. However, in general, the
results, the reading raw scores can be used
test could measure the proficiency of the vast
legitimately to determine the performers’
majority of test takers. That the test cannot
level of reading proficiency.
measure English reading proficiency at either
Table 5. Correlations between person meas- extreme (low and high level) should not be
ure and raw scores, item measure and propor- considered detrimental to the test quality
tion correct (N = 598) because the VSTEP.3-5 does not aim at
identifying examinees’ English proficiency at
Person Item
all six CEFR levels. Instead, the test targets are
measure measure
only three levels B1, B2, and C1. Therefore, if
Raw scores .995***
the examinees are at level A1, A2, or C2, their
Proportion
-.992*** ability is not likely to be well measured by the
correct (p)
VSTEP.3-5. It can be considered that the test
*** p< .001 items fulfilled their purpose to focus on three
Secondly, the item measure (item specific levels of the CEFR, rather than spread
difficulty) of the test was investigated through through all six levels.
VNU Journal of Foreign Studies, Vol.34, No.4 (2018) 129-143 139

Table 6. Item measure and item fit of the test (N = 598)

Infit Outfit
Item Measure
MNSQ ZSTD MNSQ ZSTD
1 -1.78 0.90 -2.24 0.83 -2.87
2 2.05 1.06 0.51 1.57 2.99
3 0.31 0.86 -3.57 0.83 -3.36
4 -1.52 0.80 -5.47 0.73 -5.70
5 -0.45 0.94 -2.55 0.93 -2.37
6 -1.28 0.84 -5.07 0.80 -4.90
7 -2.09 0.89 -2.02 0.78 -2.91
8 -0.77 0.96 -1.78 0.95 -1.70
9 -1.32 0.89 -3.34 0.85 -3.57
10 0.17 1.03 0.77 1.04 0.80
11 -2.02 0.95 -0.95 0.83 -2.36
12 0.50 1.05 1.08 1.09 1.43
13 1.69 1.00 0.02 1.15 1.11
14 -0.33 0.95 -1.91 0.95 -1.57
15 -0.41 1.00 0.09 1.01 0.28
16 -0.30 0.95 -1.74 0.95 -1.44
17 -0.26 1.01 0.52 1.01 0.42
18 0.37 1.04 0.95 1.08 1.33
19 0.27 0.98 -0.57 0.99 -0.17
20 0.38 1.07 1.74 1.10 1.79
21 -0.53 1.04 1.76 1.05 1.75
22 -0.23 1.02 0.82 1.04 1.01
23 -1.03 0.97 -1.14 0.97 -0.89
24 0.36 1.09 2.24 1.15 2.65
25 0.27 1.07 1.74 1.09 1.63
26 -0.33 0.93 -2.88 0.93 -2.27
27 -0.09 1.06 2.08 1.10 2.44
28 1.65 1.11 1.11 1.40 2.72
29 0.07 1.01 0.27 1.04 0.84
30 -0.03 0.91 -2.86 0.90 -2.61
31 1.05 1.06 0.94 1.26 2.64
32 0.72 1.04 0.71 1.09 1.18
33 0.78 1.04 0.73 1.10 1.35
34 0.30 1.10 2.58 1.14 2.60
35 0.62 1.09 1.78 1.14 2.02
36 0.74 1.11 2.02 1.19 2.44
37 0.76 1.03 0.62 1.08 1.09
38 0.43 0.98 -0.43 0.96 -0.60
39 0.74 1.08 1.47 1.13 1.66
40 0.54 1.01 0.28 1.02 0.34
140 N.T.P. Thao / VNU Journal of Foreign Studies, Vol.34, No.4 (2018) 129-143

Figure 2. Person maps of items of the test (N = 598)


Furthermore, the Rasch analysis also to be at lower level, are shown to be above
reveals the actual difficulty of the items. It items 31 and 32 which are of higher level.
is illustrated in Figure 2 that several items From the figure, more problematic items can
do not follow the difficulty order they were be seen as item 3, 10, and 12 which are more
intended for. For example, at the top of the difficult than expected; whilst items 23 and
scale, items 2 and 13, which were designed 26 are easier. This means some items do not
VNU Journal of Foreign Studies, Vol.34, No.4 (2018) 129-143 141

perform as expected with this group of test levels applied for VSTEP.3-5. There exists an
takers. As a result, content review is necessary agreement between reviewers about the variety
for them. This point is worth more effort of of item difficulty levels throughout the test,
item review before and after the test as it is especially that all nine required levels appear
directly related to the test content regarding in the test. However, the analysis from the three
item difficulty. Again, this is what the three experts and the test scores reveal a gap between
test reviewers commented in their analysis the proposed difficulty and actual difficulty of
when showing that it was hard to find low- some items. In the test, some questions did not
level items in the test, while more items were follow the difficulty order assigned for them,
found at mid or higher levels compared with and the levels seemed to be higher or lower
the test specification. It can be claimed that than planned. This leads the researcher to
the statistical analysis did support the test believe that the test is a bit more difficult than
analysis of content validity in this study. what is designed in the test specification.

5. Conclusion As a result, it is necessary that the specific


items pointed out from the analysis be edited.
5.1. Summary of major findings The item edition should begin by reviewing
The qualitative and quantitative data reading skills assessed by the question to reduce
analysis has shown that both the test content the concentration of such questions for any
and test results reflect its content validity. one text. Additionally, some options that were
In the first place, the paper followed the excessively challenging in terms of lexical and
guidelines of the test specification when grammatical structures should be rewritten.
considering its input characteristics such
Generally speaking, the investigated
as length, language, domain, text level and
test can be considered a success to guarantee
its response features of type and skills. This
the content validity of VSTEP.3-5 reading
claim is made from the data comparison
comprehension test.
and the three test reviewers’ feedback. What
was developed in the test covered the main 5.2. Limitations of the study
requirements of the test specification, and
It cannot be denied that the current
this is proved from the analysis of the test
research has some limitations which should be
paper made by the reviewers. Some problems,
taken into consideration for future studies. As
nevertheless, were seen to remain with the
this is a small-scale study, the focus was one
study. Texts chosen for the test had a majority
reading test with three reviewers involved.
of compound and complex structures while the
Therefore, to reach generalized conclusions,
first two passages should contain more simple
structures according to the test specification. more tests should be investigated.
With an online readability tool, the analysis References
also showed that the readability level of one
passage was higher than it should have been. Vietnamese
This is not a particularly big concern, but it is Nguyễn Thúy Lan (2017). Một số tác động của bài
worth noting for future test review. thi đánh giá năng lực tiếng Anh theo chuẩn đầu
Secondly, a wide range of difficulty levels ra đối với việc dạy tiếng Anh tại Trường Đại học
in the questions that spread from B1 low to Ngoại ngữ - Đại học Quốc gia Hà Nội. Nghiên cứu
C1 high was reported, following the CEFR Nước ngoài, 33(6), 123-141.
142 N.T.P. Thao / VNU Journal of Foreign Studies, Vol.34, No.4 (2018) 129-143

English
validity and education. Educational Researcher,
36(8), 437-448.
Alderson, J.C. (2000). Assessing Reading.
Cambridge: Cambridge University Press. Manxia, D. (2008). Content validity study on reading
Bachman, L. (1990). Fundamental Considerations in comprehension tests of NMET. CELEA Journal,
Language Testing. Oxford: Oxford University Press. 31(4), 29-39.

Bachman, L. & Palmer, A. (1996). Language Testing Messick, S. (1980). Test validity and the ethics of
in Practice: Designing and Developing Useful assessment. American Psychologists, 35, 1012-1027.
Language Tests. Oxford: Oxford University Press.
Messick, S. (1989). Validity. In R.L.Linn (Ed.),
Carr, N.T. (2006). The factor structure of test task Educational measurement 3rd ed. (pp. 13-103).
characteristics and examinee performance. New York: American Council on Education and
Language Testing, 23(3), 269-289. Available Macmillan.
through http://ltj.sagepub.com/. Accessed
01/03/2018 14:15. O’Keeffe, A. & Farr, F. (2003). Using language
corpora in language teacher education: pedagogic,
Chalhoub-Deville, M. (2009). Content validity
linguistic and cultural insights. TESOL Quarterly,
considerations in language testing contexts. In
R.W.Lissitz (Ed.), The concept of validity (pp. 241- 37(3), 389-418.
259). Charlotte, NC: Information Age Publishing, Inc. Nguyen Thi Quynh Yen (2016). Rater Consistency in
Cronbach, L.J. (1971). Test validation. In Rating L2 Learners’ Writing Task. VNU Journal of
R.L.Thorndike (Ed.), Educational Measurement Science: Foreign Studies, 32(2), 75-84.
2nd ed. (pp. 443-507). Washington, DC: American
Sireci, S.G. (2009). Packing and unpacking sources
Council on Education.
of validity evidence: History repeats itself again.
Dong, B. (2011). A content validity study of In R.W.Lissitz (Ed.), The concept of validity
TEM-8 Reading Comprehension (2008-2010). (pp. 19-39). Charlotte, NC: Information Age
Kristianstad University Sweden. Available through
Publishing, Inc.
www.diva-portal.se/smash/get/diva2:428958/
FullText01.pdf Accessed 20/02/2018 09:00. Szudarski, P. (2018). Corpus Linguistics for
Henning, G. (2001). A guide to language testing: Vocabulary: A Guide for Research. Routledge
Development, evaluation and research. Beijing: Corpus Linguistics Guides. New York: Routledge.
Foreign Language Teaching and Research Press.
Weir, C.J. (2005). Language Testing and Validation:
Kane, M.T. (2006). Validation. In R.L.Brennan (Ed.), An Evidence-Based Approach. Basingstoke:
Educational Assessment 4th ed. (pp. 17-64). New Palgrave Macmillan.
York: American Council on Education.
Wright, B.D. & Linacre, J.M. (1994). Reasonable
Lissitz, R.W. & Samuelsen, K. (2007). A suggested
mean-square fit values. Rasch Measurement
change in terminology and emphasis regarding
Transactions, 8, 370-371.
VNU Journal of Foreign Studies, Vol.34, No.4 (2018) 129-143 143

NGHIÊN CỨU TÍNH GIÁ TRỊ NỘI DUNG CỦA BÀI THI
ĐỌC THEO ĐỊNH DẠNG ĐỀ THI ĐÁNH GIÁ NĂNG LỰC
SỬ DỤNG TIẾNG ANH BẬC 3-5 (VSTEP.3-5)

Nguyễn Thị Phương Thảo


Trung tâm Khảo thí, Trường Đại học Ngoại ngữ, ĐHQGHN,
Phạm Văn Đồng, Cầu Giấy, Hà Nội, Việt Nam

Tóm tắt: Bài viết này trình bày kết quả của một nghiên cứu về tính giá trị nội dung của một
bài thi Đọc theo định dạng đề thi đánh giá năng lực sử dụng tiếng Anh bậc 3-5 (VSTEP.3-5) thông
qua phân tích số liệu định lượng và định tính. Mục đích của nghiên cứu là đánh giá tính phù hợp
của nội dung đề thi với bản đặc tính kĩ thuật của đề thi và năng lực thực tế của thí sinh dự thi.
Nghiên cứu mời ba giảng viên có chuyên môn về lĩnh vực khảo thí phân tích nội dung đề theo
khung phân tích tác vụ đề thi của Bachman và Palmer (1996). Đồng thời, nghiên cứu phân tích
điểm thi thực tế của 598 thí sinh thực hiện bài thi này. Nghiên cứu chỉ ra rằng tính giá trị nội dung
của đề thi được khảo sát phù hợp với các công cụ phân tích. Tuy nhiên, đề thi vẫn cần được kiểm
tra lại để hoàn thiện với một số vấn đề nghiên cứu đã chỉ ra.
Từ khóa: kiểm tra đánh giá ngôn ngữ, tính giá trị nội dung, bài kiểm tra kĩ năng đọc hiểu,
bài thi chuẩn

You might also like