1 Technical - Manual

Download as pdf or txt
Download as pdf or txt
You are on page 1of 44

Duolingo English Test: Technical Manual

Duolingo Research Report


May 20, 2024 (44 pages)
https://englishtest.duolingo.com/research

Ramsey Cardwell∗ , Ben Naismith∗ , Geoffrey T. LaFlair∗ , and Steven Nydick∗

Abstract
The Duolingo English Test Technical Manual provides an overview of the design, development, administration, and scoring of the
Duolingo English Test. Furthermore, the Technical Manual reports validity, reliability, and fairness evidence, as well as test-taker
demographics and the statistical characteristics of the test. This is a living document whose purpose is to provide up-to-date information
about the Duolingo English Test, and it is updated on a regular basis.

Last Update: May 20, 2024

Contents

1 Introduction 3

2 Theoretical Basis 3
2.1 Test Constructs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

3 Test Task Types 4


3.1 Reading Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.2 Listening Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.3 Writing Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.4 Speaking Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

4 Test Development 16
4.1 Task and Item Development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.2 Item Generation and Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

5 Item Delivery and Scoring 19


5.1 CAT Delivery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
5.2 CAT Item Scoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
5.3 Open-Ended Speaking and Writing Task Scoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
5.4 Subscores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5.5 Score Reliability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

6 Access and Accommodations 23


6.1 Access . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
6.2 Accommodations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

7 Test Administration and Security 26


7.1 Test Administration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

Note: We would like to acknowledge the contributions of Burr Settles, the creator of the Duolingo English Test and author of the first Technical Manual.
∗ Duolingo, Inc.

Corresponding author:
Ramsey Cardwell, PhD
Duolingo, Inc. 5900 Penn Ave, Pittsburgh, PA 15206, USA
Email: [email protected]

1
2 Duolingo Research Report

7.2 Onboarding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
7.3 Administration Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
7.4 Proctoring and Reporting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

8 Test-Taker Demographics 27

9 Test Performance Statistics 30


9.1 Score Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
9.2 Relationship with Other Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

10 Quality Control 33
10.1 Analytics for Quality Assurance in Assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
10.2 Proctoring Quality Assurance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

11 Conclusion 36

12 Appendix 37

© 2024 Duolingo, Inc


Duolingo English Test: Technical Manual 3

1 Introduction
The Duolingo English Test (DET) is a measure of English language proficiency for communication and use in English-medium settings,
covering the full range of language proficiency. It assesses test-taker ability to use linguistic skills that are required for literacy,
conversation, comprehension, and production. The test is designed for maximum accessibility while maintaining high measurement
accuracy and using authentic multimodal inputs; it is delivered via the internet, without a testing center, and is available 24 hours a day,
365 days a year. In addition, as a computer-adaptive test (CAT), it is designed to be efficient; the test takes approximately one hour to
complete, though as a CAT the exact time varies for each test taker. The test uses a variety of task types that provide maximal coverage
of the English language proficiency construct while being feasible to develop, administer, and score at scale. In all areas of the test, high
standards of security and psychometric quality are maintained.
In adhering to the Standards for Educational and Psychological Testing (AERA et al., 2014) standards for test documentation (Chapter
7), this technical manual provides an overview of all aspects of the DET so that stakeholders can make informed decisions about how to
interpret and use DET test scores. Like the Standards themselves, this technical manual begins with more theoretical foundations before
proceeding to operational topics. It contains a presentation of:
• the test’s tasks, the constructs they cover, and how they are created using human-in-the-loop generative AI
• the adaptive delivery and scoring of test items using computational psychometrics
• the test’s accessibility, delivery, and proctoring and security processes
• demographic information of the test-taker population
• the statistical characteristics of the test
Since its inception in 2016, the social mission of the DET has been to lower barriers to education access for English language learners
around the world. The DET achieves this goal by leveraging technological advances in annual test updates to produce an accessible
and affordable high-stakes language proficiency test that produces valid, fair, and reliable test scores. These scores are intended to be
interpreted as reflecting test-taker English language proficiency and to be used in a variety of settings, including for post-secondary
admissions decisions. To date, the success of this mission is evidenced by the widespread adoption of the DET by more than 5,000
academic programs in 100 countries.

2 Theoretical Basis
The Duolingo English Test employs a novel assessment ecosystem (Burstein et al., 2022) composed of an integrated set of frameworks
related to language assessment, design, psychometrics, test security, and test-taker experience (TTX). Furthermore, the processes and
documentation of the DET—including test development, scoring, and documentation of validity, reliability and fairness evidence—have
been externally evaluated against the AERA et al. (2014) standards and are continually internally evaluated against the Responsible
Artificial Intelligence (RAI) Standards (Burstein, 2023). These theoretical underpinnings motivate the research philosophy and values
of the DET, which aim to make the DET test-taker-centered by taking advantage of the latest developments in technology (including
machine learning and artificial intelligence), applied linguistics, psychometrics, and assessment science.
On a more fundamental level, the Duolingo English Test subscribes to the interactionalist perspective of what a test can in fact assess
(Chapelle, 1998; Messick, 1989, 1996; Young, 2011). In this conceptualization, test-taker performance reflects two elements and their
interaction: 1) the underlying traits of the test taker (English language skills, strategies, and competencies), and 2) the context-specific
behaviors of the test taker (task performance). For example, an individual may evidence a certain level of spoken proficiency during a
face-to-face conversation but may struggle with the exact same conversation on the phone. It is therefore necessary to always consider the
characteristics of the setting (including the task type and language modality) when drawing conclusions about a test-taker’s underlying
traits. This theory of language aligns closely with the tenets of the communicative language ability (CLA) model, which calls for language
assessments to be informed by “language ability in its totality” (Bachman & Palmer, 1996, p. 67). Such an approach is also consistent
with the Common European Framework of Reference for Languages (CEFR; Council of Europe, 2020), which conceptualizes language
use as the leveraging of communicative language competences in various contexts with corresponding conditions and constraints (p. 32).
The result of the above considerations is a modern test that equally meets the assessment criteria and the needs of stakeholders, and which
is continually being evaluated and iterated upon in all aspects of our assessment processes. Together, these ecosystem frameworks, testing
standards, and research philosophy support a test validity argument built on a digitally informed chain of inferences, appropriate for a
digital-first assessment of this nature and consistent with professional standards of practice. As a result, the adaptive DET can be seen
to assess test takers’ proficiency in General English and English for Academic Purposes (EAP), both of which are essential for success
in a range of academic or professional settings. The DET considers EAP to consist of the language knowledge and skills necessary to
perform common communicative and pedagogical tasks in a range of educational contexts across academic disciplines. This definition
highlights the importance of communicative competence in educational settings.

© 2024 Duolingo, Inc


4 Duolingo Research Report

2.1 Test Constructs


Here we describe the constructs being tested, that is, “the specific definition of an ability that provides the basis for a given assessment
or assessment task and for interpreting scores derived from this task” (Bachman & Palmer, 2010, p. 43). The DET measures test-taker
ability to use the language skills required for literacy, conversation, comprehension, and production, including the skills necessary for
success in academic contexts. These integrated skill areas correspond to the DET subscores, and each subscore can be interpreted as a
combination of two of the more traditional language subskills of speaking, writing, reading, and listening (SWRL). (For white papers
on how the DET assesses each SWRL skill, see Park et al. (2023) for speaking; Goodwin et al. (2023) for writing; Park et al. (2022)
for reading, and Goodwin and Naismith (2023) for listening.) In addition, certain DET task types target the assessment of vocabulary*
because vocabulary knowledge is critical to proficiency in all language skills areas. (See Park et al. (2024) for how the DET assesses
vocabulary.) Figure 1 shows the relationship between traditional language subskills and DET subscores (see Section 5.4).

Figure 1. Relationship between SWRL language skills and DET subscores

In total, the DET has 14 different graded task types that collectively measure test-taker proficiency in the English-language constructs
described above. The creation and selection of this specific combination of task types is guided by the DET Ecosystem (Burstein et al.,
2022), especially the Language Assessment Design Framework. In this framework, task design and scoring target constructs relevant for
General and Academic English language proficiency. Test security is also an aspect of the DET Ecosystem, and having a variety of task
types provides some protection against certain cheating strategies. In addition to test use validity and security, another consideration in
test design is ensuring a delightful TTX. As a result of these considerations, DET tasks are intuitive, reducing the need for test-specific
preparation. All DET task types are summarized in Tables 1 and 2 and are described individually in the subsequent section.†

3 Test Task Types


We now describe the 14 task types (and their sub-tasks) in Table 1. The DET task types include both closed-ended task types (e.g., C-test
and Yes/No Vocabulary) and open-ended task types (e.g., Picture Description and Writing Sample). Many of these tasks are integrative,
requiring the test taker to demonstrate proficiency with multiple skills, for example Dictation (listening and writing), Elicited Imitation
(reading and speaking), or Interactive Listening (listening and writing). Some task types are also multimodal, incorporating images,
animations, audio, and written text.
Tasks also vary in their authenticity, that is, “the degree of correspondence of characteristics of a given language test task to the features
of a TLU [target language use] task” (Bachman & Palmer, 2010, p. 23). For example, highlighting relevant text in a reading passage as
part of the Interactive Reading task is highly authentic as it is a skill that many test takers will employ in academic (or other) contexts. In
contrast, the Yes/No Vocabulary task which targets vocabulary knowledge is less authentic, as deciding if a word is real or not is not an
activity that test takers are likely to encounter outside of a language testing context. The DET deliberately includes tasks with different
levels of authenticity in order to maximize the efficiency of the test for measurement accuracy while also fully covering the intended
constructs.

*
Here we use the term Vocabulary in the broad sense, also referred to as Lexis, which includes the knowledge and use of words, word parts, multi-word units, and the
connections between them.

See Section 4.6 for information on subscores.

© 2024 Duolingo, Inc


Duolingo English Test: Technical Manual 5

Table 1. Constructs and Task Types

Subscore Skills Description Task Types

• Interactive Listening (S)


• Interactive Reading
• Interactive Writing
Reading and writing English from
• Writing Sample
Reading basic informational text to ad-
Literacy • Picture Description (writing)
Writing vanced expository/persuasive texts
• C-test
at CEFR levels A1–C2
• Vocabulary in Context
• Yes/No Vocabulary

• Interactive Listening (DC)


• Extended Speaking (audio prompt)
Listening and producing spoken
• Extended Speaking (text prompt)
English from basic discourse (e.g.,
Listening • Speaking sample
Conversation informational) to advanced dis-
Speaking • Picture Description (speaking)
course (e.g., lectures) at CEFR lev-
• Dictation
els A1–C2
• Elicited Imitation

• Interactive Listening (DC)


Understanding spoken and written • Interactive Reading
English from basic informational • C-test
Reading discourse (e.g., mini-talks) to ad- • Vocabulary in Context
Comprehension
Listening vanced discourse (e.g., extended • Dictation
monologues) at CEFR levels A1– • Elicited Imitation
C2 • Yes/No Vocabulary

• Interactive Listening (S)


• Interactive Writing
• Extended Speaking (audio prompt)
Producing spoken and written En-
• Extended Speaking (text prompt)
glish from basic informational dis-
Speaking • Speaking sample
Production course (e.g., paragraphs) to ad-
Writing • Writing Sample
vanced discourse (e.g., persuasive
• Picture Description (speaking)
arguments) at CEFR levels A1–C2
• Picture Description (writing)
• Elicited Imitation

Note: S = summarization, DC = dialogue completion.

Each test task type corresponds to two test constructs (as indicated in Table 1 by each task type being listed twice), and therefore task
types cannot be presented linearly by construct. Therefore task types are instead organized by the traditional language skills reading,
listening, writing, and speaking. Within each subsection, we begin with the more authentic tasks. For the order in which tasks are
presented on the test and the number of each task type administered in a test session, see Section 5.1.

3.1 Reading Tasks


In this subsection we describe the test task types Interactive Reading, C-Test, Vocabulary in Context, and Yes/No Vocabulary. These
tasks primarily involve reading, although C-Test and Vocabulary in Context require test takers to respond with written input.

© 2024 Duolingo, Inc


6 Duolingo Research Report

Interactive Reading

The Interactive Reading task type complements the other test task types that assess reading with a focus on reading processes (C-test
and Elicited Imitation) by focusing on reading comprehension (Park et al., 2022). This task type requires test takers to engage with
a text by sequentially performing a series of sub-tasks tapping different subconstructs of reading (reading for information, reading for
comprehension, reading for orientation, identifying cues and inferring) and all using the same text as the stimulus. This task type is
interactive in the sense that as the test progresses, different portions of the text are presented to the test taker, ensuring a comprehensive
evaluation of their reading skills. By incorporating interactivity, the DET interactive tasks provide a dynamic and immersive testing
experience that better reflects real-world language usage.

The first sub-task shows the test taker the first half of the text with 5–10 words missing (see Figure 2); test takers must select the word
that best fits each blank. Next, test takers are shown the remainder of the text with one sentence missing (see Figure 3); test takers
must select the sentence that best completes the passage from among several options. With the text now complete, test takers are shown
sequentially two questions and asked to highlight the part of the text that contains the answer (see Figure 4). Test takers are then asked
to select an idea that appears in the passage from among several options, only one of which is correct (see Figure 5). Finally, test takers
are asked to choose the best title for the text from among several options (see Figure 6).

Each Interactive Reading passage is classified by genre as either narrative or expository; each test taker receives one narrative passage
and one expository passage. Passages cover a range of topics and reflect the educational and occupational domains of language
use. Additionally, the number of complete-the-sentence blanks across the two tasks is controlled such that each test taker receives
approximately the same number. The time limit to complete the longer Interactive Reading task (i.e., more blanked words in the first
sub-task) is one minute longer than the time limit of the shorter task.

Figure 2. Example Interactive Reading ”Complete the Sentences” Sub-task

© 2024 Duolingo, Inc


Duolingo English Test: Technical Manual 7

Figure 3. Example Interactive Reading ”Complete the Passage” Sub-task

Figure 4. Example Interactive Reading ”Highlight the Answer” Sub-task

C-test

The C-test task type (see Figure 7) measures a test-taker’s global language proficiency in the written modality (Norris, 2018), capturing
chiefly knowledge of vocabulary and grammar (Eckes & Grotjahn, 2006). In addition, C-test scores correlate moderately well with
discrete language components including reading ability (Khodadady, 2014; Klein-Braley, 1997), spelling skills (Khodadady, 2014), and
vocabulary (Karimi, 2011). It has been shown that scores from C-tests are significantly correlated with scores from many other major
language proficiency tests (Daller et al., 2021; Khodadady, 2014).

In this task, the test taker is presented with a short text. The first and last sentences of the text are fully intact, while alternating words in
the intervening sentences are “damaged” by deleting the second half of the word. Test takers respond to the C-test items by completing
the damaged words in the paragraph. Test takers need to rely on context and discourse information to reconstruct the damaged words,

© 2024 Duolingo, Inc


8 Duolingo Research Report

Figure 5. Example Interactive Reading ”Identify the Idea” Sub-task

Figure 6. Example Interactive Reading ”Title the Passage” Sub-task

which include both function and content words spanning multiple parts of speech (i.e., lexical and morphosyntactic categories). The task
thus relates to linguistic subskills including reading for information, inferring, and orthographic control (i.e., adherence to spelling and
punctuation conventions).

The C-test passages themselves reflect a range of different authentic text types from the educational and private domains of language
use, including fiction (e.g., colloquial narratives), news articles, and textbook passages. The linguistic features of these passages have
been carefully analyzed to ensure a variety of text types and difficulty levels. In total, more than 150 linguistic features are annotated and
accounted for, including features related to parts of speech, verb types, and passage length (see McCarthy et al., 2021, for the complete
list).

© 2024 Duolingo, Inc


Duolingo English Test: Technical Manual 9

Figure 7. Example C-test Task

Vocabulary in Context

The Vocabulary in Context task type (see Figure 8) measures aspects of vocabulary knowledge relating to meaning, form, and use (Nation,
2001, 2013, 2022) – not only does it assess a wide range of different words (breadth), it also assesses how well a test taker knows different
dimensions of these words (depth), and it requires them to access these words in a limited amount of time (fluency). In terms of the
CEFR, this task type relates to the key elements of the Vocabulary range and Vocabulary control scales. As described in relation to the
Yes/No Vocabulary task type, vocabulary knowledge is essential for proficiency in all language skills.

The Vocabulary in Context task type follows the format of the controlled-production vocabulary-levels test (PVLT; Laufer & Nation,
1999), a measure of controlled (as opposed to “free”) productive vocabulary knowledge. In our version of this task type, test takers are
presented with a sentence that includes a damaged word (the first part of the word is given, and the second part is blank). Candidates for
target words are sourced from a large English language reference corpus across four different parts of speech (adjectives, adverbs, nouns,
and verbs). The sentences in which the target words are contextualized reflect five different genres (literary, textbook, news, personal
writing, and conversation).

Test takers must then complete the word so that it makes sense in the context of the sentence. The damaged portion of the word has a
blank space for each character, giving test takers clues about the length of the finished word and constraining possible responses. To
ensure that test takers are administered items that tap into multiple aspects of vocabulary knowledge, the test administration requires that
all test takers receive at least one item containing either an antonym or a synonym of the target word as a context clue, and at least one
additional item containing the target word in a collocation.

Yes/No Vocabulary

The “yes/no” vocabulary test (see Figure 9) measures breadth of receptive vocabulary knowledge (Beeckmans et al., 2001). Such
knowledge is a critical element of learning a second/additional language (L2) and impacts all language skills, including reading (Laufer,
1992; Roche & Harrington, 2014), listening (Bonk, 2000; Staehr, 2008), speaking (Milton, 2013; Milton et al., 2010), and writing (Kyle
& Crossley, 2016; Ruegg et al., 2011). More specifically, the “yes/no” vocabulary test has been shown to predict listening, reading, and
writing abilities (McLean et al., 2020; Milton et al., 2010; Staehr, 2008), and it has been used to assess vocabulary knowledge at various
CEFR levels (Milton, 2010).

© 2024 Duolingo, Inc


10 Duolingo Research Report

Figure 8. Example Vocabulary in Context Task

In this task type, test takers are presented, one at a time, with stimuli that are either a written English word or a pseudo-word designed
to appear English-like* . Test takers respond by selecting “Yes” or “No” within five seconds after each word is presented. In order to
accurately distinguish real and pseudo-words quickly, test takers need to possess receptive knowledge of a range of lexis, including
spelling conventions and morphology. Traditional yes/no vocabulary tests simultaneously present a large set (e.g., 100) of mixed-
difficulty stimuli. On the DET, individual yes/no vocabulary stimuli are presented adaptively, based on how the test taker performed on
previous items (see Section 5.1 for more on the computer-adaptive administration).

3.2 Listening Tasks


In this subsection we describe the test task types Interactive Listening and Dictation. These tasks primarily involve listening to audio
stimuli, although both task types also require test takers to write.

Interactive Listening

The Interactive Listening task type contributes to measurement of the constructs of L2 listening, reading, and writing (LaFlair et al.,
2023). It complements the Dictation task type, which focuses more on listening processes, by also measuring aspects of interactional
competence. This listening task type is interactive in that it requires test takers to participate in a situationally driven conversation in a
university setting. The Interactive Listening task demonstrates correspondence to the TLU domain of English-medium postsecondary
studies through the conversation topics, interlocutors (students and professors), and communicative functions, which include asking for
clarification about lecture content, making requests, gathering information, asking for advice, planning study sessions, and participating
in other university-oriented conversations (Biber & Conrad, 2019).

An Interactive Listening task starts with a scenario that describes who the test taker is talking with and for what purpose. This scenario is
followed by the dialogue completion sub-task. Some dialogue completions require the test taker to select the first turn in the conversation,
while others start with the interlocutor. After each interlocutor turn (which is presented in audio format only), the test taker must select
the best response (among multiple options presented in writing) to continue the conversation (see Figure 10). The test taker receives
visual feedback after each response; if the response is correct, the box around the text turns green; otherwise, the box turns red, and the

*
We use an LSTM recurrent neural network trained on the English dictionary to create realistic pseudo-words, filtering out any real words, including any acceptable
spellings, and pseudo-words that orthographically or phonetically resemble real English words too closely.

© 2024 Duolingo, Inc


Duolingo English Test: Technical Manual 11

Figure 9. Example Yes/No Vocabulary Task

correct response is shown. In this way, test takers can respond to the remaining turns based on the intended input. Once the conversation
ends, the test taker may use any remaining time to review the conversation before proceeding to the summary sub-task (see Figure 11).
In the summarization sub-task, the test taker has 75 seconds to compose a written summary of the conversation.
Each Interactive Listening task exhibits one of three types of conversations: student–student conversations that focus on requests, advice
seeking, and other university-oriented purposes; student–professor conversations that focus on similar purposes; and student-professor
conversations that focus on information seeking where the student needs to get information about a specific topic from their professor.
Each test session includes two Interactive Listening tasks: one student–student conversation and one student–professor conversation.

Figure 10. Example Interactive Listening ”Dialogue Completion” Sub-task

© 2024 Duolingo, Inc


12 Duolingo Research Report

Figure 11. Example Interactive Listening ”Summarization” Sub-task

Dictation

Dictation is an integrated skills task (listening and writing) that assesses test-taker ability to recognize individual words and to hold them
in memory long enough to accurately reproduce them; both abilities are critical for spoken language understanding (Bradlow & Bent,
2002; Buck, 2001; Smith & Kosslyn, 2007). Dictation tasks have also been found to be associated with language-learner intelligibility
in speech production (Bradlow & Bent, 2008).

For the DET Dictation task, test takers listen to a spoken sentence or short passage and then transcribe it using the computer keyboard
(see Figure 12). Test takers have one minute to listen to the stimulus and transcribe what they heard. They can play the stimulus up
to three times. The stimuli used for Dictation items cover a range of language functions including requesting information, expressing
opinions, and stating facts. They also exhibit authentic language features such as contractions and elisions (e.g., pronouncing “camera”
as /kamra/). Given the time constraint and limited number of replays, an understanding of the vocabulary and grammar used in the
stimulus is necessary for error-free task completion.

3.3 Writing Tasks


In this subsection we describe the three open-ended writing task types, which measure test takers’ English writing abilities: Interactive
Writing, Writing Sample, and Picture Description (writing). All writing task types elicit written responses that evidence writing
proficiency in terms of the writing subconstructs of Content, Discourse coherence, Grammar, and Vocabulary and proficiency in
discussing topics in the different domains described in the CEFR (Personal, Public, Educational, and Professional).

The written prompts for the independent writing tasks ask test takers to recount an experience, give examples and recommendations, or
argue a point of view, requiring the demonstration of cohesive devices and formulating an argument. In the case of Interactive Writing
(Figure 13), there are two parts. First, test takers are asked to respond to a prompt as described above. Next, their first response is
analyzed in real-time for topic-relevant themes, and they are asked a related follow-up question based on their initial response. This
design is intended to more authentically reflect the real-world writing scenarios and to elicit greater evidence of test takers’ true writing
proficiency. The final independent writing task in a test administration is the Writing Sample (14); a test taker’s written response to this
task is provided to institutions with which the test taker shares their results.

In the Picture Description (writing) task type (Figure 15), the stimuli (i.e., the photos) were selected by people with graduate degrees in
applied linguistics. These images are designed to give test takers the opportunity to display their full range of written language abilities
as they contain stimulating depictions of people, animals, and objects in a wide range of contexts. The images are therefore capable of

© 2024 Duolingo, Inc


Duolingo English Test: Technical Manual 13

Figure 12. Example Dictation Task

Figure 13. Example Interactive Writing Task

eliciting writing from low-, mid-, and high-proficiency test takers. The Picture Description task provides opportunities for test takers
to use descriptive language, whereas the other two writing task types require test takers to demonstrate more discursive knowledge of
writing in addition to language knowledge (Weigle, 2002).

© 2024 Duolingo, Inc


14 Duolingo Research Report

Figure 14. Example Writing Sample Task

Figure 15. Example Picture Description (writing) Task

3.4 Speaking Tasks

In this subsection we describe the test task types Extended Speaking, Speaking Sample, Picture Description, and Elicited Imitation.
These task types require test takers to respond by speaking in response to various stimuli (text prompts, audio prompts, and images).

© 2024 Duolingo, Inc


Duolingo English Test: Technical Manual 15

Open-Ended Speaking Tasks

Each test session includes multiple task types requiring an open-ended spoken response: prompt-based speaking tasks (Extended
Speaking [audio prompt], Extended Speaking [text prompt], and Speaking Sample, which is shared with institutions) and an image-
based speaking task Picture Description (speaking). All of these task types require test takers to speak for an extended time period
and to leverage different aspects of their organizational knowledge (e.g., grammar, vocabulary, and discourse coherence) and functional
elements of their pragmatic language knowledge (e.g., ideational knowledge; Bachman & Palmer, 1996). All open-ended speaking task
types elicit samples that evidence speaking proficiency in terms of the speaking subconstructs of Content, Discourse coherence, Grammar,
Vocabulary, Fluency, and Pronunciation. As with the writing tasks described previously, test takers must demonstrate proficiency in
discussing topics in the different domains described in the CEFR (Personal, Public, Educational, and Professional).
The open-ended speaking tasks are administered after the computer-adaptive portion of the test. As with the Picture Description (writing)
task, the stimuli (i.e., the photos) for Picture Description (speaking) were selected by people with graduate degrees in applied linguistics.
These images contain stimulating depictions of people, animals, and objects in a wide range of contexts, enabling test takers of all
ability levels to demonstrate lexical and grammatical proficiency. For the independent speaking prompts, three are presented as written
prompts and one as an aural prompt (see Figures 16–18). These prompts ask test takers to recount an experience, give examples and
recommendations, or argue a point of view. A recording of a test taker’s spoken response to the Speaking Sample task is provided to
institutions with which the test taker shares their results.

Figure 16. Example Picture Description (speaking) Task

Elicited Imitation

The read-aloud variation of the elicited imitation task (see Figure 19) is an integrated skills task measuring test-taker reading and speaking
abilities (Jessop et al., 2007; Litman et al., 2018; Vinther, 2002). The goal of this task is to evaluate the intelligibility and fluency of
speech production (i.e., phonological control), which are affected by segmental/phonemic and suprasegmental properties like intonation,
rhythm, and stress (Anderson-Hsieh et al., 1992; Derwing et al., 1998; Field, 2005; Hahn, 2004). Furthermore, intelligibility is correlated
with overall spoken comprehensibility (Derwing & Munro, 1997; Derwing et al., 1998; Munro & Derwing, 1995), meaning that this task
type can capture aspects of speaking proficiency.
This task type requires test takers to read, understand, and speak a sentence. After reading the target sentences, test takers respond by
using the computer’s microphone to record themselves reading the sentence aloud. Elicited Imitation prompts span a range of topics
and language functions, including language use characteristic of academic contexts. The DET uses state-of-the-art speech recognition
technologies to extract features of spoken language from test-taker responses, such as acoustic and fluency features that predict these
properties (in addition to automatic speech recognition), thus evaluating the general intelligibility of speech.

© 2024 Duolingo, Inc


16 Duolingo Research Report

Figure 17. Example Extended Speaking (text prompt) / Speaking Sample Task

Figure 18. Example Extended Speaking (audio prompt) Task

4 Test Development

The Duolingo English Test’s test development processes are encompassed by the item factory, which is both a conceptual framework and
a system of interconnected software platforms and human review processes used by the DET for scalable high-volume item development
(von Davier et al., 2024). The item factory, based on the principles of architecture for smart manufacturing, is designed to streamline
the creation and review of test items through intelligent automation while incorporating expert human oversight. This system leverages

© 2024 Duolingo, Inc


Duolingo English Test: Technical Manual 17

Figure 19. Example Elicited Imitation Task

advanced Natural Language Processing (NLP) and Machine Learning (ML) technologies to automate the initial generation of test items,
ensuring a high level of linguistic variety and complexity suitable for a diverse global audience. Additionally, the item factory embodies
human-in-the-loop AI, in which human oversight and quality control is built into every step (see Burstein et al., 2022).
The item factory’s human review processes ensure consistency and efficiency while producing items that meet the quality standards
for a high-stakes international English language test, including fairness towards all test takers from diverse sociocultural backgrounds.
Involving humans at every stage of item development and review provides additional quality control and oversight, in alignment with
the DET’s RAI Standards (Burstein, 2023). See Figure 20 for a simplified heuristic of the item development process.

4.1 Task and Item Development


All Duolingo English Test task types are designed and approved by language testing specialists, adhering to the principles of the Expanded
Evidence-Centered Design (e-ECD; Arieli-Attali et al., 2019) embedded within the DET’s theoretical assessment ecosystem (Burstein
et al., 2022). Drawing on theoretical background such as the CEFR (Council of Europe, 2001, 2020), assessment researchers create
several variants of new item types for field testing to arrive at the final task design. The creation of these task types leverages authentic
English-language content, which not only ensures the relevance and real-world applicability of the items but also provides input for
automatic item generation (AIG). During the item design stage, the item type’s AIG process is also developed and refined. Thus, even
though all DET task types are designed by experts, AIG allows for the efficient production of a large number of items for each task type,
catering to different levels of proficiency and supporting a comprehensive and secure assessment system.
Designs for new item types are tested through the DET’s field testing platform, which allows prospective test takers to opt into trying
out new item types during the practice test. This enables DET researchers to test alternate item designs through randomized experiments
and iterate on item features based on empirical evidence. Furthermore, individual items that do not perform well psychometrically in
pilot testing are excluded from the operational test item pool. All generated items also go through an extensive item review process.

4.2 Item Generation and Review


Each DET item type is generated using unique methods and prompts. For example, Attali et al. (2022) details the use of Generative
Pre-trained Transformer 3 (GPT-3) to generate the passages and response options for the Interactive Reading item type. GPT-based AIG
involves fine-tuning the prompts to produce items aligned with the task specifications and minimize the proportion of items that are
unusable or require manual revision. Once a task’s design and generation prompt are finalized, hundreds or even thousands of items can
be generated in a short period. Due to the large item pool made possible by AIG, each test taker only sees a minuscule proportion of

© 2024 Duolingo, Inc


18 Duolingo Research Report

Figure 20. DET Item Generation Process

existing items, and any two test sessions are unlikely to share a single item. However, not all generated items meet the quality standards
for use on the test, so a series of automatic filters and human reviews take place to prevent problematic content from appearing on the
test.
Upon generation, items undergo a rigorous review process that combines automated checks with expert human evaluation. This multi-
stage review process guarantees that each item meets strict quality standards before being included in any item pool. Automated checks
are designed to catch obvious issues of linguistic accuracy and social appropriateness. The automatic filters use language models to
detect potentially biased or discriminatory language patterns. Items are also automatically screened for any words or phrases that would
not be acceptable in any context, such as terms for specific acts of violence.
Following the automated checks is the item quality review (IQR) stage, in which each item is reviewed (and potentially edited) by
subject matter experts (SMEs) to ensure the items are of sufficiently high quality. Written item content is reviewed for adherence to
English writing conventions. Audio stimuli are reviewed for accuracy of pronunciation and overall comprehensibility. Item content is
also fact-checked to ensure that test takers are not distracted or confused by inaccurate assertions (an example of mitigating the influence
of sociocognitive factors on test performance). Both automatic filters and SME reviews are facilitated by a flexible platform and human
management system that allow for seamless transitions between phases and collaboration among SMEs.
Next, items go through a sensitivity review referred to as fairness and bias (FAB) review. In the FAB review stage, each item is judged
by two or three human reviewers against internal FAB guidelines to ensure items are fair towards test takers of diverse identities and
backgrounds (e.g., cultural, socioeconomic, and gender). FAB raters are selected to represent diverse identities, perspectives, geographic
locations, work contexts, and linguistic backgrounds. As well, all raters have demonstrated experience and interest in promoting equity
and diversity. Raters are trained to identify potential sources of construct-irrelevant variance due to either specialized knowledge (e.g.,
highly technical or culture-specific information) or potentially offensive or distracting content (e.g., cultural stereotypes or descriptions
of violence). Items flagged for FAB issues are either edited to resolve the issue or excluded from the item bank if the issue is too extensive.
FAB rating data are also used to improve automatic flagging of potentially problematic items. In addition, differential item functioning
(DIF) analyses are conducted regularly after the test administrations (Belzak, 2023; Belzak et al., 2023).
The DET’s item review platform is an internally developed tool for the coordination and oversight of the item review process. It assigns
items to reviewers and provides reviewers with an interface for editing and rating items. It also tracks editorial suggestions and delivers
feedback on the outcomes of such suggestions. Reviewers also regularly receive training items interspersed among their assigned items
to monitor inter-rater consistency and provide corrective feedback to the reviewers. The platform facilitates management of the item
review process by allowing operations managers to set deadlines and track the progress of items through the process. The platform also

© 2024 Duolingo, Inc


Duolingo English Test: Technical Manual 19

serves a quality assurance function by enabling the monitoring of quality of reviewers’ work (e.g., reviewing time and volume, inter-
reviewer agreement) and quality of item content (e.g., the number of edits to an item). Access to the item review platform is protected
with multiple security measures to prevent unauthorized access and maintain the security of the test content.

5 Item Delivery and Scoring


This section explains how the computer-adaptive portion of the Duolingo English Test works and how the items are scored. Additionally,
it provides information about the automated scoring systems for the speaking and writing tasks and how they were evaluated.

Once items are generated, calibrated (𝑏𝑖̂ estimates are made), and placed in the item pool, the DET uses CAT approaches to administer
and score tests (Segall, 2005; Wainer, 2000). Because computer-adaptive administration gives items to test takers conditional on their
estimated ability, CATs have been shown to be shorter (Thissen & Mislevy, 2000) and provide uniformly precise scores for most test
takers when compared to fixed-form tests (Weiss & Kingsbury, 1984).
The primary advantage of a CAT is that it can estimate test-taker ability (𝜃) more precisely with fewer test items. The precision of the 𝜃
estimate depends on the item sequence: test takers of higher ability are best assessed by items with higher difficulty 𝑏𝑖 (and likewise for
lower values of 𝜃 and 𝑏𝑖 ). The true value of a test taker’s ability is unknown before test administration. As a result, an iterative, adaptive
algorithm is required.

5.1 CAT Delivery


Each test session begins with a set of Yes/No Vocabulary tasks followed by Vocabulary in Context tasks. Table 2 lists the task types
administered on each Duolingo English Test session in the approximate order of administration. The task types are loosely organized
into the focus areas of “Linguistic resources” (which emphasizes more granular/bottom-up linguistic competences) and “Skills mastery”
(which emphasizes more communicative competence). The linguistic-resources task types are administered at the beginning of the test
because they take considerably less time per item and therefore provide an estimate of a test taker’s proficiency before administering the
more time-intensive skills-mastery task types. This measurement efficiency is what allows the DET to be completed in under an hour,
improving the accessibility of the test.

Table 2. Task types and Administration Order

Task Type Name for Test Takers Adaptive Freq


Focus area 1: Linguistic resources
Yes/No Vocabulary Read and Select Yes 15–18
Vocabulary in Context Fill in the Blanks Yes 6–9
C-test* Read and Complete Yes 4–6
Dictation* Listen and Type Yes 6–9
Elicited Imitation* Read Aloud Yes 4–6
Focus area 2: Skills mastery
Interactive Reading Interactive Reading Yes 2
Interactive Listening Interactive Listening Yes 2
Picture Description (writing) Write About the Photo No 3
Interactive Writing Interactive Writing No 1
Picture Description (speaking) Speak About the Photo No 1
Extended Speaking (text prompt) Read, Then Speak No 1
Extended Speaking (audio prompt) Listen, Then Speak No 2
Writing Sample Writing Sample No 1
Speaking sample Speaking Sample No 1
*These item types are interspersed in the computer-adaptive portion of the test

After the initial task, the CAT algorithm makes a provisional estimate of 𝜃𝑡̂ based on the test taker’s responses to time point 𝑡. Then the
difficulty of the next item is selected as a function of the current estimate: 𝑏𝑡+1 = 𝑓(𝜃𝑡̂ ). The provisional estimate 𝜃𝑡̂ is updated after
each administered item. Essentially, 𝜃𝑡̂ is the expected a posteriori estimate based on all the administered items up to time point 𝑡. This
process repeats until a stopping criterion is satisfied.

© 2024 Duolingo, Inc


20 Duolingo Research Report

The CAT approach, combined with concise and predictive task types, helps to minimize test administration time significantly. DET
sessions are variable-length, meaning that exam duration and number of items vary across administrations. The iterative, adaptive
procedure continues until the test exceeds a maximum length in terms of minutes or items, as long as a minimum number of items has
been administered. Almost all test takers complete the DET within an hour (including speaking and writing; excluding onboarding and
uploading).
Once the stopping criterion is satisfied, an ability estimate is calculated on responses to each CAT task type separately. These score
estimates of each CAT task type are then used with the scores of the Interactive Listening, speaking, and writing tasks to compute an
overall score and the four subscores.

5.2 CAT Item Scoring


All test items are graded automatically via statistical procedures appropriate for the task type. For three CAT task types—C-test,
Vocabulary in Context, and Yes/No Vocabulary—each task comprises discrete items to which responses are deemed correct or incorrect.
In the case of C-test and Vocabulary in Context tasks, completing each damaged word is a distinct item. For Yes/No Vocabulary, deciding
whether an individual stimulus is a real English word is an item. Such task types are scored with item response theory (IRT) models: a
2PL (two-parameter logistic) in the case of C-test and Vocabulary in Context tasks, and a 3PL with a fixed lower asymptote for Yes/No
Vocabulary tasks. Regression calibration (Carroll et al., 2006) was used in item calibration to control for the differing ability of test
takers responding to each item, since items are administered based on a test taker’s responses to previous items, and thus more difficult
items are seen by more able test takers. Each item of the aforementioned task types has a unique set of 2PL item parameters, which are
then used to estimate an expected ability based on a test taker’s responses to all administered items of the same type.
For Dictation items, responses are graded on a [0, 1] scale as a function of the edit distance* . The maximum grade value is 1, occurring
when the provided response is identical to the expected response. Values less than one indicate various degrees of accuracy. Item grades
are then used, in combination with item difficulty parameters, to estimate test-taker ability. Because a substantial proportion of Dictation
responses receive a perfect grade, item difficulty parameters are estimated with a custom model that combines elements of models for
both discrete and continuous data, similar to the model of Molenaar et al. (2022).
The responses to the Elicited Imitation tasks are aligned against an expected reference text, and similarities and differences in the
alignment are evaluated to produce a grade on a [0, 1] scale. In addition to the alignment of responses to the target text, Elicited Imitation
grades also incorporate data on speaking features like pronunciation and rate of speech. An IRT model is used to produce scores on the
𝜃 scale.
The Interactive Reading task type comprises both selected-response sub-tasks with a clearly defined number of response options and a
highlight sub-task with a large, undefined number of possible responses. Items for selected-response sub-tasks are graded dichotomously
(correct/incorrect) and scores estimated via 2PL IRT models. Responses to the highlighting sub-task are compared to a single correct
answer (a particular part of the text). For grading purposes, a text selection is defined as a point in the two-dimensional space for the
location of the start and end indices of the selection. A continuous grade between 0 and 1 is then calculated based on the discrepancy
(geometric distance) between the point representations of the response and the correct answer. Scores for the highlight sub-tasks are
produced using the same scoring model as the Dictation tasks.
The final CAT task type—Interactive Listening—also comprises a mixture of sub-tasks. Items for the dialogue completion sub-task
are graded dichotomously and scores estimated via 2PL IRT models. The written summary sub-task is scored using a sub-task-specific
automated scoring model, as described in the next section.

5.3 Open-Ended Speaking and Writing Task Scoring


The speaking and writing tasks are scored by automated scoring systems developed by experts at Duolingo in the fields of machine
learning (ML), natural language processing (NLP), and applied linguistics: the Duo Speaking Scorer and Duo Writing Scorer, with
separate scoring models for the different task types. These models evaluate each item response based on a number of theoretical speaking
and writing subconstructs (i.e., factors contributing to speaking and writing quality). These subconstructs are reflected in human scoring
rubrics† and are operationalized for automated scoring through the measurement of numerous research-supported linguistic features.
Table 3 presents these subconstructs for speaking and writing and provides examples of how these subconstructs are described in both
human and automated scoring.

*
“Edit distance” is a concept from natural language processing referring to the number of single-letter modifications necessary to transform one character string into
another. It is used as a measure of similarity between two text strings.

http://go.duolingo.com/DET_speaking_and_writing_rubrics

© 2024 Duolingo, Inc


Duolingo English Test: Technical Manual 21

Table 3. Open-ended speaking and writing scoring subconstructs

Subconstruct Example dimensions Example automated feature

Task achievement, Relevance, Effect on the the cosine similarity between the prompt’s
Content reader/listener, Appropriacy of style, embedding and the response’s embedding
Development (relevance feature)

Clarity, Cohesion, Progression of ideas,


predicted coherence rating (0–6 scale) from a
Discourse coherence Appropriacy of format, Structure (writing
language model fine-tuned on SME ratings
only)

Lexical diversity, Lexical sophistication, the proportion of lemmatized words from the
Lexis Word choice, Word formation, Error severity, response that are level CEFR C1 and above
Spelling (writing only) (lexical sophistication feature)

the mean tree depth among the dependency


Range of structures, Grammatical complexity,
Grammar trees of each sentence in the response
Error frequency, Error severity, Appropriacy
(grammatical complexity feature)

Fluency (speaking only) Speed, Chunking, Breakdowns, Repairs number of words per second (speed feature)

Intelligibility, Comprehensibility, Individual


the acoustic model’s confidence in the
Pronunciation (speaking only) sounds, Word stress, Rhythm, Sentence stress,
transcription (intelligibility feature)
Intonation, Connected speech

Duo Writing Scorer was evaluated on 2,460 test sessions,* and the Duo Speaking Scorer was evaluated on 1,922 test sessions. Each
session had 1-2 responses rated by a human rater using the rubrics previously described, and the human ratings for each session
were averaged. These raters are experts in the field of English language teaching and assessment and possess the following minimum
requirements: 5 or more years of TESOL experience with adults; TESOL certification (e.g., Cambridge DELTA); relevant bachelor’s
degree or equivalent; expert English proficiency (C2 on CEFR); and experience in high-stakes speaking/writing assessment. Raters are
also selected to ensure a diversity of nationalities, geographic locations, linguistic backgrounds (including English varieties), and work
experiences. Raters undergo initial training, must pass a certification test, and are regularly monitored. Table 4 shows the correlation
(Pearson r) between the average human rating and the automated score (i.e., Human-Machine) and between human ratings for single
responses (i.e., Human-Human) for open-ended speaking and writing tasks. Pearson r correlations over 0.50 can be interpreted as large
(Cohen, 1988). That human–machine and human–human correlations are so similar indicates that the automated scorers are attending
to and equally weighting the same subconstructs of speaking and writing as the human raters.

Table 4. Correlation of productive skills tasks by scoring method

Human–Machine (r) Human–Human (r)

Speaking tasks 0.84 0.85

Writing tasks 0.85 0.87

*
The dataset used for evaluating the Duo Writing Scorer pre-dated the introduction of the Interactive Listening summarization task.

© 2024 Duolingo, Inc


22 Duolingo Research Report

5.4 Subscores
In addition to the overall score, the DET reports four subscores* that are also on a scale of 10–160 and assess four integrated skill
areas: Literacy (reading and writing tasks), Conversation (speaking and listening tasks), Comprehension (reading and listening tasks),
and Production (speaking and writing tasks). LaFlair (2020) provides multivariate analyses of DET response data that support this
skill configuration and shows that subscores estimating these skills have reliability and added value (beyond an overall score) that meet
professional standards for subscore validity (Sinharay & Haberman, 2008). Each subscore can be interpreted as a combination of two
of the more traditional language subskills: speaking, writing, reading, and listening (SWRL). Figure 21 shows the relationship between
the DET task types, the subscores, and SWRL subskills.

Figure 21. Contribution of Task Types to DET Subscores

5.5 Score Reliability


The reliability of the DET is evaluated by examining the relationship between multiple scores from repeat test takers (test–retest
reliability) and the standard error of measurement (SEM). The data used to estimate each of these measures come from a subset of
the 478,378 certified tests administered between May 20, 2023 and May 19, 2024.
There are two main challenges with using repeaters to estimate test reliabilities for the full test-taking population. The first is that
repeaters are a self-selected, non-random subset of the full testing population. People who choose to repeat tend to represent a more
homogeneous, lower-ability subpopulation than the full testing population. Unless addressed, this reduction in ability heterogeneity will
tend to artificially reduce estimated reliabilities based on repeaters. The second challenge is that repeaters not only self-select to repeat
the test, but also self-select when to repeat the test. Some repeaters take the test twice in a short period, while other repeaters may wait

*
Due to the way the subscores are computed, there may be cases where test takers with the same overall score have different subscore profiles.

© 2024 Duolingo, Inc


Duolingo English Test: Technical Manual 23

a year or more to retest. The more time that passes between repeat test takers’ sessions, the more opportunity there is for change in test
takers’ true proficiency. Change in proficiency over time, which varies across individuals, must also be accounted for to avoid artificially
reducing reliability estimates.

In order to address the challenges inherent to test–retest reliability, the analysis was conducted on a sample of repeaters who took the
DET twice within seven days. The restriction to such repeaters is intended to reduce the impact of heterogeneous proficiency changes
on estimated test–retest reliability. To address the fact that repeaters are different from the full population of first-time test takers, DET
assessment scientists used Minimum Discriminant Information Adjustment (MDIA; Haberman, 1984). Specifically, MDIA was used
to compute weights so that the weighted repeater sample matches all first-time test takers with respect to country, first language, age,
gender, computer operating system (Windows vs MacOS), TOEFL overall scores, IELTS overall scores, and the means and variances
of the DET scores on the first attempt. Weighting in this manner mitigates the potential biasing effects of repeater self-selection on
test–retest reliability estimates (Haberman & Yao, 2015). A weighted test–retest correlation was calculated for the overall score and
all four subscores. Bootstrapping was used to calculate normal 95% confidence intervals for each reliability estimate. This reliability
estimation method is described in greater detail in Belzak and Lockwood (in press).

The point estimates and confidence intervals of the reliabilities for the DET overall score and subscores are shown in Table 5. The
subscore reliabilities are slightly lower than the overall score reliability. This finding is expected because subscores are calculated from
a smaller number of items. The SEM is estimated based on the standard deviation of the overall score or subscore and the corresponding
test–retest reliability estimate. The SEM is a statistic that reflects the accuracy or precision of a score, indicating how much a person’s
score might vary if the test were taken multiple times. A smaller SEM indicates higher test reliability.

Table 5. Test-Retest Reliability and SEM Estimates (May 20, 2023 — May 19, 2024)

Score Test–Retest Lower CI Upper CI SEM


Literacy 0.91 0.91 0.91 6.50
Conversation 0.92 0.91 0.92 6.25
Comprehension 0.90 0.90 0.91 6.27
Production 0.95 0.94 0.95 5.40
Overall 0.94 0.94 0.94 4.96

6 Access and Accommodations


Given the Duolingo English Test’s mission to lower barriers and increase opportunities for English learners, broad accessibility is
one of the central motivations for the test’s existence and a primary consideration in any changes to the test. A combination of
universally accessible features and accommodations for test takers with disabilities ensures that all test takers have an equal opportunity
to demonstrate their English proficiency.

6.1 Access
The DET reflects principles of Universal Design (UD), a framework for designing products and spaces with the goal of maximum
accessibility from the start; the concept originated in the field of architecture but has also been adapted to assessment design (Thompson
et al., 2002). Maximizing test accessibility through intentional design benefits all test takers, both those with and without disabilities,
while simultaneously reducing the need for selective accommodations. The ethos of UD is evident in the origin of the DET and the DET’s
assessment ecosystem (Burstein et al., 2022), in which all aspects of test design and administration are infused with consideration of the
test-taker experience (TTX). The DET’s at-home on-demand approach, intuitive user interface, and asynchronous proctoring collectively
are designed to reduce physical, socioeconomic, and psychological barriers to test access and optimal test performance.

Perhaps the most salient accessibility benefit of the DET is that at-home testing obviates the need to travel to a physical test center.
Traveling to a test center can be burdensome for both socioeconomic and disability-related reasons. Test centers are necessarily
concentrated in relatively large urban areas, and some countries do not have any test centers that administer high-stakes ELP tests.
It is also not guaranteed that a prospective test taker can obtain a test seat at their closest test center at a time that meets their needs.
Many test takers therefore must spend time and money to travel significant distances, even internationally, in order to take a test. This
burden is compounded for test takers with disabilities, who might require special transportation or assistance. For such test takers, even
local travel can pose a non-trivial barrier. The DET allows most individuals to have their English proficiency evaluated from the most
accessible location—their own home.

© 2024 Duolingo, Inc


24 Duolingo Research Report

Number of Test Centers per Million People

0 10

Number of Internet Users per Million People

0 1,000,000

Number of Annual Duolingo English Test Takers per Million People

0 750

Figure 22. Heatmaps of Test Center Accessibility as of 2024 (top), Internet Accessibility as of 2023 (middle), and Concentration of DET Test Takers in 12
months as of 2024 (bottom)

© 2024 Duolingo, Inc


Duolingo English Test: Technical Manual 25

The AuthaGraph maps (Rudis & Kunimune, 2020) in Figure 22 visualize the issue of physical test access by showing the concentration
of test centers in the world (top panel) compared to internet penetration in the world (middle panel), and the concentration of DET test
takers (bottom panel; for all tests administered since August 1, 2017). The top two panels of Figure 22 show how much more easily
an internet-based test can be accessed than a test center (although Central Africa is admittedly underserved by both models). While
the ratio of population to internet access and to test center access is a somewhat limited metric, the potential audience for the DET is
clearly orders of magnitude larger than those with access to traditional test centers. As a more specific example, Figure 23 shows the
approximate locations of DET test takers relative to test centers currently approved for UK visa purposes. There are test takers on every
inhabited continent who would have to travel more than 1,000km to reach the nearest approved test center. By delivering assessments
on-demand, 24 hours a day, on any of the world’s estimated two billion internet-connected computers, the DET is at the forefront of
maximizing test access while maintaining test use validity and test security.

Figure 23. Example: Distance of DET Test Takers from Test Centers Approved for UK Visas

In addition to lowering physical barriers to test access, the DET also embodies accessibility in the economic sense, most obviously
through its registration fee, which is a fraction of alternative tests’ fees. Furthermore, the DET does not charge extra fees for sharing
scores with institutions or appealing proctoring decisions. The DET’s at-home on-demand nature removes the need to travel to a test
center, potentially representing a cost saving several times greater than the test fee itself. These factors collectively reduce a potentially
insurmountable barrier to taking an English language proficiency test, and also make it more feasible for test takers to reattempt the
test if needed. The DET’s Access Program further reduces socioeconomic barriers for test takers with the greatest need by routinely
providing fee waivers to institutions, providing fee waivers to organizations working with populations affected by natural disasters and
armed conflicts, and partnering with the UNHCR to provide college counseling to refugee students.

Once test takers have gained access to the DET, the test’s design also reduces construct-irrelevant barriers to optimal test performance that
could arise during the testing experience. Testing at home gives test takers control over the setup of their testing environment, including
the furniture, lighting, and equipment, allowing them to take the test comfortably. This feature is particularly beneficial for test takers
with disabilities who may require medical devices or special computer equipment such as screen magnification or a special keyboard.
The ability to test in a comfortable and familiar environment can also reduce test anxiety (Stowell & Bennett, 2010). The relatively
short duration of the test, facilitated by the DET’s adaptive nature, may be beneficial for test takers who cannot sit and/or concentrate
continuously for long periods due to physical and or psychological disabilities. The DET’s user interface complies with W3C Web
Content Accessibility Guidelines (WCAG) 2.1 Level AA. Furthermore, the DET’s use of asynchronous proctoring (see Section 7.4)
likely has a positive impact on TTX, as it does not require interaction with a human proctor and the accompanying concerns about
privacy and potential interruptions during testing.

© 2024 Duolingo, Inc


26 Duolingo Research Report

6.2 Accommodations
The DET’s inherently accessible design features reduce the need for certain testing accommodations (e.g., extended breaks between test
sections). Nevertheless, the DET provides accommodations for both physical (e.g., visual or hearing impairment) and psychological
(e.g., autism spectrum disorder) conditions that could constitute construct-irrelevant barriers to optimal test performance. To receive an
accommodation, test takers must submit a request at https://englishtest.duolingo.com/accommodations describing both their reason
for requesting an accommodation (with supporting documentation, if applicable) and the accommodation requested. The available
accommodation options are

• 50% extra time per question


• Accessibility devices (alternate keyboard, etc.)
• Hearing aids
• Headphones
• Listening device
• Screen magnifier/reader
• Other accommodation (to be described by the test taker)

All requests for documented needs are accommodated to the extent reasonable. To ensure accessibility, we have significantly streamlined
the process for requesting accommodations compared to the industry standard. The DET requests similar documentation to other English
proficiency tests but only requires test takers to fill out a single online form. All inquiries receive a response within three days.

7 Test Administration and Security


The Duolingo English Test is administered to test takers via the internet. The security of DET scores is ensured through a robust and
secure onboarding process, automated security measures, rules that test takers must adhere to during the test administration, and a strict
proctoring process. All test sessions are proctored after the test has been administered and prior to score reporting. Additional security
is also provided by the DET’s large item bank, CAT format, and active monitoring of item exposure rates, which collectively minimize
the probability that test takers can gain any advantage through item pre-knowledge (i.e., exposure to test content before encountering it
during an operational test session). Item pre-knowledge is further minimized by preventing repeat test takers (i.e., individuals who take
the test more than once) from seeing the same item within a certain period.

Overall, the test security framework is an essential dimension of the larger assessment ecosystem (Burstein et al., 2022), used to protect
the integrity of test scores at all stages of the assessment process (LaFlair et al., 2022). The remainder of this section presents a summary
of the information found in the Security, Proctoring, and Accommodations whitepaper (Duolingo English Test, 2021).

7.1 Test Administration


Test takers are required to take the test alone in a quiet environment on a laptop or desktop computer running Windows or macOS and
equipped with a front-facing camera, a microphone, and speakers (headphones are not permitted). An internet connection with at least
2 Mbps download speed and 1 Mbps upload speed is recommended for test sessions. Test takers are required to take the test through the
DET desktop app, which provides a more stable and secure test-taking experience. Test takers are prompted to download and install the
desktop app after clicking “Start Test” on the DET website. The desktop app automatically prevents navigation away from the test and
blocks tools such as spelling and grammar checkers and automatic word completion.

7.2 Onboarding
Before the test is administered, test takers complete an onboarding process. This process checks that the computer’s microphone and
speaker work. It is also at this time that test takers are asked to show identification and are informed of the test’s administration rules,
which they must agree to follow before proceeding. In order to ensure test-taker identity, an identity document (ID) must be presented
to the webcam during onboarding. An image of the ID is captured* . IDs must meet certain criteria, such as being government-issued,
currently valid, and including a clear picture of the test taker.

*
ID images are stored temporarily in a highly secure digital repository in compliance with all applicable data privacy regulations and best practices.

© 2024 Duolingo, Inc


Duolingo English Test: Technical Manual 27

7.3 Administration Rules


The behaviors that are prohibited during an administration of the DET are listed below. These rules require test takers to remain visible
to their cameras at all times and to keep their camera and microphone enabled throughout the test administration. The rules are displayed
in the test taker’s chosen interface language* to ensure comprehension. Test takers are required to acknowledge understanding and agree
to these rules before proceeding with the test. If the test session is automatically terminated for reasons such as moving the mouse off-
screen or a technical error, a test taker may attempt the test again for free, up to a total of three times. Test takers may contact customer
support to obtain additional test attempts in the case of recurring technical errors. Other reasons for test cancellation include:

• Leaving the camera preview


• Looking away from the screen
• Covering ears
• Leaving the web browser
– Leaving the window with the cursor
– Exiting full-screen mode
• Speaking when not instructed to do so
• Communicating with another person at any point
• Allowing others in the room
• Using any outside reference material
• Using a phone or other device
• Writing or reading notes
• Disabling the microphone or camera

7.4 Proctoring and Reporting


After the test has been completed and uploaded, all DET sessions undergo a thorough proctoring review by trained human proctors
with TESOL/applied linguistics expertise. This review is supplemented by artificial intelligence to call proctors’ attention to suspicious
behavior. Proctors have access to both audio and video recordings of the entire test session, including both a view of the test taker and a
recording of the computer screen. Each test session is reviewed independently by at least two proctors. When necessary, the test session
is sent to a third level of review, to be evaluated by a senior proctor or operations manager. This process takes no more than 48 hours
after the test has been uploaded. After the process has been completed, score reports are sent electronically to the test taker and any
institutions with which they have elected to share their scores. Test takers can share their scores with an unlimited number of institutions.
While AI provides assistance at every stage of proctoring, the proctors make the final decision on whether to certify a test. Certain
invalid results are eligible to be appealed within 72 hours by submitting a form from the test taker’s homepage describing the reason for
the appeal. Once the form has been submitted, the test taker will receive an emailed response within four business days informing them
of the appeal ruling.

8 Test-Taker Demographics
This section summarizes test-taker demographics based on all certified Duolingo English Test sessions between May 20, 2023 and May
19, 2024. During the onboarding and offboarding process of each test administration, test takers are asked to report their first language
(L1), date of birth, reason for taking the test, and their gender identity. The issuing country/region of test takers’ identity documents is
logged when they show government-issued identification during the onboarding process.

Reporting gender identity during the onboarding process is optional, but reporting date of birth is required. Table 6 shows an
approximately even distribution of male and female gender identities. However, the gender distribution of test takers varies considerably
across countries. Figure 24 depicts the proportion of reported gender identities for all countries with more than 300 test takers, ranging
from a maximum of 78% male to a maximum of 66% female.

The median test-taker age is 22. Table 7 shows that 81% of DET test takers are between 16 and 30 years of age at the time of test
administration.

*
Currently available user interface languages: Chinese, English, French, German, Hindi, Hungarian, Indonesian, Italian, Japanese, Korean, Portuguese, Russian,
Spanish, Thai, Turkish, Vietnamese

© 2024 Duolingo, Inc


28 Duolingo Research Report

Table 6. Percentages of Test-Taker Gender (May 20, 2023 — May 19, 2024)

Gender Percentage
Female 47.42%
Male 52.44%
Other 0.14%
Total 100.00%

Table 7. Percentages of Test-Taker Age (May 20, 2023 — May 19, 2024)

Age Percentage
< 16 3.28%
[16, 21) 32.18%
[21, 26) 34.34%
[26, 31) 14.93%
[31, 41) 11.30%
≥ 41 3.97%
Total 100.00%

Test takers are asked to report their L1s during the onboarding process. The most common first languages of DET test takers include
Mandarin, Spanish, Arabic, English* , French, and Portuguese (see Table 8). There are 148 unique L1s represented by test takers of the
DET, and the test has been administered to test takers from 214 countries and dependent territories. The full tables of all test-taker L1s
and places of origin can be found in the Appendix (Section 12).

Table 8. Most Frequent Test-Taker L1s (May 20, 2023 — May 19, 2024)

First Language
Chinese - Mandarin
Spanish
English
Telugu
Arabic
Portuguese
Korean
French
Hindi
Indonesian

For each test session, the issuing country of the test taker’s identity document is recorded, as well as the country in which they are taking
the test. For 82% of test takers, the ID issuing country and the country in which they take the test are the same. The other 18% represent
test takers who are presumably residing outside of their country of origin when they take the DET. Tables 9 and 10 display, for such test
takers, the top ten testing locations and the top ten ID issuing countries, respectively.

Test takers are also asked to optionally indicate their intention for taking the DET, with the choice of applying to a school (secondary,
undergraduate, or graduate) and job-related purposes. Table 11 presents the distribution of test-taker intentions.

*
60% of English-L1 test takers come from India and Canada

© 2024 Duolingo, Inc


Duolingo English Test: Technical Manual 29

Philippines
Mongolia
Ireland
Kazakhstan
Canada
USA
Russia
Albania
Bulgaria
Peru
Zimbabwe
Thailand
Kenya
UK
South Africa
Myanmar
Singapore
Uganda
Australia
Netherlands
Mexico
Country/Region

Norway
Malaysia
gender
Ecuador female
Rwanda
Tanzania other
Denmark
male
Panama
Argentina
Germany
Ukraine
Japan
Bahrain
Austria
Lithuania
Turkey
Morocco
UAE
Italy
Nepal
Kyrgyzstan
Costa Rica
Saudi Arabia
Ghana
Kuwait
Algeria
Uzbekistan
Pakistan
Afghanistan
0.00 0.25 0.50 0.75 1.00
Proportion
Figure 24. Proportion of Reported Gender Identities for all Countries and Territories with >300 Test Takers in Past 12 Months (only every other country
labeled)

© 2024 Duolingo, Inc


30 Duolingo Research Report

Table 9. Most Frequent Testing Locations for Test Takers Residing Outside Their Country of Origin (May 20, 2023 — May 19, 2024)

Top testing locations


USA
Canada
UK
Ireland
China
Germany
Singapore
UAE
Hong Kong
France

Table 10. Most Frequent ID Issuing Countries for Test Takers Residing Outside Their Country of Origin (May 20, 2023 — May 19, 2024)

Top ID origins
China
India
South Korea
Brazil
Ukraine
Mexico
USA
Colombia
Japan
France

Table 11. Percentages of Test-Taker Intention (May 20, 2023 — May 19, 2024)

Intention Percentage
Undergrad 46.63%
Grad 37.45%
Secondary School 6.29%
Work 1.96%
None of the Above 7.39%

9 Test Performance Statistics


This section provides an overview of the statistical characteristics of the Duolingo English Test, including information about the score
distributions and concordance with tests of similar constructs. For reliability estimates of the overall score and subscores, see Section
5.5.

9.1 Score Distributions


Figure 25 shows the distribution of scores for the overall score and subscores (on the x-axis of each plot) using data from tests administered
between May 20, 2023 and May 19, 2024. From top to bottom, the panels show the distribution of test scores for the four subscores
and the overall score using three different visualization techniques. The left panels show a boxplot of the test scores. The center panels
show the density function of the test scores, and the right panels show the empirical cumulative density function (ECDF) of the test
scores. (The value of the ECDF at a given test score is the proportion of scores at or below that point.) The density plot (center panel)
provides evidence that the test is of appropriate difficulty (neither too easy nor too hard) for the test-taker population, given that there is
no apparent ceiling effect on the overall score.
The plots in Figure 25 show some negative skew, which is reflected in the descriptive statistics in Table 12. The overall score mean
and median are 110.26 and 110 respectively, and the interquartile range is 25. Tables 17–19 in the Appendix show the percentage and
cumulative percentage of the total test scores and subscores. These are numerical, tabled representations of the plots in Figure 25.

© 2024 Duolingo, Inc


Duolingo English Test: Technical Manual 31

1.00
0.020
0.75
Literacy

0.015
0.010 0.50

0.005 0.25

0.000 0.00
10 60 110 160 10 60 110 160 10 60 110 160

1.00
0.020
Conversation

0.75
0.015
0.010 0.50

0.005 0.25

0.000 0.00
10 60 110 160 10 60 110 160 10 60 110 160

0.025 1.00
Comprehension

0.020 0.75
0.015
0.50
0.010
0.005 0.25

0.000 0.00
10 60 110 160 10 60 110 160 10 60 110 160

0.020 1.00
Production

0.015 0.75

0.010 0.50

0.005 0.25

0.000 0.00
10 60 110 160 10 60 110 160 10 60 110 160

0.025 1.00
0.020 0.75
Total

0.015
0.50
0.010
0.005 0.25

0.000 0.00
10 60 110 160 10 60 110 160 10 60 110 160
Figure 25. Boxplots (left), Density Plots (middle), and Empirical Cumulative Distribution Plots (right) of the Overall Score and Subscores.

© 2024 Duolingo, Inc


32 Duolingo Research Report

Table 12. Descriptive Statistics for Total and Subscores (n = 99,415) (May 20, 2023 — May 19, 2024)

Score Mean SD 25th Percentile Median 75th Percentile


Comprehension 117.74 20.03 105 120 130
Conversation 101.23 21.68 90 100 115
Literacy 111.07 21.81 100 110 125
Production 88.86 23.35 75 90 105
Total 110.26 20.23 100 110 125

9.2 Relationship with Other Tests

In 2022, correlational and concordance studies were conducted (Cardwell et al., 2024) to examine the relationship between DET scores
and scores from TOEFL iBT and IELTS Academic—tests designed to measure similar constructs of English language proficiency and
used for the same purpose of postsecondary admissions. The data for these studies are the results of certified DET sessions since the
launch of the Integrated Reading task type on March 29, 2022, as well as associated TOEFL or IELTS scores for a subset of test takers.

DET assessment scientists designed a study to collect official TOEFL and IELTS score reports from DET test takers. Test takers could
submit official score reports in exchange for payment or a credit to take the DET again (referred to subsequently as the “official score
data”). Prior to any analysis, official score data were assembled, checked, and cleaned by Duolingo assessment scientists and a research
assistant. In order to achieve recommended minimum sample sizes of 1,500* (Kolen & Brennan, 2004, p. 304) for both TOEFL and
IELTS data, as well as to represent a greater range of test-taker ability, the official score data were supplemented with self-report data.
DET test takers have the opportunity to voluntarily report TOEFL or IELTS results at the end of each test session. Table 13 reports the
sizes of the final analytic samples after data cleaning (e.g., removing out-of-range scores and records with invalid subscore–overall score
relationships) and restricting the data to DET–TOEFL and DET–IELTS score pairs from test dates less than four months apart.

Table 13. Sample Sizes for Correlation and Concordance Analyses (March 29, 2022 — August 05, 2022)

TOEFL IELTS
Official 328 1,643
Self-report 1,095 4,420

Correlation

Pearson’s correlation coefficients were estimated from official score data to evaluate the relationship between the DET and the TOEFL
iBT and IELTS Academic (Table 14). The correlation coefficients show strong, positive relationships of DET scores with TOEFL iBT
scores and with IELTS scores. These relationships are visualized in Figure 26. The left panel shows the relationship between the DET
and TOEFL iBT, and the right panel shows the relationship between the DET and IELTS. Values in parentheses are the sample sizes
corresponding to each condition.

Table 14. Correlations Between DET Scores and TOEFL / IELTS Scores (March 29, 2022 — August 05, 2022)

TOEFL IELTS
All candidates .71 (328) .65 (1,643)
Center-based .82 (183) —
Home Edition .61 (145) —

*
This recommended minimum is for the equivalent-groups design. The necessary minimum sample size for a single-group design is theoretically smaller, but a specific
number is not given, and so we take 1,500 as the acceptable minimum.

© 2024 Duolingo, Inc


Duolingo English Test: Technical Manual 33

120 9

8
100
7

IELTS Academic
TOEFL iBT 80 6

5
60
4

40 3

2
20 Center−based
1
Home Edition
0 0

10 60 110 160 10 60 110 160


Duolingo English Test Duolingo English Test

Figure 26. Relationship Between Test Scores

Concordance

Given that a sample size of 1,500 is the recommended minimum for building a concordance table using standard equipercentile equating
methods (Kolen & Brennan, 2004, p. 304), self-report and official data were both included in the concordance study. Assessment
scientists first used data of individuals who both self-reported an external score and submitted an official score report to estimate potential
reporting bias in self-report data. MDIA (Haberman, 1984) was used to correct for this reporting bias. Follow-up analyses demonstrated
that the resulting, adjusted scores had approximately the same properties as the official scores. The DET–IELTS concordance results
computed on the official data and on the combined data were compared to confirm that the combined data set is unbiased. The sample
of those who took both the DET and IELTS was sufficiently large to allow for this comparison. After correcting for reporting bias, the
self-report and official data were then combined prior to performing final equating. For individuals with external scores in both the
self-report and official score data, only the official score records were retained in the combined data.
Two types of equating were compared in a single-group equating design: equipercentile (Kolen & Brennan, 2004) and kernel equating
(von Davier et al., 2004). The equating study was conducted using the equate (Albano, 2016) and kequate (Andersson et al., 2013)
packages in R (R Core Team, 2022). Additionally, the data were presmoothed using log-linear models (von Davier et al., 2004) prior to
applying the equating methods. The equating methods were evaluated by looking at the final concordance as well as the standard error
of equating (SEE), which were estimated via bootstrapping. The final concordance was very similar when comparing equipercentile
and kernel equating methods. The standard errors were also very similar across equating methods, although kernel equating had slightly
lower and more stable standard errors than equipercentile equating, especially for IELTS given the shorter scale. For these reasons,
kernel equating was chosen as the final equating method. See Cardwell et al. (2024) more details on the methods underlying the DET’s
concordance tables.
The concordance with IELTS exhibits less error overall because the IELTS score scale contains fewer distinct score points (19 possible
band scores between 1 and 9) than the DET (31 possible score values), meaning test takers with the same DET score are very likely to
have the same IELTS score. Conversely, the TOEFL scale contains a greater number of distinct score points (121 unique score values),
leading to relatively more cases where a particular DET score can correspond to multiple TOEFL scores, which inflates the SEE. The
concordance tables can be found on the DET scores page (https://englishtest.duolingo.com/scores).

10 Quality Control
The unprecedented flexibility, complexity, and high-stakes nature of the Duolingo English Test pose quality assurance challenges. In
order to ensure the test is of high quality at all times, it is necessary to continuously monitor processes associated with the DET ecosystem
frameworks and key summary statistics of the test. Doing so allows for the prompt identification and correction of any anomalies.

10.1 Analytics for Quality Assurance in Assessment


The DET utilizes a custom-built psychometric quality assurance system, Analytics for Quality Assurance in Assessment (AQuAA; Liao
et al., 2022), to continuously monitor test metrics and trends in the test data. AQuAA is an interactive dashboard that blends educational
data mining techniques and psychometric theory, allowing the DET’s psychometricians and assessment scientists to continuously monitor

© 2024 Duolingo, Inc


34 Duolingo Research Report

and evaluate the interaction between the test items, the test administration and scoring algorithms, and the samples of test takers, ensuring
scores are consistent over many test administrations. As depicted in Figure 27, test data such as test-taker demographics, item response
durations, and item scores are automatically imported into AQuAA from DET databases. These data are then used to calculate various
statistics, producing intermediate data files and data visualizations, which are regularly reviewed by a team of psychometricians in order
to promptly detect and respond to any anomalous events.

Figure 27. DET Quality Control Procedures

AQuAA monitors metrics over time in the following five categories, adjusting for seasonality effects.

1. Scores: Overall scores, subscores, and task type scores are tracked. Score-related statistics include the location and spread of
scores, inter-correlations between scores, internal consistency reliability measures and SEM, and correlation with self-reported
external measures.

2. Test-taker profile: The composition of the test-taker population is tracked over time, as demographic trends partially explain
seasonal variability in test scores. Specifically tracked are the percentages of test takers by country, first language (L1), gender,
age, intent in taking the test, and other background variables. In addition, many of the score statistics are tracked across major
test-taker groups.

3. Repeaters: Repeaters are defined as those who take the test more than once within a 30-day window. The prevalence, demographic
composition, and test performance of the repeater population are tracked. The performance of the repeater population is tracked
with many of the same test score statistics identified above, with additional statistics that are specific to repeaters: testing location
and distribution of scores from both the first and second test attempt, as well as their score change, and test–retest reliability (and
SEM).

4. Item analysis: Item quality is quantified with four categories of item performance statistics—item difficulty, item discrimination,
and item slowness (response time). Tracking these statistics allows for setting expectations about the item bank with respect to
item performance, flagging items with extreme and/or inadequate performance, and detecting drift in measures of performance
across time.

5. Item exposure: An important statistic in this category is the item exposure rate, which is calculated as the number of test
administrations containing a certain item divided by the total number of test administrations. Tracking item exposure rates can help
flag under- or over-exposure of items. Values of item exposure statistics result from the interaction of various factors, including
the size of the item bank and the item selection algorithm.

The quality assurance of the DET is a combination of automatic processes and human review processes. The AQuAA system is used as
the starting point for the human review process, and the human review process, in turn, helps AQuAA to evolve into a more powerful tool
to detect assessment validity issues. Figure 28 depicts the human review process following every week’s update of AQuAA; assessment
experts meet to review all metrics for any potential anomalies. Automatic flags have also been implemented to indicate results that
warrant closer attention. The assessment experts review any flags individually to determine whether it is a false alarm or further action

© 2024 Duolingo, Inc


Duolingo English Test: Technical Manual 35

is required. If the alarm is believed to be caused by a validity issue, follow-up actions are taken to determine the severity and urgency
of the issue, fix the issue, and document the issue. Improvements are regularly made to the automatic flagging mechanisms to minimize
false positives and false negatives, thereby improving AQuAA’s functionality.

Figure 28. AQuAA Expert Review Process

While the primary purpose of AQuAA is to facilitate quality control, it also helps DET developers continually improve the exam. Insights
drawn from AQuAA are used to direct the maintenance and improvement of other aspects of the assessment, such as item development.
Additionally, the AQuAA system itself is designed to be flexible, with the possibility to modify and add metrics in order to adapt as the
DET continues to evolve.

10.2 Proctoring Quality Assurance


In addition to psychometric quality assurance, DET proctoring quality is monitored regularly by assessment scientists and SMEs. A
variety of tools and metrics are used to evaluate decision consistency among DET proctors and improve accuracy of decision-making in
accordance with proctoring guidelines. These tools and metrics include:

Tools

• Monthly reports that track and evaluate proctors’ decisions over the last 12 months
– Used to identify outlier proctors, who then undergo retraining with senior proctors
• Proctor calibration tool that evaluates proctors’ decisions using the same test sessions automatically provides immediate feedback
about the consensus answer (i.e., what the majority of proctors decide about a test session)
• Calibration meetings between senior and junior proctors, where senior proctors provide feedback on difficult proctoring sessions
in a group setting
• Personal training sessions where more experienced proctors shadow less experienced proctors and provide feedback
• Weekly quizzes on proctoring process changes

Metrics

• Percentage of test sessions determined to have rule violations, cheating outcomes, identification issues, or technical errors across
time
– Changes in the test taker population (e.g., due to seasonal trends or market forces) can lead to differences in these trends

© 2024 Duolingo, Inc


36 Duolingo Research Report

• Variability in proctors’ decisions across all test sessions proctored, as well as on the same test sessions (e.g., see proctor calibration
tool)
• Percentage of decisions overturned between proctors with more and less experience
• Outliers in the percentage of flagged test-taker behaviors, both in terms of under- and overuse (e.g., see monthly reports)
• Average number of minutes taken to proctor a test, controlling for decision type (i.e., rule violation, cheating, etc.) and accuracy
of decision
• Test-taker score differences as a function of the type of test-taker behavior that is flagged
The tools and metrics used to monitor proctoring decisions help maintain high-quality, consistent proctoring by continually providing
formative feedback to proctors and identifying proctors in need of additional training or re-calibration. Additionally, insights from
proctoring quality assurance processes can lead to improvements in test administration and security. For instance, we can identify how
and where test takers most often violate rules unintentionally and then modify instructions to minimize rule violation. Maintaining a
high degree of consistency across proctors reinforces the security of the DET and ensures that test-taker sessions are reviewed equitably.

11 Conclusion
This version of the Technical Manual was produced on May 20, 2024. It provides a detailed overview of all facets of the Duolingo English
Test and reports evidence for the DET’s validity, reliability, and fairness as outlined in the Standards for Educational and Psychological
Testing (AERA et al., 2014). Updated versions of this document will be released to reflect changes to the test and new research findings.

© 2024 Duolingo, Inc


Duolingo English Test: Technical Manual 37

12 Appendix
Table 15. Test-Taker L1s in Alphabetical Order (May 20, 2023 — May 19, 2024)

Afrikaans English Kanuri Minangkabau Swedish


Akan Estonian Kashmiri Mongolian Tagalog
Albanian Ewe Kazakh Mossi Tajik
Amharic Farsi Khmer Nauru Tamil
Arabic Fijian Kikuyu Nepali Tatar
Armenian Finnish Kinyarwanda Northern Sotho Telugu
Assamese French Kirundi Norwegian Thai
Aymara Fulah Kongo Oriya Tibetan
Azerbaijani Ga Konkani Oromo Tigrinya
Bambara Galician Korean Palauan Tonga
Bashkir Ganda Kosraean Pohnpeian Tswana
Basque Georgian Kurdish Polish Turkish
Belarusian German Kurukh Portuguese Turkmen
Bemba Greek Lao Punjabi Twi
Bengali Guarani Latvian Pushto Uighur
Bikol Gujarati Lingala Romanian Ukrainian
Bosnian Gwichin Lithuanian Russian Umbundu
Bulgarian Hausa Luba-Lulua Samoan Urdu
Burmese Hebrew Luo Santali Uzbek
Catalan Hiligaynon Luxembourgish Serbian Vietnamese
Cebuano Hindi Macedonian Sesotho Wolof
Chichewa (Nyanja) Hungarian Madurese Shona Xhosa
Chinese - Cantonese Icelandic Malagasy Sindhi Yapese
Chinese - Mandarin Igbo Malay Sinhalese Yoruba
Chuvash Iloko Malayalam Slovak Zhuang
Croatian Indonesian Maltese Slovenian Zulu
Czech Italian Mandingo Somali
Danish Japanese Marathi Spanish
Dutch Javanese Marshallese Sundanese
Efik Kannada Mende Swahili

© 2024 Duolingo, Inc


38 Duolingo Research Report

Table 16. Test-Taker Country Origins in Alphabetical Order (May 20, 2023 — May 19, 2024)
Afghanistan Djibouti Lesotho Saint Kitts and Nevis
Albania Dominica Liberia Saint Lucia
Algeria Dominican Republic Libya Saint Vincent and the Grenadines
American Samoa Ecuador Liechtenstein Samoa
Andorra Egypt Lithuania San Marino
Angola El Salvador Luxembourg Sao Tome and Principe
Anguilla Equatorial Guinea Macao Saudi Arabia
Antigua and Barbuda Eritrea Madagascar Senegal
Argentina Estonia Malawi Serbia
Armenia Eswatini Malaysia Seychelles
Aruba Ethiopia Maldives Sierra Leone
Australia Faroe Islands Mali Singapore
Austria Fiji Malta Sint Maarten (Dutch)
Azerbaijan Finland Marshall Islands Slovakia
Bahamas France Mauritania Slovenia
Bahrain Gabon Mauritius Solomon Islands
Bangladesh Gambia Mexico Somalia
Barbados Georgia Micronesia (Federated States) South Africa
Belarus Germany Monaco South Sudan
Belgium Ghana Mongolia Spain
Belize Gibraltar Montenegro Sri Lanka
Benin Greece Montserrat State of Palestine
Bermuda Greenland Morocco Sudan
Bhutan Grenada Mozambique Suriname
Bolivarian Republic of Venezuela Guatemala Myanmar Sweden
Bolivia Guinea Namibia Switzerland
Bosnia and Herzegovina Guinea-Bissau Nauru Taiwan
Botswana Guyana Nepal Tajikistan
Brazil Haiti Netherlands Thailand
Brunei Darussalam Holy See New Zealand Timor-Leste
Bulgaria Honduras Nicaragua Togo
Burkina Faso Hong Kong Niger Tonga
Burundi Hungary Nigeria Trinidad and Tobago
Cabo Verde Iceland North Macedonia Tunisia
Cambodia India Norway Turkey
Cameroon Indonesia Oman Turkmenistan
Canada Iraq Pakistan Turks and Caicos Islands
Cayman Islands Ireland Palau Uganda
Central African Republic Isle of Man Panama Ukraine
Chad Israel Papua New Guinea United Arab Emirates
Chile Italy Paraguay United Kingdom of Great Britain and Northern Ireland
China Jamaica Peru United Republic of Tanzania
Colombia Japan Philippines United States of America
Comoros Jersey Poland Uruguay
Congo Jordan Portugal Uzbekistan
Congo (Democratic Republic) Kazakhstan Puerto Rico Vanuatu
Costa Rica Kenya Qatar Viet Nam
Croatia Kiribati Republic of Korea Virgin Islands (U.S.)
Cuba Kuwait Republic of Moldova Yemen
Cyprus Kyrgyzstan Romania Zambia
Czechia Lao People’s Democratic Republic Russian Federation Zimbabwe
Côte d’Ivoire Latvia Rwanda Åland Islands
Denmark Lebanon Saint Helena, Ascension and Tristan da Cunha

© 2024 Duolingo, Inc


Duolingo English Test: Technical Manual 39

Table 17. Percentage Distribution Overall Score (May 20, 2023 — May 19, 2024)

Total Percentage Cumulative percentage


160 0.10% 100.00%
155 0.48% 99.90%
150 1.29% 99.42%
145 2.40% 98.14%
140 3.70% 95.74%
135 5.08% 92.03%
130 6.58% 86.96%
125 8.00% 80.38%
120 9.29% 72.37%
115 10.25% 63.08%
110 10.53% 52.83%
105 9.99% 42.30%
100 8.54% 32.31%
95 6.94% 23.77%
90 5.05% 16.83%
85 3.66% 11.78%
80 2.50% 8.12%
75 1.72% 5.62%
70 1.16% 3.90%
65 0.80% 2.74%
60 0.55% 1.94%
55 0.39% 1.39%
50 0.29% 1.00%
45 0.20% 0.72%
40 0.15% 0.52%
35 0.11% 0.37%
30 0.08% 0.26%
25 0.06% 0.18%
20 0.05% 0.12%
15 0.03% 0.08%
10 0.04% 0.04%

© 2024 Duolingo, Inc


40 Duolingo Research Report

Table 18. Subscore Percentage Distributions (May 20, 2023 — May 19, 2024)

Conversation Literacy Comprehension Production


160 0.07% 0.60% 0.67% 0.04%
155 0.16% 1.03% 1.56% 0.07%
150 0.43% 1.86% 3.14% 0.17%
145 1.03% 2.97% 4.70% 0.39%
140 1.91% 4.12% 6.03% 0.75%
135 2.98% 5.42% 7.47% 1.23%
130 4.08% 6.59% 8.84% 1.88%
125 5.36% 7.75% 9.98% 2.60%
120 6.75% 8.91% 10.46% 3.46%
115 8.10% 9.74% 10.20% 4.48%
110 9.06% 9.78% 9.18% 5.64%
105 9.60% 9.26% 7.67% 6.81%
100 9.58% 8.02% 5.95% 8.02%
95 8.89% 6.49% 4.38% 8.78%
90 7.73% 4.95% 3.05% 9.05%
85 6.31% 3.59% 2.12% 8.97%
80 4.98% 2.59% 1.45% 8.34%
75 3.73% 1.82% 0.97% 7.12%
70 2.74% 1.26% 0.64% 5.75%
65 1.97% 0.90% 0.43% 4.39%
60 1.41% 0.63% 0.33% 3.33%
55 0.97% 0.47% 0.21% 2.41%
50 0.68% 0.33% 0.15% 1.81%
45 0.48% 0.24% 0.11% 1.30%
40 0.31% 0.18% 0.08% 0.94%
35 0.22% 0.13% 0.06% 0.68%
30 0.16% 0.11% 0.04% 0.49%
25 0.11% 0.07% 0.03% 0.35%
20 0.08% 0.05% 0.03% 0.26%
15 0.06% 0.03% 0.02% 0.20%
10 0.06% 0.10% 0.04% 0.29%

© 2024 Duolingo, Inc


Duolingo English Test: Technical Manual 41

Table 19. Subscore Cumulative Percentage Distributions (May 20, 2023 — May 19, 2024)

Conversation Literacy Comprehension Production


160 100.00% 100.00% 100.00% 100.00%
155 99.93% 99.40% 99.33% 99.96%
150 99.77% 98.37% 97.77% 99.89%
145 99.34% 96.51% 94.63% 99.72%
140 98.31% 93.55% 89.93% 99.32%
135 96.40% 89.42% 83.90% 98.57%
130 93.42% 84.01% 76.43% 97.34%
125 89.34% 77.41% 67.59% 95.46%
120 83.98% 69.67% 57.62% 92.86%
115 77.23% 60.75% 47.15% 89.39%
110 69.13% 51.01% 36.95% 84.92%
105 60.07% 41.23% 27.77% 79.28%
100 50.47% 31.97% 20.10% 72.47%
95 40.90% 23.95% 14.15% 64.45%
90 32.01% 17.46% 9.77% 55.66%
85 24.28% 12.51% 6.72% 46.61%
80 17.96% 8.92% 4.60% 37.65%
75 12.98% 6.33% 3.16% 29.31%
70 9.25% 4.51% 2.19% 22.19%
65 6.51% 3.25% 1.54% 16.44%
60 4.53% 2.35% 1.11% 12.05%
55 3.12% 1.71% 0.78% 8.72%
50 2.16% 1.25% 0.57% 6.31%
45 1.48% 0.91% 0.42% 4.50%
40 1.00% 0.67% 0.31% 3.20%
35 0.70% 0.49% 0.23% 2.26%
30 0.47% 0.36% 0.17% 1.59%
25 0.31% 0.25% 0.13% 1.09%
20 0.20% 0.19% 0.10% 0.75%
15 0.12% 0.13% 0.07% 0.49%
10 0.06% 0.10% 0.04% 0.29%

© 2024 Duolingo, Inc


42 Duolingo Research Report

AERA, APA, & NCME. (2014). Standards for educational and psychological testing.
Albano, A. (2016). equate: An R package for observed-score linking and equating. Journal of Statistical Software, 74(8), 1–36. https://d
oi.org/10.18637/jss.v074.i08
Anderson-Hsieh, J., Johnson, R., & Koehler, K. (1992). The relationship between native speaker judgments of nonnative pronunciation
and deviance in segmentals, prosody, and syllable structure. Language Learning, 42, 529–555. https://doi.org/10.1111/j.1467-
1770.1992.tb01043.x
Andersson, B., Bränberg, K., & Wiberg, M. (2013). Performing the kernel method of test equating with the package kequate. Journal of
Statistical Software, 55(6), 1–25. http://www.jstatsoft.org/v55/i06/
Arieli-Attali, M., Ward, S., Thomas, J., Deonovic, B., & von Davier, A. A. (2019). The expanded evidence-centered design (e-ECD)
for learning and assessment systems: A framework for incorporating learning goals and processes within assessment design.
Frontiers in Psychology, 10, 853. https://doi.org/10.3389/fpsyg.2019.00853
Attali, Y., Runge, A., LaFlair, G. T., Yancey, K., Goodwin, S., Park, Y., & von Davier, A. A. (2022). The interactive reading task:
Transformer-based automatic item generation. Frontiers in Artificial Intelligence, 5. https://doi.org/10.3389/frai.2022.903077
Bachman, L., & Palmer, A. (1996). Language testing in practice. Oxford University Press.
Bachman, L., & Palmer, A. (2010). Language assessment in practice: Developing language assessments and justifying their use in the
real world. Oxford University Press.
Beeckmans, R., Eyckmans, J., Janssens, V., Dufranne, M., & Van de Velde, H. (2001). Examining the yes/no vocabulary test: Some
methodological issues in theory and practice. Language Testing, 18(3), 235–274. https://doi.org/10.1177/02655322010180030
1
Belzak, W. C. (2023). The multidimensionality of measurement bias in high-stakes testing: Using machine learning to evaluate complex
sources of differential item functioning. Educational Measurement: Issues and Practice, 42(1), 24–33. https://doi.org/10.1111
/emip.12486
Belzak, W. C., & Lockwood, J. R. (in press). Estimating test-retest reliability in the presence of self-selection bias and learning/practice
effects. Applied Psychological Measurement.
Belzak, W. C., Naismith, B., & Burstein, J. (2023). Ensuring fairness of human- and AI-generated test items. In N. Wang, G. Rebolledo-
Mendez, V. Dimitrova, N. Matsuda, & O. C. Santos (Eds.), Artificial intelligence in education. posters and late breaking
results, workshops and tutorials, industry and innovation tracks, practitioners, doctoral consortium and blue sky (pp. 701–
707). Springer Nature Switzerland. https://doi.org/10.1007/978-3-031-36336-8_108
Biber, D., & Conrad, S. (2019). Register, genre, and style (2nd). Cambridge University Press. https://doi.org/10.1017/9781108686136
Bonk, W. J. (2000). Second language lexical knowledge and listening comprehension. International Journal of Listening, 14(1), 14–31.
https://doi.org/10.1080/10904018.2000.10499033
Bradlow, A., & Bent, T. (2002). The clear speech effect for non-native listeners. Journal of the Acoustical Society of America, 112,
272–284. https://doi.org/10.1121/1.1487837
Bradlow, A., & Bent, T. (2008). Perceptual adaptation to non-native speech. Cognition, 106, 707–729. https://doi.org/10.1016/j.cogniti
on.2007.04.005
Buck, G. (2001). Assessing listening. Cambridge University Press.
Burstein, J. (2023). The Duolingo English Test Responsible AI Standards (tech. rep.) (Updated March 29, 2024). https://go.duolingo.co
m/ResponsibleAI
Burstein, J., LaFlair, G. T., Kunnan, A. J., & von Davier, A. A. (2022). A theoretical assessment ecosystem for a digital-first assessment—
The Duolingo English Test (Duolingo Research Report No. DRR-22-01). Duolingo. https://go.duolingo.com/ecosystem
Cardwell, R. L., Nydick, S. W., Lockwood, J., & von Davier, A. A. (2024). Practical considerations when building concordances between
English tests. Language Testing, 41(1), 192–202. https://doi.org/10.1177/02655322231195027
Carroll, R. J., Ruppert, D., Stefanski, L. A., & Crainiceanu, C. M. (2006). Measurement error in nonlinear models: A modern perspective.
CRC press.
Chapelle, C. A. (1998). Construct definition and validity inquiry in SLA research. In L. F. Bachman & A. D. Cohen (Eds.), Interfaces
between second language acquisition and language testing research (pp. 32–70). Cambridge University Press.
Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Routledge. https://doi.org/10.4324/9780203771587
Council of Europe. (2001). Common European Framework of Reference for Languages: Learning, teaching, assessment. Cambridge
University Press.
Council of Europe. (2020). Common European Framework of Reference for Languages: Learning, teaching, assessment – companion
volume. https://www.coe.int/lang-cefr
Daller, M., Müller, A., & Wang-Taylor, Y. (2021). The C-test as predictor of the academic success of international students. International
Journal of Bilingual Education and Bilingualism, 24(10), 1502–1511. https://doi.org/10.1080/13670050.2020.1747975
Derwing, T., & Munro, M. (1997). Accent, intelligibility, and comprehensibility: Evidence from four L1s. Studies in Second Language
Acquisition, 19(1), 1–16. https://www.jstor.org/stable/44488664

© 2024 Duolingo, Inc


Duolingo English Test: Technical Manual 43

Derwing, T., Munro, M., & Wiebe, G. (1998). Evidence in favor of a broad framework for pronunciation instruction. Language Learning,
48, 393–410. https://doi.org/10.1111/0023-8333.00047
Duolingo English Test. (2021). Duolingo English Test: Security, proctoring, and accommodations (tech. rep.). Duolingo. https://duolin
go-papers.s3.amazonaws.com/other/det-security-proctoring-whitepaper.pdf
Eckes, T., & Grotjahn, R. (2006). A closer look at the construct validity of C-tests. Language Testing, 23(3), 290–325. https://doi.org/1
0.1191/0265532206lt330oa
Field, J. (2005). Intelligibility and the listener: The role of lexical stress. TESOL Quarterly, 39, 399–423. https://doi.org/10.2307/3588487
Goodwin, S., Attali, Y., LaFlair, G. T., Runge, A., Park, Y., von Davier, A. A., & Yancey, K. P. (2023). Duolingo English Test: Writing
construct (Duolingo Research Report No. DRR-22-03). Duolingo. https://go.duolingo.com/scored-writing
Goodwin, S., & Naismith, B. (2023). Assessing listening on the Duolingo English Test (Duolingo Research Report No. DRR-23-02).
Duolingo. http://duolingo-testcenter.s3.amazonaws.com/media/resources/listening-whitepaper.pdf
Haberman, S. (1984). Adjustment by minimum discriminant information. The Annals of Statistics, 12, 971–988. https://doi.org/10.1214
/aos/1176346715
Haberman, S., & Yao, L. (2015). Repeater analysis for combining information from different assessments. Journal of Educational
Measurement, 52, 223–251. https://doi.org/10.1111/jedm.12075
Hahn, L. (2004). Primary stress and intelligibility: Research to motivate the teaching of suprasegmentals. TESOL Quarterly, 38, 201–223.
https://doi.org/10.2307/3588378
Jessop, L., Suzuki, W., & Tomita, Y. (2007). Elicited imitation in second language acquisition research. Canadian Modern Language
Review, 64(1), 215–238. https://doi.org/10.3138/cmlr.64.1.215
Karimi, N. (2011). C-test and vocabulary knowledge. Language Testing in Asia, 1(4), 7. https://doi.org/10.1186/2229-0443-1-4-7
Khodadady, E. (2014). Construct validity of C-tests: A factorial approach. Journal of Language Teaching and Research, 5. https://doi.o
rg/10.4304/jltr.5.6.1353-1362
Klein-Braley, C. (1997). C-Tests in the context of reduced redundancy testing: An appraisal. Language Testing, 14(1), 47–84. https://do
i.org/10.1177/026553229701400104
Kolen, M., & Brennan, R. (2004). Test equating methods and practices. Springer-Verlag.
Kyle, K., & Crossley, S. A. (2016). The relationship between lexical sophistication and independent and source-based writing. Journal
of Second Language Writing, 34, 12–24. https://doi.org/10.1016/j.jslw.2016.10.003
LaFlair, G. T. (2020). Duolingo English Test: Subscores (Duolingo Research Report No. DRR-20-03). Duolingo. https://duolingo-pape
rs.s3.amazonaws.com/reports/subscore-whitepaper.pdf
LaFlair, G. T., Langenfeld, T., Baig, B., Horie, A. K., Attali, Y., & von Davier, A. A. (2022). Digital-first assessments: A security
framework. Journal of Computer Assisted Learning. https://doi.org/10.1111/jcal.12665
LaFlair, G. T., Runge, A., Attali, Y., Park, Y., Church, J., & Goodwin, S. (2023). Interactive listening—The Duolingo English Test
(Duolingo Research Report No. DRR-23-01). Duolingo.
Laufer, B., & Nation, P. (1999). A vocabulary-size test of controlled productive ability. Language Testing, 16(1), 33–51. https://doi.org
/10.1177/026553229901600103
Laufer, B. (1992). Reading in a foreign language: How does L2 lexical knowledge interact with the reader’s general academic ability.
Journal of Research in Reading, 15(2), 95–103. https://doi.org/10.1111/j.1467-9817.1992.tb00025.x
Liao, M., Attali, Y., Lockwood, J. R., & von Davier, A. A. (2022). Maintaining and monitoring quality of a continuously administered
digital assessment. Frontiers in Education, 7. https://doi.org/10.3389/feduc.2022.857496
Litman, D., Strik, H., & Lim, G. S. (2018). Speech technologies and the assessment of second language speaking: Approaches, challenges,
and opportunities. Language Assessment Quarterly, 15(3), 294–309. https://doi.org/10.1080/15434303.2018.1472265
McCarthy, A. D., Yancey, K. P., LaFlair, G. T., Egbert, J., Liao, M., & Settles, B. (2021). Jump-starting item parameters for adaptive
language tests. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 883–899. https:
//doi.org/10.18653/v1/2021.emnlp-main.67
McLean, S., Stewart, J., & Batty, A. O. (2020). Predicting L2 reading proficiency with modalities of vocabulary knowledge: A
bootstrapping approach. Language Testing, 37(3), 389–411. https://doi.org/10.1177/0265532219898380
Messick, S. (1989). Validity. In Educational measurement, 3rd ed (pp. 13–103). American Council on Education.
Messick, S. (1996). Validity of performance assessments. In G. W. Phillips (Ed.), Technical issues in large-scale performance assessment
(pp. 1–18). US Department of Education, Office of Educational Research; Improvement.
Milton, J. (2010). The development of vocabulary breadth across the CEFR levels. In I. Bartning, M. Martin, & I. Vedder (Eds.),
Communicative proficiency and linguistic development: Intersections between SLA and language testing research (pp. 211–
232, Vol. 1). EuroSLA.
Milton, J. (2013). Measuring the contribution of vocabulary knowledge to proficiency in the four skills. In C. Bardel, C. Lindqvist, &
B. Laufer (Eds.), Eurosla monographs series 2 (pp. 57–78). European Second Language Association.

© 2024 Duolingo, Inc


44 Duolingo Research Report

Milton, J., Wade, J., & Hopkins, N. (2010). Aural word recognition and oral competence in English as a foreign language. In R. Chacón-
Beltrán, C. Abello-Contesse, & M. Torreblanca-López (Eds.), Insights into non-native vocabulary teaching and learning
(pp. 83–98, Vol. 52). Multilingual Matters.
Molenaar, D., Cúri, M., & Bazán, J. L. (2022). Zero and one inflated item response theory models for bounded continuous data. Journal
of Educational and Behavioral Statistics, 47(6), 693–735. https://doi.org/10.3102/10769986221108455
Munro, M., & Derwing, T. (1995). Foreign accent, comprehensibility, and intelligibility in the speech of second language learners.
Language Learning, 45, 73–97. https://doi.org/10.1111/j.1467-1770.1995.tb00963.x
Nation, I. S. P. (2001). Learning vocabulary in another language. Cambridge University Press.
Nation, I. S. P. (2013). Learning vocabulary in another language (2nd ed.) Cambridge University Press.
Nation, I. S. P. (2022). Learning vocabulary in another language (3rd ed.) Cambridge University Press. https://doi.org/10.1017/978100
9093873
Norris, J. (2018). Developing C-tests for estimating proficiency in foreign language research. Peter Lang. https://doi.org/10.3726/b13235
Park, Y., LaFlair, G. T., Attali, Y., Runge, A., & Goodwin, S. (2022). Duolingo English Test: Interactive reading (Duolingo Research
Report No. DRR-22-02). Duolingo. https://duolingo-papers.s3.amazonaws.com/other/mpr-whitepaper.pdf
Park, Y., Cardwell, R., Goodwin, S., Naismith, B., LaFlair, G., Loh, K., & Yancey, K. (2023). Assessing speaking on the Duolingo English
Test (Duolingo Research Report No. DRR-23-03). https://duolingo-testcenter.s3.amazonaws.com/media/resources/speaking-
whitepaper.pdf
Park, Y., Cardwell, R. L., & Naismith, B. (2024). Assessing vocabulary on the Duolingo English Test (Duolingo Research Report
No. DRR-24-01). Duolingo.
R Core Team. (2022). R: A language and environment for statistical computing. R Foundation for Statistical Computing. Vienna, Austria.
https://www.R-project.org/
Roche, T., & Harrington, M. (2014). Vocabulary knowledge and its relationship with EAP proficiency and academic achievement in an
English-medium university in Oman. In R. Al-Mahrooqi & A. Roscoe (Eds.), Focusing on EFL reading: Theory and practice
(pp. 27–41). Cambridge Scholars Publishing.
Rudis, B., & Kunimune, J. (2020). Imago: Hacky world map geojson based on the imago projection [R package version 0.1.0]. https://g
it.rud.is/hrbrmstr/imago
Ruegg, R., Fritz, E., & Holland, J. (2011). Rater sensitivity to qualities of lexis in writing. TESOL Quarterly, 45(1), 63–80. http://www
.jstor.org/stable/41307616
Segall, D. O. (2005). Computerized adaptive testing. In K. Kempf-Leonard (Ed.), Encyclopedia of social measurement (pp. 429–438).
Elsevier. https://doi.org/10.1016/B0-12-369398-5/00444-8
Sinharay, S., & Haberman, S. J. (2008). Reporting subscores: A survey (tech. rep. No. 08-18). ETS. https://www.ets.org/Media/Researc
h/pdf/RM-08-18.pdf
Smith, E., & Kosslyn, S. (2007). Cognitive psychology: Mind and brain. Pearson/Prentice Hall.
Staehr, L. (2008). Vocabulary size and the skills of listening, reading and writing. Language Learning Journal, 36, 139–152. https://doi
.org/10.1080/09571730802389975
Stowell, J. R., & Bennett, D. (2010). Effects of online testing on student exam performance and test anxiety. Journal of Educational
Computing Research, 42(2), 161–171. https://doi.org/10.2190/EC.42.2.b
Thissen, D., & Mislevy, R. (2000). Testing algorithms. In H. Wainer (Ed.), Computerized adaptive testing: A primer (2nd edition)
(pp. 103–135). Routledge.
Thompson, S. J., Johnstone, C. J., & Thurlow, M. L. (2002). Universal design applied to large scale assessments. synthesis report.
Vinther, T. (2002). Elicited imitation: A brief overview. International Journal of Applied Linguistics, 12(1), 54–73. https://doi.org/10.1
111/1473-4192.00024
von Davier, A. A., Attali, Y., Runge, A., Church, J., Park, Y., & LaFlair, G. (2024). The item factory: Intelligent automation in support of
test development at scale. In Machine learning, natural language processing, and psychometrics. Information Age Publishing.
von Davier, A. A., Holland, P. W., & Thayer, D. T. (2004). The kernel method of test equating. Springer Science & Business Media.
Wainer, H. (2000). Computerized adaptive testing: A primer (2nd edition). Routledge.
Weigle, S. C. (2002). Assessing writing. Cambridge University Press.
Weiss, D., & Kingsbury, G. (1984). Application of computerized adaptive testing to educational problems. Journal of Educational
Measurement, 21, 361–375. https://doi.org/j.1745-3984.1984.tb01040.x
Young, R. (2011, January). Interactional competence in language learning, teaching, and testing. In E. Hinkel (Ed.), Handbook of research
in second language teaching and learning (pp. 426–443, Vol. 2). Routledge.

© 2024 Duolingo, Inc

You might also like