Childes-Db: A Exible and Reproducible Interface To The Child Language Data Exchange System

Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

Behavior Research Methods (2019) 51:1928–1941

https://doi.org/10.3758/s13428-018-1176-7

childes-db: A flexible and reproducible interface to the child language


data exchange system
Alessandro Sanchez1 · Stephan C. Meylan2,3 · Mika Braginsky3 · Kyle E. MacDonald1 · Daniel Yurovsky4 ·
Michael C. Frank1

Published online: 8 January 2019


© The Psychonomic Society, Inc. 2019

Abstract
The Child Language Data Exchange System (CHILDES) has played a critical role in research on child language devel-
opment, particularly in characterizing the early language learning environment. Access to these data can be both complex
for novices and difficult to automate for advanced users, however. To address these issues, we introduce childes-db,
a database-formatted mirror of CHILDES that improves data accessibility and usability by offering novel interfaces,
including browsable web applications and an R application programming interface (API). Along with versioned infrastruc-
ture that facilitates reproducibility of past analyses, these interfaces lower barriers to analyzing naturalistic parent–child
language, allowing for a wider range of researchers in language and cognitive development to easily leverage CHILDES in
their work.

Keywords Child language · Corpus linguistics · Reproducibility · R packages · Research software

Introduction all of these systems is the contribution of the child’s input—


exposure to linguistic and non-linguistic data—in the early
What are the representations that children learn about lan- environment. While in-lab experiments can shed light on
guage, and how do they emerge from the interaction of linguistic knowledge and some of the implicated learning
learning mechanisms and environmental input? Develop- mechanisms, characterizing this early environment requires
ing facility with language requires learning a great many additional research methods and resources.
interlocking components—meaningful distinctions between One of the key methods that has emerged to address this
sounds (phonology), names of particular objects and actions gap is the collection and annotation of speech to and by chil-
(word learning), meaningful sub-word structure (morphol- dren, often in the context of the home. Starting with Roger
ogy), rules for how to organize words together (syntax), Brown’s (1973) work on Adam, Eve, and Sarah, audio
and context-dependent and context-independent aspects of recordings—and more recently video recordings—have
meaning (semantics and pragmatics). The key to learning been augmented with rich, searchable annotations to allow
researchers to address a number of questions regarding
the language-learning environment. Focusing on language
learning in naturalistic contexts also reveals that children
Alessandro Sanchez and Stephan C. Meylan are co-first authors. have, in many cases, productive and receptive abilities
 Stephan C. Meylan exceeding those demonstrated in experimental contexts.
[email protected] Often, children’s most revealing and sophisticated uses of
language emerge in the course of naturalistic play.
1 While corpora of early language acquisition are extremely
Department of Psychology, Stanford University,
450 Serra Mall, Stanford, CA 94305, USA useful, creating them requires significant resources. Col-
2
lecting and transcribing audio and video is costly and
Duke University, Durham, NC, USA
extremely time-consuming—even orthographic transcrip-
3 MIT, Cambridge, MA 02139, USA tion (i.e., transcriptions with minimal phonetic detail)
4 University of Chicago, Chicago, IL 60637, USA can take ten times the duration of the original recording
Behav Res (2019) 51:1928–1941 1929

(MacWhinney, 2000). Automated, machine learning-based On the opposite end of the spectrum, for data-oriented
methods like automatic speech recognition (ASR) have pro- researchers who are interested in doing large-scale analyses
vided only modest gains in efficiency. Such systems are of CHILDES, the current tools are also not ideal. CLAN
limited both by the less-than-ideal acoustic properties of software is an excellent tool for interactive exploration,
home recordings, and also by the poor fit of language but—as a free-standing application—it can be tricky to build
models built on adult-directed, adult-produced language into a processing pipeline written in Python or R. Thus,
samples to child-directed and child-produced speech. Thus, researchers who would like to ingest the entire corpus (or
researchers’ desires for data in analyses of child language some large subset) into a computational analysis typically
corpora can very quickly outstrip their resources. write their own parsers of the CHAT format to extract the
Established in 1984 to address this issue, the Child Lan- subset of the data they would like to use (e.g., Kline, 2012;
guage Data Exchange System (CHILDES) aims to make Meylan et al., 2017; Redington et al., 1998; Yang, 2013).
transcripts and recordings relevant to the study of child lan- The practice of writing custom parsers is problematic for
guage acquisition available to researchers as free, public a number of reasons. First, effort is wasted in implementing
datasets (MacWhinney, 2000, 2014; MacWhinney & Snow the same features again and again. Second, this process can
1985). CHILDES now archives tens of thousands of tran- introduce errors and inconsistencies in data handling due to
scripts and associated media across 20+ languages, making difficulties dealing with the many special cases in the CHAT
it a critical resource for characterizing both children’s early standard. Third, these parsing scripts are rarely shared—
productive language use and their language environment. As and when they are, they typically break with subsequent
the first major effort to consolidate and share transcripts of revisions to the dataset—leading to much greater difficulty
child language, CHILDES has been a pioneer in the move in reproducing the exact numerical results from previous
to curate and disseminate large-scale behavioral datasets published research that used CHILDES (see e.g., Meylan
publicly. et al., 2017 for an example). Fourth, the CHILDES corpus
Since its inception, a tremendous body of research has itself is a moving target: computational work using the
made use of CHILDES data. Individual studies are too entire corpus at one time point may include a different set
numerous to list, but classics include studies of morpholog- of data than subsequent work as corpora are added and
ical over-regularization (Marcus et al., 1992), distributional revised. Currently, there is no simple way for researchers
learning (Redington, Chater, & Finch, 1998), word segmen- to document exactly which version of the corpus has been
tation (Goldwater, Griffiths, & Johnson, 2009), the role of used, short of creating a full mirror of the data. These factors
frequency in word learning (Goodman, Dale, & Li, 2008), together lead to a lack of computational reproducibility,
and many others. Some studies analyze individual examples a major problem that keeps researchers from verifying or
in depth (e.g., Snyder, 2007), some track multiple child- building on published research (Donoho, 2010; Stodden
caregiver dyads (e.g., Meylan, Frank, Roy, & Levy, 2017), et al., 2016).
and still others use the aggregate properties of all child or In the current manuscript, we describe a system for
caregiver speech pooled across corpora (Montag, Jones, & extending the functionality of CHILDES to address these
Smith, 2015); e.g., Redington et al., 1998). issues. Our system, childes-db, is a database-formatted
Nonetheless, there are some outstanding challenges mirror of CHILDES that allows access through an appli-
working with CHILDES, both for students and for advanced cation programming interface (API). This infrastructure
users. The CHILDES ecosystem uses a specialized file for- allows the creation of web applications for browsing and
mat (CHAT), which is stored as plain text but includes struc- easily visualizing the data, facilitating classroom use of
tured annotations grouped into parallel information “tiers” the dataset. Further, the database can be accessed program-
on separate lines. These tiers allow for a searchable plaintext matically by advanced researchers, obviating the need to
transcript of an utterance to be stored along with struc- write one-off parsers of the CHAT format. The database is
tured annotations of its phonological, morphological, or versioned for access to previous releases, allowing compu-
syntactic content. These files are usually analyzed using a tational reproducibility of particular analyses.
command-line program (CLAN) that allows users to count We begin by describing the architecture of
word frequencies, compute statistics (e.g., mean length of childes-db and the web applications that we provide.
utterance, or MLU), and execute complex searches against Next, we describe the childesr API, which provides
the data. While this system is flexible and powerful, mas- a set of R functions for programmatic access to the data
tering the CHAT codes and especially the CLAN tool with while abstracting away many of the technical details. We
its many functions and flags can be daunting. These tech- conclude by presenting several worked examples of specific
nical barriers decrease the ease of exploration by a novice uses of the system—both web apps and the R API—for
researcher or in a classroom exercise. research and teaching.
1930 Behav Res (2019) 51:1928–1941

Design and technical approach common in the course of language development and often
of special interest to researchers—are kept as a separate
As described above, CHILDES is most often approached (possibly null) field associated with each token.
as a set of distinct CHAT files, which are then parsed Many of the other tables in the database describe
by users, often using CLAN. In contrast to this parsing hierarchical collections built out of tokens—utterance,
approach, which entails the sequential processing of strings, transcript, corpus, and collection—and store attributes
childes-db treats CHILDES as a set of linked tables, appropriate for each level of description. Every entity
with records corresponding to intuitive abstractions such as includes attributes that link it to all higher-order collec-
words, utterances, and transcripts (see Kline, 2012 for an tions, e.g., an utterance lists the transcript, corpus, and
earlier example of deriving a singular tabular representation collection to which it belongs. An utterance contains one
of a CHILDES transcript). Users of data analysis languages or more word tokens and includes fields such as the
like R or Julia, libraries like Pandas, or those familiar utterance type (e.g., declarative, interrogative, etc.), total
with Structured Query Language (SQL) will be familiar number of tokens, and the total number of morphemes
with operations on tabular representations of data such as if the morphological structure is available in the original
filtering (subsetting), sorting, aggregation (grouping), and CHAT file. A transcript consists of one or more utter-
joins (merges). These operations obviate the need for users ances and includes the date collected, the name of the
to consider the specifics of the CHAT representation— target child, the age in days if defined, and the filename
instead they simply request the entities they need for their from CHILDES. A corpus consists of one or more tran-
research and allow the API to take care of the formatting scripts, corresponding to well-known collections like the
details. We begin by orienting readers to the design of Brown (Brown, 1973) or Providence (Demuth, Culbertson,
the system via a top-level description and motivation for & Alter, 2006) corpus. Finally, a collection is a superor-
the design of the database schema, then provide details dinate collection of corpora generally corresponding to a
on the database’s current technical implementation and the geographic region, following the convention in CHILDES.
versioning scheme. Users primarily interested in accessing Because every record can be linked to a top-level collec-
the database can skip these details and focus on access tion (generally corresponding to a language), each table
through the childesr API and the web apps. includes data from all languages represented in CHILDES
(Fig. 1).
Database format Participants—generally children and caregivers—are
represented separately from the token hierarchy because
At its core, childes-db is a database consisting of a it is common for the same children to appear in multiple
set of linked tabular data stores where records correspond transcripts. A participant identifier is associated with every
to linguistic entities like words, utterances, and sampling word and utterance, including a name, role, three-letter
units like transcriptions and corpora. The smallest unit CHILDES identifier (CHI = child, MOT = mother, FAT =
of abstraction tracked by the database is a token, treated father, etc.), and the range of ages for which they are
here as the standard (or citation) orthographic form of a observed (or age of corresponding child, in the case
word. Using the standardized written form of the word of caregivers). For non-child participants (caregivers and
facilitates the computation of lexical frequency statistics for others), the record additionally contains an identifier for the
comparison or aggregation across children or time periods. corresponding target child, such that data corresponding to
Deviations from the citation form—which are particularly children and their caregivers can be easily associated.

Fig. 1 Database schema for ‘childes-db’. Tokens are linked to superordinate groupings of utterances, transcripts, corpora, and collections (red
arrows). All tokens and utterances are additionally associated with a participant (blue arrows)
Behav Res (2019) 51:1928–1941 1931

Technical implementation we introduce a simple versioning system by adding a new


complete parse of the current state of CHILDES every
childes-db is stored as a MySQL database, an industry- 6 months or as warranted by changes in CHILDES. By
standard, open-source relational database that can be default, users interact with the most recent version of the
accessed directly from a wide range of programming database available. To support reproduction of results with
languages. The childes-db project provides access previous versions of the database, we continue to host recent
to hosted, read-only databases on a publicly accessible versions (up to the last 3 years/six versions) through our
server for direct access and childesr (described below). childesr API so that researchers can run analyses against
The project also hosts compressed .sql exports for local specific historical versions of the database. For versions
installation. While the former is appropriate for most users, more than 3 years old, we host compressed .sql files that
local installation can provide performance gains by allowing users may download and serve using a local installation of
a user to access the database on their machine or on their MySQL server (for which we provide instructions).
local network, as well as allowing users to store derived
information in the same database. Current annotation coverage
In order to import the CHILDES corpora into the MySQL
schema described above, it must first be accurately parsed The current implementation of childes-db emphasizes
and subsequently vetted to ensure its integrity. We parse the the computation of lexical statistics, and consequently
XML (eXtensible Markup Language) release of CHILDES focuses on reproducing the words, utterances, and speaker
hosted by childes.talkbank.org using the NLTK library information in CHILDES transcripts. For this reason, we do
in Python (Bird & Loper, 2004). Logic implemented in not preserve all of the information available in CHILDES,
Python converts the linear, multi-tier parse into a tabular such as:
format appropriate for childes-db. This logic includes
decisions that we review below regarding what information • Sparsely annotated tiers, e.g., phonology (%pho) and
sources are captured in the current release of the database situation (%sit)
and which are left for future development. • Media links
The data imported into childes-db is subject to data • Tone direction and stress
integrity checks to ensure that our import of the corpora • Filled pauses
is accurate and preferable over ad hoc parsers developed • Reformulations, word revision, and phrase revision,
by many individual researchers. In order to evaluate our e.g., <what did you>[//] how can you see it?
success in replicating CLAN parses, we compared unigram • paralinguistic material, e.g., [=! cries]
counts in our database with those outputted by CLAN,
the command-line tool built specifically for analysis of At present, childes-db focuses strictly on the contents
transcripts coded in CHAT. We used the CLAN commands of CHILDES, and does not include material in related
FREQ and MLU to compare total token counts and mean TalkBank projects such as PhonBank, AphasiaBank, or
lengths of utterance for every speaker in every transcript DementiaBank. We will prioritize the addition of these
and compared these values to our own using the Pearson information sources and others in response to community
correlation coefficient. The results of the comparison feedback.
were .99 and .98 for the unigram count and MLU data,
respectively, indicating reliable parsing.
Interfaces for accessing childes-db
Versioning
We first discuss the childes-db web apps and then intro-
The content of CHILDES changes as additional corpora are duce the childesr R package.
added or transcriptions are updated; as of time of writing,
these changes are not systematically tracked in a public Interactive web apps
repository.1 To facilitate reproducibility of past analyses,
The ability to easily browse and explore the CHILDES
1 Specificversions of the database, tracked using the version control corpora is a cornerstone of the childes-db project.
system Git, can be obtained by emailing the maintainers of the To this end, we have created powerful, yet easy-to-use
CHILDES project. While tracking line-level changes with Git provides
detailed information about what has changed, our method allows
interactive web applications that enable users to visualize
researchers to access the relevant version programmatically by simply various dimensions of the CHILDES corpus: frequency
adding an argument to a function call. counts, mean lengths of utterance, type-token ratios, and
1932 Behav Res (2019) 51:1928–1941

more. All of this is doable without the requirement of of the normal range may be indicative of speech, language,
understanding command-line tools.2 or communication disorders.
Our web apps are built using Shiny, a software package Several of the most common of these measures are
that enables easy app construction using R. Underneath the available in the Derived Measures app, which plots these
hood, each web app is making calls to our childesr measures across age for a given subset of data, again
API and subsequently plots the data using the popular specified by collection, corpora, children, and speakers. As
R plotting package ggplot2. A user’s only task is to with the Frequency Counts app, caregivers’ lexical diversity
configure exactly what should be plotted through a series measures can be plotted alongside children’s. We have
of buttons, sliders, and text boxes. The user may specify currently implemented the following measures:
what collection, corpus, child, age range, caregiver, etc., • MLU-w (mean length of utterance in words),
should be included in a given analysis. The plot is displayed • MLU-m (mean length of utterance in morphemes),
and updated in real time, and the underlying data are also • TTR (type-token ratio, a measure of lexical diversity;
available for download alongside the plot. All of these Templin, 1957),
analyses may also be reproduced using the childesr • MTLD (measure of textual lexical diversity; Malvern &
package, but the web apps are intended for the casual user Richards, 1997),
who seeks to easily extract developmental indices quickly • HD-D (lexical diversity via the hypergeometric distri-
and without any technical overhead. bution; McCarthy & Jarvis, 2010

Frequency counts As with the Frequency Counts app, a user may subset the
data as they choose, compare measures between caregivers
The lexical statistics of language input to children have and children, and aggregate across children from different
long been an object of study in child language acquisition corpora (Fig. 3).
research. Frequency counts of words in particular may
provide insight into the cognitive, conceptual, and linguistic Population viewer
experience of a young child (see e.g., Ambridge, Kidd,
Rowland, and Theakston, 2015 for review). In this web app, In many cases, a researcher may want to view the statistics
inspired by ChildFreq (Bååth, 2010), we provide users the and properties of corpora (e.g., their size, number of
ability to search for any word spoken by a participant in utterances, number of tokens) before choosing a target
the CHILDES corpora and track the usage of that word corpus or set of corpora for an analysis. This web app is
by a child or caregiver over time. Because of the various intended to provide a basic overview regarding the scale and
toggles available to the user that can subset the data, a user temporal extent of various corpora in CHILDES, as well as
may view word frequency curves for a single child in the give researchers insight into the aggregate characteristics of
Brown corpus or all Spanish-speaking children, if desired. CHILDES. For example, examining the aggregate statistics
In addition, users can plot frequency curves belonging to reveals that coverage in CHILDES peaks at around 30
caregivers alongside their child for convenient side-by-side months (Fig. 4).
comparisons. A single word or multiple words may be
entered into the input box (Fig. 2). The childesr package

Derived measures Although the interactive analysis tools described above


cover some of the most common use cases of CHILDES
The syntactic complexity and lexical diversity of chil- data, researchers interested in more detailed and flexible
dren’s speech are similarly critical metrics for acquisition analyses will want to interface directly with the data in
researchers (Miller & Chapman, 1981; Watkins, Kelly, childes-db. Making use of the R programming language
Harbers, & Hollis, 1995). There are a number of well- (R Core Team, 2017), we provide the childesr package.
established measures of children’s speech that operational- R is an open-source, extensible statistical computing
ize complexity and diversity, and have many applications in environment that is rapidly growing in popularity across
speech-language pathology (SLP), where measures outside fields and is increasing in use in child language research
(e.g., Norrman & Bylund, 2015; Song, Shattuck-Hufnagel,
2 The LuCiD toolkit (Chang, 2017) provides related functionality for & Demuth, 2015). The childesr package abstracts away
a number of common analyses. In contrast to those tools, which the details of connecting to and querying the database. Users
focus on filling gaps not covered by CLAN—e.g., the use of n-
gram models, incremental sentence generation, and distributional word
can take advantage of the tools developed in the popular
classification—our web apps focus on covering the same common dplyr package (Wickham, Francois, Henry, & Müller,
tasks as CLAN, but yielding visualizations for the web browser. 2017), which makes manipulating large datasets quick and
Behav Res (2019) 51:1928–1941 1933

Fig. 2 The frequency counts application allows users to track the frequency of words across various subgroups of children

Fig. 3 The derived measures application allows users to view several measures of children’s speech in CHILDES that operationalize complexity
and diversity
1934 Behav Res (2019) 51:1928–1941

easy. We describe the commands that the package provides • get tokens() gives information on each token
and then give several worked examples of analyses using the (gloss, stem, part of speech, number of morphemes,
package. speaker information, target child information)
The childesr package is easily installed via CRAN,
the comprehensive R archive network. To install, sim- Each of these functions takes arguments that restrict the
ply type: install.packages("childesr"). After query to a particular subset of the data (e.g., by collection,
installation, users have access to functions that can be used by corpus, by speaker role, by target child age, etc.) and
to retrieve tabular data from the database: returns the output in the form of a table. All functions
support the specification of the database version to use. For
• get collections() gives the names of available more detailed documentation, see the package repository
collections of corpora (“Eng-NA”, “Spanish”, etc.) (http://github.com/langcog/childesr).
• get corpora() gives the names of available corpora
(“Brown”, “Clark”, etc.)
• get transcripts() gives information on available Using childes-db: worked examples
transcripts (language, date, target child demographics)
• get participants() gives information on tran- In this section, we give a number of examples of how
script participants (name, role, demographics) childes-db can be used in both research and teaching,
• get speaker statistics() gives summary using both the web apps and the R API. Note that all of these
statistics for each participant in each transcript (number examples use dplyr syntax (Wickham et al., 2017); several
of utterances, number of types, number of tokens, mean accessible introductions to this framework are available
length of utterance) online (e.g., Wickham & Grolemund, 2016).
• get utterances() gives information on each
utterance (glosses, stems, parts of speech, utterance Research applications
type, number of tokens, number of morphemes, speaker
information, target child information) Color frequency
• get types() gives information on each type within
each transcript (gloss, count, speaker information, One common use of CHILDES is to estimate the frequency
target child information) with which children hear different words. These frequency

Fig. 4 The population viewer application allows users to investigate the statistics of corpora in CHILDES
Behav Res (2019) 51:1928–1941 1935

estimates are used both in the development of theory (e.g., meanings to which they refer (see Wagner, Dobkins, and
frequent words are learned earlier; Goodman et al., 2008), Barner (2013)). However, within the set of color words,
and in the construction of age-appropriate experimental the frequency with which these words are heard predicts
stimuli. One benefit of the childes-db interface is a significant fraction of the variance in their order of
that it allows for easy analysis of how the frequencies of acquisition (Yurovsky, Wagner, Barner, & Frank, 2015),
words change over development. Many of our theories in but are these frequencies stationary—e.g., do children hear
which children learn the structure of language from its “blue” as often at 12 months as they do at 24 months? We
statistical properties implicitly assume that these statistics answer this question in two ways—first using the web apps,
are stationary, i.e., unchanging over development (e.g., and then using the childesr package.
Saffran, Aslin, & Newport, 1996). However, a number of
recent analyses show that the frequencies with which infants Using web apps To investigate whether the frequency of
encounter both linguistic and visual properties of their color words is stationary over development, a user can
environment may change dramatically over development navigate to the Frequency app, and enter a set of color
(Fausey, Jayaraman, & Smith, 2016), and these changing words into the Word selector separated by a comma: here
distributions may produce similarly dramatic changes in “blue, red, green”. Because the question of interest is about
the ease or difficulty with which these regularities can be the frequency of words in the input (rather than produced
learned (Elman, 1993). by children), the Speaker field can be set to reflect
To demonstrate how one might discover such non- this choice. In this example, we select “Mother”. Because
stationarity, we take as a case study the frequency with children learn most of their basic color words by the age of
which children hear the color words of English (e.g., “blue”, 5, the age range 1–5 years is a reasonable choice for Ages
“green”). Color words tend to be learned relatively late by to include. The results of these selections are shown
children, potentially in part due to the abstractness of the in Fig. 5. We can also create a hyperlink to store these set

Fig. 5 An example of using the frequency shiny app to explore how children’s color input changes over development
1936 Behav Res (2019) 51:1928–1941

of choices so that we can share these results with others


(or with ourselves in the future) by clicking on the Share
Analysis button in the bottom left corner.
From this figure, it seems likely that children hear “blue”
more frequently early in development, but the trajectories
of “red” and “green” are less clear. We also do not have a
good sense of the errors of these measurements, are limited
to just a few colors at a time before the plot becomes too
crowded, and cannot combine frequencies across speakers.
To perform this analysis in a more compelling and complete
way, a user can use the childesr interface.

Using childesr We can analyze these learning trajecto-


ries using childesr by breaking the process into five
steps: (1) define our words of interest, (2) find the fre- We now join these two pieces of information together—
quencies with which children hear these words, (3) find the how many times each speaker produced each color word,
proportion of the total words children hear that these fre- and how many total words they produced. We then group
quencies account for, (4) aggregate across transcripts and the data into 6-month age bins, and compute the proportion
children to determine the error in our estimates of these of tokens that comprise each color for each child in each
proportions, and (5) plot the results. 6-month bin. For comparability with the web app analysis,
For this analysis, we will define our words of interest these proportions are converted to parts per million words.
as the basic color words of English (except for gray, which
children hear very rarely). We store these in the colors
variable, and then use the get types() function from
childesr to get the type frequency of each of these words
in all of the corpora in CHILDES. All other functions are
provided by base R or the tidyverse package. For
demonstration, we look only at the types produced by the
speakers in each corpus tagged as Mother and Father. We
also restrict ourselves to children from 1–5 years old (12–
60 months), and look only at the North American English
corpora.

Finally, we use non-parametric bootstrapping to estimate


95% confidence intervals for our estimates of the parts per
To normalize correctly (i.e., to ask what proportion of million words of each color term with the tidyboot package.
the input children hear consists of these color words), we
need to know how many total words these children hear
from their parents in these transcripts. To do this, we use
the get speaker statistics() function, which will
return a total number of tokens (num tokens) for each of
these speakers.
Behav Res (2019) 51:1928–1941 1937

Figure 6 shows the results of these analyses: Input This childesr call retrieves data from all collections
frequency varies substantially over the 1–5 year range for and corpora, including those languages for which there are
nearly every color word. very sparse data. In order to make any substantial inferences
from our analysis, we begin by filtering the dataset to
Gender include only languages for which there are a large number of
transcripts (> 500). We also restrict our analysis to children
Gender has long been known to be an important factor for under the age of 4 years.
early vocabulary growth, with girls learning more words
earlier than boys (Huttenlocher, Haight, Bryk, Seltzer, &
Lyons, 1991). Parent-report data from ten languages suggest
that female children have larger vocabularies on average
than male children in nearly every language (Eriksson et al.,
2012). Comparable cross-linguistic analysis of naturalistic
production data has not been conducted, however, and
these differences are easy to explore using childesr. By
pulling data from the transcript by speaker table,
a user has access to a set of derived linguistic measures
that are often used to evaluate a child’s grammatical
Our transcript by speaker table contains multiple
development. In this worked example, we walk through a
derived measures of lexical diversity—here we use MTLD
sample analysis that explores gender differences in early
(McCarthy, 2005). MTLD is derived from the average length
lexical diversity.
of orthographic words that are above a pre-specified type-
First, we use the childesr function call get
token ratio, making it more robust to transcript length than
speaker statistics() to pull data relating to the
simple TTR. We start by filtering to include only those chil-
aforementioned derived measures for children and their
dren for which a sex was defined in the transcript, who speak
transcripts. Note that we exclusively select the children’s
a language in our subset of languages with a large number
production data, and exclude their caregivers’ speech.
of transcripts, and who are in the appropriate age range. We
then compute an average MTLD score for each child at each
age point by aggregating across transcripts while keeping

Fig. 6 Color frequency as a function of age. Points represent means across transcripts, error bars represent 95% confidence intervals computed by
nonparametric bootstrap. Y-axes are free because non-stationarity is evaluated within each color word, while their absolute frequencies vary widely
1938 Behav Res (2019) 51:1928–1941

information about the child’s sex and language. Note that Teaching with childes-db
one child in particular, “Leo” in the eponymous German
corpus, contained transcripts that were a collection of his In-class demonstrations
most complex utterances (as caregivers were instructed to
record); this child was excluded from the analysis. Teachers of courses on early language acquisition often
want to illustrate the striking developmental changes in
children’s early language. One method is to present static
displays that show text from parent–child conversations
extracted from CHILDES or data visualizations of various
metrics of production and input (e.g., MLU or Frequency),
but one challenge of such graphics is that they cannot be
modified during a lecture and thus rely on the instructor
selecting examples that will be compelling to students.
In contrast, in-class demonstrations can be a powerful
way to explain complex concepts while increasing student
engagement with the course materials.
Consider the following demonstration about children’s
first words. Diary studies and large-scale studies using
parent report show that children’s first words tend to fall
into a fairly small number of categories: people, food, body
parts, clothing, animals, vehicles, toys, household objects,
routines, and activities or states (Clark, 2009; Fenson et al.,
The data contained in CHILDES is populated from a 1994; Tardif et al., 2008). The key insight is that young
diverse array of studies reflecting varying circumstances children talk about what is going on around them: people
of data collection. This point is particularly salient in they see every day, e.g., toys and small household objects
our gender analysis due to potential non-independence they can manipulate or food they can control. To illustrate
issues that may emerge from the inclusion of many this point, an instructor could:
transcripts from longitudinal studies. To account for non-
independence, we fit a linear mixed effects model with a 1. introduce the research question (e.g., What are the types
gender ∗ age (treated as a quadratic predictor) interaction of words that children first produce?),
as fixed effects, child identity as a random intercept, and 2. allow students to reflect or do a pair-and-share
gender + age by language as a random slope, the maximal discussion with their neighbor,
converging random effects structure (Barr, Levy, Scheepers, 3. show the trajectory of a single lexical item while
& Tily, 2013).3 The plot below displays the average MTLD explaining key parts of the visualization (see panel a of
scores for various children at different ages, split by gender, Fig. 8),
with a line corresponding to the prediction of our fit mixed 4. elicit hypotheses from students about the kinds of words
effects model. that children are likely to produce,
This plot reveals a slight gender difference in linguistic 5. make real-time queries to the web application to add
productivity in young children, replicating the moderate students’ suggestions and talk through the updated plots
female advantage found by Eriksson et al. (2012). The (panels b and c of Fig. 8), and
goal of this analysis was to showcase an example of using 6. finish by entering a pre-selected set of words that com-
childesr to explore the CHILDES dataset. We also municate the important takeaway point (Panel d of Fig. 8).
highlighted some of the potential pitfalls—sparsity and non-
independence–that emerge in working with a diverse set Tutorials and programming assignments
of corpora, many of which were collected in longitudinal
studies (Fig. 7). One goal for courses on applied natural language processing
(NLP) is for students to get hands-on experience using
NLP tools to analyze real-world language data. A primary
challenge for the instructor is to decide how much time
3 All code and analyses are available at https://github.com/langcog/ should be spent teaching the requisite programming skills
childes-db-paper for accessing and formatting language data, which are
Behav Res (2019) 51:1928–1941 1939

Fig. 7 MTLD (Measure of textual lexical diversity) scores as a function of age. Points represent mean scores for individual children across
transcripts. Females have a slight advantage in linguistic productivity over males

typically unstructured. One pedagogical strategy is to findings in the case studies presented above—color words
abstract away these details and avoid having students deal or gender. Depending on the students’ knowledge of R,
with obtaining data and formatting text. This approach the instructor could decide how much of the childesr
shifts students’ effort away from data cleaning and towards starter code to provide before asking students to generate
programming analyses that encourage the exploration their own plots and write-ups. The instructor could then
and testing of interesting hypotheses. In particular, the easily compare students’ code and plots to the expected
childesr API provides instructors with an easy-to-learn output to measure learning progress. In addition to specific
method for giving students programmatic access to child programming assignments, the instructor could use the
language data. childes-db and childesr workflow as a tool for
For example, an instructor could create a programming facilitating student research projects that are designed to
assignment with the specific goal of reproducing the key address new research questions.
1940 Behav Res (2019) 51:1928–1941

Fig. 8 Worked example of using the web applications for in-class teaching. Panels a–d show how an instructor could dynamically build a plot
during a lecture to demonstrate a key concept in language acquisition

Conclusions the current suite of web apps. We invite other researchers to


join us in both suggesting and contributing new functional-
We have presented childes-db, a database formatted ity as our system grows and adapts to researchers’ needs.
mirror of the CHILDES dataset. This database—together
with the R API and web apps—facilitates the use of child
Author Notes Thanks to Brian MacWhinney for advice and guidance,
language data. For teachers, students, and casual explorers, and to Melissa Kline for her work on ClanToR, which formed a starting
the web apps allow browsing and demonstration. For point for our work. This work is supported by a Jacobs Advanced
researchers interested in scripting more complex analyses, Research Fellowship to MCF.
the API allows them to abstract away from the details of the
CHAT format and easily create reproducible analyses of the
Publisher’s note Springer Nature remains neutral with regard to
data. We hope that these functionalities broaden the set of jurisdictional claims in published maps and institutional affiliations.
users who can easily interact with CHILDES data, leading
to future insights into the process of language acquisition.
childes-db addresses a number of needs that have References
emerged in our own research and teaching, but there are still
a number of limitations that point the way to future improve- Ambridge, B., Kidd, E., Rowland, C. F., & Theakston, A. L. (2015).
ments. For example, childes-db currently operates only The ubiquity of frequency effects in first language acquisition.
on transcript data, without links to the underlying media Journal of Child Language, 42(2), 239–273.
files; in the future, adding such links may facilitate further Barr, D. J., Levy, R., Scheepers, C., & Tily, H. J. (2013). Random
effects structure for confirmatory hypothesis testing: Keep it
computational and manual analyses of phonology, prosody, maximal. Journal of Memory and Language, 68(3), 255–278.
social interaction, and other phenomena by providing easy Bååth, R. (2010). Childfreq: An online tool to explore word
access to the video and audio data. Further, we have focused frequencies in child language. Lucs Minor, 16, 1–6.
on including the most common and widely used tiers of Bird, S., & Loper, E. (2004). NLTK: The natural language toolkit.
In Proceedings of the Association for Computational Linguistics
CHAT annotation into the database first, but our plan is Workshop on Interactive Poster and Demonstration sessions.
eventually to include the full range of tiers. Finally, a wide Brown, R. (1973). A first language. The early stages. Cambridge:
range of further interactive analyses could easily be added to Harvard University Press.
Behav Res (2019) 51:1928–1941 1941

Chang, F. (2017). The luCID language researcher’s toolkit [com- Meylan, S. C., Frank, M. C., Roy, B. C., & Levy, R. (2017). The
puter software]. Retrieved from http://www.lucid.ac.uk/resources/ emergence of an abstract grammatical category in children’s early
for-researchers/toolkit/. speech. Psychological Science, 28(2), 181–192.
Clark, E. V. (2009). First language acquisition. Cambridge: Cam- Miller, J. F., & Chapman, R. S. (1981). The relation between age
bridge University Press. and mean length of utterance in morphemes. Journal of Speech,
Demuth, K., Culbertson, J., & Alter, J. (2006). Word-minimality, Language, and Hearing Research, 24(2), 154–161.
epenthesis and CODA licensing in the early acquisition of English. Montag, J. L., Jones, M. N., & Smith, L. B. (2015). The words children
Language and Speech, 49(2), 137–173. hear: Picture books and the statistics for language learning.
Donoho, D. L. (2010). An invitation to reproducible computational Psychological Science, 26(9), 1489–1496.
research. Biostatistics, 11(3), 385–388. Norrman, G., & Bylund, E. (2015). The irreversibility of sensitive
Elman, J. L. (1993). Learning and development in neural networks: period effects in language development: evidence from second
The importance of starting small. Cognition, 48(1), 71–99. language acquisition in international adoptees. Developmental
Eriksson, M., Marschik, P. B., Tulviste, T., Almgren, M., Pérez Science, 19(3), 513–520.
Pereira, M., Wehberg, S., . . . , Gallego, C. (2012). Differences R Core Team (2017). R: A language and environment for statis-
between girls and boys in emerging language skills: Evidence tical computing. Vienna, Austria: R Foundation for statistical
from 10 language communities. British Journal of Developmental computing. Retrieved from https://www.R-project.org/.
Psychology, 30(2), 326–343. Redington, M., Chater, N., & Finch, S. (1998). Distributional
Fausey, C. M., Jayaraman, S., & Smith, L. B. (2016). From faces to information: A powerful cue for acquiring syntactic categories.
hands: Changing visual input in the first two years. Cognition, 152, Cognitive Science, 22(4), 425–469.
101–107. Saffran, J. R., Aslin, R. N., & Newport, E. L. (1996). Statistical
Fenson, L., Dale, P. S., Reznick, J. S., Bates, E., Thal, D. J., Pethick, learning by 8-month-old infants. Science, 274(5294), 1926–1928.
S. J., . . . , Stiles, J. (1994). Variability in early communicative Snyder, W. (2007). Child language: The parametric approach.
development. Monographs of the Society for Research in Child London: Oxford University Press.
Development, i–185. Song, J. Y., Shattuck-Hufnagel, S., & Demuth, K. (2015). Devel-
Goldwater, S., Griffiths, T. L., & Johnson, M. (2009). A Bayesian opment of phonetic variants (allophones) in 2-year-olds learning
framework for word segmentation: Exploring the effects of American English: A study of alveolar stop/t, d/codas. Journal of
context. Cognition, 112(1), 21–54. Phonetics, 52, 152–169.
Stodden, V., McNutt, M., Bailey, D. H., Deelman, E., Gil, Y.,
Goodman, J. C., Dale, P. S., & Li, P. (2008). Does frequency count?
Hanson, B., . . . , Taufer, M. (2016). Enhancing reproducibility for
Parental input and the acquisition of vocabulary. Journal of Child
computational methods. Science, 354(6317), 1240–1241.
Language, 35(3), 515–531.
Tardif, T., Fletcher, P., Liang, W., Zhang, Z., Kaciroti, N., &
Huttenlocher, J., Haight, W., Bryk, A., Seltzer, M., & Lyons, T. (1991).
Marchman, V. A. (2008). Baby’s first 10 words. Developmental
Early vocabulary growth: Relation to language input and gender.
Psychology, 44(4), 929.
Developmental Psychology, 27(2), 236.
Templin, M. (1957). Certain language skills in children: Their
Kline, M. (2012). CLANtoR. http://github.com/mekline/CLANtoR/. development and interrelationships (monograph series no 26).
GitHub. https://doi.org/10.5281/zenodo.1196626. Minneapolis: University of Minnesota, the Institute of Child
MacWhinney, B. (2000). The CHILDES project: The Database Vol. 2. Welfare.
Hove: Psychology Press. Wagner, K., Dobkins, K., & Barner, D. (2013). Slow mapping: Color
MacWhinney, B. (2014). The CHILDES project: Tools for analyzing word learning as a gradual inductive process. Cognition, 127(3),
talk, volume ii: The database. Hove: Psychology Press. 307–317.
MacWhinney, B., & Snow, C. (1985). The child language data Watkins, R. V., Kelly, D. J., Harbers, H. M., & Hollis, W. (1995).
exchange system. Journal of Child Language, 12(2), 271–295. Measuring children’s lexical diversity: Differentiating typical and
Malvern, D. D., & Richards, B. J. (1997). A new measure of impaired language learners. Journal of Speech, Language, and
lexical diversity. British Studies in Applied Linguistics, 12, 58– Hearing Research, 38(6), 1349–1355.
71. Wickham, H., & Grolemund, G. (2016). R for data science: Import,
Marcus, G. F., Pinker, S., Ullman, M., Hollander, M., Rosen, T. J., tidy, transform, visualize, and model data. Sebastopol: O’Reilly
Xu, F., & Clahsen, H. (1992). Overregularization in language Media, Inc.
acquisition. Monographs of the Society for Research in Child Wickham, H., Francois, R., Henry, L., & Müller, K. (2017). Dplyr:
Development, i–178. A grammar of data manipulation. Retrieved from https://CRAN.
McCarthy, P. M. (2005). An assessment of the range and usefulness R-project.org/package=dplyr.
of lexical diversity measures and the potential of the measure Yang, C. (2013). Ontogeny and phylogeny of language. Proceedings
of textual, lexical diversity (MTLD). Dissertation Abstracts of the National Academy of Sciences, 110(16), 6324–6327.
International, 66, 12. Yurovsky, D., Wagner, K., Barner, D., & Frank, M. C. (2015).
McCarthy, P. M., & Jarvis, S. (2010). MTLD, vocd-D, and HD-D: A Signatures of domain-general categorization mechanisms in color
validation study of sophisticated approaches to lexical diversity word learning. In Proceedings of the 37th Annual Meeting of the
assessment. Behavior Research Methods, 42(2), 381–392. Cognitive Science Society.

You might also like