Discover millions of ebooks, audiobooks, and so much more with a free trial

From $11.99/month after trial. Cancel anytime.

Regression Modeling for Linguistic Data
Regression Modeling for Linguistic Data
Regression Modeling for Linguistic Data
Ebook860 pages8 hours

Regression Modeling for Linguistic Data

Rating: 0 out of 5 stars

()

Read preview

About this ebook

The first comprehensive textbook on regression modeling for linguistic data offers an incisive conceptual overview along with worked examples that teach practical skills for realistic data analysis.

In the first comprehensive textbook on regression modeling for linguistic data in a frequentist framework, Morgan Sonderegger provides graduate students and researchers with an incisive conceptual overview along with worked examples that teach practical skills for realistic data analysis. The book features extensive treatment of mixed-effects regression models, the most widely used statistical method for analyzing linguistic data. 

Sonderegger begins with preliminaries to regression modeling: assumptions, inferential statistics, hypothesis testing, power, and other errors. He then covers regression models for non-clustered data: linear regression, model selection and validation, logistic regression, and applied topics such as contrast coding and nonlinear effects. The last three chapters discuss regression models for clustered data: linear and logistic mixed-effects models as well as model predictions, convergence, and model selection. The book’s focused scope and practical emphasis will equip readers to implement these methods and understand how they are used in current work.

  • The only advanced discussion of modeling for linguists
  • Uses R throughout, in practical examples using real datasets
  • Extensive treatment of mixed-effects regression models
  • Contains detailed, clear guidance on reporting models
  • Equal emphasis on observational data and data from controlled experiments
  • Suitable for graduate students and researchers with computational interests across linguistics and cognitive science
LanguageEnglish
PublisherThe MIT Press
Release dateJun 6, 2023
ISBN9780262362467
Regression Modeling for Linguistic Data

Related to Regression Modeling for Linguistic Data

Related ebooks

Science & Mathematics For You

View More

Related articles

Reviews for Regression Modeling for Linguistic Data

Rating: 0 out of 5 stars
0 ratings

0 ratings0 reviews

What did you think?

Tap to rate

Review must be at least 10 words

    Book preview

    Regression Modeling for Linguistic Data - Morgan Sonderegger

    Preface

    This book introduces applied regression analysis for analyzing linguistic data, using R. It aims to provide both conceptual understanding and practical skills through extensive examples, using three different kinds of linguistic data:

    Preliminaries to regression modeling (chapters 1–3): assumptions, inferential statistics, hypothesis testing, power, and other errors.

    Regression models for nonclustered data (chapters 4–7): linear regression, model selection and validation, logistic regression, and practical topics (e.g., contrast coding, post hoc tests, nonlinear effects).

    Regression models for clustered data (chapters 8–10): linear and logistic mixed-effects models, and practical topics (e.g., model predictions, convergence, model selection).

    The book started as a minor revision of Quantitative Methods for Linguistic Data (Sonderegger, Wagner, and Torreira 2018), co-authored with Francisco Torreira and Michael Wagner, which itself grew out of lectures for a one-semester graduate quantitative methods course at McGill Linguistics that is taught most years. I ended up rewriting the entire manuscript, including adding several new chapters. So this book is best understood as a new text, which incorporates aspects of QMLD. I thank Michael and Francisco for their understanding, and letting me incorporate their work here.

    I see this book as a frozen version of an evolving document. Any feedback is welcome ([email protected]) and will hopefully be incorporated in a future version. I hope to update the book’s website, which also contains all code and datasets, as the text is updated in the future.¹

    Audience This book is for graduate students and researchers in linguistics and other language sciences, who work with quantitative data. This includes data-heavy subfields of linguistics (e.g., experimental syntax/semantics/phonology, phonetics, psycholinguistics, corpus linguistics, language acquisition, sociolinguistics), as well as communication sciences and disorders, psychologists or cognitive scientists of language, and so on. I have taught this material since 2013 to students and faculty (about 50% from each group). I use linguist and language scientist interchangeably in this book for brevity, similarly for linguistics and language sciences.² The text assumes knowledge of elementary linguistics, using terms such as phone or lexical item without comment, but should be understandable regardless (you’ll just have to think of some variables as y or x rather than as voice onset time or place of articulation).

    Background This book joins many existing texts on quantitative analysis/statistics for linguists, which are also mostly practical introductions using R, including Brezina (2018), Eddington (2016), Garcia (2021), Gómez (2014), and Gries (2021, 2013), Johnson (2008), Levshina (2015), Rietveld and Van Hout (1993, 2005), Vasishth and Broe (2011), Vasishth et al. (2021), and Winter (2019). Baayen (2008) has been particularly influential. These texts differ in many respects, such as the type of linguistic data assumed and the theory/practice balance, but mostly share two aspects: starting from scratch and broad scope (statistics or quantitative analysis generally).

    The approach of this book, described in more detail in chapter 1, is different. Its goals are narrower—conceptual understanding and practical experience with regression modeling of linguistic data—and I assume you have some experience with statistics and R. It does not cover other (important) quantitative tools, such as classification methods or exploratory data analysis. While I don’t assume you have read any of the preceding books, the current book can be seen as complementary to them.

    I focus on regression models because these are the main form of statistical analysis in papers (using quantitative data) published in major journals, but they are complex tools to use in practice. I give practical and detailed treatments of a smaller number of topics, describing decision points and the pros and cons of different methods, conventions in the current language sciences literature, and how to report your analyses. The goal is to equip you to use these methods in practice and understand how they are used in current work. This book will not cover the full range of possible regression models (e.g., Poisson or multinomial models), but extending to them should be straightforward after you’ve used this book.

    Other differences from existing texts are a greater focus on data from (laboratory) phonology, phonetics, and language variation and change (though other areas are represented) and equal emphasis on observational data (especially from speech corpora) and data from controlled experiments. These reflect my own background as a linguist working primary in these areas, often using corpus data—but I have tried to keep the presentation useful for language scientists generally.

    Regression modeling is used across many scientific fields, and we can learn from best practices in other fields to better analyze linguistic data. At the same time, it’s easier to learn data analysis with data that looks like your own, and analysis tools become specialized for particular kinds of data as they are used in a field over time. I try to give context and places to read more for particular topics of interest—from (statistics for) language sciences, as well as behavioral sciences, social sciences, and ecology—both in the text and in Further Reading sections at the end of each chapter. No statistical method used in this book is new, so it is neither possible nor useful to give comprehensive citations; I give references I am familiar with and have found helpful, including particularly detailed treatments of linguistic data by other authors (in books, cited above, plus articles). My goal is to provide useful entry points to the vast literature on statistical methods for when you want to learn more.

    What you need to know This is an intermediate text. It assumes previous exposure to quantitative methods, such as from one of the books above, but aims to be useful for readers with different backgrounds.

    You should be familiar with secondary school math concepts, such as algebra, logarithms, exponentials, summation notation, and basic linear algebra (vectors, matrices, matrix multiplication); as well as some probability theory, specifically what the following terms mean: probability distribution, normal distribution, random variable, binomial distribution, and conditional probability. Most important is familiarity with statistics, at the level of a first course:

    Descriptive statistics: data summarization and visualization: the meaning of concepts such as mean, mode, standard deviation, quantile, and correlation; how to make and read common statistical plots (e.g., boxplots, histograms, density plots, scatterplots).

    Inferential statistics: the idea of sample and population, basic hypothesis testing concepts (p-values, test statistics), and tests (t-tests, χ²-tests), and maybe basic analysis of variance.

    The book focuses on conceptual understanding and practical skills by working in R, but without actually providing instruction in R. Thus, I assume a working knowledge of R and R programming, including both base R and some familiarity with tidyverse functionality (packages such as ggplot2, dplyr: https://tidyverse.org).³

    In practice, dozens of graduate students with a variety of backgrounds in these areas (including not much) have done well in the course using this book’s material, with some doing extra work to catch up. In particular, many students have learned R from scratch at the same time as using this material, using online tutorials for base R and R for Data Science (Grolemund and Wickham 2016) for tidyverse, and students with less math background than described have done well.

    Some resources for math/probability/statistics are Khan Academy videos, Sharon Goldwater’s math tutorials, general statistics books listed in chapter 2, Further Reading, and those for linguistics listed earlier; for descriptive statistics and visualization, Grolemund and Wickham (2016) (general), Gries (2021, chapter 3), and Garcia (2021, part 2) (linguistics) are particularly thorough.

    Caveat I am not a statistician, but a self-taught practitioner with some masters-level math/statistics training. Books written by practitioners can be useful because of the perspective of working with this kind of data, but they can also contain errors. If you frequently use a particular tool for data analysis, I recommend (eventually) consulting a more authoritative source; one goal of this book is to equip you to do so. This is particularly important because statistical practice is not static: this book emphasizes practical skills (e.g., which package to use, best practices for fitting and interpreting models) for which best practices are constantly evolving. My aims are for this book to be useful for analyzing linguistic data and as (technically) correct as possible.

    How to use this book This book is ideally read while executing the code shown in code blocks on your own computer, for example, by pasting them in to RStudio or the R console. The code in each chapter is independent, so you should always be able to start reading/coding at the beginning of a chapter. I often refer to objects that have been created in previous code, and sometimes the output of running code is not actually shown (I’m assuming you can see it in your console).

    Most code is shown in the actual PDF; this is the code I assume you are able to run. An important exception is code for creating plots, which is usually omitted because code to make decent plots is verbose. You can find all code (including plotting code) in the code file for each chapter, on the book’s website.

    Each chapter consists of main text and boxes, inspired by McElreath (2020)’s Statistical Rethinking. Boxed text is not essential for understanding the main text and gives extra information. There are two types of boxes: Broader Context boxes provide more in-depth explanations of technical concepts or math, connect to other approaches, discuss common misunderstandings, and so on. Practical Note boxes discuss aspects of statistical analysis you’re unlikely to delve into until you actually use these methods in your own work, such as statistical reporting or R details.

    Offsetting materials in boxes is intended to make the book easier to use. On a first read you can focus just on the essentials by skipping boxes, and when analyzing your own data later or learning more about a particular method, the boxes call out relevant material.

    Datasets This book uses publicly available datasets, all of which are available on the book’s website or in the languageR package. I am grateful to dataset authors for their willingness to post data publicly: Michael Wagner (givenness); Timo Roettger and Bodo Winter (neutralization); Francisco Torreira, Seán Roberts, and Stephen Levinson (transitions); Michael McAuliffe and Hye-Young Bang (turkish_if0); Max Bane and Peter Graff (vot); and R. Harald Baayen and co-authors (english, regularity).

    Acknowledgments Many people deserve thanks for the long journey to a book. Above all, students in McGill classes from 2013 to 2021 provided feedback, as did students who continued to use the materials for their research—especially Hye-Young Bang, Amélie Bernard, Guilherme Garcia, Claire Honda, Oriana Kilbourn-Ceron, Donghyun Kim, Bing’er Jiang, Jeff Lamontagne, James Tanner, and Connie Ting. Substantial editing and typsetting work have been done by David Flesicher, Claire Honda, Jacob Hoover, Vanna Willerton, and especially Michaela Socolof. Colleagues who have offered encouragement and comments over the years include Meghan Clayards, Jessamyn Schertz, Tyler Kendall, Michael Wagner, Paul Boersma, Christian Di Canio, Volya Kapatsinki, Roger Levy, Jane Stuart-Smith, Tim O’Donnell, Alan Yu, Bodo Winter, Tania Zamuner, and five anonymous reviewers. Special thanks go to James Kirby for comments on the entire manuscript and reminders to keep going. I am grateful to you all, including those I have forgotten. Finally, I am thankful to Katie for her companionship, support, and especially patience—it’s finally done.

    Morgan Sonderegger

    January 2022

    1. The website is currently at https://osf.io/pnumg/. Please check my personal web page if the OSF link no longer works when you read this.

    2. The different connotations of the two terms are not relevant for statistical analysis.

    3. Some people have strong opinions about base R versus tidyverse. If you don’t like one idiom or the other, it will be perfectly possible to follow along, I just won’t be explaining particular functions in detail.

    1

    Preliminaries

    1.1 Our R Toolset

    This book exclusively uses R, which has become the de facto standard across language sciences. R is free, relatively easy to learn, and incorporates very broad functionality through packages. It is an excellent default for visualization and statistical analysis.

    A downside of using R is that you must decide what dialect to use: the core set of base packages, which are stable but clunky, or a better alternative set of packages that have been developed, which may become obsolete.¹ The currently dominant dialect is tidyverse, a family of packages that offer wide functionality and an elegant implementation based on the tidy data philosophy (Wickham et al. 2019; Grolemund and Wickham 2016), but that evolve over time. The world being what it is, base R and tidyverse zealots can be easily found online.

    I agree with Winter (2019) that the most realistic option is to learn both dialects: you need to know base R to function, while tidyverse functionality is often superior in practice, and both are widely used in online resources (e.g., StackOverflow pages). This book uses both, leaning toward tidyverse functionality when available, but often showing both base and tidy ways to do the same thing, in line with my general philosophy of showing you alternative methods so you can choose for yourself. This book also often uses data and functions from the languageR package associated with Baayen (2008)’s Analyzing Linguistic Data (Baayen and Shafaei-Bajestan 2019).

    The book’s code will be kept updated on its website, currently osf.io/pnumg (or see my personal website). Check there if any code doesn’t work using your (future) R/tidyverse version. Appendix B shows the exact R and package versions used to compile this book.

    1.2 Our Approach

    The primary goals of this book are conceptual understanding and practical experience with regression modeling, in R.

    Conceptual understanding My general philosophy is that understanding the statistical/data analysis methods you use ("why do X, what is X) is of primary importance, and the best way to do this is through practical demonstration (how to do X"). Developing conceptual understanding also often requires math, whether via equations or simulation, because this is the underlying language. Conceptual understanding is the most practical thing you can learn for data analysis, because statistical methods (i.e., best practices for doing X) change over time, but new methods always build on old ones. For this reason, the book focuses on fewer topics—such as covering linear and logistic regression, but not other types of linear models—in greater detail, and in a cumulative fashion: linear mixed-effects models build on linear regression, which builds on t-tests.

    Practical experience Conceptual understanding alone will not let you analyze data. There are often little tricks, or best practices that you learn with experience, which are essential in practice, but don’t come up unless you actually spend a lot of time analyzing data. So this book contains a lot of working code, integrated into the main text, showing how to do everything discussed, using a good set of R packages for 2022.² The only exception is code for making plots, which is not shown in the text because it is verbose.

    I strongly recommend actually running the code as you read, because just as one cannot learn martial arts by watching Bruce Lee movies, you can’t learn to program statistical models by only reading a book (McElreath 2020, xiii). To facilitate this, R files of just the code for each chapter (including plotting code) are posted on the book’s website.

    Statistical modeling Regression analysis is a statistical modeling approach to data analysis, where we seek to interpret some data with respect to research questions or hypotheses we have, by building and interpreting a model. A different approach to data analysis, which language scientists often learn, is a hypothesis testing approach: the researcher applies one of a fixed set of tests (e.g., t-test, one-way analysis of variance), depending on the type of data and the question being asked. Hypothesis testing is one foundation of statistical modeling (see chapter 2), but the underlying philosophy is quite different (see, e.g., Rodgers 2010; Gelman and Hill 2007). The statistical modeling approach is more flexible, but harder to learn—it involves making choices, and thinking about the data (box 1.1).

    Box 1.1

    Broader Context: Trade-Offs versus Flowcharts

    Many researchers, including language scientists, just want to know how to analyze their data—they see statistical analysis as an onerous task that one would rather leave to others (Baayen 2008, viii)—and don’t want to have to choose among different methods depending on the pros and cons. This is understandable, and to meet this demand, statistics textbooks often use a flowchart/recipe approach to guide researchers in choosing a method given their data (e.g., compare two normally distributed groups ⇒ two-sample t-test). This approach is simple, and requires less conceptual understanding, but it has serious disadvantages in practice: one’s data often does not fit neatly into a flowchart box (e.g., you can’t assess normality, or a reviewer asks for a different method), in which case you don’t know what to do. Also, no intuition is developed for the consequences of using different analysis methods, as in different papers in the literature. If you understand the pros and cons of different methods, you are better able to address the scientific questions you want to ask about your data, and you will be a more informed consumer of the literature.

    The statistical modeling approach recognizes three central facts about data analysis. First, we are not building a model of the actual generating mechanism of the data (e.g., neurons, vocal tract muscles), which for linguistic data is usually unknown; at best we are building a process model to gain insight into the research questions motivating the analysis (McElreath 2020, section 1.2).³ It follows that while there are incorrect ways to analyze one’s data, there is never a single right way—data analysis requires an educated choice of method, and different choices carry different risks and rewards. Finally, data analysis always takes place in a scientific context: the hypotheses or research questions motivating your analysis are fundamental to choosing the analysis method, because the goal of the analysis is to address these questions.⁴

    These points inform this book’s presentation. Rather than showing you the right way to do an analysis (e.g., fit and report a linear mixed-effects model), I show decision points, the pros and cons of different paths, and what is done in current practice. I try to introduce methods in the context of concrete research questions; a corollary is that this book uses fewer datasets than many other books do.

    1.3 Context

    This section defines terminology and notation used throughout the book, including classic oppositions among experimental/observational studies, correlation/causation, and exploratory/confirmatory analysis.

    1.3.1 Types of Data and Study

    In this book linguistic data is often used as a shorthand for any quantitative dataset produced in a linguistic study.⁵ These come most often from laboratory experiments or linguistic corpora (of speech or text), but they could also be from typology (typological frequencies), computational studies (e.g., output of a speech recognizer as parameters are varied), observation of language acquisition (what words children know at age X) or language change, lexicons, and many other sources.

    These sources can be divided into experimental studies, where the researcher constructs a world and manipulates some variables (x) to observe the effect on other(s) (y), and correlational or observational studies, where the researcher observes the real world (e.g., Field, Miles, and Field 2012, section 1.6.1). Linguistic data can be experimental or observational, for example, respectively, from controlled laboratory experiments versus corpus studies.

    This distinction is closely related to causation versus correlation (i.e., data description). In the classic formulation, causality (x y) can only be inferred from an experimental study, but it is unclear whether the results generalize to the real world, while observational studies are always correlational (x ∼ y), but the results have greater ecological validity. In reality, inferring causality is hard even for experimental data (e.g., there may be unobserved confounders, a nonrandom sample), and we must just always assume that we are fitting correlational models and bear the correlation is not causation adage in mind when interpreting our results. Section 4.2 discusses further for regression models.

    1.3.2 Exploratory and Confirmatory Analysis

    Traditionally, data analysis can be exploratory or confirmatory (EDA, CDA; Tukey 1977): exploring the data, often by visualization, to generate hypotheses versus testing known hypotheses in novel data, for example, fitting a regression model and calculating p-values for each coefficient. An alternative characterization is that CDA/EDA is any data analysis that does/doesn’t involve statistical modeling. The ideal is that EDA precedes CDA. In realistic data analysis, the boundary between the two is often unclear (Tukey 1980), especially for modern regression models (e.g., Gelman 2004). It is common to explore and confirm using the same dataset, including going back and forth, and in many linguistic studies (especially observational) the exact statistical model to be run cannot be fully specified in advance. The exact balance is tricky and depends on the context.

    Rather than discussing exploration versus confirmation in depth, this book assumes that you are familiar with the basic issues relevant for regression modeling:

    EDA (read: making many plots) is critical before any statistical analysis, even if the statistical methods have been completely prespecified: to find problems with the data, get intuitions about what the data say that the fitted model can be checked against, and so on.

    Testing hypotheses (CDA) suggested by the data (EDA) is dangerous: this makes it less likely your results generalize to new data, and you can easily hypothesize after the research question is known (sometimes referred to as HARK) after you see what terms are significant.

    Both exploratory and confirmatory phases are valuable for regression analysis: fitting a prespecified model to data will often miss important aspects of the data, while a model whose structure comes (only) from examining empirical plots may not generalize to new data.

    A (published) statistical analysis can be exploratory or confirmatory or both. Confirmatory analyses have higher status, but this is just convention—exploratory studies are very important and should be reported as exploratory (not written as if they are confirmatory). Most realistic data analysis is actually both exploratory and confirmatory, and it is important in your writeup to specify which part of your analysis was performed in exploratory mode versus confirmatory mode. (For example, some terms in a regression may be based on scientific hypotheses and some from examining empirical plots.)

    Some places to read more are Winter (2019, chapter 16), Gries (2021, chapter 4, section 5.5), Nicenboim et al. (2018) and Roettger (2019) (for linguistic data); Baguley (2012, section 1.3), and references already discussed.

    Most regression models that this book covers are confirmatory, but they are used in analysis pipelines including exploratory steps. These include making empirical plots (before modeling) and checking fitted models (model validation) to detect problems that can lead to refitting.

    Box 1.2

    Broader Context: Assumptions about Our World

    This book assumes a classic frequentist framework, in an idealized world:

    The goal is to estimate parameter values (e.g., the mean voice onset time for stops produced by American English speakers).

    These parameters have true (population) values, which are approximated by taking a sample.

    The population from which the sample is taken is infinitely large.

    Samples drawn from the population are representative and random (e.g., samples are from all American English speakers, randomly).

    Assumptions 1 and 2 are assumptions of frequentist data analysis, as opposed to Bayesian (see box 2.1). Assumptions 3 and 4 are idealizations about the sample that are assumed by most statistical methods; in reality a researcher is usually working with a convenience sample, which she hopes is representative/random enough. While assumptions 3 and 4 are important, they form part of the general issue of how the data were obtained, which I abstract away from in this book. Note that one prerequisite for assumption 4 is that observations are assumed to be independent. This is sometimes true for linguistic data, in which case methods from chapters 4 to 7 can be used, but often not; this is the primary motivation for mixed-effects regression models (chapters 8–10).

    For more on these points, see Navarro (2016, chapter 10), Kline (2013, chapter 2), or other statistics textbooks (some are listed in section 2.8)

    1.3.3 Mathematical Notation

    This book’s notation largely follows Gelman and Hill (2007). Values of parameters (the population values) are written with Greek letters: μ, σ. Estimates of parameters are written with a hat: , . Random variables corresponding to observed data are typically written with lowercase Roman letters, with subscripts denoting individual observations. For example, y could be observed reaction time in data from a laboratory experiment, and y1, , yn are the values of the n observations. However, parameters that are proportions are written with p (whose estimate would be ), and some Greek letters (𝜖, δ, γ) are used for error terms, which are discussed when they are first used (chapters 4, 8). Sample means are written with a bar: y is the mean of y1, , yn.

    The notation is used to describe how a random variable is distributed. For example, "individual observations yi follow a normal distribution with mean 1 and standard deviation 5 is written y ∼ N(1, 5)." N(μ, σ) means "a normal probability density with mean μ and standard deviation σ."

    1. For example, Baayen (2008) uses Lattice packages that are now obsolete.

    2. Note that there is no best set of tools: R packages change rapidly, and different tools work for different people. For example, I prefer customizing my plots carefully; you may be fine with just using existing prewritten functions. You may strongly prefer base R or tidyverse functionality.

    3. This is simplified somewhat: McElreath distinguishes between process models, which are well-specified quantitative causal models of the process (e.g., how speakers parse sentences, how vowels are realized acoustically), and statistical models, which are the actual models we fit to data.

    4. On the centrality of scientific questions for statistical analysis, see Speed (1986) and McElreath (2020, chapter 17) (who cites it).

    5. This shorthand is just for convenience, because only quantitative data are relevant for us.

    6. In linguistics proper, experimental [data/linguistics] is commonly used in two senses: a laboratory experiment, or any study that primarily uses quantitative data, which would include most sources of what I am calling quantitative data. I avoid usage like experimental linguistics because of this ambiguity and only use experimental as defined in the text.

    2

    Samples, Estimates, and Hypothesis Tests

    This chapter and the next cover basics of inferential statistics: going from a finite sample of data from a population to inferences about the population, with the goal of [drawing] conclusions about which parameter values are supported by the data and which are not (Hoenig and Heisey 2001, 4).

    Regression modeling is a type of inferential statistics that builds on concepts covered in these two chapters. They first cover estimation of population values and differences using sample statistics (section 2.2), uncertainty in these estimates (section 2.3), and assessment of the reliability of conclusions that we reach about population values/differences (sections 2.4–2.7) using hypothesis testing. Chapter 3 covers the size of estimates and different kinds of errors that we can make in assessing the size and reliability of an effect.

    I assume you already have some exposure to the topics in the current chapter, which are covered in depth in many sources; some are listed in section 2.8. However, these topics are covered in very different ways in different settings (e.g., in a statistics class vs. an R tutorial). For language scientists learning regression modeling, it is useful to establish a common set of concepts, terminology, and practical guidelines, using linguistic data examples. This is the goal of this chapter.

    2.1 Preliminaries

    2.1.1 Packages

    This discussion assumes that you have loaded the tidyverse and languageR libraries (section 1.1).

    library(tidyverse) is a shortcut to load a set of tidyverse packages (section 1.1).¹ You can alternatively just install and load single packages as needed. For this book, the dplyr, ggplot2, and tidyr packages are the most important (Wickham et al. 2021; Wickham 2016, 2021).

    2.1.2 Data

    The transitions dataset

    This discussion assumes that you have loaded the transitions dataset:

    This dataset (described in more detail in appendix A.1), comes from a study by Roberts, Torreira, and Levinson (2015) that examines approximately 20,000 transitions between conversational turns in a corpus of telephone calls. Each conversation (column file) is between two different speakers. Of interest is what factors affect transition durations (column dur): how long after one speaker finishes speaking before the other speaker begins. The before and after speakers for each turn are called speaker A and speaker B (columns spkA, spkB). For example, in conversation SW3154.EAF (the first rows of the dataframe), the two speakers are SPKR1290 and SPKR1288, and which one is speaker A or B alternates:

    (Here, … indicates omitted lines of R output. You can always run the code yourself to see full output.)

    Observations from the same conversation are not independent—because individual speakers probably have characteristic durations—but independent observations are assumed by methods introduced in this chapter. Thus, we take a small subset of the data where observations are plausibly independent, by choosing a random observation from each conversation:

    I assume that you have run these commands, so the dataframe transitions_sub exists (n = 349), and you are using the same random dataframe.

    2.1.3 Notational Conventions

    This chapter refers to individual datasets and R objects. R libraries (e.g., tidyverse, ggplot) are kept in plain text, while datasets are referred to in teletype, such as the transitions dataset. Teletype is used for objects in R code, such as the transitions_sub dataframe, or individual columns of the dataframe. A fundamental data type in R is the factor, a categorical variable that takes on discrete values. Factors (typically columns of a dataframe) are written using teletype and individual levels with SMALL CAPS. For example, the factor sexB in the transitions dataset has levels F and M.

    2.2 Point Estimation

    In a quantitative study we are often interested in estimating single numbers (called point estimates in statistics) that characterize an aspect of the world. For example, in the transitions data, we may be interested in the effect of speaker B’s gender (column sexB: values F, M).² This could be quantified by three numbers:

    How long are transitions if speaker B is male?

    How long are transitions if speaker B is female?

    What is the effect of gender on transition duration (the difference between male and female durations)?

    2.2.1 Population and Sample

    In quantitative studies we are typically interested in population values of a parameter—their true values in the world, under the model of the world we are assuming (box 2.1).

    Box 2.1

    Broader Context: Frequentist and Bayesian Statistics

    There are two major approaches to statistical inference, corresponding to different philosophies of what probability means. The assumption that true values of parameters exist implies we are doing frequentist statistics rather than Bayesian statistics, where inference results in a probability distribution describing degrees of belief over possible values of the parameter. This is simply a pragmatic choice—frequentist methods are vastly more common in behavioral and social sciences, though Bayesian methods offer some serious advantages and are making inroads. Many sources describe the general differences between Bayesian and frequentist approaches (e.g., Dienes 2008, chapter 4; McElreath 2020, chapter 1), and Nicenboim and Vasishth (2016) and Vasishth, Nicenboim, et al. (2018) are good starting points for Bayesian methods for analyzing linguistic data in particular.

    Typically population values refer to a context beyond just the setting for the study—we are probably interested in gender effects on transition time among all speakers of American English, not just all American English speakers who volunteered to be recorded for this corpus. However, in the real world we never observe population values; we can only take a sample of size n and make an inference about the population values.

    For example, to estimate the preceding three quantities using the transitions_sub data (n = 349), we could use:

    The average value of dur when sexB is F or M (91 msec, 259 msec)

    The difference between these averages ( − 167 msec)

    These values can be calculated for the transitions_sub data using functions from dplyr:

    These estimates are not the same as the population values, for several reasons: the sample may (a) not be representative of the desired population (e.g., all American English speakers), or (b) truly random, and (c) the sample is finite. While (a) and (b) are important, they form part of the general issue of how the data were obtained, which we are abstracting away from in this book (section 1.3, box 1.2). We thus assume we do have a random sample from the population of interest. This leaves (c), which is a fundamental issue addressed by statistical inference: estimation of (population) quantities of interest, whose true values we will never actually know, based on a finite sample.

    2.2.2 Sampling Distribution of the Sample Mean

    In inferential statistics, the general setup is that we have a data sample from a quantitative study, which we assume is representative and random. We use this sample to calculate sample statistics, which are estimates of the population values of quantities we care about—typically parameters of a statistical model.

    Ideally a sample statistic should be an unbiased estimator of the population value: the statistic’s average value should be the same as the population value, meaning that if we kept repeating the study and computing the statistic, averaging these values would get us closer and closer to the true value.

    The most basic sample statistic is the sample mean, which is the average of n observations (written xi, , xn):

    The sample mean approximates the population mean, which we write μ. To understand how the sample mean is related to the population mean, we can explore using simulations where we know the population distribution.

    Suppose that durations of transitions (dur) to female speakers in the transitions data were in fact drawn from a normal distribution, with mean μ = 200 and standard deviation (SD) σ = 450, which we write N(200, 450) (see section 1.3.3 on notation). These are the (made-up) population values.

    We are interested in the sampling distribution of the sample mean: How likely are we to calculate different values for μ if we kept drawing random samples? We can plot a good approximation of this distribution as follows:

    Draw a sample of n observations from the distribution N(200, 450).

    Calculate for the sample.

    Repeat steps 1 and 2 many times (nsim), and plot a histogram showing the distribution of values.

    Figure 2.1 (top row) shows these histograms when the sample mean is calculated over 5, 10, and 50 observations (with nsim = 100,000). The distribution of the sample mean gets narrower for larger n. Thus, how certain we should be about our observed sample mean (91 msec) depends a lot on sample size: if n = 5, we would be likely to calculate a sample mean that is at least this far (109 msec) from the true value, just by chance.

    Figure 2.1

    Sampling distribution of the sample mean (histograms), calculated over n observations drawn from a normal distribution with μ = 200 and SD σ, for varying n and σ. Dotted lines show the probability distribution of observations [N(200, σ)].

    The distribution of the sample mean is also narrower if the quantity that we are estimating is less variable (that is, smaller σ), as illustrated in the bottom row of figure 2.1. Thus, the more observations in the sample or the less variable the quantity we are estimating, the more precise (= less variable) is the mean value that we calculate based on the sample.

    As suggested by the shape of the distributions in figure 2.1, the sample mean is itself normally distributed (box 2.2). The mean of this distribution is μ—because the sample mean is an unbiased estimator of the population mean—and its SD is . This can written more succinctly as

    Box 2.2

    Practical Note: Normal Distributions Refresher

    It is useful to know some properties and notation for normal distributions that come up frequently in regression modeling (and in R output). The probability density for a normal distribution is

    This is often abbreviated as N(μ, σ) distribution. (Or as N(μ, σ²) depending on the author.) σ² is the variance, and the inverse variance (1²) is the precision.

    A normal distribution with mean 0 and SD 1 is called a standard normal distribution, written N(0, 1). It is common (in statistics texts or in R output) to use z (or Z) to refer to any random variable that is expected to follow a standard normal distribution. If you draw an observation z from such a random variable, the probability that |z| < 1, |z| < 2, or |z| < 3 is 0.68, 0.954, and 0.997 (respectively). That is, about 66% of probability lies within one σ from the mean, about 95% lies within two σ, and almost all probability lies within three σ. Given the ubiquity of the 95% significance criterion in language sciences, it is also useful to remember that exactly 95% of probability lies within 1.96 σ from the mean. But in general 2 is close enough to 1.96 to represent 95% probability.

    A very useful property of normal distributions is closure under linear combination:

    If Z ∼ N(μ, σ) and a and b are constants, then

    That is, adding a constant increases the mean and multiplying by a constant multiplies the variance (which is now ²).

    If Z1∼ N(μ1, σ1) and Z2∼ N(μ2, σ2) and Z1 and Z2 are independent, then

    That is, the mean and variance of the sum are just the sums of the individual means and variances.

    One application is normality of the sample mean of n normally distributed (and independent) observations. This follows from the last two equations, because the sample mean is just a sum of normally distributed random variables divided by a constant.

    using notation for describing the distribution of a random variable (section 1.3.3). The term quantifies the observation from figure 2.1: either higher sample size or lower variability (in the data we’re analyzing) leads to a more precise estimate.

    2.2.3 Nonnormal Distributions and the Central Limit Theorem

    Much of regression modeling boils down to estimating mean values, as we did for estimating the mean of a normal distribution (as well as quantifying uncertainty in the estimates, as we’ll do in section 2.3). But in general, we’ll want to analyze data beyond just continuous variables drawn from a normal distribution—much linguistic data is discrete (e.g., yes/no responses in an experiment, syntactic construction A vs. B observed in a corpus), and much continuous-valued linguistic data isn’t normally distributed (e.g., word frequencies, phonetic parameters such as voice onset time, reaction times). What happens if we take the sample mean for observations from a nonnormal distribution?

    For example, consider the Dutch verb regularity data (dataframe regularity) from the languageR package, described in more detail on its help page (type ?regularity). This dataset, originally from Baayen and Prado Martin (2005), lists 700 Dutch irregular and regular verbs (column Regularity) and includes lexical and distributional variables that may help predict whether a verb is regular, including the verb’s frequency (column WrittenFrequency) and which auxiliary verb is used to form certain past tenses (column Auxiliary: levels HEBBEN, ZIJN, ZIJNHEB). In this sample, 159 verbs (23%) are irregular.

    Suppose we are trying to estimate a single (population) probability, p: how often Dutch verbs are irregular. (If we picked a random verb in a large Dutch dictionary, how likely would we be to select an irregular one?) We observe n Dutch verbs, x1, , xn, each of which is 0 (regular) or 1 (irregular),

    Our estimate for p is the sample proportion:

    which is the proportion of verbs that were irregular. Equation (2.3) looks the same as equation (2.1), because they are both sample means, but each xi in equation (2.3) follows a Bernoulli distribution rather than a normal distribution. The numerator of equation (2.3) is thus a count, which follows a binomial distribution, not a normal distribution.

    To examine the probability distribution of the sample proportion, we can use the same simulation procedure as in section 2.2.2. Figure 2.2 shows the distribution for different sample sizes, assuming that p = 0.1 or 0.4 (made-up values), with a dotted line showing p.

    Figure 2.2

    Sampling distribution of the sample proportion ( ) for n observations of a Bernoulli random variable with probability p (histogram), varying n and p. Dotted lines show the value of p.

    The distribution of the sample proportion looks somewhat normal for n = 10, and by n = 50 looks perfectly normal—even though the distribution of the actual random variable whose mean is being estimated is not normal (it only takes on values 0 or 1). This illustrates one of the most important results of probability theory, the central limit theorem: for a large enough sample from any random variable with mean μ and SD σ, the sampling distribution of the sample mean is approximately normally distributed with mean μ and SD of (equation (2.2)).³

    The central limit theorem essentially says that the larger the sample you collect, the closer to normally distributed the sample mean is. This remarkable result is frequently used in inferential statistics, because it allows us to apply the same tools (for dealing with normally distributed data) to many different kinds of data. Nonetheless, as the preceding example shows, it is important to bear in mind that any normal approximation is still an approximation, which depends on sample size and the exact distribution being approximated.

    Box 2.3

    Broader Context: Variability in Bernoulli and Binomial Distributions

    When estimating the sample mean of a normally distributed quantity we can vary n and σ, but for the sample proportion (figure 2.2) there is no σ. This is because for a Bernoulli random variable (e.g., a coin flip) the only free parameter is the probability of a success (p). The distribution still has an SD σ, it is just a function of the free parameter p:

    This quantity is maximized when p = 0.5 and approaches 0 as p gets closer to 0 or 1. The (population) standard error of the sample proportion is just , which decreases either for higher n or p further from 0.5—the pattern seen in figure 2.2. Intuitively, the further p is from 0.5, the more certain you can be about the outcome of an individual observation. (If p = 1 or 0, there is no uncertainty.)

    The fact that Bernoulli distributions don’t have an independent variance parameter will be important for understanding logistic regressions (chapter 6).

    2.3 Uncertainty and Interval Estimation

    Almost as important as estimating the value of a quantity is estimating the uncertainty in our estimate, measured either by a single number or by a range of values (called an interval estimate).

    2.3.1 Standard Error

    For our estimate of the sample mean, we saw that the width of the distribution in figure 2.1, , quantifies how much uncertainty there is in our estimate of the sample mean. But in general we do not know σ (the population value) and must estimate it. An unbiased estimator for σ is

    which looks almost identical to the formula for calculating an SD of a sample, except with n− 1 in the denominator instead of n (which corrects for finite sample size [Rice 2006, section 7.3]).

    We can then define the standard error (SE) of the sample mean, which is an unbiased estimator of :

    The SE estimates how much error there is, on average (across many samples), in our estimate of the population mean μ using .

    One consequence of the central limit theorem is that (for large enough n) we can use as an approximate SE when estimating any sample mean (equation (2.5)), just by replacing by an estimate of the SD. Intuitively, whatever we’re trying to estimate, our estimate will be more precise for larger sample size or lower variability.

    For example, when estimating a proportion, an unbiased estimator for σ is

    and the standard error is

    Box 2.4

    Practical Note: Standard Error and Sample Size

    Technically, we should call the standard error and the estimated standard error (section 7.3). This book almost always is referring to the latter, so it will use SE to mean estimated SE except where there is ambiguity.

    Something useful to remember is that error scales as the square root of sample size (because the standard error has a in the denominator). This is the reason why collecting more data has diminishing returns: doubling sample size only decreases error by a factor of 1.41, and to halve error you need four times as much data! Note that the (estimated) SE won’t be exactly halved if you collect four times the data, because this number ( ), is itself an estimate, and s will change for a new sample. Nonetheless, the true SE will be halved. (See exercise 2.1.)

    This basic relationship (error ) holds for many kinds of errors and is useful in practical settings, such as planning a new study (how much more data should you collect to observe a hypothesized effect?) or critically assessing empirical patterns in published work (how much should you trust the means of cells A and B if cell B contains half as much data?).

    2.3.2 Confidence Intervals

    In isolation a standard error is not an intuitive measure of uncertainty, because it does not give a sense of which values are likely. One commonly used notion is a confidence interval (CI): a range of values that is X% likely to contain the population value. Most often, X = 95%.

    2.3.2.1 Z-based confidence intervals

    Consider

    Enjoying the preview?
    Page 1 of 1