Ebook860 pages8 hours

Regression Modeling for Linguistic Data

Name: Regression Modeling for Linguistic Data
Author: Morgan Sonderegger
ISBN: 9780262362467

By Morgan Sonderegger

Rating: 0 out of 5 stars

()

Read preview

About this ebook

The first comprehensive textbook on regression modeling for linguistic data offers an incisive conceptual overview along with worked examples that teach practical skills for realistic data analysis.

In the first comprehensive textbook on regression modeling for linguistic data in a frequentist framework, Morgan Sonderegger provides graduate students and researchers with an incisive conceptual overview along with worked examples that teach practical skills for realistic data analysis. The book features extensive treatment of mixed-effects regression models, the most widely used statistical method for analyzing linguistic data.

Sonderegger begins with preliminaries to regression modeling: assumptions, inferential statistics, hypothesis testing, power, and other errors. He then covers regression models for non-clustered data: linear regression, model selection and validation, logistic regression, and applied topics such as contrast coding and nonlinear effects. The last three chapters discuss regression models for clustered data: linear and logistic mixed-effects models as well as model predictions, convergence, and model selection. The book’s focused scope and practical emphasis will equip readers to implement these methods and understand how they are used in current work.

The only advanced discussion of modeling for linguists
Uses R throughout, in practical examples using real datasets
Extensive treatment of mixed-effects regression models
Contains detailed, clear guidance on reporting models
Equal emphasis on observational data and data from controlled experiments
Suitable for graduate students and researchers with computational interests across linguistics and cognitive science

Skip carousel

LanguageEnglish

PublisherThe MIT Press

Release dateJun 6, 2023

ISBN9780262362467

Author

Morgan Sonderegger

Related authors

Skip carousel

Related to Regression Modeling for Linguistic Data

Related ebooks

Skip carousel

Quantile Regression: Estimation and Simulation
Ebook
Quantile Regression: Estimation and Simulation
byMarilena Furno
Rating: 4 out of 5 stars
4/5
Distributed Cooperative Control: Emerging Applications
Ebook
Distributed Cooperative Control: Emerging Applications
byYi Guo
Rating: 0 out of 5 stars
0 ratings
A Guide to Business Statistics
Ebook
A Guide to Business Statistics
byDavid M. McEvoy
Rating: 0 out of 5 stars
0 ratings
CompTIA DataX Study Guide: Exam DY0-001
Ebook
CompTIA DataX Study Guide: Exam DY0-001
byFred Nwanganga
Rating: 0 out of 5 stars
0 ratings
Robust Statistics: Theory and Methods (with R)
Ebook
Robust Statistics: Theory and Methods (with R)
byRicardo A. Maronna
Rating: 0 out of 5 stars
0 ratings
Fundamental Statistical Inference: A Computational Approach
Ebook
Fundamental Statistical Inference: A Computational Approach
byMarc S. Paolella
Rating: 0 out of 5 stars
0 ratings
Multiple Biological Sequence Alignment: Scoring Functions, Algorithms and Evaluation
Ebook
Multiple Biological Sequence Alignment: Scoring Functions, Algorithms and Evaluation
byKen Nguyen
Rating: 0 out of 5 stars
0 ratings
Interpreting Evidence: Evaluating Forensic Science in the Courtroom
Ebook
Interpreting Evidence: Evaluating Forensic Science in the Courtroom
byBernard Robertson
Rating: 0 out of 5 stars
0 ratings
The Biostatistics of Aging: From Gompertzian Mortality to an Index of Aging-Relatedness
Ebook
The Biostatistics of Aging: From Gompertzian Mortality to an Index of Aging-Relatedness
byGilberto Levy
Rating: 0 out of 5 stars
0 ratings
Skeletal Variation and Adaptation in Europeans: Upper Paleolithic to the Twentieth Century
Ebook
Skeletal Variation and Adaptation in Europeans: Upper Paleolithic to the Twentieth Century
byChristopher B. Ruff
Rating: 0 out of 5 stars
0 ratings
Analytical Modeling of Wireless Communication Systems
Ebook
Analytical Modeling of Wireless Communication Systems
byCarla-Fabiana Chiasserini
Rating: 0 out of 5 stars
0 ratings
Practical Applications of Bayesian Reliability
Ebook
Practical Applications of Bayesian Reliability
byYan Liu
Rating: 0 out of 5 stars
0 ratings
Applied RVE Reconstruction and Homogenization of Heterogeneous Materials
Ebook
Applied RVE Reconstruction and Homogenization of Heterogeneous Materials
byYves Rémond
Rating: 0 out of 5 stars
0 ratings
Optimization Methods in Metabolic Networks
Ebook
Optimization Methods in Metabolic Networks
byCostas D. Maranas
Rating: 0 out of 5 stars
0 ratings
Practical Design of Experiments (DOE): A Guide for Optimizing Designs and Processes
Ebook
Practical Design of Experiments (DOE): A Guide for Optimizing Designs and Processes
byMark Allen Durivage
Rating: 0 out of 5 stars
0 ratings
Stevens' Handbook of Experimental Psychology and Cognitive Neuroscience, Methodology
Ebook
Stevens' Handbook of Experimental Psychology and Cognitive Neuroscience, Methodology
byWiley
Rating: 0 out of 5 stars
0 ratings
Digital Spectral Analysis MATLAB® Software User Guide
Ebook
Digital Spectral Analysis MATLAB® Software User Guide
byS. Lawrence Marple, Jr.
Rating: 0 out of 5 stars
0 ratings
Seismic Inversion: Theory and Applications
Ebook
Seismic Inversion: Theory and Applications
byYanghua Wang
Rating: 0 out of 5 stars
0 ratings
Chemometrics: Data Driven Extraction for Science
Ebook
Chemometrics: Data Driven Extraction for Science
byRichard G. Brereton
Rating: 0 out of 5 stars
0 ratings
Transitions from Digital Communications to Quantum Communications: Concepts and Prospects
Ebook
Transitions from Digital Communications to Quantum Communications: Concepts and Prospects
byMalek Benslama
Rating: 0 out of 5 stars
0 ratings
Sample Sizes for Clinical, Laboratory and Epidemiology Studies
Ebook
Sample Sizes for Clinical, Laboratory and Epidemiology Studies
byDavid Machin
Rating: 0 out of 5 stars
0 ratings
Reviews in Computational Chemistry
Ebook
Reviews in Computational Chemistry
byAbby L. Parrill
Rating: 0 out of 5 stars
0 ratings
Introduction to Bayesian Estimation and Copula Models of Dependence
Ebook
Introduction to Bayesian Estimation and Copula Models of Dependence
byArkady Shemyakin
Rating: 0 out of 5 stars
0 ratings
Quantitative Methods: An Introduction for Business Management
Ebook
Quantitative Methods: An Introduction for Business Management
byPaolo Brandimarte
Rating: 5 out of 5 stars
5/5
Network Meta-Analysis for Decision-Making
Ebook
Network Meta-Analysis for Decision-Making
bySofia Dias
Rating: 0 out of 5 stars
0 ratings
Multivariate Density Estimation: Theory, Practice, and Visualization
Ebook
Multivariate Density Estimation: Theory, Practice, and Visualization
byDavid W. Scott
Rating: 4 out of 5 stars
4/5
Computational Acoustics: Theory and Implementation
Ebook
Computational Acoustics: Theory and Implementation
byDavid R. Bergman
Rating: 0 out of 5 stars
0 ratings
Spline and Spline Wavelet Methods with Applications to Signal and Image Processing: Volume III: Selected Topics
Ebook
Spline and Spline Wavelet Methods with Applications to Signal and Image Processing: Volume III: Selected Topics
byAmir Z. Averbuch
Rating: 0 out of 5 stars
0 ratings
Introductory Mathematics and Statistics for Islamic Finance
Ebook
Introductory Mathematics and Statistics for Islamic Finance
byAbbas Mirakhor
Rating: 0 out of 5 stars
0 ratings
The Chicago Guide to Writing About Numbers
Ebook
The Chicago Guide to Writing About Numbers
byJane E. Miller
Rating: 0 out of 5 stars
0 ratings

Science & Mathematics For You

Skip carousel

The Source: The Secrets of the Universe, the Science of the Brain
Ebook
The Source: The Secrets of the Universe, the Science of the Brain
byDr. Tara Swart
Rating: 4 out of 5 stars
4/5
Feeling Good: The New Mood Therapy
Ebook
Feeling Good: The New Mood Therapy
byDavid D. Burns, M.D.
Rating: 4 out of 5 stars
4/5
Our Kind of People: Inside America's Black Upper Class
Ebook
Our Kind of People: Inside America's Black Upper Class
byLawrence Otis Graham
Rating: 3 out of 5 stars
3/5
What If?: Serious Scientific Answers to Absurd Hypothetical Questions
Ebook
What If?: Serious Scientific Answers to Absurd Hypothetical Questions
byRandall Munroe
Rating: 5 out of 5 stars
5/5
Sapiens: A Brief History of Humankind
Ebook
Sapiens: A Brief History of Humankind
byYuval Noah Harari
Rating: 4 out of 5 stars
4/5
The Gulag Archipelago [Volume 1]: An Experiment in Literary Investigation
Ebook
The Gulag Archipelago [Volume 1]: An Experiment in Literary Investigation
byAleksandr I. Solzhenitsyn
Rating: 4 out of 5 stars
4/5
No-Drama Discipline: the bestselling parenting guide to nurturing your child's developing mind
Ebook
No-Drama Discipline: the bestselling parenting guide to nurturing your child's developing mind
byDaniel J. Siegel
Rating: 4 out of 5 stars
4/5
Activate Your Brain: How Understanding Your Brain Can Improve Your Work - and Your Life
Ebook
Activate Your Brain: How Understanding Your Brain Can Improve Your Work - and Your Life
byScott G Halford
Rating: 4 out of 5 stars
4/5
The Rise of the Fourth Reich: The Secret Societies That Threaten to Take Over America
Ebook
The Rise of the Fourth Reich: The Secret Societies That Threaten to Take Over America
byJim Marrs
Rating: 4 out of 5 stars
4/5
Blitzed: Drugs in the Third Reich
Ebook
Blitzed: Drugs in the Third Reich
byNorman Ohler
Rating: 4 out of 5 stars
4/5
The Big Book of Hacks: 264 Amazing DIY Tech Projects
Ebook
The Big Book of Hacks: 264 Amazing DIY Tech Projects
byDoug Cantor
Rating: 4 out of 5 stars
4/5
Becoming Cliterate: Why Orgasm Equality Matters--And How to Get It
Ebook
Becoming Cliterate: Why Orgasm Equality Matters--And How to Get It
byDr. Laurie Mintz
Rating: 4 out of 5 stars
4/5
Outsmart Your Brain: Why Learning is Hard and How You Can Make It Easy
Ebook
Outsmart Your Brain: Why Learning is Hard and How You Can Make It Easy
byDaniel T Willingham
Rating: 4 out of 5 stars
4/5
American Carnage: On the Front Lines of the Republican Civil War and the Rise of President Trump
Ebook
American Carnage: On the Front Lines of the Republican Civil War and the Rise of President Trump
byTim Alberta
Rating: 4 out of 5 stars
4/5
First, We Make the Beast Beautiful: A New Journey Through Anxiety
Ebook
First, We Make the Beast Beautiful: A New Journey Through Anxiety
bySarah Wilson
Rating: 4 out of 5 stars
4/5
Ultralearning: Master Hard Skills, Outsmart the Competition, and Accelerate Your Career
Ebook
Ultralearning: Master Hard Skills, Outsmart the Competition, and Accelerate Your Career
byScott Young
Rating: 4 out of 5 stars
4/5
Homo Deus: A Brief History of Tomorrow
Ebook
Homo Deus: A Brief History of Tomorrow
byYuval Noah Harari
Rating: 4 out of 5 stars
4/5
Rethinking Narcissism: The Bad---and Surprising Good---About Feeling Special
Ebook
Rethinking Narcissism: The Bad---and Surprising Good---About Feeling Special
byDr. Craig Malkin
Rating: 4 out of 5 stars
4/5
The Confidence Code: The Science and Art of Self-Assurance---What Women Should Know
Ebook
The Confidence Code: The Science and Art of Self-Assurance---What Women Should Know
byKatty Kay
Rating: 4 out of 5 stars
4/5
Lies My Gov't Told Me: And the Better Future Coming
Ebook
Lies My Gov't Told Me: And the Better Future Coming
byRobert W. Malone
Rating: 4 out of 5 stars
4/5
How to Think Critically: Question, Analyze, Reflect, Debate.
Ebook
How to Think Critically: Question, Analyze, Reflect, Debate.
byAlbert Rutherford
Rating: 4 out of 5 stars
4/5
The Invisible Rainbow: A History of Electricity and Life
Ebook
The Invisible Rainbow: A History of Electricity and Life
byArthur Firstenberg
Rating: 5 out of 5 stars
5/5
Chaos: Making a New Science
Ebook
Chaos: Making a New Science
byJames Gleick
Rating: 4 out of 5 stars
4/5
Other Minds: The Octopus, the Sea, and the Deep Origins of Consciousness
Ebook
Other Minds: The Octopus, the Sea, and the Deep Origins of Consciousness
byPeter Godfrey-Smith
Rating: 4 out of 5 stars
4/5
The Big Fat Surprise: Why Butter, Meat and Cheese Belong in a Healthy Diet
Ebook
The Big Fat Surprise: Why Butter, Meat and Cheese Belong in a Healthy Diet
byNina Teicholz
Rating: 4 out of 5 stars
4/5
Alchemy: The Dark Art and Curious Science of Creating Magic in Brands, Business, and Life
Ebook
Alchemy: The Dark Art and Curious Science of Creating Magic in Brands, Business, and Life
byRory Sutherland
Rating: 4 out of 5 stars
4/5
The Wuhan Cover-Up: And the Terrifying Bioweapons Arms Race
Ebook
The Wuhan Cover-Up: And the Terrifying Bioweapons Arms Race
byRobert F. Kennedy, Jr.
Rating: 4 out of 5 stars
4/5
Mothers Who Can't Love: A Healing Guide for Daughters
Ebook
Mothers Who Can't Love: A Healing Guide for Daughters
bySusan Forward
Rating: 4 out of 5 stars
4/5
A Crack In Creation: Gene Editing and the Unthinkable Power to Control Evolution
Ebook
A Crack In Creation: Gene Editing and the Unthinkable Power to Control Evolution
byJennifer A. Doudna
Rating: 4 out of 5 stars
4/5
The Systems Thinker: Essential Thinking Skills For Solving Problems, Managing Chaos,
Ebook
The Systems Thinker: Essential Thinking Skills For Solving Problems, Managing Chaos,
byAlbert Rutherford
Rating: 4 out of 5 stars
4/5

Related podcast episodes

Skip carousel

Quantifying yeast microtubules and spindles using the Toolkit for Automated Microtubule Tracking (TAMiT)
UNLIMITED
Quantifying yeast microtubules and spindles using the Toolkit for Automated Microtubule Tracking (TAMiT)
byPaperPlayer biorxiv cell biology
0 ratings
0% found this document useful
Alignment Newsletter #164: How well can language models write code?: How well can language models write code?
UNLIMITED
Alignment Newsletter #164: How well can language models write code?: How well can language models write code?
byAlignment Newsletter Podcast
0 ratings
0% found this document useful
Waveguides: Modellansatz 230
UNLIMITED
Waveguides: Modellansatz 230
byModellansatz - English episodes only
0 ratings
0% found this document useful
Electric Vehicles on the Grid: Modellansatz 183
UNLIMITED
Electric Vehicles on the Grid: Modellansatz 183
byModellansatz - English episodes only
0 ratings
0% found this document useful
DNA Replication, Transcription and R-loops (Stephan Hamperl): In this episode of the Epigenetics Podcast, we talked with Dr. Stephan Hamperl from the Helmholtz Zentrum Munich about his work on how conflicts between transcription, replication, and R-loop formation influence genome stability in human cells. Durin...
UNLIMITED
DNA Replication, Transcription and R-loops (Stephan Hamperl): In this episode of the Epigenetics Podcast, we talked with Dr. Stephan Hamperl from the Helmholtz Zentrum Munich about his work on how conflicts between transcription, replication, and R-loop formation influence genome stability in human cells. Durin...
byEpigenetics Podcast
0 ratings
0% found this document useful
Convolution Quadrature: Modellansatz 133
UNLIMITED
Convolution Quadrature: Modellansatz 133
byModellansatz - English episodes only
0 ratings
0% found this document useful
Automatic Differentiation: Modellansatz 167
UNLIMITED
Automatic Differentiation: Modellansatz 167
byModellansatz - English episodes only
0 ratings
0% found this document useful
Can Single Cell Respiration be Measured by Scanning Electrochemical Microscopy (SECM)?
UNLIMITED
Can Single Cell Respiration be Measured by Scanning Electrochemical Microscopy (SECM)?
byPaperPlayer biorxiv cell biology
0 ratings
0% found this document useful
Neuromorphic Cytometry: Implementation on cell counting and size estimation
UNLIMITED
Neuromorphic Cytometry: Implementation on cell counting and size estimation
byPaperPlayer biorxiv cell biology
0 ratings
0% found this document useful
Bridging the light-electron resolution gap with correlative cryo-SRRF and dual-axis cryo-STEM tomography
UNLIMITED
Bridging the light-electron resolution gap with correlative cryo-SRRF and dual-axis cryo-STEM tomography
byPaperPlayer biorxiv cell biology
0 ratings
0% found this document useful
103: DigiPath Digest #11 (Pathology & AI: Metastasis Detection, Fast Annotations & Foundation Models)
UNLIMITED
103: DigiPath Digest #11 (Pathology & AI: Metastasis Detection, Fast Annotations & Foundation Models)
byDigital Pathology Podcast
0 ratings
0% found this document useful
Dynamical Sampling
UNLIMITED
Dynamical Sampling
byModellansatz
0 ratings
0% found this document useful
Dynamical Sampling: Modellansatz 173
UNLIMITED
Dynamical Sampling: Modellansatz 173
byModellansatz - English episodes only
0 ratings
0% found this document useful
Changepoint Detection: Secret Weapon of the Data Scientist
UNLIMITED
Changepoint Detection: Secret Weapon of the Data Scientist
byDataCafé
0 ratings
0% found this document useful
Episode 276 - Concurrent Chains Arrangement: Somehow we’ve made it nearly 300 episodes without actually talking about concurrent chains arrangement specifically. I mean, we’ve talked about measuring assent, preferences for treatments, chaining, and concurrent schedules. Finally, all the...
UNLIMITED
Episode 276 - Concurrent Chains Arrangement: Somehow we’ve made it nearly 300 episodes without actually talking about concurrent chains arrangement specifically. I mean, we’ve talked about measuring assent, preferences for treatments, chaining, and concurrent schedules. Finally, all the...
byABA Inside Track
0 ratings
0% found this document useful
Bringing it All Together: Chaining Procedures in AAC
UNLIMITED
Bringing it All Together: Chaining Procedures in AAC
bySLP Nerdcast
0 ratings
0% found this document useful
The APsolute RecAP: Chemistry Edition - Episode 59: Unit 6 selected FRQs: Unit 6 is all about the big idea Energy. Episode 59 discusses the questions 2021 - Question 4, 2017 - Question 5 and 2013 - Question 3. These are released FRQs from previous exams and copyright of the College Board.
UNLIMITED
The APsolute RecAP: Chemistry Edition - Episode 59: Unit 6 selected FRQs: Unit 6 is all about the big idea Energy. Episode 59 discusses the questions 2021 - Question 4, 2017 - Question 5 and 2013 - Question 3. These are released FRQs from previous exams and copyright of the College Board.
byThe APsolute RecAP: Chemistry Edition
0 ratings
0% found this document useful
Hasty Treat - Neat Things in CSS Color - Current and Coming!: In this Hasty Treat, Scott and Wes talk about all things color in CSS, both current and coming soon! Sentry - Sponsor If you want to know what’s happening with your code, track errors and monitor performance with Sentry. Sentry’s Application...
UNLIMITED
Hasty Treat - Neat Things in CSS Color - Current and Coming!: In this Hasty Treat, Scott and Wes talk about all things color in CSS, both current and coming soon! Sentry - Sponsor If you want to know what’s happening with your code, track errors and monitor performance with Sentry. Sentry’s Application...
bySyntax - Tasty Web Development Treats
0 ratings
0% found this document useful
Writing in the Margins: Better Inference Pattern for Long Context Retrieval
UNLIMITED
Writing in the Margins: Better Inference Pattern for Long Context Retrieval
byPapers Read on AI
0 ratings
0% found this document useful
Beyond Chain-of-Thought, Effective Graph-of-Thought Reasoning in Large Language Models: With the widespread use of large language models (LLMs) in NLP tasks, researchers have discovered the potential of Chain-of-thought (CoT) to assist LLMs in accomplishing complex reasoning tasks by generating intermediate steps. However, human thought...
UNLIMITED
Beyond Chain-of-Thought, Effective Graph-of-Thought Reasoning in Large Language Models: With the widespread use of large language models (LLMs) in NLP tasks, researchers have discovered the potential of Chain-of-thought (CoT) to assist LLMs in accomplishing complex reasoning tasks by generating intermediate steps. However, human thought...
byPapers Read on AI
0 ratings
0% found this document useful
Thomas Huckle and Tobias Neckel, "Bits and Bugs: A Scientific and Historical Review of Software Failures in Computational Science" (SIAM, 2019): An interview with Thomas Huckle and Tobias Neckel
UNLIMITED
Thomas Huckle and Tobias Neckel, "Bits and Bugs: A Scientific and Historical Review of Software Failures in Computational Science" (SIAM, 2019): An interview with Thomas Huckle and Tobias Neckel
byNew Books in Science, Technology, and Society
0 ratings
0% found this document useful
Thomas Huckle and Tobias Neckel, "Bits and Bugs: A Scientific and Historical Review of Software Failures in Computational Science" (SIAM, 2019): An interview with Thomas Huckle and Tobias Neckel
UNLIMITED
Thomas Huckle and Tobias Neckel, "Bits and Bugs: A Scientific and Historical Review of Software Failures in Computational Science" (SIAM, 2019): An interview with Thomas Huckle and Tobias Neckel
byNew Books in Mathematics
0 ratings
0% found this document useful
The APsolute RecAP: Chemistry Edition - VSEPR Theory: In this episode we define the acronym V-S-E-P-R and take a closer look at bonding and non-bonding electron pairs around a central atom (1:10).
UNLIMITED
The APsolute RecAP: Chemistry Edition - VSEPR Theory: In this episode we define the acronym V-S-E-P-R and take a closer look at bonding and non-bonding electron pairs around a central atom (1:10).
byThe APsolute RecAP: Chemistry Edition
0 ratings
0% found this document useful
Thomas Huckle and Tobias Neckel, "Bits and Bugs: A Scientific and Historical Review of Software Failures in Computational Science" (SIAM, 2019): An interview with Thomas Huckle and Tobias Neckel
UNLIMITED
Thomas Huckle and Tobias Neckel, "Bits and Bugs: A Scientific and Historical Review of Software Failures in Computational Science" (SIAM, 2019): An interview with Thomas Huckle and Tobias Neckel
byNew Books in the History of Science
0 ratings
0% found this document useful
Chronos: Learning the Language of Time Series: We introduce Chronos, a simple yet effective framework for pretrained probabilistic time series models. Chronos tokenizes time series values using scaling and quantization into a fixed vocabulary and trains existing transformer-based language model a...
UNLIMITED
Chronos: Learning the Language of Time Series: We introduce Chronos, a simple yet effective framework for pretrained probabilistic time series models. Chronos tokenizes time series values using scaling and quantization into a fixed vocabulary and trains existing transformer-based language model a...
byPapers Read on AI
0 ratings
0% found this document useful
Choice of friction coefficient deeply affects tissue behaviour in epithelial vertex models
UNLIMITED
Choice of friction coefficient deeply affects tissue behaviour in epithelial vertex models
byPaperPlayer biorxiv cell biology
0 ratings
0% found this document useful
Elastomeric Pillar Cages Modulate Actomyosin Contractility of Epithelial Microtissues by Substrate Stiffness and Topography
UNLIMITED
Elastomeric Pillar Cages Modulate Actomyosin Contractility of Epithelial Microtissues by Substrate Stiffness and Topography
byPaperPlayer biorxiv cell biology
0 ratings
0% found this document useful
GelMap: Intrinsic calibration and deformation mapping for expansion microscopy
UNLIMITED
GelMap: Intrinsic calibration and deformation mapping for expansion microscopy
byPaperPlayer biorxiv cell biology
0 ratings
0% found this document useful
Nanophotonics: Modellansatz 066
UNLIMITED
Nanophotonics: Modellansatz 066
byModellansatz - English episodes only
0 ratings
0% found this document useful
Patch-Level Training for Large Language Models
UNLIMITED
Patch-Level Training for Large Language Models
byPapers Read on AI
0 ratings
0% found this document useful

Skip carousel

Excite Audio Vision 4X £89
Computer Music
UNLIMITED
Excite Audio Vision 4X £89
Apr 19, 2023
3 min read
Measurements
Stereophile
UNLIMITED
Measurements
Dec 8, 2020
3 min read
Microcontrollers In Amateur Radio
CQ Amateur Radio
UNLIMITED
Microcontrollers In Amateur Radio
Feb 1, 2023
3 min read
Measurements
Stereophile
UNLIMITED
Measurements
Aug 8, 2023
2 min read
Measurements
Stereophile
UNLIMITED
Measurements
Apr 14, 2020
5 min read
Soulver 3: Mac App Simplifies Readable Calculations And Conversions
MacWorld
UNLIMITED
Soulver 3: Mac App Simplifies Readable Calculations And Conversions
Nov 19, 2019
3 min read
#05 Metering tools
Computer Music
UNLIMITED
#05 Metering tools
Mar 22, 2023
4 min read
Measurements
Stereophile
UNLIMITED
Measurements
May 9, 2023
3 min read
How Spooky Science Helps Us Peer Inside The Planets
All About Space
UNLIMITED
How Spooky Science Helps Us Peer Inside The Planets
Dec 3, 2020
An assistant professor of computational science at the EPFL research centre in Lausanne, Switzerland, involved in the current research on metallic hydrogen. Could you explain how the machine-learning techniques used in your research work? Why were th
1 min read
Laboratory Test Report
Australian HiFi
UNLIMITED
Laboratory Test Report
Jan 31, 2022
Readers interested in a full technical appraisal of the performance of the SVS 3000 Micro Subwoofer should continue on and read the LABORATORY REPORT published on the following pages. Readers should note that the results mentioned in the report, tabu
4 min read
Measurements
Stereophile
UNLIMITED
Measurements
Jul 6, 2021
3 min read
Predicting Balun Performance Using 3-D Models
CQ Amateur Radio
UNLIMITED
Predicting Balun Performance Using 3-D Models
Apr 1, 2022
11 min read
Measurements
Stereophile
UNLIMITED
Measurements
Aug 6, 2024
3 min read
Measurements
Stereophile
UNLIMITED
Measurements
Feb 11, 2020
2 min read
Measurements
Stereophile
UNLIMITED
Measurements
Aug 9, 2022
I used DRA Labs’ MLSSA system and a calibrated DPA 4006 microphone with an Earthworks microphone preamplifier to measure the Audiovector QR 7’s farfield frequency response and dispersion. I used an Earthworks QTC-40 mike for the nearfield and in-room
4 min read
Measurements
Stereophile
UNLIMITED
Measurements
Jan 4, 2022
3 min read
Test: Sonible Metering Bundle
Beat English
UNLIMITED
Test: Sonible Metering Bundle
Jan 4, 2023
Included are the two analyzer plug-ins true:balance and true:level, which promise to check a mix for publishability. They do not, however, optimize the audio, per se, like iZotope‘s Ozone, for example. While the former makes the frequency spectrum an
3 min read
Measurements
Stereophile
UNLIMITED
Measurements
Aug 9, 2022
I used DRA Labs’ MLSSA system, a calibrated DPA 4006 microphone, and an Earthworks microphone preamplifier to measure the KEF Blade Two Meta’s frequency response in the farfield. I used an Earthworks QTC-40 mike for the nearfield measurements. Becaus
3 min read
Code Your Own Wordle-like Game
Linux Format
UNLIMITED
Code Your Own Wordle-like Game
Aug 20, 2024
During late 2021, a game called Wordle was released to the world, and it became very popular incredibly quickly. The appeal of the game is its simplicity, the relatively short amount of time it takes to play and the innovative way in which your daily
9 min read
Mastering The Mix REFERENCE 2
Electronic Musician
UNLIMITED
Mastering The Mix REFERENCE 2
Dec 22, 2020
3 min read
Using Calc For Serious Mathematics Work
Linux Format
UNLIMITED
Using Calc For Serious Mathematics Work
Mar 10, 2020
10 min read
Measuring Sharpness
Amateur Photographer
UNLIMITED
Measuring Sharpness
Oct 30, 2018
For generations photographers have valued a very ‘sharp' lens. The only problem is, how do you tell what is a ‘sharp' lens? When I first started to become interested in such things the metric was ‘line pairs per millimetre. The way this was measured
2 min read
Master The Mix Reference 2
Future Music
UNLIMITED
Master The Mix Reference 2
Feb 9, 2021
3 min read
Wavetable Synthesis
Computer Music
UNLIMITED
Wavetable Synthesis
Jun 14, 2023
16 min read
SpectraLayers’ Spectrograms
Electronic Musician
UNLIMITED
SpectraLayers’ Spectrograms
Oct 26, 2021
1 min read
Learning Curve
CQ Amateur Radio
UNLIMITED
Learning Curve
Sep 1, 2021
5 min read
Grid Modeling Overview: Four Types of Models Guiding the Transition to Clean Electricity
Union of Concerned Scientists
UNLIMITED
Grid Modeling Overview: Four Types of Models Guiding the Transition to Clean Electricity
Apr 25, 2022
6 min read
Measurements
Stereophile
UNLIMITED
Measurements
Jun 4, 2024
4 min read
Nugen Audio Halo Vision £209
Computer Music
UNLIMITED
Nugen Audio Halo Vision £209
Nov 2, 2022
3 min read
SpectraLayers’ Spectrograms
Computer Music
UNLIMITED
SpectraLayers’ Spectrograms
Sep 8, 2021
Most sounds, with the exception of sine waves, consist of different frequencies that vary in amplitude over time. The standard waveform graphs we’re familiar with show only two dimensions of this information: time and the overall amplitude of the sig
1 min read

Related categories

Skip carousel

Reviews for Regression Modeling for Linguistic Data

Rating: 0 out of 5 stars

0 ratings

0 ratings0 reviews

Book preview

Regression Modeling for Linguistic Data - Morgan Sonderegger

Preface

This book introduces applied regression analysis for analyzing linguistic data, using R. It aims to provide both conceptual understanding and practical skills through extensive examples, using three different kinds of linguistic data:

Preliminaries to regression modeling (chapters 1–3): assumptions, inferential statistics, hypothesis testing, power, and other errors.

Regression models for nonclustered data (chapters 4–7): linear regression, model selection and validation, logistic regression, and practical topics (e.g., contrast coding, post hoc tests, nonlinear effects).

Regression models for clustered data (chapters 8–10): linear and logistic mixed-effects models, and practical topics (e.g., model predictions, convergence, model selection).

The book started as a minor revision of Quantitative Methods for Linguistic Data (Sonderegger, Wagner, and Torreira 2018), co-authored with Francisco Torreira and Michael Wagner, which itself grew out of lectures for a one-semester graduate quantitative methods course at McGill Linguistics that is taught most years. I ended up rewriting the entire manuscript, including adding several new chapters. So this book is best understood as a new text, which incorporates aspects of QMLD. I thank Michael and Francisco for their understanding, and letting me incorporate their work here.

I see this book as a frozen version of an evolving document. Any feedback is welcome ([email protected]) and will hopefully be incorporated in a future version. I hope to update the book’s website, which also contains all code and datasets, as the text is updated in the future.¹

Audience This book is for graduate students and researchers in linguistics and other language sciences, who work with quantitative data. This includes data-heavy subfields of linguistics (e.g., experimental syntax/semantics/phonology, phonetics, psycholinguistics, corpus linguistics, language acquisition, sociolinguistics), as well as communication sciences and disorders, psychologists or cognitive scientists of language, and so on. I have taught this material since 2013 to students and faculty (about 50% from each group). I use linguist and language scientist interchangeably in this book for brevity, similarly for linguistics and language sciences.² The text assumes knowledge of elementary linguistics, using terms such as phone or lexical item without comment, but should be understandable regardless (you’ll just have to think of some variables as y or x rather than as voice onset time or place of articulation).

Background This book joins many existing texts on quantitative analysis/statistics for linguists, which are also mostly practical introductions using R, including Brezina (2018), Eddington (2016), Garcia (2021), Gómez (2014), and Gries (2021, 2013), Johnson (2008), Levshina (2015), Rietveld and Van Hout (1993, 2005), Vasishth and Broe (2011), Vasishth et al. (2021), and Winter (2019). Baayen (2008) has been particularly influential. These texts differ in many respects, such as the type of linguistic data assumed and the theory/practice balance, but mostly share two aspects: starting from scratch and broad scope (statistics or quantitative analysis generally).

The approach of this book, described in more detail in chapter 1, is different. Its goals are narrower—conceptual understanding and practical experience with regression modeling of linguistic data—and I assume you have some experience with statistics and R. It does not cover other (important) quantitative tools, such as classification methods or exploratory data analysis. While I don’t assume you have read any of the preceding books, the current book can be seen as complementary to them.

I focus on regression models because these are the main form of statistical analysis in papers (using quantitative data) published in major journals, but they are complex tools to use in practice. I give practical and detailed treatments of a smaller number of topics, describing decision points and the pros and cons of different methods, conventions in the current language sciences literature, and how to report your analyses. The goal is to equip you to use these methods in practice and understand how they are used in current work. This book will not cover the full range of possible regression models (e.g., Poisson or multinomial models), but extending to them should be straightforward after you’ve used this book.

Other differences from existing texts are a greater focus on data from (laboratory) phonology, phonetics, and language variation and change (though other areas are represented) and equal emphasis on observational data (especially from speech corpora) and data from controlled experiments. These reflect my own background as a linguist working primary in these areas, often using corpus data—but I have tried to keep the presentation useful for language scientists generally.

Regression modeling is used across many scientific fields, and we can learn from best practices in other fields to better analyze linguistic data. At the same time, it’s easier to learn data analysis with data that looks like your own, and analysis tools become specialized for particular kinds of data as they are used in a field over time. I try to give context and places to read more for particular topics of interest—from (statistics for) language sciences, as well as behavioral sciences, social sciences, and ecology—both in the text and in Further Reading sections at the end of each chapter. No statistical method used in this book is new, so it is neither possible nor useful to give comprehensive citations; I give references I am familiar with and have found helpful, including particularly detailed treatments of linguistic data by other authors (in books, cited above, plus articles). My goal is to provide useful entry points to the vast literature on statistical methods for when you want to learn more.

What you need to know This is an intermediate text. It assumes previous exposure to quantitative methods, such as from one of the books above, but aims to be useful for readers with different backgrounds.

You should be familiar with secondary school math concepts, such as algebra, logarithms, exponentials, summation notation, and basic linear algebra (vectors, matrices, matrix multiplication); as well as some probability theory, specifically what the following terms mean: probability distribution, normal distribution, random variable, binomial distribution, and conditional probability. Most important is familiarity with statistics, at the level of a first course:

Descriptive statistics: data summarization and visualization: the meaning of concepts such as mean, mode, standard deviation, quantile, and correlation; how to make and read common statistical plots (e.g., boxplots, histograms, density plots, scatterplots).

Inferential statistics: the idea of sample and population, basic hypothesis testing concepts (p-values, test statistics), and tests (t-tests, χ²-tests), and maybe basic analysis of variance.

The book focuses on conceptual understanding and practical skills by working in R, but without actually providing instruction in R. Thus, I assume a working knowledge of R and R programming, including both base R and some familiarity with tidyverse functionality (packages such as ggplot2, dplyr: https://tidyverse.org).³

In practice, dozens of graduate students with a variety of backgrounds in these areas (including not much) have done well in the course using this book’s material, with some doing extra work to catch up. In particular, many students have learned R from scratch at the same time as using this material, using online tutorials for base R and R for Data Science (Grolemund and Wickham 2016) for tidyverse, and students with less math background than described have done well.

Some resources for math/probability/statistics are Khan Academy videos, Sharon Goldwater’s math tutorials, general statistics books listed in chapter 2, Further Reading, and those for linguistics listed earlier; for descriptive statistics and visualization, Grolemund and Wickham (2016) (general), Gries (2021, chapter 3), and Garcia (2021, part 2) (linguistics) are particularly thorough.

Caveat I am not a statistician, but a self-taught practitioner with some masters-level math/statistics training. Books written by practitioners can be useful because of the perspective of working with this kind of data, but they can also contain errors. If you frequently use a particular tool for data analysis, I recommend (eventually) consulting a more authoritative source; one goal of this book is to equip you to do so. This is particularly important because statistical practice is not static: this book emphasizes practical skills (e.g., which package to use, best practices for fitting and interpreting models) for which best practices are constantly evolving. My aims are for this book to be useful for analyzing linguistic data and as (technically) correct as possible.

How to use this book This book is ideally read while executing the code shown in code blocks on your own computer, for example, by pasting them in to RStudio or the R console. The code in each chapter is independent, so you should always be able to start reading/coding at the beginning of a chapter. I often refer to objects that have been created in previous code, and sometimes the output of running code is not actually shown (I’m assuming you can see it in your console).

Most code is shown in the actual PDF; this is the code I assume you are able to run. An important exception is code for creating plots, which is usually omitted because code to make decent plots is verbose. You can find all code (including plotting code) in the code file for each chapter, on the book’s website.

Each chapter consists of main text and boxes, inspired by McElreath (2020)’s Statistical Rethinking. Boxed text is not essential for understanding the main text and gives extra information. There are two types of boxes: Broader Context boxes provide more in-depth explanations of technical concepts or math, connect to other approaches, discuss common misunderstandings, and so on. Practical Note boxes discuss aspects of statistical analysis you’re unlikely to delve into until you actually use these methods in your own work, such as statistical reporting or R details.

Offsetting materials in boxes is intended to make the book easier to use. On a first read you can focus just on the essentials by skipping boxes, and when analyzing your own data later or learning more about a particular method, the boxes call out relevant material.

Datasets This book uses publicly available datasets, all of which are available on the book’s website or in the languageR package. I am grateful to dataset authors for their willingness to post data publicly: Michael Wagner (givenness); Timo Roettger and Bodo Winter (neutralization); Francisco Torreira, Seán Roberts, and Stephen Levinson (transitions); Michael McAuliffe and Hye-Young Bang (turkish_if0); Max Bane and Peter Graff (vot); and R. Harald Baayen and co-authors (english, regularity).

Acknowledgments Many people deserve thanks for the long journey to a book. Above all, students in McGill classes from 2013 to 2021 provided feedback, as did students who continued to use the materials for their research—especially Hye-Young Bang, Amélie Bernard, Guilherme Garcia, Claire Honda, Oriana Kilbourn-Ceron, Donghyun Kim, Bing’er Jiang, Jeff Lamontagne, James Tanner, and Connie Ting. Substantial editing and typsetting work have been done by David Flesicher, Claire Honda, Jacob Hoover, Vanna Willerton, and especially Michaela Socolof. Colleagues who have offered encouragement and comments over the years include Meghan Clayards, Jessamyn Schertz, Tyler Kendall, Michael Wagner, Paul Boersma, Christian Di Canio, Volya Kapatsinki, Roger Levy, Jane Stuart-Smith, Tim O’Donnell, Alan Yu, Bodo Winter, Tania Zamuner, and five anonymous reviewers. Special thanks go to James Kirby for comments on the entire manuscript and reminders to keep going. I am grateful to you all, including those I have forgotten. Finally, I am thankful to Katie for her companionship, support, and especially patience—it’s finally done.

Morgan Sonderegger

January 2022

1. The website is currently at https://osf.io/pnumg/. Please check my personal web page if the OSF link no longer works when you read this.

2. The different connotations of the two terms are not relevant for statistical analysis.

3. Some people have strong opinions about base R versus tidyverse. If you don’t like one idiom or the other, it will be perfectly possible to follow along, I just won’t be explaining particular functions in detail.

Preliminaries

1.1 Our R Toolset

This book exclusively uses R, which has become the de facto standard across language sciences. R is free, relatively easy to learn, and incorporates very broad functionality through packages. It is an excellent default for visualization and statistical analysis.

A downside of using R is that you must decide what dialect to use: the core set of base packages, which are stable but clunky, or a better alternative set of packages that have been developed, which may become obsolete.¹ The currently dominant dialect is tidyverse, a family of packages that offer wide functionality and an elegant implementation based on the tidy data philosophy (Wickham et al. 2019; Grolemund and Wickham 2016), but that evolve over time. The world being what it is, base R and tidyverse zealots can be easily found online.

I agree with Winter (2019) that the most realistic option is to learn both dialects: you need to know base R to function, while tidyverse functionality is often superior in practice, and both are widely used in online resources (e.g., StackOverflow pages). This book uses both, leaning toward tidyverse functionality when available, but often showing both base and tidy ways to do the same thing, in line with my general philosophy of showing you alternative methods so you can choose for yourself. This book also often uses data and functions from the languageR package associated with Baayen (2008)’s Analyzing Linguistic Data (Baayen and Shafaei-Bajestan 2019).

The book’s code will be kept updated on its website, currently osf.io/pnumg (or see my personal website). Check there if any code doesn’t work using your (future) R/tidyverse version. Appendix B shows the exact R and package versions used to compile this book.

1.2 Our Approach

The primary goals of this book are conceptual understanding and practical experience with regression modeling, in R.

Conceptual understanding My general philosophy is that understanding the statistical/data analysis methods you use ("why do X, what is X) is of primary importance, and the best way to do this is through practical demonstration (how to do X"). Developing conceptual understanding also often requires math, whether via equations or simulation, because this is the underlying language. Conceptual understanding is the most practical thing you can learn for data analysis, because statistical methods (i.e., best practices for doing X) change over time, but new methods always build on old ones. For this reason, the book focuses on fewer topics—such as covering linear and logistic regression, but not other types of linear models—in greater detail, and in a cumulative fashion: linear mixed-effects models build on linear regression, which builds on t-tests.

Practical experience Conceptual understanding alone will not let you analyze data. There are often little tricks, or best practices that you learn with experience, which are essential in practice, but don’t come up unless you actually spend a lot of time analyzing data. So this book contains a lot of working code, integrated into the main text, showing how to do everything discussed, using a good set of R packages for 2022.² The only exception is code for making plots, which is not shown in the text because it is verbose.

I strongly recommend actually running the code as you read, because just as one cannot learn martial arts by watching Bruce Lee movies, you can’t learn to program statistical models by only reading a book (McElreath 2020, xiii). To facilitate this, R files of just the code for each chapter (including plotting code) are posted on the book’s website.

Statistical modeling Regression analysis is a statistical modeling approach to data analysis, where we seek to interpret some data with respect to research questions or hypotheses we have, by building and interpreting a model. A different approach to data analysis, which language scientists often learn, is a hypothesis testing approach: the researcher applies one of a fixed set of tests (e.g., t-test, one-way analysis of variance), depending on the type of data and the question being asked. Hypothesis testing is one foundation of statistical modeling (see chapter 2), but the underlying philosophy is quite different (see, e.g., Rodgers 2010; Gelman and Hill 2007). The statistical modeling approach is more flexible, but harder to learn—it involves making choices, and thinking about the data (box 1.1).

Box 1.1

Broader Context: Trade-Offs versus Flowcharts

Many researchers, including language scientists, just want to know how to analyze their data—they see statistical analysis as an onerous task that one would rather leave to others (Baayen 2008, viii)—and don’t want to have to choose among different methods depending on the pros and cons. This is understandable, and to meet this demand, statistics textbooks often use a flowchart/recipe approach to guide researchers in choosing a method given their data (e.g., compare two normally distributed groups ⇒ two-sample t-test). This approach is simple, and requires less conceptual understanding, but it has serious disadvantages in practice: one’s data often does not fit neatly into a flowchart box (e.g., you can’t assess normality, or a reviewer asks for a different method), in which case you don’t know what to do. Also, no intuition is developed for the consequences of using different analysis methods, as in different papers in the literature. If you understand the pros and cons of different methods, you are better able to address the scientific questions you want to ask about your data, and you will be a more informed consumer of the literature.

The statistical modeling approach recognizes three central facts about data analysis. First, we are not building a model of the actual generating mechanism of the data (e.g., neurons, vocal tract muscles), which for linguistic data is usually unknown; at best we are building a process model to gain insight into the research questions motivating the analysis (McElreath 2020, section 1.2).³ It follows that while there are incorrect ways to analyze one’s data, there is never a single right way—data analysis requires an educated choice of method, and different choices carry different risks and rewards. Finally, data analysis always takes place in a scientific context: the hypotheses or research questions motivating your analysis are fundamental to choosing the analysis method, because the goal of the analysis is to address these questions.⁴

These points inform this book’s presentation. Rather than showing you the right way to do an analysis (e.g., fit and report a linear mixed-effects model), I show decision points, the pros and cons of different paths, and what is done in current practice. I try to introduce methods in the context of concrete research questions; a corollary is that this book uses fewer datasets than many other books do.

1.3 Context

This section defines terminology and notation used throughout the book, including classic oppositions among experimental/observational studies, correlation/causation, and exploratory/confirmatory analysis.

1.3.1 Types of Data and Study

In this book linguistic data is often used as a shorthand for any quantitative dataset produced in a linguistic study.⁵ These come most often from laboratory experiments or linguistic corpora (of speech or text), but they could also be from typology (typological frequencies), computational studies (e.g., output of a speech recognizer as parameters are varied), observation of language acquisition (what words children know at age X) or language change, lexicons, and many other sources.

These sources can be divided into experimental studies, where the researcher constructs a world and manipulates some variables (x) to observe the effect on other(s) (y), and correlational or observational studies, where the researcher observes the real world (e.g., Field, Miles, and Field 2012, section 1.6.1). Linguistic data can be experimental or observational, for example, respectively, from controlled laboratory experiments versus corpus studies.⁶

This distinction is closely related to causation versus correlation (i.e., data description). In the classic formulation, causality (x → y) can only be inferred from an experimental study, but it is unclear whether the results generalize to the real world, while observational studies are always correlational (x ∼ y), but the results have greater ecological validity. In reality, inferring causality is hard even for experimental data (e.g., there may be unobserved confounders, a nonrandom sample), and we must just always assume that we are fitting correlational models and bear the correlation is not causation adage in mind when interpreting our results. Section 4.2 discusses further for regression models.

1.3.2 Exploratory and Confirmatory Analysis

Traditionally, data analysis can be exploratory or confirmatory (EDA, CDA; Tukey 1977): exploring the data, often by visualization, to generate hypotheses versus testing known hypotheses in novel data, for example, fitting a regression model and calculating p-values for each coefficient. An alternative characterization is that CDA/EDA is any data analysis that does/doesn’t involve statistical modeling. The ideal is that EDA precedes CDA. In realistic data analysis, the boundary between the two is often unclear (Tukey 1980), especially for modern regression models (e.g., Gelman 2004). It is common to explore and confirm using the same dataset, including going back and forth, and in many linguistic studies (especially observational) the exact statistical model to be run cannot be fully specified in advance. The exact balance is tricky and depends on the context.

Rather than discussing exploration versus confirmation in depth, this book assumes that you are familiar with the basic issues relevant for regression modeling:

EDA (read: making many plots) is critical before any statistical analysis, even if the statistical methods have been completely prespecified: to find problems with the data, get intuitions about what the data say that the fitted model can be checked against, and so on.

Testing hypotheses (CDA) suggested by the data (EDA) is dangerous: this makes it less likely your results generalize to new data, and you can easily hypothesize after the research question is known (sometimes referred to as HARK) after you see what terms are significant.

Both exploratory and confirmatory phases are valuable for regression analysis: fitting a prespecified model to data will often miss important aspects of the data, while a model whose structure comes (only) from examining empirical plots may not generalize to new data.

A (published) statistical analysis can be exploratory or confirmatory or both. Confirmatory analyses have higher status, but this is just convention—exploratory studies are very important and should be reported as exploratory (not written as if they are confirmatory). Most realistic data analysis is actually both exploratory and confirmatory, and it is important in your writeup to specify which part of your analysis was performed in exploratory mode versus confirmatory mode. (For example, some terms in a regression may be based on scientific hypotheses and some from examining empirical plots.)

Some places to read more are Winter (2019, chapter 16), Gries (2021, chapter 4, section 5.5), Nicenboim et al. (2018) and Roettger (2019) (for linguistic data); Baguley (2012, section 1.3), and references already discussed.

Most regression models that this book covers are confirmatory, but they are used in analysis pipelines including exploratory steps. These include making empirical plots (before modeling) and checking fitted models (model validation) to detect problems that can lead to refitting.

Box 1.2

Broader Context: Assumptions about Our World

This book assumes a classic frequentist framework, in an idealized world:

The goal is to estimate parameter values (e.g., the mean voice onset time for stops produced by American English speakers).

These parameters have true (population) values, which are approximated by taking a sample.

The population from which the sample is taken is infinitely large.

Samples drawn from the population are representative and random (e.g., samples are from all American English speakers, randomly).

Assumptions 1 and 2 are assumptions of frequentist data analysis, as opposed to Bayesian (see box 2.1). Assumptions 3 and 4 are idealizations about the sample that are assumed by most statistical methods; in reality a researcher is usually working with a convenience sample, which she hopes is representative/random enough. While assumptions 3 and 4 are important, they form part of the general issue of how the data were obtained, which I abstract away from in this book. Note that one prerequisite for assumption 4 is that observations are assumed to be independent. This is sometimes true for linguistic data, in which case methods from chapters 4 to 7 can be used, but often not; this is the primary motivation for mixed-effects regression models (chapters 8–10).

For more on these points, see Navarro (2016, chapter 10), Kline (2013, chapter 2), or other statistics textbooks (some are listed in section 2.8)

1.3.3 Mathematical Notation

This book’s notation largely follows Gelman and Hill (2007). Values of parameters (the population values) are written with Greek letters: μ, σ. Estimates of parameters are written with a hat: , . Random variables corresponding to observed data are typically written with lowercase Roman letters, with subscripts denoting individual observations. For example, y could be observed reaction time in data from a laboratory experiment, and y1, …, yn are the values of the n observations. However, parameters that are proportions are written with p (whose estimate would be ), and some Greek letters (𝜖, δ, γ) are used for error terms, which are discussed when they are first used (chapters 4, 8). Sample means are written with a bar: y is the mean of y1, …, yn.

The notation ∼ is used to describe how a random variable is distributed. For example, "individual observations yi follow a normal distribution with mean 1 and standard deviation 5 is written y ∼ N(1, 5)." N(μ, σ) means "a normal probability density with mean μ and standard deviation σ."

1. For example, Baayen (2008) uses Lattice packages that are now obsolete.

2. Note that there is no best set of tools: R packages change rapidly, and different tools work for different people. For example, I prefer customizing my plots carefully; you may be fine with just using existing prewritten functions. You may strongly prefer base R or tidyverse functionality.

3. This is simplified somewhat: McElreath distinguishes between process models, which are well-specified quantitative causal models of the process (e.g., how speakers parse sentences, how vowels are realized acoustically), and statistical models, which are the actual models we fit to data.

4. On the centrality of scientific questions for statistical analysis, see Speed (1986) and McElreath (2020, chapter 17) (who cites it).

5. This shorthand is just for convenience, because only quantitative data are relevant for us.

6. In linguistics proper, experimental [data/linguistics] is commonly used in two senses: a laboratory experiment, or any study that primarily uses quantitative data, which would include most sources of what I am calling quantitative data. I avoid usage like experimental linguistics because of this ambiguity and only use experimental as defined in the text.

2

Samples, Estimates, and Hypothesis Tests

This chapter and the next cover basics of inferential statistics: going from a finite sample of data from a population to inferences about the population, with the goal of [drawing] conclusions about which parameter values are supported by the data and which are not (Hoenig and Heisey 2001, 4).

Regression modeling is a type of inferential statistics that builds on concepts covered in these two chapters. They first cover estimation of population values and differences using sample statistics (section 2.2), uncertainty in these estimates (section 2.3), and assessment of the reliability of conclusions that we reach about population values/differences (sections 2.4–2.7) using hypothesis testing. Chapter 3 covers the size of estimates and different kinds of errors that we can make in assessing the size and reliability of an effect.

I assume you already have some exposure to the topics in the current chapter, which are covered in depth in many sources; some are listed in section 2.8. However, these topics are covered in very different ways in different settings (e.g., in a statistics class vs. an R tutorial). For language scientists learning regression modeling, it is useful to establish a common set of concepts, terminology, and practical guidelines, using linguistic data examples. This is the goal of this chapter.

2.1 Preliminaries

2.1.1 Packages

This discussion assumes that you have loaded the tidyverse and languageR libraries (section 1.1).

library(tidyverse) is a shortcut to load a set of tidyverse packages (section 1.1).¹ You can alternatively just install and load single packages as needed. For this book, the dplyr, ggplot2, and tidyr packages are the most important (Wickham et al. 2021; Wickham 2016, 2021).

2.1.2 Data

The transitions dataset

This discussion assumes that you have loaded the transitions dataset:

This dataset (described in more detail in appendix A.1), comes from a study by Roberts, Torreira, and Levinson (2015) that examines approximately 20,000 transitions between conversational turns in a corpus of telephone calls. Each conversation (column file) is between two different speakers. Of interest is what factors affect transition durations (column dur): how long after one speaker finishes speaking before the other speaker begins. The before and after speakers for each turn are called speaker A and speaker B (columns spkA, spkB). For example, in conversation SW3154.EAF (the first rows of the dataframe), the two speakers are SPKR1290 and SPKR1288, and which one is speaker A or B alternates:

(Here, … indicates omitted lines of R output. You can always run the code yourself to see full output.)

Observations from the same conversation are not independent—because individual speakers probably have characteristic durations—but independent observations are assumed by methods introduced in this chapter. Thus, we take a small subset of the data where observations are plausibly independent, by choosing a random observation from each conversation:

I assume that you have run these commands, so the dataframe transitions_sub exists (n = 349), and you are using the same random dataframe.

2.1.3 Notational Conventions

This chapter refers to individual datasets and R objects. R libraries (e.g., tidyverse, ggplot) are kept in plain text, while datasets are referred to in teletype, such as the transitions dataset. Teletype is used for objects in R code, such as the transitions_sub dataframe, or individual columns of the dataframe. A fundamental data type in R is the factor, a categorical variable that takes on discrete values. Factors (typically columns of a dataframe) are written using teletype and individual levels with SMALL CAPS. For example, the factor sexB in the transitions dataset has levels F and M.

2.2 Point Estimation

In a quantitative study we are often interested in estimating single numbers (called point estimates in statistics) that characterize an aspect of the world. For example, in the transitions data, we may be interested in the effect of speaker B’s gender (column sexB: values F, M).² This could be quantified by three numbers:

How long are transitions if speaker B is male?

How long are transitions if speaker B is female?

What is the effect of gender on transition duration (the difference between male and female durations)?

2.2.1 Population and Sample

In quantitative studies we are typically interested in population values of a parameter—their true values in the world, under the model of the world we are assuming (box 2.1).

Box 2.1

Broader Context: Frequentist and Bayesian Statistics

There are two major approaches to statistical inference, corresponding to different philosophies of what probability means. The assumption that true values of parameters exist implies we are doing frequentist statistics rather than Bayesian statistics, where inference results in a probability distribution describing degrees of belief over possible values of the parameter. This is simply a pragmatic choice—frequentist methods are vastly more common in behavioral and social sciences, though Bayesian methods offer some serious advantages and are making inroads. Many sources describe the general differences between Bayesian and frequentist approaches (e.g., Dienes 2008, chapter 4; McElreath 2020, chapter 1), and Nicenboim and Vasishth (2016) and Vasishth, Nicenboim, et al. (2018) are good starting points for Bayesian methods for analyzing linguistic data in particular.

Typically population values refer to a context beyond just the setting for the study—we are probably interested in gender effects on transition time among all speakers of American English, not just all American English speakers who volunteered to be recorded for this corpus. However, in the real world we never observe population values; we can only take a sample of size n and make an inference about the population values.

For example, to estimate the preceding three quantities using the transitions_sub data (n = 349), we could use:

The average value of dur when sexB is F or M (91 msec, 259 msec)

The difference between these averages ( − 167 msec)

These values can be calculated for the transitions_sub data using functions from dplyr:

These estimates are not the same as the population values, for several reasons: the sample may (a) not be representative of the desired population (e.g., all American English speakers), or (b) truly random, and (c) the sample is finite. While (a) and (b) are important, they form part of the general issue of how the data were obtained, which we are abstracting away from in this book (section 1.3, box 1.2). We thus assume we do have a random sample from the population of interest. This leaves (c), which is a fundamental issue addressed by statistical inference: estimation of (population) quantities of interest, whose true values we will never actually know, based on a finite sample.

2.2.2 Sampling Distribution of the Sample Mean

In inferential statistics, the general setup is that we have a data sample from a quantitative study, which we assume is representative and random. We use this sample to calculate sample statistics, which are estimates of the population values of quantities we care about—typically parameters of a statistical model.

Ideally a sample statistic should be an unbiased estimator of the population value: the statistic’s average value should be the same as the population value, meaning that if we kept repeating the study and computing the statistic, averaging these values would get us closer and closer to the true value.

The most basic sample statistic is the sample mean, which is the average of n observations (written xi, …, xn):

The sample mean approximates the population mean, which we write μ. To understand how the sample mean is related to the population mean, we can explore using simulations where we know the population distribution.

Suppose that durations of transitions (dur) to female speakers in the transitions data were in fact drawn from a normal distribution, with mean μ = 200 and standard deviation (SD) σ = 450, which we write N(200, 450) (see section 1.3.3 on notation). These are the (made-up) population values.

We are interested in the sampling distribution of the sample mean: How likely are we to calculate different values for μ if we kept drawing random samples? We can plot a good approximation of this distribution as follows:

Draw a sample of n observations from the distribution N(200, 450).

Calculate for the sample.

Repeat steps 1 and 2 many times (nsim), and plot a histogram showing the distribution of values.

Figure 2.1 (top row) shows these histograms when the sample mean is calculated over 5, 10, and 50 observations (with nsim = 100,000). The distribution of the sample mean gets narrower for larger n. Thus, how certain we should be about our observed sample mean (91 msec) depends a lot on sample size: if n = 5, we would be likely to calculate a sample mean that is at least this far (109 msec) from the true value, just by chance.

Figure 2.1

Sampling distribution of the sample mean (histograms), calculated over n observations drawn from a normal distribution with μ = 200 and SD σ, for varying n and σ. Dotted lines show the probability distribution of observations [N(200, σ)].

The distribution of the sample mean is also narrower if the quantity that we are estimating is less variable (that is, smaller σ), as illustrated in the bottom row of figure 2.1. Thus, the more observations in the sample or the less variable the quantity we are estimating, the more precise (= less variable) is the mean value that we calculate based on the sample.

As suggested by the shape of the distributions in figure 2.1, the sample mean is itself normally distributed (box 2.2). The mean of this distribution is μ—because the sample mean is an unbiased estimator of the population mean—and its SD is . This can written more succinctly as

Box 2.2

Practical Note: Normal Distributions Refresher

It is useful to know some properties and notation for normal distributions that come up frequently in regression modeling (and in R output). The probability density for a normal distribution is

This is often abbreviated as N(μ, σ) distribution. (Or as N(μ, σ²) depending on the author.) σ² is the variance, and the inverse variance (1/σ²) is the precision.

A normal distribution with mean 0 and SD 1 is called a standard normal distribution, written N(0, 1). It is common (in statistics texts or in R output) to use z (or Z) to refer to any random variable that is expected to follow a standard normal distribution. If you draw an observation z from such a random variable, the probability that |z| < 1, |z| < 2, or |z| < 3 is 0.68, 0.954, and 0.997 (respectively). That is, about 66% of probability lies within one σ from the mean, about 95% lies within two σ, and almost all probability lies within three σ. Given the ubiquity of the 95% significance criterion in language sciences, it is also useful to remember that exactly 95% of probability lies within 1.96 σ from the mean. But in general 2 is close enough to 1.96 to represent 95% probability.

A very useful property of normal distributions is closure under linear combination:

If Z ∼ N(μ, σ) and a and b are constants, then

That is, adding a constant increases the mean and multiplying by a constant multiplies the variance (which is now aσ²).

If Z1∼ N(μ1, σ1) and Z2∼ N(μ2, σ2) and Z1 and Z2 are independent, then

That is, the mean and variance of the sum are just the sums of the individual means and variances.

One application is normality of the sample mean of n normally distributed (and independent) observations. This follows from the last two equations, because the sample mean is just a sum of normally distributed random variables divided by a constant.

using notation for describing the distribution of a random variable (section 1.3.3). The term quantifies the observation from figure 2.1: either higher sample size or lower variability (in the data we’re analyzing) leads to a more precise estimate.

2.2.3 Nonnormal Distributions and the Central Limit Theorem

Much of regression modeling boils down to estimating mean values, as we did for estimating the mean of a normal distribution (as well as quantifying uncertainty in the estimates, as we’ll do in section 2.3). But in general, we’ll want to analyze data beyond just continuous variables drawn from a normal distribution—much linguistic data is discrete (e.g., yes/no responses in an experiment, syntactic construction A vs. B observed in a corpus), and much continuous-valued linguistic data isn’t normally distributed (e.g., word frequencies, phonetic parameters such as voice onset time, reaction times). What happens if we take the sample mean for observations from a nonnormal distribution?

For example, consider the Dutch verb regularity data (dataframe regularity) from the languageR package, described in more detail on its help page (type ?regularity). This dataset, originally from Baayen and Prado Martin (2005), lists 700 Dutch irregular and regular verbs (column Regularity) and includes lexical and distributional variables that may help predict whether a verb is regular, including the verb’s frequency (column WrittenFrequency) and which auxiliary verb is used to form certain past tenses (column Auxiliary: levels HEBBEN, ZIJN, ZIJNHEB). In this sample, 159 verbs (23%) are irregular.

Suppose we are trying to estimate a single (population) probability, p: how often Dutch verbs are irregular. (If we picked a random verb in a large Dutch dictionary, how likely would we be to select an irregular one?) We observe n Dutch verbs, x1, …, xn, each of which is 0 (regular) or 1 (irregular),

Our estimate for p is the sample proportion:

which is the proportion of verbs that were irregular. Equation (2.3) looks the same as equation (2.1), because they are both sample means, but each xi in equation (2.3) follows a Bernoulli distribution rather than a normal distribution. The numerator of equation (2.3) is thus a count, which follows a binomial distribution, not a normal distribution.

To examine the probability distribution of the sample proportion, we can use the same simulation procedure as in section 2.2.2. Figure 2.2 shows the distribution for different sample sizes, assuming that p = 0.1 or 0.4 (made-up values), with a dotted line showing p.

Figure 2.2

Sampling distribution of the sample proportion ( ) for n observations of a Bernoulli random variable with probability p (histogram), varying n and p. Dotted lines show the value of p.

The distribution of the sample proportion looks somewhat normal for n = 10, and by n = 50 looks perfectly normal—even though the distribution of the actual random variable whose mean is being estimated is not normal (it only takes on values 0 or 1). This illustrates one of the most important results of probability theory, the central limit theorem: for a large enough sample from any random variable with mean μ and SD σ, the sampling distribution of the sample mean is approximately normally distributed with mean μ and SD of (equation (2.2)).³

The central limit theorem essentially says that the larger the sample you collect, the closer to normally distributed the sample mean is. This remarkable result is frequently used in inferential statistics, because it allows us to apply the same tools (for dealing with normally distributed data) to many different kinds of data. Nonetheless, as the preceding example shows, it is important to bear in mind that any normal approximation is still an approximation, which depends on sample size and the exact distribution being approximated.

Box 2.3

Broader Context: Variability in Bernoulli and Binomial Distributions

When estimating the sample mean of a normally distributed quantity we can vary n and σ, but for the sample proportion (figure 2.2) there is no σ. This is because for a Bernoulli random variable (e.g., a coin flip) the only free parameter is the probability of a success (p). The distribution still has an SD σ, it is just a function of the free parameter p:

This quantity is maximized when p = 0.5 and approaches 0 as p gets closer to 0 or 1. The (population) standard error of the sample proportion is just , which decreases either for higher n or p further from 0.5—the pattern seen in figure 2.2. Intuitively, the further p is from 0.5, the more certain you can be about the outcome of an individual observation. (If p = 1 or 0, there is no uncertainty.)

The fact that Bernoulli distributions don’t have an independent variance parameter will be important for understanding logistic regressions (chapter 6).

2.3 Uncertainty and Interval Estimation

Almost as important as estimating the value of a quantity is estimating the uncertainty in our estimate, measured either by a single number or by a range of values (called an interval estimate).

2.3.1 Standard Error

For our estimate of the sample mean, we saw that the width of the distribution in figure 2.1, , quantifies how much uncertainty there is in our estimate of the sample mean. But in general we do not know σ (the population value) and must estimate it. An unbiased estimator for σ is

which looks almost identical to the formula for calculating an SD of a sample, except with n− 1 in the denominator instead of n (which corrects for finite sample size [Rice 2006, section 7.3]).

We can then define the standard error (SE) of the sample mean, which is an unbiased estimator of :

The SE estimates how much error there is, on average (across many samples), in our estimate of the population mean μ using .

One consequence of the central limit theorem is that (for large enough n) we can use as an approximate SE when estimating any sample mean (equation (2.5)), just by replacing by an estimate of the SD. Intuitively, whatever we’re trying to estimate, our estimate will be more precise for larger sample size or lower variability.

For example, when estimating a proportion, an unbiased estimator for σ is

and the standard error is

Box 2.4

Practical Note: Standard Error and Sample Size

Technically, we should call the standard error and the estimated standard error (section 7.3). This book almost always is referring to the latter, so it will use SE to mean estimated SE except where there is ambiguity.

Something useful to remember is that error scales as the square root of sample size (because the standard error has a in the denominator). This is the reason why collecting more data has diminishing returns: doubling sample size only decreases error by a factor of 1.41, and to halve error you need four times as much data! Note that the (estimated) SE won’t be exactly halved if you collect four times the data, because this number ( ), is itself an estimate, and s will change for a new sample. Nonetheless, the true SE will be halved. (See exercise 2.1.)

This basic relationship (error ∼ ) holds for many kinds of errors and is useful in practical settings, such as planning a new study (how much more data should you collect to observe a hypothesized effect?) or critically assessing empirical patterns in published work (how much should you trust the means of cells A and B if cell B contains half as much data?).

2.3.2 Confidence Intervals

In isolation a standard error is not an intuitive measure of uncertainty, because it does not give a sense of which values are likely. One commonly used notion is a confidence interval (CI): a range of values that is X% likely to contain the population value. Most often, X = 95%.

2.3.2.1 Z-based confidence intervals

Consider

Enjoying the preview?

Page 1 of 1

Regression Modeling for Linguistic Data

About this ebook

Morgan Sonderegger

Related authors

Related to Regression Modeling for Linguistic Data

Related ebooks

Quantile Regression: Estimation and Simulation

Distributed Cooperative Control: Emerging Applications

A Guide to Business Statistics

CompTIA DataX Study Guide: Exam DY0-001

Robust Statistics: Theory and Methods (with R)

Fundamental Statistical Inference: A Computational Approach

Multiple Biological Sequence Alignment: Scoring Functions, Algorithms and Evaluation

Interpreting Evidence: Evaluating Forensic Science in the Courtroom

The Biostatistics of Aging: From Gompertzian Mortality to an Index of Aging-Relatedness

Skeletal Variation and Adaptation in Europeans: Upper Paleolithic to the Twentieth Century

Analytical Modeling of Wireless Communication Systems

Practical Applications of Bayesian Reliability

Applied RVE Reconstruction and Homogenization of Heterogeneous Materials

Optimization Methods in Metabolic Networks

Practical Design of Experiments (DOE): A Guide for Optimizing Designs and Processes

Stevens' Handbook of Experimental Psychology and Cognitive Neuroscience, Methodology

Digital Spectral Analysis MATLAB® Software User Guide

Seismic Inversion: Theory and Applications

Chemometrics: Data Driven Extraction for Science

Transitions from Digital Communications to Quantum Communications: Concepts and Prospects

Sample Sizes for Clinical, Laboratory and Epidemiology Studies

Reviews in Computational Chemistry

Introduction to Bayesian Estimation and Copula Models of Dependence

Quantitative Methods: An Introduction for Business Management

Network Meta-Analysis for Decision-Making

Multivariate Density Estimation: Theory, Practice, and Visualization

Computational Acoustics: Theory and Implementation

Spline and Spline Wavelet Methods with Applications to Signal and Image Processing: Volume III: Selected Topics

Introductory Mathematics and Statistics for Islamic Finance

The Chicago Guide to Writing About Numbers

Science & Mathematics For You

The Source: The Secrets of the Universe, the Science of the Brain

Feeling Good: The New Mood Therapy

Our Kind of People: Inside America's Black Upper Class

What If?: Serious Scientific Answers to Absurd Hypothetical Questions

Sapiens: A Brief History of Humankind

The Gulag Archipelago [Volume 1]: An Experiment in Literary Investigation

No-Drama Discipline: the bestselling parenting guide to nurturing your child's developing mind

Activate Your Brain: How Understanding Your Brain Can Improve Your Work - and Your Life

The Rise of the Fourth Reich: The Secret Societies That Threaten to Take Over America

Blitzed: Drugs in the Third Reich

The Big Book of Hacks: 264 Amazing DIY Tech Projects

Becoming Cliterate: Why Orgasm Equality Matters--And How to Get It

Outsmart Your Brain: Why Learning is Hard and How You Can Make It Easy

American Carnage: On the Front Lines of the Republican Civil War and the Rise of President Trump

First, We Make the Beast Beautiful: A New Journey Through Anxiety

Ultralearning: Master Hard Skills, Outsmart the Competition, and Accelerate Your Career

Homo Deus: A Brief History of Tomorrow

Rethinking Narcissism: The Bad---and Surprising Good---About Feeling Special

The Confidence Code: The Science and Art of Self-Assurance---What Women Should Know

Lies My Gov't Told Me: And the Better Future Coming

How to Think Critically: Question, Analyze, Reflect, Debate.

The Invisible Rainbow: A History of Electricity and Life

Chaos: Making a New Science

Other Minds: The Octopus, the Sea, and the Deep Origins of Consciousness

The Big Fat Surprise: Why Butter, Meat and Cheese Belong in a Healthy Diet

Alchemy: The Dark Art and Curious Science of Creating Magic in Brands, Business, and Life

The Wuhan Cover-Up: And the Terrifying Bioweapons Arms Race

Mothers Who Can't Love: A Healing Guide for Daughters

A Crack In Creation: Gene Editing and the Unthinkable Power to Control Evolution

The Systems Thinker: Essential Thinking Skills For Solving Problems, Managing Chaos,

Related podcast episodes

Related articles

Related categories

Reviews for Regression Modeling for Linguistic Data

What did you think?

Book preview

Regression Modeling for Linguistic Data - Morgan Sonderegger

Preface

2

Samples, Estimates, and Hypothesis Tests