Discover millions of ebooks, audiobooks, and so much more with a free trial

From $11.99/month after trial. Cancel anytime.

Building Statistical Models in Python: Develop useful models for regression, classification, time series, and survival analysis
Building Statistical Models in Python: Develop useful models for regression, classification, time series, and survival analysis
Building Statistical Models in Python: Develop useful models for regression, classification, time series, and survival analysis
Ebook761 pages5 hours

Building Statistical Models in Python: Develop useful models for regression, classification, time series, and survival analysis

Rating: 0 out of 5 stars

()

Read preview

About this ebook

The ability to proficiently perform statistical modeling is a fundamental skill for data scientists and essential for businesses reliant on data insights. Building Statistical Models with Python is a comprehensive guide that will empower you to leverage mathematical and statistical principles in data assessment, understanding, and inference generation.

This book not only equips you with skills to navigate the complexities of statistical modeling, but also provides practical guidance for immediate implementation through illustrative examples. Through emphasis on application and code examples, you’ll understand the concepts while gaining hands-on experience. With the help of Python and its essential libraries, you’ll explore key statistical models, including hypothesis testing, regression, time series analysis, classification, and more.

By the end of this book, you’ll gain fluency in statistical modeling while harnessing the full potential of Python's rich ecosystem for data analysis.

LanguageEnglish
Release dateAug 31, 2023
ISBN9781804612156
Building Statistical Models in Python: Develop useful models for regression, classification, time series, and survival analysis

Related to Building Statistical Models in Python

Related ebooks

Computers For You

View More

Related articles

Reviews for Building Statistical Models in Python

Rating: 0 out of 5 stars
0 ratings

0 ratings0 reviews

What did you think?

Tap to rate

Review must be at least 10 words

    Book preview

    Building Statistical Models in Python - Huy Hoang Nguyen

    Cover.png

    BIRMINGHAM—MUMBAI

    Building Statistical Models in Python

    Copyright © 2023 Packt Publishing

    All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

    Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author(s), nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

    Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

    Group Product Manager: Ali Abidi

    Publishing Product Manager: Sanjana Gupta

    Senior Editor: Sushma Reddy

    Technical Editor: Rahul Limbachiya

    Copy Editor: Safis Editing

    Book Project Manager: Kirti Pisat

    Project Coordinator: Farheen Fathima

    Proofreader: Safis Editing

    Indexer: Hemangini Bari

    Production Designer: Prashant Ghare

    Marketing Coordinator: Nivedita Singh

    First published: August 2023

    Production reference: 3310823

    Published by Packt Publishing Ltd.

    Grosvenor House

    11 St Paul's Square

    Birmingham

    B3 1RB, UK.

    ISBN 978-1-80461-428-0

    www.packtpub.com

    To my parents, Thieu and Tang, for their enormous support and faith in me.

    To my wife, Tam, for her endless love, dedication, and courage.

    - Huy Hoang Nguyen

    To my daughter, Lydie, for demonstrating how work and dedication regenerate inspiration and creativity. To my wife, Helene, for her love and support.

    – Paul Adams

    To my partner, Kate, who has always supported my endeavors.

    – Stuart Miller

    Contributors

    About the authors

    Huy Hoang Nguyen is a mathematician and data scientist with extensive experience in advanced mathematics, strategic leadership, and applied machine learning research. He holds a PhD in Mathematics, as well as two Master’s degrees in Applied Mathematics and Data Science. His previous work focused on Partial Differential Equations, Functional Analysis, and their applications in Fluid Mechanics. After transitioning from academia to the healthcare industry, he has undertaken a variety of data science projects, ranging from traditional machine learning to deep learning.

    Paul Adams is a Data Scientist with a background primarily in the healthcare industry. Paul applies statistics and machine learning in multiple areas of industry, focusing on projects in process engineering, process improvement, metrics and business rules development, anomaly detection, forecasting, clustering, and classification. Paul holds an MSc in Data Science from Southern Methodist University.

    Stuart Miller is a Machine Learning Engineer with a wide range of experience. Stuart has applied machine learning methods to various projects in industries ranging from insurance to semiconductor manufacturing. Stuart holds degrees in data science, electrical engineering, and physics.

    About the reviewers

    Krishnan Raghavan is an IT Professional with over 20+ years of experience in software development and delivery excellence across multiple domains and technology ranging from C++ to Java, Python, Data Warehousing, and Big Data tools and technologies.

    When not working, Krishnan likes to spend time with his wife and daughter, reading fiction and nonfiction as well as technical books. Krishnan tries to give back to the community by being part of the GDG Pune Volunteer Group, helping the team organize events. Currently, he is unsuccessfully trying to learn how to play the guitar.

    You can connect with Krishnan at or via LinkedIn: .

    I would like to thank my wife Anita and daughter Ananya for giving me the time and space to review this book.

    Karthik Dulam is a Principal Data Scientist at EDB. He is passionate about all things data with a particular focus on data engineering, statistical modeling, and machine learning. He has a diverse background delivering machine learning solutions for the healthcare, IT, automotive, telecom, tax, and advisory industries. He actively engages with students as a guest speaker at esteemed universities delivering insightful talks on machine learning use cases.

    I would like to thank my wife, Sruthi Anem, for her unwavering support and patience. I also want to thank my family, friends, and colleagues who have played an instrumental role in shaping the person I am today. Their unwavering support, encouragement, and belief in me have been a constant source of inspiration.

    Table of Contents

    Preface

    Part 1: Introduction to Statistics

    1

    Sampling and Generalization

    Software and environment setup

    Population versus sample

    Population inference from samples

    Randomized experiments

    Observational study

    Sampling strategies – random, systematic, stratified, and clustering

    Probability sampling

    Non-probability sampling

    Summary

    2

    Distributions of Data

    Technical requirements

    Understanding data types

    Nominal data

    Ordinal data

    Interval data

    Ratio data

    Visualizing data types

    Measuring and describing distributions

    Measuring central tendency

    Measuring variability

    Measuring shape

    The normal distribution and central limit theorem

    The Central Limit Theorem

    Bootstrapping

    Confidence intervals

    Standard error

    Correlation coefficients (Pearson’s correlation)

    Permutations

    Permutations and combinations

    Permutation testing

    Transformations

    Summary

    References

    3

    Hypothesis Testing

    The goal of hypothesis testing

    Overview of a hypothesis test for the mean

    Scope of inference

    Hypothesis test steps

    Type I and Type II errors

    Type I errors

    Type II errors

    Basics of the z-test – the z-score, z-statistic, critical values, and p-values

    The z-score and z-statistic

    A z-test for means

    z-test for proportions

    Power analysis for a two-population pooled z-test

    Summary

    4

    Parametric Tests

    Assumptions of parametric tests

    Normally distributed population data

    Equal population variance

    T-test – a parametric hypothesis test

    T-test for means

    Two-sample t-test – pooled t-test

    Two-sample t-test – Welch’s t-test

    Paired t-test

    Tests with more than two groups and ANOVA

    Multiple tests for significance

    ANOVA

    Pearson’s correlation coefficient

    Power analysis examples

    Summary

    References

    5

    Non-Parametric Tests

    When parametric test assumptions are violated

    Permutation tests

    The Rank-Sum test

    The test statistic procedure

    Normal approximation

    Rank-Sum example

    The Signed-Rank test

    The Kruskal-Wallis test

    Chi-square distribution

    Chi-square goodness-of-fit

    Chi-square test of independence

    Chi-square goodness-of-fit test power analysis

    Spearman’s rank correlation coefficient

    Summary

    Part 2: Regression Models

    6

    Simple Linear Regression

    Simple linear regression using OLS

    Coefficients of correlation and determination

    Coefficients of correlation

    Coefficients of determination

    Required model assumptions

    A linear relationship between the variables

    Normality of the residuals

    Homoscedasticity of the residuals

    Sample independence

    Testing for significance and validating models

    Model validation

    Summary

    7

    Multiple Linear Regression

    Multiple linear regression

    Adding categorical variables

    Evaluating model fit

    Interpreting the results

    Feature selection

    Statistical methods for feature selection

    Performance-based methods for feature selection

    Recursive feature elimination

    Shrinkage methods

    Ridge regression

    LASSO regression

    Elastic Net

    Dimension reduction

    PCA – a hands-on introduction

    PCR – a hands-on salary prediction study

    Summary

    Part 3: Classification Models

    8

    Discrete Models

    Probit and logit models

    Multinomial logit model

    Poisson model

    The Poisson distribution

    Modeling count data

    The negative binomial regression model

    Negative binomial distribution

    Summary

    9

    Discriminant Analysis

    Bayes’ theorem

    Probability

    Conditional probability

    Discussing Bayes’ Theorem

    Linear Discriminant Analysis

    Supervised dimension reduction

    Quadratic Discriminant Analysis

    Summary

    Part 4: Time Series Models

    10

    Introduction to Time Series

    What is a time series?

    Goals of time series analysis

    Statistical measurements

    Mean

    Variance

    Autocorrelation

    Cross-correlation

    The white-noise model

    Stationarity

    Summary

    References

    11

    ARIMA Models

    Technical requirements

    Models for stationary time series

    Autoregressive (AR) models

    Moving average (MA) models

    Autoregressive moving average (ARMA) models

    Models for non-stationary time series

    ARIMA models

    Seasonal ARIMA models

    More on model evaluation

    Summary

    References

    12

    Multivariate Time Series

    Multivariate time series

    Time-series cross-correlation

    ARIMAX

    Preprocessing the exogenous variables

    Fitting the model

    Assessing model performance

    VAR modeling

    Step 1 – visual inspection

    Step 2 – selecting the order of AR(p)

    Step 3 – assessing cross-correlation

    Step 4 – building the VAR(p,q) model

    Step 5 – testing the forecast

    Step 6 – building the forecast

    Summary

    References

    Part 5: Survival Analysis

    13

    Time-to-Event Variables – An Introduction

    What is censoring?

    Left censoring

    Right censoring

    Interval censoring

    Type I and Type II censoring

    Survival data

    Survival Function, Hazard and Hazard Ratio

    Summary

    14

    Survival Models

    Technical requirements

    Kaplan-Meier model

    Model definition

    Model example

    Exponential model

    Model example

    Cox Proportional Hazards regression model

    Step 1

    Step 2

    Step 3

    Step 4

    Step 5

    Summary

    Index

    Other Books You May Enjoy

    Preface

    Statistics is a discipline of study used for applying analytical methods to answer questions and solve problems using data, in both academic and industry settings. Many methods have been around for centuries, while others are much more recent. Statistical analysis and results are fairly straightforward for presenting to both technical and non-technical audiences. Furthermore, producing results with statistical analysis does not necessarily require large amounts of data or compute resources and can be done fairly quickly, especially when using programming languages such as Python, which is moderately easy to work with and implement.

    While artificial intelligence (AI) and advanced machine learning (ML) tools have become more prominent and popular over recent years with the increase of accessibility in compute power, performing statistical analysis as a precursor to developing larger-scale projects using AI and ML can enable a practitioner to assess feasibility and practicality before using larger compute resources and project architecture development for those types of projects.

    This book provides a wide variety of tools that are commonly used to test hypotheses and provide basic predictive capabilities to analysts and data scientists alike. The reader will walk through the basic concepts and terminology required for understanding the statistical tools in this book prior to exploring the different tests and conditions under which they are applicable. Further, the reader will gain knowledge for assessing the performance of the tests. Throughout, examples will be provided in the Python programming language to get readers started understanding their data using the tools presented, which will be applicable to some of the most common questions faced in the data analytics industry. The topics we will walk through include:

    An introduction to statistics

    Regression models

    Classification models

    Time series models

    Survival analysis

    Understanding the tools provided in these sections will provide the reader with a firm foundation from which further independent growth in the statistics domain can more easily be achieved.

    Who this book is for

    Professionals in most industries can benefit from the tools in this book. The tools provided are useful primarily at a higher level of inferential analysis, but can be applied to deeper levels depending on the industry in which the practitioner wishes to apply them. The target audiences of this book are:

    Industry professionals with limited statistical or programming knowledge who would like to learn to use data for testing hypotheses they have in their business domain

    Data analysts and scientists who wish to broaden their statistical knowledge and find a set of tools and their implementations for performing various data-oriented tasks

    The ground-up approach of this book seeks to provide entry into the knowledge base for a wide audience and therefore should neither discourage novice-level practitioners nor exclude advanced-level practitioners from the benefits of the materials presented.

    What this book covers

    Chapter 1, Sampling and Generalization, describes the concepts of sampling and generalization. The discussion of sampling covers several common methods for sampling data from a population and discusses the implications for generalization. This chapter also discusses how to setup the software required for this book.

    Chapter 2, Distributions of Data, provides a detailed introduction to types of data, common distributions used to describe data, and statistical measures. This chapter also covers common transformations used to change distributions.

    Chapter 3, Hypothesis Testing, introduces the concept of statistical tests as a method for answering questions of interest. This chapter covers the steps to perform a test, the types of errors encountered in testing, and how to select power using the Z-test.

    Chapter 4, Parametric Tests, further discusses statistical tests, providing detailed descriptions of common parametric statistical tests, the assumptions of parametric tests, and how to assess the validity of parametric tests. This chapter also introduces the concept of multiple tests and provides details on corrections for multiple tests.

    Chapter 5, Non-parametric Tests, discuss how to perform statistical tests when the assumptions of parametric tests are violated with class of tests without assumptions called non-parametric tests.

    Chapter 6, Simple Linear Regression, introduces the concept of a statistical model with the simple linear regression model. This chapter begins by discussing the theoretical foundations of simple linear regression and then discusses how to interpret the results of the model and assess the validity of the model.

    Chapter 7, Multiple Linear Regression, builds on the previous chapter by extending the simple linear regression model into additional dimensions. This chapter also discusses issues that occur when modeling with multiple explanatory variables, including multicollinearity, feature selection, and dimension reduction.

    Chapter 8, Discrete Models, introduces the concept of classification and develops a model for classifying variables into discrete levels of a categorical response variable. This chapter starts by developing the model binary classification and then extends the model to multivariate classification. Finally, the Poisson model and negative binomial models are covered.

    Chapter 9, Discriminant Analysis, discusses several additional models for classification, including linear discriminant analysis and quadratic discriminant analysis. This chapter also introduces Bayes’ Theorem.

    Chapter 10, Introduction to Time Series, introduces time series data, discussing the time series concept of autocorrelation and the statistical measures for time series. This chapter also introduces the white noise model and stationarity.

    Chapter 11, ARIMA Models, discusses models for univariate models. This chapter starts by discussing models for stationary time series and then extends the discussion to non-stationary time series. Finally, this chapter provides a detailed discussion on model evaluation.

    Chapter 12, Multivariate Time Series, builds on the previous two chapters by introducing the concept of a multivariate time series and extends ARIMA models to multiple explanatory variables. This chapter also discusses time series cross-correlation.

    Chapter 13, Survival Analysis, introduces survival data, also called time-to-event data. This chapter discusses the concept of censoring and the impact of censoring survival data. Finally, the chapter discusses the survival function, hazard, and hazard ratio.

    Chapter 14, Survival Models, building on the previous chapter, provides an overview of several models for survival data, including the Kaplan-Meier model, the Exponential model, and the Cox Proportional Hazards model.

    To get the most out of this book

    You will need access to download and install open-source code packages implemented in the Python programming language and accessible through PyPi.org or the Anaconda Python distribution. While a background in statistics is helpful, but not necessary, this book assumes you have a decent background in basic algebra. Each unit of this book is independent of the other units, but the chapters within each unit build upon each other. Thus, we advise you to begin each unit with that unit’s first chapter to understand the content.

    If you are using the digital version of this book, we advise you to type the code yourself or access the code from the book’s GitHub repository (a link is available in the next section). Doing so will help you avoid any potential errors related to the copying and pasting of code.

    Download the example code files

    You can download the example code files for this book from GitHub at https://github.com/PacktPublishing/Building-Statistical-Models-in-Python. If there’s an update to the code, it will be updated in the GitHub repository.

    We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

    Conventions used

    There are a number of text conventions used throughout this book.

    Code in text: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: Mount the downloaded WebStorm-10*.dmg disk image file as another disk in your system.

    A block of code is set as follows:

    A = [3,5,4]

    B = [43,41,56,78,54]

    permutation_testing(A,B,n_iter=10000)

    Any command-line input or output is written as follows:

    pip install SomePackage

    Bold: Indicates a new term, an important word, or words that you see onscreen. For instance, words in menus or dialog boxes appear in bold. Here is an example: Select System info from the Administration panel.

    Tips or important notes

    Appear like this.

    Get in touch

    Feedback from our readers is always welcome.

    General feedback: If you have questions about any aspect of this book, email us at [email protected] and mention the book title in the subject of your message.

    Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/support/errata and fill in the form.

    Piracy: If you come across any illegal copies of our works in any form on the internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.

    If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.

    Share Your Thoughts

    Once you’ve read Building Statistical Models in Python, we’d love to hear your thoughts! Please click here to go straight to the Amazon review page for this book and share your feedback.

    Your review is important to us and the tech community and will help us make sure we’re delivering excellent quality content.

    Download a free PDF copy of this book

    Thanks for purchasing this book!

    Do you like to read on the go but are unable to carry your print books everywhere?
Is your eBook purchase not compatible with the device of your choice?

    Don’t worry, now with every Packt book you get a DRM-free PDF version of that book at no cost.

    Read anywhere, any place, on any device. Search, copy, and paste code from your favorite technical books directly into your application.

    The perks don’t stop there, you can get exclusive access to discounts, newsletters, and great free content in your inbox daily

    Follow these simple steps to get the benefits:

    Scan the QR code or visit the link below

    https://packt.link/free-ebook/978-1-80461-428-0

    Submit your proof of purchase

    That’s it! We’ll send your free PDF and other benefits to your email directly

    Part 1:Introduction to Statistics

    This part will cover the statistical concepts that are foundational to statistical modeling.

    It includes the following chapters:

    Chapter 1, Sampling and Generalization

    Chapter 2, Distributions of Data

    Chapter 3, Hypothesis Testing

    Chapter 4, Parametric Tests

    Chapter 5, Non-Parametric Tests

    1

    Sampling and Generalization

    In this chapter, we will describe the concept of populations and sampling from populations, including some common strategies for sampling. The discussion of sampling will lead to a section that will describe generalization. Generalization will be discussed as it relates to using samples to make conclusions about their respective populations. When modeling for statistical inference, it is necessary to ensure that samples can be generalized to populations. We will provide an in-depth overview of this bridge through the subjects in this chapter.

    We will cover the following main topics:

    Software and environment setup

    Population versus sample

    Population inference from samples

    Sampling strategies – random, systematic, and stratified

    Software and environment setup

    Python is one of the most popular programming languages for data science and machine learning thanks to the large open source community that has driven the development of these libraries. Python’s ease of use and flexible nature made it a prime candidate in the data science world, where experimentation and iteration are key features of the development cycle. While there are new languages in development for data science applications, such as Julia, Python currently remains the key language for data science due to its wide breadth of open source projects, supporting applications from statistical modeling to deep learning. We have chosen to use Python in this book due to its positioning as an important language for data science and its demand in the job market.

    Python is available for all major operating systems: Microsoft Windows, macOS, and Linux. Additionally, the installer and documentation can be found at the official website: https://www.python.org/.

    This book is written for Python version 3.8 (or higher). It is recommended that you use whatever recent version of Python that is available. It is not likely that the code found in this book will be compatible with Python 2.7, and most active libraries have already started dropping support for Python 2.7 since official support ended in 2020.

    The libraries used in this book can be installed with the Python package manager, pip, which is part of the standard Python library in contemporary versions of Python. More information about pip can be found here: https://docs.python.org/3/installing/index.html. After pip is installed, packages can be installed using pip on the command line. Here is basic usage at a glance:

    Install a new package using the latest version:

    pip install SomePackage

    Install the package with a specific version, version 2.1 in this example:

    pip install SomePackage==2.1

    A package that is already installed can be upgraded with the --upgrade flag:

    pip install SomePackage –upgrade

    In general, it is recommended to use Python virtual environments between projects and to keep project dependencies separate from system directories. Python provides a virtual environment utility, venv, which, like pip, is part of the standard library in contemporary versions of Python. Virtual environments allow you to create individual binaries of Python, where each binary of Python has its own set of installed dependencies. Using virtual environments can prevent package version issues and conflict when working on multiple Python projects. Details on setting up and using virtual environments can be found here: https://docs.python.org/3/library/venv.html.

    While we recommend the use of Python and Python’s virtual environments for environment setups, a highly recommended alternative is Anaconda. Anaconda is a free (enterprise-ready) analytics-focused distribution of Python by Anaconda Inc. (previously Continuum Analytics). Anaconda distributions come with many of the core data science packages, common IDEs (such as Jupyter and Visual Studio Code), and a graphical user interface for managing environments. Anaconda can be installed using the installer found at the Anaconda website here: https://www.anaconda.com/products/distribution.

    Anaconda comes with its own package manager, conda, which can be used to install new packages similarly to pip.

    Install a new package using the latest version:

    conda install SomePackage

    Upgrade a package that is already installed:

    conda upgrade SomePackage

    Throughout this book, we will make use of several core libraries in the Python data science ecosystem, such as NumPy for array manipulations, pandas for higher-level data manipulations, and matplotlib for data visualization. The package versions used for this book are contained in the following list. Please ensure that the versions installed in your environment are equal to or greater than the versions listed. This will help ensure that the code examples run correctly:

    statsmodels 0.13.2

    Matplotlib 3.5.2

    NumPy 1.23.0

    SciPy 1.8.1

    scikit-learn 1.1.1

    pandas 1.4.3

    The packages used for the code in this book are shown here in Figure 1.1. The __version__ method can be used to print the package version in code.

    Figure 1.1 – Package versions used in this book

    Figure 1.1 – Package versions used in this book

    Having set up the technical environment for the book, let’s get into the statistics. In the next sections, we will discuss the concepts of population and sampling. We will demonstrate sampling strategies with code implementations.

    Population versus sample

    In general, the goal of statistical modeling is to answer a question about a group by making an inference about that group. The group we are making an inference on could be machines in a production factory, people voting in an election, or plants on different plots of land. The entire group, every individual item or entity, is referred to as the population. In most cases, the population of interest is so large that it is not practical or even possible to collect data on every entity in the population. For instance, using the voting example, it would probably not be possible to poll every person that voted in an election. Even if it was possible to reach all the voters for the election of interest, many voters may not consent to polling, which would prevent collection on the entire population. An additional consideration would be the expense of polling such a large group. These factors make it practically impossible to collect population statistics in our example of vote polling. These types of prohibitive factors exist in many cases where we may want to assess a population-level attribute. Fortunately, we do not need to collect data on the entire population of interest. Inferences about a population can be made using a subset of the population. This subset of the population is called a sample. This is the main idea of statistical modeling. A model will be created using a sample and inferences will be made about the population.

    In order to make valid inferences about the population of interest using a sample, the sample must be representative of the population of interest, meaning that the sample should contain the variation found in the population. For example, if we were interested in making an inference about plants in a field, it is unlikely that samples from one corner of the field would be sufficient for inferences about the larger population. There would likely be variations in plant characteristics over the entire field. We could think of various reasons why there might be variation. For this example, we will consider some examples from Figure 1.2.

    Figure 1.2 – Field of plants

    Figure 1.2 – Field of plants

    The figure shows that Sample A is near a forest. This sample area may be affected by the presence of the forest; for example, some of the plants in that sample may receive less sunlight than plants in the other sample. Sample B is shown to be in between the main irrigation lines. It’s conceivable that this sample receives more water on average than the other two samples, which may have an effect on the plants in this sample. The final Sample C is near a road. This sample may see other effects that are not seen in Sample A or B.

    If samples were only taken from one of those sections, the inferences from those samples would be biased and would not provide valid references about the population. Thus, samples would need to be taken from across the entire field to create a sample that is more likely to be representative of the population of plants. When taking samples from populations, it is critical to ensure the sampling method is robust to possible issues, such as the influence of irrigation and shade in the previous example. Whenever taking a sample from a population, it’s important to identify and mitigate possible influences of bias because biases in data will affect your model and skew your conclusions.

    In the next section, various methods for sampling from a dataset will be discussed. An additional consideration is the sample size. The sample size impacts the type of statistical tools we can use, the distributional assumptions that can be made about the sample, and the confidence of inferences and predictions. The impact of sample size will be explored in depth in Chapter 2, Distributions of Data and Chapter 3, Hypothesis Testing.

    Population inference from samples

    When using a statistical model to make inferential conclusions about a population from a sample subset of that population, the study design must account for similar degrees of uncertainty in its variables as those in the population. This is the variation mentioned earlier in this chapter. To appropriately draw inferential conclusions about a population, any statistical model must be structured around a chance mechanism. Studies structured around these chance mechanisms are called randomized experiments and provide an understanding of both correlation and causation.

    Randomized experiments

    There are two primary characteristics of a randomized experiment:

    Random sampling, colloquially referred to as random selection

    Random assignment of treatments, which is the nature of the study

    Random sampling

    Random sampling (also called random selection) is designed with the intent of creating a sample representative of the overall population so that statistical models generalize the population well enough to assign cause-and-effect outcomes. In order for random sampling to be successful, the population of interest must be well defined. All samples taken from the population must have a chance of being selected. In considering the example of polling voters, all voters must be willing to be polled. Once all voters are entered into a lottery, random sampling can be used to subset voters for modeling. Sampling from only voters who are willing to be polled introduces sampling bias into statistical modeling, which can lead to skewed results. The sampling method in the scenario where only some voters are willing to participate is called self-selection. Any information obtained and modeled from self-selected samples – or any non-random samples – cannot be used for inference.

    Random assignment of treatments

    The random assignment of treatments refers to two motivators:

    The first motivator is to gain an understanding of specific input variables and their influence on the response – for example, understanding whether assigning treatment A to a specific individual may produce more favorable outcomes than a placebo.

    The second motivator is to remove the impact of external variables on the outcomes of a study. These external variables, called confounding variables (or confounders), are important to remove as they often prove difficult to control. They may have unpredictable values or even be unknown to the researcher. The consequence of including confounders is that the outcomes of a study may not be replicable, which can be costly. While confounders can influence outcomes, they can also influence input variables, as well as the relationships between those variables.

    Referring back to the example in the earlier section, Population versus sample, consider a farmer who decides to start using pesticides on his crops and wants to test two different brands. The farmer knows there are three distinct areas of the land; plot A, plot B, and plot C. To determine the success of the pesticides and prevent damage to the crops, the farmer randomly chooses 60 plants from each plot (this is called stratified random sampling where random sampling is stratified across each plot) for testing. This

    Enjoying the preview?
    Page 1 of 1